Spaces:

nice-bill
/

personalisation-engine

Sleeping

App Files Files Community

nice-bill commited on Dec 6, 2025

Commit

7964128

1 Parent(s): 40ff852

Deploy personalization engine

Browse files

Files changed (21) hide show

.dockerignore +16 -0
.gitignore +18 -0
.python-version +1 -0
Dockerfile +71 -0
README.md +149 -4
pyproject.toml +23 -0
requirements.txt +8 -0
scripts/1b_generate_semantic_data.py +109 -0
scripts/download_artifacts.py +54 -0
scripts/download_model.py +13 -0
scripts/evaluate_quality.py +94 -0
scripts/evaluate_system.py +149 -0
scripts/inspect_data.py +33 -0
scripts/optimize_index.py +47 -0
scripts/visualize_users.py +113 -0
src/personalization/__init__.py +0 -0
src/personalization/api/__init__.py +0 -0
src/personalization/api/main.py +186 -0
src/personalization/config.py +12 -0
test_api.py +46 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,16 @@

+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+env/
+venv/
+.venv/
+*.log
+.git
+.mypy_cache
+.pytest_cache
+# Ignore raw data if any, but keep catalog/index/embeddings
+data/raw
+data/synthetic

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+.venv/
+venv/
+.env
+# Data & Models (Too large for git)
+data/
+*.pt
+*.pth
+*.parquet
+*.csv
+# IDE
+.vscode/
+.idea/

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

Dockerfile ADDED Viewed

	@@ -0,0 +1,71 @@

+# Build Stage
+FROM python:3.10-slim AS builder
+WORKDIR /app
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+# Install build dependencies
+RUN apt-get update && apt-get install -y \
+    gcc \
+    python3-dev \
+    curl \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv
+RUN pip install --no-cache-dir uv
+COPY requirements.txt .
+# Create virtual environment and install dependencies
+ENV UV_HTTP_TIMEOUT=300
+RUN uv venv .venv && \
+    uv pip install --no-cache -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
+# --- Runtime Stage ---
+FROM python:3.10-slim
+WORKDIR /app
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV PATH="/app/.venv:$PATH"
+ENV LANG=C.UTF-8
+ENV LC_ALL=C.UTF-8
+# Install runtime dependencies
+RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
+# Copy virtual environment from builder
+COPY --from=builder /app/.venv /app/.venv
+# Copy Scripts
+COPY scripts/ ./scripts/
+# Data & Model Baking
+ENV HF_HOME=/app/data/model_cache
+# Download Model
+RUN /app/.venv/bin/python scripts/download_model.py
+# Download Data
+# Ensure data directory exists
+RUN mkdir -p data/catalog data/index
+RUN /app/.venv/bin/python scripts/download_artifacts.py
+# Copy Code (Last to maximize layer caching)
+COPY src/ ./src/
+# Create directories and permissions
+# RUN addgroup --system app && adduser --system --group app && \
+#     chown -R app:app /app
+# USER app
+# Expose port
+EXPOSE 7860
+# Run Command
+CMD ["uvicorn", "src.personalization.api.main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,11 +1,156 @@
----
 title: Personalisation Engine
 emoji: 😻
 colorFrom: gray
 colorTo: indigo
 sdk: docker
 pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 title: Personalisation Engine
 emoji: 😻
 colorFrom: gray
 colorTo: indigo
 sdk: docker
 pinned: false
+# Semantic Book Personalization Engine
+A high-performance, standalone recommendation service that uses **Semantic Search** to provide personalized book suggestions.
+Unlike traditional recommenders that rely on collaborative filtering (which fails without massive user data), this engine uses **Sentence Transformers** to understand the *content* of books (Title + Author + Genre + Description), allowing it to work effectively from Day 1 ("Cold Start").
+## 🚀 Key Features
+*   **Semantic Understanding:** Connects "The Haunted School" to "Ghost Beach" based on plot descriptions, not just title keywords.
+*   **Hybrid Scoring:** Combines **Semantic Similarity** (85%) with **Book Ratings** (15%) to recommend high-quality matches.
+*   **Smart Optimization:** Uses **Product Quantization (IVF-PQ)** to compress the search index by **48x** (146MB -> 3MB) with minimal accuracy loss.
+*   **Time-Decay Memory:** Prioritizes a user's *recent* reads over ancient history.
+*   **Evaluation:** Achieves **40% Exact Hit Rate @ 10** on held-out author tests.
+*   **Standalone API:** Runs as a separate microservice (FastAPI) on Port 8001.
+## 🏗️ Architecture
+This project uses a **retrieval-based** approach:
+1.  **The Brain:** A pre-trained `all-MiniLM-L6-v2` model encodes all book metadata (Title, Author, Genre, Description) into 384-dimensional vectors.
+2.  **The Index:** A highly optimized FAISS `IndexIVFPQ` (Inverted File + Product Quantization) index for millisecond retrieval.
+3.  **The Engine:**
+    *   User history is converted to vectors.
+    *   Vectors are aggregated using **Time-Decay Averaging**.
+    *   The engine searches the FAISS index for the nearest neighbors.
+    *   Results are re-ranked using the book's rating.
+## 📦 Installation & Setup
+### Prerequisites
+*   Python 3.10+ (or Docker)
+*   `uv` (recommended for fast package management) or `pip`
+### 1. Clone the Repository
+```bash
+git clone <your-repo-url>
+cd personalise
+```
+### 2. Setup Environment
+```bash
+# Using uv (Recommended)
+uv venv
+# Windows:
+.venv\Scripts\activate
+# Linux/Mac:
+source .venv/bin/activate
+uv pip install -r requirements.txt
+```
+### 3. Data Preparation (Crucial Step)
+The system needs the "Brain" (Embeddings) and "Index" to function.
+**Option A: Download Pre-computed Artifacts (Fast)**
+```bash
+# Make sure you are in the root 'personalise' folder
+python scripts/download_artifacts.py
+```
+**Option B: Generate from Scratch (Slow - ~1.5 hours)**
+```bash
+# 1. Generate Embeddings
+python scripts/1b_generate_semantic_data.py
+# 2. Optimize Index
+python scripts/optimize_index.py
+```
+## 🏃 Run the Application
+### Option A: Run Locally
+```bash
+uvicorn src.personalization.api.main:app --reload --port 8001
+```
+API will be available at `http://localhost:8001`.
+### Option B: Run with Docker
+The Dockerfile is optimized to cache the model and data layers.
+```bash
+# 1. Build the image
+docker build -t personalise .
+# 2. Run the container
+docker run -p 8001:8001 personalise
+```
+## 🧪 Evaluation & Demo
+We have included a synthetic dataset of 10,000 users to validate the model.
+**Run the Offline Evaluation:**
+This script uses a "Leave-One-Out" strategy to see if the model can predict the next book a user reads.
+```bash
+python scripts/evaluate_system.py
+```
+**Visualize User Clusters:**
+Generate a 2D t-SNE plot showing how the model groups users by interest (requires `matplotlib` & `seaborn`).
+```bash
+# First install viz deps
+uv pip install matplotlib seaborn
+# Run visualization
+python scripts/visualize_users.py
+```
+*Output saved to `docs/user_clusters_tsne.png`*
+**Inspect Synthetic Data:**
+```bash
+python scripts/inspect_data.py
+```
+## 📡 API Usage
+#### POST `/personalize/recommend`
+Get personalized books based on reading history.
+```json
+{
+  "user_history": ["The Haunted School", "It Came from Beneath the Sink!"],
+  "top_k": 5
+}
+```
+#### POST `/search`
+Semantic search by plot or vibe.
+```json
+{
+  "query": "detective in space solving crimes",
+  "top_k": 5
+}
+```
+## 📊 Performance Stats
+| Metric | Brute Force (Flat) | Optimized (IVF-PQ) |
+| :--- | :--- | :--- |
+| **Memory** | ~150 MB | **~3 MB** |
+| **Recall @ 10** | 100% | ~95% |
+| **Speed** | ~10ms | ~2ms |
+| **Hit Rate @ 10** | N/A | **40.0%** |
+## 🗺️ Roadmap & Future Improvements
+*   **Model Compression (ONNX):** Replace the heavy PyTorch dependency with **ONNX Runtime**. This would reduce the Docker image size from ~3GB to ~500MB and improve CPU inference latency by 2-3x.
+*   **Real-Time Learning:** Implement a "Session-Based" Recommender (using RNNs or Transformers) to adapt to user intent within a single session, rather than just long-term history.
+*   **A/B Testing Framework:** Add infrastructure to serve different model versions to different user segments to scientifically measure engagement.
+## 📄 License
+MIT

pyproject.toml ADDED Viewed

	@@ -0,0 +1,23 @@

+[project]
+name = "personalise"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "faiss-cpu>=1.13.0",
+    "fastapi>=0.123.0",
+    "huggingface-hub>=0.36.0",
+    "matplotlib>=3.10.7",
+    "numpy>=2.3.5",
+    "pandas>=2.3.3",
+    "prometheus-fastapi-instrumentator>=7.1.0",
+    "pyarrow>=22.0.0",
+    "requests>=2.32.5",
+    "scikit-learn>=1.7.2",
+    "seaborn>=0.13.2",
+    "sentence-transformers>=5.1.2",
+    "torch>=2.9.1",
+    "tqdm>=4.67.1",
+    "uvicorn>=0.38.0",
+]

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+fastapi
+uvicorn
+numpy
+pandas
+faiss-cpu
+sentence-transformers
+requests
+huggingface_hub

scripts/1b_generate_semantic_data.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import pandas as pd
+import numpy as np
+from pathlib import Path
+from tqdm import tqdm
+import json
+import torch
+from sentence_transformers import SentenceTransformer
+import random
+import faiss
+NUM_USERS = 10000
+MIN_SEQUENCE_LENGTH = 5
+MAX_SEQUENCE_LENGTH = 50
+DATA_DIR = Path("data")
+CATALOG_PATH = DATA_DIR / "catalog" / "books_catalog.csv"
+OUTPUT_DIR = DATA_DIR / "synthetic"
+MODEL_NAME = "all-MiniLM-L6-v2"
+def main():
+    print("Loading catalog...")
+    df = pd.read_csv(CATALOG_PATH)
+    df['rich_content'] = (
+        "Title: " + df['title'].fillna("") +
+        "; Author: " + df['authors'].fillna("Unknown") +
+        "; Genres: " + df['genres'].fillna("") +
+        "; Description: " + df['description'].fillna("").astype(str).str.slice(0, 300)
+    )
+    titles = df['title'].tolist()
+    content_to_encode = df['rich_content'].tolist()
+    EMBEDDINGS_CACHE = DATA_DIR / "embeddings_cache.npy"
+    if EMBEDDINGS_CACHE.exists():
+        print(f"Loading cached embeddings from {EMBEDDINGS_CACHE}...")
+        emb_np = np.load(EMBEDDINGS_CACHE)
+        print("Embeddings loaded.")
+    else:
+        print(f"Loading Teacher Model ({MODEL_NAME})...")
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        model = SentenceTransformer(MODEL_NAME, device=device)
+        print("Encoding books (Title + Author + Genre + Desc)...")
+        embeddings = model.encode(content_to_encode, show_progress_bar=True, convert_to_tensor=True)
+        emb_np = embeddings.cpu().numpy()
+        print(f"Saving embeddings to {EMBEDDINGS_CACHE}...")
+        np.save(EMBEDDINGS_CACHE, emb_np)
+    print(f"Generating {NUM_USERS} semantic user journeys...")
+    cpu_index = faiss.IndexFlatIP(emb_np.shape[1])
+    faiss.normalize_L2(emb_np)
+    cpu_index.add(emb_np)
+    users = []
+    for user_id in tqdm(range(NUM_USERS)):
+        sequence = []
+        num_interests = random.choice([1, 1, 2, 3])
+        for _ in range(num_interests):
+            anchor_idx = random.randint(0, len(titles) - 1)
+            k_neighbors = 50
+            q = emb_np[anchor_idx].reshape(1, -1)
+            _, indices = cpu_index.search(q, k_neighbors)
+            neighbors_indices = indices[0]
+            num_to_read = random.randint(5, 15)
+            read_indices = np.random.choice(neighbors_indices, size=min(len(neighbors_indices), num_to_read), replace=False)
+            for idx in read_indices:
+                sequence.append(titles[idx])
+        if len(sequence) > MAX_SEQUENCE_LENGTH:
+            sequence = sequence[:MAX_SEQUENCE_LENGTH]
+        if len(sequence) >= MIN_SEQUENCE_LENGTH:
+             users.append({
+                'user_id': user_id,
+                'book_sequence': sequence,
+                'sequence_length': len(sequence),
+                'persona': 'semantic_explorer',
+                'metadata': {'generated': True}
+            })
+    users_df = pd.DataFrame(users)
+    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+    output_path = OUTPUT_DIR / "user_sequences.parquet"
+    users_df.to_parquet(output_path, index=False)
+    stats = {
+        'num_users': len(users_df),
+        'avg_sequence_length': float(users_df['sequence_length'].mean()),
+        'generated_via': "semantic_clustering"
+    }
+    with open(OUTPUT_DIR / "user_metadata.json", 'w') as f:
+        json.dump(stats, f, indent=2)
+    print(f"\n Generated {len(users_df)} semantic users")
+    print(f"   Output: {output_path}")
+if __name__ == "__main__":
+    main()

scripts/download_artifacts.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import os
+from pathlib import Path
+from huggingface_hub import hf_hub_download
+import shutil
+HF_REPO_ID = "nice-bill/book-recommender-artifacts"
+REPO_TYPE = "dataset"
+# Local Paths
+DATA_DIR = Path("data")
+CATALOG_DIR = DATA_DIR / "catalog"
+INDEX_DIR = DATA_DIR / "index"
+FILES_TO_DOWNLOAD = {
+    "books_catalog.csv": CATALOG_DIR,
+    "embeddings_cache.npy": DATA_DIR,
+    "optimized.index": INDEX_DIR
+}
+def main():
+    print(f"--- Checking artifacts from {HF_REPO_ID} ---")
+    # Ensure directories exist
+    for dir_path in [DATA_DIR, CATALOG_DIR, INDEX_DIR]:
+        dir_path.mkdir(parents=True, exist_ok=True)
+    for filename, dest_dir in FILES_TO_DOWNLOAD.items():
+        dest_path = dest_dir / filename
+        if dest_path.exists():
+            print(f"Found {filename}")
+            continue
+        print(f"Downloading {filename}...")
+        try:
+            # Download to local cache
+            cached_path = hf_hub_download(
+                repo_id=HF_REPO_ID,
+                filename=filename,
+                repo_type=REPO_TYPE
+            )
+            # Copy from cache to our project structure
+            shutil.copy(cached_path, dest_path)
+            print(f"   Saved to {dest_path}")
+        except Exception as e:
+            print(f"Failed to download {filename}: {e}")
+            print("   (Did you create the HF repo and upload the files?)")
+    print("\nArtifact setup complete.")
+if __name__ == "__main__":
+    main()

scripts/download_model.py ADDED Viewed

	@@ -0,0 +1,13 @@

+from sentence_transformers import SentenceTransformer
+import os
+MODEL_NAME = "all-MiniLM-L6-v2"
+def download():
+    print(f"Downloading {MODEL_NAME}...")
+    # This will download to HF_HOME (set in Dockerfile)
+    SentenceTransformer(MODEL_NAME)
+    print("Done.")
+if __name__ == "__main__":
+    download()

scripts/evaluate_quality.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import pandas as pd
+import requests
+import random
+import argparse
+import sys
+from tqdm import tqdm
+from pathlib import Path
+# Add src to path
+sys.path.append(str(Path(__file__).parent.parent))
+from src.personalization.config import settings
+# Config
+CATALOG_PATH = Path("data/catalog/books_catalog.csv")
+NUM_SAMPLES = 100
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default=settings.HOST)
+    parser.add_argument("--port", type=int, default=settings.PORT)
+    parser.add_argument("--samples", type=int, default=100, help="Number of evaluation queries")
+    args = parser.parse_args()
+    api_url = f"http://{args.host}:{args.port}/personalize/recommend"
+    print("Loading catalog for ground truth...")
+    if not CATALOG_PATH.exists():
+        print("Catalog not found!")
+        return
+    df = pd.read_csv(CATALOG_PATH)
+    # Filter authors with at least 5 books
+    author_counts = df['authors'].value_counts()
+    valid_authors = author_counts[author_counts >= 5].index.tolist()
+    print(f"Found {len(valid_authors)} authors with 5+ books.")
+    hits = 0
+    genre_matches = 0
+    total_recs = 0
+    print(f"Running {args.samples} evaluation queries against {api_url}...")
+    for _ in tqdm(range(args.samples)):
+        # 1. Pick a random author
+        author = random.choice(valid_authors)
+        books = df[df['authors'] == author]
+        if len(books) < 5:
+            continue
+        # 2. Split: History (3 books) -> Target (1 book)
+        sample = books.sample(n=4, replace=False)
+        history = sample.iloc[:3]['title'].tolist()
+        target_book = sample.iloc[3]
+        target_title = target_book['title']
+        # 3. Call API
+        try:
+            payload = {"user_history": history, "top_k": 10}
+            resp = requests.post(api_url, json=payload)
+            if resp.status_code != 200:
+                continue
+            recs = resp.json()
+            rec_titles = [r['title'] for r in recs]
+            # Metrics
+            if target_title in rec_titles:
+                hits += 1
+            # Author Match
+            rec_authors = df[df['title'].isin(rec_titles)]['authors'].tolist()
+            if author in rec_authors:
+                matches = rec_authors.count(author)
+                genre_matches += matches
+            total_recs += len(recs)
+        except Exception as e:
+            print(f"Connection Error: {e}")
+            break
+    if total_recs > 0:
+        print("\n--- Evaluation Results ---")
+        print(f"Exact Target Hit Rate @ 10: {hits / args.samples:.2%}")
+        print(f"Same Author Relevance: {genre_matches / total_recs:.2%} (Approx)")
+    else:
+        print("No results obtained. Check API connection.")
+if __name__ == "__main__":
+    main()

scripts/evaluate_system.py ADDED Viewed

	@@ -0,0 +1,149 @@

+import pandas as pd
+import numpy as np
+import faiss
+from pathlib import Path
+import logging
+from sentence_transformers import SentenceTransformer
+from tqdm import tqdm
+# Setup
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger("Evaluator")
+# Paths
+DATA_DIR = Path("data")
+SYNTHETIC_DATA_PATH = DATA_DIR / "synthetic" / "user_sequences.parquet"
+CATALOG_PATH = DATA_DIR / "catalog" / "books_catalog.csv"
+EMBEDDINGS_PATH = DATA_DIR / "embeddings_cache.npy"
+INDEX_PATH = DATA_DIR / "index" / "optimized.index"
+def evaluate_hit_rate(top_k=10, sample_size=1000):
+    """
+    Evaluates the recommender using a Leave-One-Out strategy.
+    metric: Hit Rate @ k
+    """
+    # 1. Load Resources
+    logger.info("Loading Catalog and Embeddings...")
+    if not CATALOG_PATH.exists() or not EMBEDDINGS_PATH.exists():
+        logger.error("Missing Data! Run download scripts first.")
+        return
+    # Load Titles for mapping
+    df_catalog = pd.read_csv(CATALOG_PATH)
+    titles = df_catalog['title'].tolist()
+    # Create Title -> Index map (normalized)
+    title_to_idx = {t.lower().strip(): i for i, t in enumerate(titles)}
+    # Load Embeddings
+    embeddings = np.load(EMBEDDINGS_PATH)
+    # Load Index
+    logger.info("Loading FAISS Index...")
+    if INDEX_PATH.exists():
+        index = faiss.read_index(str(INDEX_PATH))
+        index.nprobe = 10
+    else:
+        logger.info("Optimized index not found, building flat index on the fly...")
+        d = embeddings.shape[1]
+        index = faiss.IndexFlatIP(d)
+        faiss.normalize_L2(embeddings)
+        index.add(embeddings)
+    # 2. Load Synthetic Users
+    logger.info(f"Loading Synthetic Data from {SYNTHETIC_DATA_PATH}...")
+    df_users = pd.read_parquet(SYNTHETIC_DATA_PATH)
+    # Sample users if dataset is too large
+    if len(df_users) > sample_size:
+        df_users = df_users.sample(sample_size, random_state=42)
+    logger.info(f"Evaluating on {len(df_users)} users...")
+    hits = 0
+    processed_users = 0
+    for _, row in tqdm(df_users.iterrows(), total=len(df_users)):
+        history = row['book_sequence']
+        # Need at least 2 books (1 for history, 1 for test)
+        if len(history) < 2:
+            continue
+        # Leave-One-Out Split
+        target_book = history[-1]
+        context_books = history[:-1]
+        # 3. Convert Context to Vector
+        valid_indices = []
+        for book in context_books:
+            norm_title = book.lower().strip()
+            if norm_title in title_to_idx:
+                valid_indices.append(title_to_idx[norm_title])
+        if not valid_indices:
+            continue
+        # Get vectors and average (Time Decay Simulation)
+        context_vectors = embeddings[valid_indices]
+        # Simple Time Decay
+        n = len(valid_indices)
+        decay_factor = 0.9
+        weights = np.array([decay_factor ** (n - 1 - i) for i in range(n)])
+        weights = weights / weights.sum()
+        user_vector = np.average(context_vectors, axis=0, weights=weights).reshape(1, -1).astype(np.float32)
+        faiss.normalize_L2(user_vector)
+        # 4. Search
+        # We search for top_k + len(context) because the model might return books the user already read
+        search_k = top_k + len(valid_indices) + 5
+        scores, indices = index.search(user_vector, search_k)
+        # Filter results
+        recommended_titles = []
+        seen_indices = set(valid_indices) # Don't recommend what they just read
+        for idx in indices[0]:
+            if idx in seen_indices:
+                continue
+            rec_title = titles[idx]
+            recommended_titles.append(rec_title)
+            if len(recommended_titles) >= top_k:
+                break
+        # 5. Check Hit
+        # We check if the TARGET book title is in the recommended list
+        # Using loose matching (substring or exact) can be generous, but strict is better for ML metrics
+        # We'll stick to exact string match (normalized)
+        target_norm = target_book.lower().strip()
+        rec_norm = [t.lower().strip() for t in recommended_titles]
+        if target_norm in rec_norm:
+            hits += 1
+        processed_users += 1
+    # 6. Report
+    if processed_users == 0:
+        print("No valid users found for evaluation.")
+        return
+    hit_rate = hits / processed_users
+    print("\n" + "="*40)
+    print(f"EVALUATION REPORT (Sample: {processed_users} users)")
+    print("="*40)
+    print(f"Metric: Hit Rate @ {top_k}")
+    print(f"Score:  {hit_rate:.4f} ({hit_rate*100:.2f}%)")
+    print("-" * 40)
+    print("Interpretation:")
+    print(f"In {hit_rate*100:.1f}% of cases, the model successfully predicted")
+    print("the exact next book the user would read.")
+    print("="*40)
+if __name__ == "__main__":
+    evaluate_hit_rate()

scripts/inspect_data.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import pandas as pd
+from pathlib import Path
+# Config
+DATA_PATH = Path("data/synthetic/user_sequences.parquet")
+def inspect():
+    if not DATA_PATH.exists():
+        print(f"Error: File not found at {DATA_PATH}")
+        return
+    try:
+        print(f"Reading {DATA_PATH}...")
+        df = pd.read_parquet(DATA_PATH)
+        print("\n--- Schema ---")
+        print(df.info())
+        print("\n--- First 5 Rows ---")
+        print(df.head().to_string())
+        print("\n--- Sample User History ---")
+        # Show the full history of the first user
+        first_user = df.iloc[0]
+        print(f"User ID: {first_user.get('user_id', 'N/A')}")
+        print(f"History: {first_user.get('book_history', 'N/A')}")
+    except Exception as e:
+        print(f"Failed to read parquet: {e}")
+if __name__ == "__main__":
+    inspect()

scripts/optimize_index.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import numpy as np
+import faiss
+from pathlib import Path
+import time
+import sys
+# Config
+DATA_DIR = Path("data")
+EMBEDDINGS_PATH = DATA_DIR / "embeddings_cache.npy"
+OUTPUT_PATH = DATA_DIR / "index" / "optimized.index"
+def main():
+    if not EMBEDDINGS_PATH.exists():
+        print("No embeddings found. Run scripts/1b... first.")
+        sys.exit(1)
+    print(f"Loading embeddings from {EMBEDDINGS_PATH}...")
+    embeddings = np.load(EMBEDDINGS_PATH).astype(np.float32)
+    d = embeddings.shape[1]
+    nb = embeddings.shape[0]
+    print(f"Dataset: {nb} items, {d} dimensions.")
+    nlist = 100
+    m = 32
+    nbits = 8
+    print(f"Training IVF{nlist}, PQ{m} index...")
+    quantizer = faiss.IndexFlatL2(d)
+    index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
+    start_t = time.time()
+    index.train(embeddings)
+    print(f"Training time: {time.time() - start_t:.2f}s")
+    print("Adding vectors to index...")
+    index.add(embeddings)
+    OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
+    faiss.write_index(index, str(OUTPUT_PATH))
+    print(f"Optimized index saved to {OUTPUT_PATH}")
+    print(f"Original Size: {nb * d * 4 / 1024 / 1024:.2f} MB")
+    print(f"Optimized Size: {nb * m / 1024 / 1024:.2f} MB (Approx)")
+if __name__ == "__main__":
+    main()

scripts/visualize_users.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import pandas as pd
+import numpy as np
+import logging
+from pathlib import Path
+from tqdm import tqdm
+# Setup Logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger("Visualizer")
+# Paths
+DATA_DIR = Path("data")
+SYNTHETIC_DATA_PATH = DATA_DIR / "synthetic" / "user_sequences.parquet"
+CATALOG_PATH = DATA_DIR / "catalog" / "books_catalog.csv"
+EMBEDDINGS_PATH = DATA_DIR / "embeddings_cache.npy"
+OUTPUT_DIR = Path("docs")
+OUTPUT_IMAGE = OUTPUT_DIR / "user_clusters_tsne.png"
+def visualize_clusters(sample_size=2000):
+    """
+    Generates a 2D t-SNE projection of user vectors, colored by Persona.
+    """
+    try:
+        import matplotlib.pyplot as plt
+        import seaborn as sns
+        from sklearn.manifold import TSNE
+    except ImportError as e:
+        logger.error("Missing visualization libraries!")
+        logger.error("Please run: uv pip install matplotlib seaborn")
+        return
+    # 1. Load Resources
+    logger.info("Loading Data...")
+    if not CATALOG_PATH.exists() or not EMBEDDINGS_PATH.exists():
+        logger.error("Missing Data! Run download scripts first.")
+        return
+    # Load Titles for mapping
+    df_catalog = pd.read_csv(CATALOG_PATH)
+    titles = df_catalog['title'].tolist()
+    title_to_idx = {t.lower().strip(): i for i, t in enumerate(titles)}
+    # Load Embeddings
+    embeddings = np.load(EMBEDDINGS_PATH)
+    # Load Users
+    df_users = pd.read_parquet(SYNTHETIC_DATA_PATH)
+    # Sample
+    if len(df_users) > sample_size:
+        df_users = df_users.sample(sample_size, random_state=42)
+    logger.info(f"Processing {len(df_users)} users...")
+    user_vectors = []
+    user_personas = []
+    # 2. Calculate User Vectors
+    valid_users = 0
+    for _, row in tqdm(df_users.iterrows(), total=len(df_users)):
+        history = row['book_sequence']
+        persona = row['persona']
+        valid_indices = []
+        for book in history:
+            norm_title = book.lower().strip()
+            if norm_title in title_to_idx:
+                valid_indices.append(title_to_idx[norm_title])
+        if not valid_indices:
+            continue
+        # Average Embeddings
+        vectors = embeddings[valid_indices]
+        user_vec = np.mean(vectors, axis=0)
+        user_vectors.append(user_vec)
+        user_personas.append(persona)
+        valid_users += 1
+    X = np.array(user_vectors)
+    # 3. t-SNE Reduction
+    logger.info("Running t-SNE (this might take a moment)...")
+    tsne = TSNE(n_components=2, random_state=42, perplexity=30)
+    X_embedded = tsne.fit_transform(X)
+    # 4. Plotting
+    logger.info("Generating Plot...")
+    OUTPUT_DIR.mkdir(exist_ok=True)
+    plt.figure(figsize=(12, 8))
+    sns.scatterplot(
+        x=X_embedded[:, 0],
+        y=X_embedded[:, 1],
+        hue=user_personas,
+        palette="viridis",
+        alpha=0.7,
+        s=60
+    )
+    plt.title(f"Semantic User Clusters (t-SNE Projection of {valid_users} Users)", fontsize=16)
+    plt.xlabel("Dimension 1")
+    plt.ylabel("Dimension 2")
+    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title="Persona")
+    plt.tight_layout()
+    plt.savefig(OUTPUT_IMAGE, dpi=300)
+    logger.info(f"✅ Visualization saved to {OUTPUT_IMAGE}")
+    print(f"Success! Check {OUTPUT_IMAGE} to see your user clusters.")
+if __name__ == "__main__":
+    visualize_clusters()

src/personalization/__init__.py ADDED Viewed

File without changes

src/personalization/api/__init__.py ADDED Viewed

File without changes

src/personalization/api/main.py ADDED Viewed

	@@ -0,0 +1,186 @@

+from contextlib import asynccontextmanager
+from pathlib import Path
+import logging
+from sentence_transformers import SentenceTransformer
+from prometheus_fastapi_instrumentator import Instrumentator
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from typing import List
+import pandas as pd
+import faiss
+import numpy as np
+import time
+# Setup Logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Config
+DATA_DIR = Path("data")
+CATALOG_PATH = DATA_DIR / "catalog" / "books_catalog.csv"
+EMBEDDINGS_PATH = DATA_DIR / "embeddings_cache.npy"
+MODEL_NAME = "all-MiniLM-L6-v2"
+# Global State
+state = {
+    "titles": [],
+    "title_to_idx": {},
+    "index": None,
+    "embeddings": None,
+    "ratings": [],
+    "genres": [],
+    "model": None,
+    "popular_indices": []
+}
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    logger.info("Loading resources...")
+    start_time = time.time()
+    if not CATALOG_PATH.exists() or not EMBEDDINGS_PATH.exists():
+        logger.error("Missing catalog or embeddings! Run scripts/1b... first.")
+        # We continue but service might be degraded
+    else:
+        try:
+            df = pd.read_csv(CATALOG_PATH)
+            state["titles"] = df['title'].tolist()
+            state["genres"] = df['genres'].fillna("").tolist()
+            raw_ratings = pd.to_numeric(df['rating'], errors='coerce').fillna(3.0)
+            max_rating = raw_ratings.max()
+            state["ratings"] = (raw_ratings / max_rating).tolist() if max_rating > 0 else [0.5] * len(df)
+            # Use normalized keys for robust lookup
+            state["title_to_idx"] = {t.lower().strip(): i for i, t in enumerate(state["titles"])}
+            state["popular_indices"] = np.argsort(raw_ratings)[::-1][:50].tolist()
+            logger.info("Loading embeddings...")
+            embeddings = np.load(EMBEDDINGS_PATH)
+            state["embeddings"] = embeddings
+            OPTIMIZED_INDEX_PATH = DATA_DIR / "index" / "optimized.index"
+            if OPTIMIZED_INDEX_PATH.exists():
+                logger.info("Loading OPTIMIZED FAISS index (IVF-PQ)...")
+                state["index"] = faiss.read_index(str(OPTIMIZED_INDEX_PATH))
+                state["index"].nprobe = 10
+            else:
+                logger.info("Building Standard FAISS index (Flat)...")
+                d = embeddings.shape[1]
+                index = faiss.IndexFlatIP(d)
+                faiss.normalize_L2(embeddings)
+                index.add(embeddings)
+                state["index"] = index
+            logger.info(f"Loading Semantic Model ({MODEL_NAME})...")
+            state["model"] = SentenceTransformer(MODEL_NAME)
+            logger.info(f"Ready! Loaded {len(state['titles'])} books in {time.time() - start_time:.2f}s")
+        except Exception as e:
+            logger.error(f"Failed to load resources: {e}")
+            # Consider raising if critical
+    yield
+    logger.info("Shutting down...")
+    # Clean up resources if needed
+app = FastAPI(title="Semantic Book Discovery Engine", lifespan=lifespan)
+# Add Prometheus Instrumentation
+Instrumentator().instrument(app).expose(app)
+class RecommendationRequest(BaseModel):
+    user_history: List[str]
+    top_k: int = 10
+class SearchRequest(BaseModel):
+    query: str
+    top_k: int = 10
+class BookResponse(BaseModel):
+    title: str
+    score: float
+    genres: str
+@app.post("/search", response_model=List[BookResponse])
+async def search(request: SearchRequest):
+    if state["model"] is None or state["index"] is None:
+        raise HTTPException(status_code=503, detail="Service loading...")
+    query_vector = state["model"].encode([request.query], convert_to_numpy=True)
+    faiss.normalize_L2(query_vector)
+    scores, indices = state["index"].search(query_vector, request.top_k)
+    results = []
+    for score, idx in zip(scores[0], indices[0]):
+        results.append(BookResponse(
+            title=state["titles"][idx],
+            score=float(score),
+            genres=str(state["genres"][idx])
+        ))
+    return results
+@app.post("/personalize/recommend", response_model=List[BookResponse])
+async def recommend(request: RecommendationRequest):
+    if state["index"] is None:
+        raise HTTPException(status_code=503, detail="Service not ready")
+    valid_indices = []
+    for title in request.user_history:
+        normalized_title = title.lower().strip()
+        if normalized_title in state["title_to_idx"]:
+            valid_indices.append(state["title_to_idx"][normalized_title])
+    if not valid_indices:
+        logger.info("Cold start user: returning popular books")
+        results = []
+        for idx in state["popular_indices"][:request.top_k]:
+             results.append(BookResponse(
+                title=state["titles"][idx],
+                score=state["ratings"][idx],
+                genres=str(state["genres"][idx])
+            ))
+        return results
+    history_vectors = state["embeddings"][valid_indices]
+    n = len(valid_indices)
+    decay_factor = 0.9
+    weights = np.array([decay_factor ** (n - 1 - i) for i in range(n)])
+    weights = weights / weights.sum()
+    user_vector = np.average(history_vectors, axis=0, weights=weights).reshape(1, -1).astype(np.float32)
+    faiss.normalize_L2(user_vector)
+    search_k = (request.top_k * 3) + len(valid_indices)
+    scores, indices = state["index"].search(user_vector, search_k)
+    results = []
+    seen_indices = set(valid_indices)
+    seen_titles = set()
+    for score, idx in zip(scores[0], indices[0]):
+        if idx in seen_indices: continue
+        title = state["titles"][idx]
+        if title in seen_titles: continue
+        seen_titles.add(title)
+        final_score = float(score) + (state["ratings"][idx] * 0.1)
+        results.append(BookResponse(
+            title=title,
+            score=final_score,
+            genres=str(state["genres"][idx])
+        ))
+        if len(results) >= request.top_k:
+            break
+    results.sort(key=lambda x: x.score, reverse=True)
+    return results

src/personalization/config.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import os
+class Settings:
+    # Default to 8001, but allow Env Override
+    HOST = os.getenv("API_HOST", "localhost")
+    PORT = int(os.getenv("API_PORT", 8001))
+    @property
+    def BASE_URL(self):
+        return f"http://{self.HOST}:{self.PORT}"
+settings = Settings()

test_api.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import requests
+import argparse
+import sys
+from pathlib import Path
+# Add src to path to import config
+sys.path.append(str(Path(__file__).parent))
+from src.personalization.config import settings
+def main():
+    parser = argparse.ArgumentParser(description="Test the Recommendation API")
+    parser.add_argument("--host", type=str, default=settings.HOST, help="API Host")
+    parser.add_argument("--port", type=int, default=settings.PORT, help="API Port")
+    args = parser.parse_args()
+    base_url = f"http://{args.host}:{args.port}"
+    url = f"{base_url}/personalize/recommend"
+    payload = {
+        "user_history": [
+            "The Haunted School",
+            "It Came from Beneath the Sink!"
+            "Legion"
+        ],
+        "top_k": 5
+    }
+    print(f"Sending request to {url}...")
+    try:
+        response = requests.post(url, json=payload)
+        if response.status_code == 200:
+            results = response.json()
+            print("\u2714 Recommendations:")
+            for i, book in enumerate(results, 1):
+                print(f"{i}. {book['title']} (Score: {book['score']:.4f})")
+        else:
+            print(f"\u2714 Error {response.status_code}: {response.text}")
+    except Exception as e:
+        print(f"\u2714 Failed to connect: {e}")
+        print("Make sure the uvicorn server is running on port 8001!")
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff