Spaces:

KBaba7
/

DocsBot

Sleeping

App Files Files Community

BabaK07 commited on 17 days ago

Commit

90b7bb0

1 Parent(s): 8c66fe6

Integrate Jina embeddings and refresh assignment README

Browse files

Files changed (5) hide show

.env.example +3 -0
README.md +111 -133
app/config.py +4 -1
app/services/vector_store.py +60 -14
pyproject.toml +1 -2

.env.example CHANGED Viewed

@@ -11,5 +11,8 @@ ACCESS_TOKEN_EXPIRE_MINUTES=720
 MODEL_NAME=llama-3.1-8b-instant
 EMBEDDING_MODEL=mixedbread-ai/mxbai-embed-large-v1
 EMBEDDING_DIMENSIONS=1024
 WEB_SEARCH_PROVIDER=duckduckgo
 TAVILY_API_KEY=

 MODEL_NAME=llama-3.1-8b-instant
 EMBEDDING_MODEL=mixedbread-ai/mxbai-embed-large-v1
 EMBEDDING_DIMENSIONS=1024
+JINA_API_KEY=
+JINA_API_BASE=https://api.jina.ai/v1/embeddings
+JINA_EMBEDDING_MODEL=jina-embeddings-v3
 WEB_SEARCH_PROVIDER=duckduckgo
 TAVILY_API_KEY=

README.md CHANGED Viewed

@@ -1,112 +1,108 @@
-# DocsQA LangGraph Assignment
-RAG-powered research assistant with:
-- Auth (register/login/logout) using HTTP-only cookie sessions
-- Multi-file PDF upload (up to 5 files/request, max 10 pages/file)
-- Duplicate detection by SHA-256 hash with cross-user document reuse
-- Vector indexing in Supabase Postgres + `pgvector`
-- LangGraph agent with document retrieval + web search fallback
-- Session conversation memory for follow-up questions
-- Source citations in answers for both document and web evidence
-- Chat-style UI with markdown rendering
-## Architecture
-- Backend: FastAPI + SQLAlchemy
-- Agent: LangGraph ReAct agent
-- LLM: Groq chat model
-- Vector store: Supabase Postgres with `pgvector`
-- Search fallback: Tavily (preferred) or DuckDuckGo when available
 ## Chunking Strategy
-- Splitter: recursive character splitter (`chunk_size=1200`, `chunk_overlap=200`)
-- Why:
-  - 1200 keeps enough local context for legal/business clauses
-  - 200 overlap reduces boundary loss between adjacent chunks
-  - good balance for retrieval accuracy vs. embedding cost
-- Indexing is page-aware: each stored chunk carries `page_number` metadata.
 ## Retrieval Approach
-- Retrieval method: cosine similarity search in `pgvector`
-- Pipeline:
-  - determine relevant user-owned document hashes
-  - embed query
-  - retrieve top-k chunks across selected docs
-- Returned evidence includes:
-  - document filename
-  - page number
-  - excerpt text
-- Final assistant answer is instructed to cite these in a human-friendly source section.
 ## Agent Routing Logic
-- Default behavior: prefer `vector_search` for questions answerable from uploaded docs.
-- If document evidence is insufficient, agent can call `web_search` tool.
-- Web search output is normalized to citation-friendly rows (title, URL, snippet).
-- Prompt requires:
-  - vector citations: document + page + excerpt
-  - web citations: website title + URL
-## Bonus Feature
-**Implemented bonus:** User-scoped retrieval with automatic document dedup reuse.
-- If two users upload the same file, processing/indexing is reused by file hash.
-- Ownership is still enforced via `user_documents` mapping, so retrieval stays user-scoped.
-- Why chosen: materially improves performance/cost while preserving access boundaries.
-## Problems Faced and Fixes
-- Dependency mismatch (`transformers`/`sentence-transformers`/`torch`) causing startup errors.
-  - Added robust local fallback embedding path to keep app functional.
-- Optional web-search dependency (`ddgs`) missing.
-  - Added graceful web tool fallback and Tavily direct tool support.
-- Passlib bcrypt backend issues.
-  - Switched new password hashing to `pbkdf2_sha256` while retaining bcrypt verify compatibility.
-- Template/render and response UX issues.
-  - Reworked frontend into a stable chat-style UI with clean result handling.
-## If I Had More Time
-- Add proper migration tooling (Alembic) instead of startup `ALTER TABLE`.
-- Add reranking for higher retrieval precision on long multi-document queries.
-- Add persistent server-side conversation storage (Redis/Postgres) for multi-worker deployments.
-- Add automated evaluation suite for citation faithfulness and retrieval quality.
-## Environment Setup
-```bash
-cp .env.example .env
-```
-Required:
-- `GROQ_API_KEY`
-- `SECRET_KEY`
-- `DATABASE_URL` (Supabase transaction pooler recommended)
-Optional:
-- `TAVILY_API_KEY` (for Tavily web search)
-Storage (optional, recommended for deployment):
-- `STORAGE_BACKEND=local` or `supabase`
-- `SUPABASE_URL`
-- `SUPABASE_SERVICE_ROLE_KEY`
-- `SUPABASE_STORAGE_BUCKET` (default: `documents`)
-- `SUPABASE_STORAGE_PREFIX` (default: `docsqa`)
-Recommended `DATABASE_URL` format:
-`postgresql+psycopg://<user>:<password>@<pooler-host>:6543/postgres?sslmode=require`
-## Install and Run
 ```bash
 python3 -m venv .venv
 source .venv/bin/activate
 pip install -e .
@@ -115,10 +111,29 @@ uvicorn app.main:app --reload
 Open: `http://127.0.0.1:8000`
-## File Storage Mode
-- Local dev default: `STORAGE_BACKEND=local` (writes under `UPLOAD_DIRECTORY`).
-- Deployment recommendation: `STORAGE_BACKEND=supabase` to store PDFs in Supabase Storage instead of local disk.
 ## API Endpoints
@@ -127,57 +142,20 @@ Open: `http://127.0.0.1:8000`
 - `POST /logout`
 - `POST /upload`
 - `GET /documents`
 - `POST /ask`
-## test_documents
-Sample PDFs used during development are in `test_documents/`.
-## Deployment and Loom
-- Live deployed URL: _add your deployed link here_
-- Loom walkthrough (<5 min): _add your Loom link here_
-## Deploy on Render
-This repo now includes a `render.yaml` Blueprint.
-1. Push the latest `main` branch to GitHub.
-2. In Render, click **New +** -> **Blueprint**.
-3. Connect GitHub and select this repository.
-4. Render will detect `render.yaml` and create a `docsbot` web service.
-5. Set required secret env vars in Render:
-   - `SECRET_KEY`
-   - `DATABASE_URL`
-   - `GROQ_API_KEY`
-   - `SUPABASE_URL`
-   - `SUPABASE_SERVICE_ROLE_KEY`
-   - optionally `TAVILY_API_KEY`
-6. Deploy and open the generated Render URL.
-Render uses:
-- Build command: `pip install -e .`
-- Start command: `uvicorn app.main:app --host 0.0.0.0 --port $PORT`
-## Deploy on Fly.io
-This repo includes `Dockerfile` and `fly.toml`.
-1. Install Fly CLI:
-   - macOS: `brew install flyctl`
-2. Login:
-   - `fly auth login`
-3. If app name `docsbot-kbaba7` is unavailable, change `app` in `fly.toml`.
-4. Create app (first time only):
-   - `fly apps create docsbot-kbaba7`
-5. Set secrets:
-   - `fly secrets set SECRET_KEY=...`
-   - `fly secrets set DATABASE_URL=...`
-   - `fly secrets set GROQ_API_KEY=...`
-   - `fly secrets set SUPABASE_URL=...`
-   - `fly secrets set SUPABASE_SERVICE_ROLE_KEY=...`
-   - optional: `fly secrets set TAVILY_API_KEY=...`
-6. Deploy:
-   - `fly deploy`
-7. Open app:
-   - `fly open`

+# DocsQA Smart Research Assistant
+This is my take-home submission for the ABSTRABIT AI/ML Engineer assignment: a RAG-powered assistant where users upload PDFs, ask questions, and get grounded answers with citations.
+## Live Project
+- Live app (Railway): `https://docsbot-web-production.up.railway.app`
+- GitHub: `https://github.com/KBaba7/DocsBot`
+- Loom walkthrough: _add your link here_
+## What I Built
+The app supports authentication, PDF upload (up to 5 files and 10 pages per file), document chunking + vector indexing, and a chat experience that answers from uploaded documents first.
+If the uploaded documents are not enough, the agent falls back to web search and cites those sources too.
+## Stack
+- FastAPI + SQLAlchemy
+- LangGraph agent
+- Groq chat model
+- Supabase Postgres + `pgvector`
+- Railway deployment
+## How Retrieval Works
+Uploaded PDFs are parsed page by page and split into chunks.
+Each chunk is stored with metadata (document, page number, chunk index) and embedded into `pgvector`.
+At question time:
+1. The app searches relevant chunks from the user’s accessible documents.
+2. The agent answers from those chunks when possible.
+3. If evidence is weak, the agent uses web search and cites external URLs.
 ## Chunking Strategy
+- Chunk size: `1200`
+- Overlap: `200`
+Why this setup:
+- Long, structured documents need enough contiguous context.
+- Overlap helps avoid missing content around chunk boundaries.
+- It gives a practical quality/cost balance for retrieval.
 ## Retrieval Approach
+I use cosine similarity search in `pgvector` (no reranker yet).
+The top matches are turned into readable citations (document name + page + snippet), and those are shown per answer in the UI.
 ## Agent Routing Logic
+The agent is prompted to prefer document context first.
+- If retrieved document context is sufficient: answer from documents with citations.
+- If not sufficient: clearly say docs are insufficient and use web search tool.
+This is implemented as tool-based behavior in LangGraph rather than a static fallback message.
+## Source Citations
+Each turn stores/returns source metadata separately from the answer body.
+- Vector source cards include:
+  - document name
+  - page number
+  - excerpt (short snippet from retrieved chunk)
+- Web source cards include:
+  - title
+  - URL
+## Conversation Memory
+Conversation history is maintained within session scope, so follow-ups like “tell me more about that” work as expected.
+## Bonus Feature
+I added hash-based deduplicated ingestion:
+- If the same PDF is uploaded again, processing/indexing is reused.
+- Access control is still user-scoped via ownership mapping.
+Why I chose this:
+- saves compute/time,
+- avoids duplicate indexing,
+- keeps retrieval secure per user.
+## Challenges I Ran Into
+1. Heavy embedding dependencies made deployment images too large.
+   - I switched to lightweight embeddings for deployment and added Jina API embedding support.
+2. Source rendering got messy across multiple chat turns.
+   - I separated answer text from source payloads and extracted sources per turn.
+3. Intermittent DB DNS/pooler issues during deployment.
+   - I improved connection handling and standardized Supabase transaction-pooler config.
+## If I Had More Time
+- Add reranking (cross-encoder) for better precision on long multi-doc queries.
+- Add automated citation-faithfulness checks.
+- Add Alembic migrations for cleaner schema evolution.
+- Add stronger eval/observability for routing and retrieval quality.
+## Local Setup
 ```bash
+cp .env.example .env
 python3 -m venv .venv
 source .venv/bin/activate
 pip install -e .
 Open: `http://127.0.0.1:8000`
+## Important Environment Variables
+Required:
+- `GROQ_API_KEY`
+- `SECRET_KEY`
+- `DATABASE_URL`
+Embeddings (recommended):
+- `JINA_API_KEY`
+- `JINA_API_BASE` (default: `https://api.jina.ai/v1/embeddings`)
+- `JINA_EMBEDDING_MODEL` (default: `jina-embeddings-v3`)
+- `EMBEDDING_DIMENSIONS` (default: `1024`)
+Storage:
+- `STORAGE_BACKEND=local|supabase`
+- `SUPABASE_URL`
+- `SUPABASE_SERVICE_ROLE_KEY`
+- `SUPABASE_STORAGE_BUCKET`
+- `SUPABASE_STORAGE_PREFIX`
+Web search:
+- `WEB_SEARCH_PROVIDER=duckduckgo|tavily`
+- `TAVILY_API_KEY` (if using Tavily)
 ## API Endpoints
 - `POST /logout`
 - `POST /upload`
 - `GET /documents`
+- `DELETE /documents/{document_id}`
+- `GET /documents/{document_id}/pdf`
 - `POST /ask`
+## Sample Documents
+As requested in the assignment, sample PDFs are included in `test_documents/`.
+## Railway Deployment
+```bash
+railway login
+railway link
+railway up
+```
+Set the same env vars in Railway service settings before deploying.

app/config.py CHANGED Viewed

@@ -21,8 +21,11 @@ class Settings(BaseSettings):
     model_name: str = "llama-3.1-8b-instant"
     embedding_model: str = "mixedbread-ai/mxbai-embed-large-v1"
     embedding_dimensions: int = 1024
     groq_api_key: str | None = None
-    web_search_provider: str = "duckduckgo"
     tavily_api_key: str | None = None
     @property

     model_name: str = "llama-3.1-8b-instant"
     embedding_model: str = "mixedbread-ai/mxbai-embed-large-v1"
     embedding_dimensions: int = 1024
+    jina_api_key: str | None = None
+    jina_api_base: str = "https://api.jina.ai/v1/embeddings"
+    jina_embedding_model: str = "jina-embeddings-v3"
     groq_api_key: str | None = None
+    web_search_provider: str = "tavily"
     tavily_api_key: str | None = None
     @property

app/services/vector_store.py CHANGED Viewed

@@ -3,6 +3,7 @@ import math
 import re
 from typing import Any
 from sqlalchemy import delete, select
 from sqlalchemy.orm import Session
@@ -65,25 +66,70 @@ class LocalHashEmbeddings:
         return [value / norm for value in vector]
 class VectorStoreService:
     def __init__(self) -> None:
         self.splitter = SimpleTextSplitter(chunk_size=1200, chunk_overlap=200)
-        self.embeddings = None
     def _get_embeddings(self) -> Any:
-        settings = get_settings()
-        if self.embeddings is None:
-            try:
-                from langchain_huggingface import HuggingFaceEmbeddings
-                self.embeddings = HuggingFaceEmbeddings(
-                    model_name=settings.embedding_model,
-                    model_kwargs={"device": "cpu"},
-                    encode_kwargs={"normalize_embeddings": True},
-                )
-            except Exception:
-                # Keep the app usable when transformer/torch dependencies are unavailable.
-                self.embeddings = LocalHashEmbeddings(settings.embedding_dimensions)
         return self.embeddings
     def add_document(self, *, db: Session, document_id: int, file_hash: str, filename: str, pages: list[tuple[int, str]]) -> None:

 import re
 from typing import Any
+import requests
 from sqlalchemy import delete, select
 from sqlalchemy.orm import Session
         return [value / norm for value in vector]
+class JinaEmbeddings:
+    def __init__(self, *, api_key: str, base_url: str, model: str, dimensions: int) -> None:
+        self.api_key = api_key
+        self.base_url = base_url
+        self.model = model
+        self.dimensions = dimensions
+    def embed_documents(self, texts: list[str]) -> list[list[float]]:
+        return self._embed(texts=texts, task="retrieval.passage")
+    def embed_query(self, text: str) -> list[float]:
+        vectors = self._embed(texts=[text], task="retrieval.query")
+        return vectors[0] if vectors else [0.0] * self.dimensions
+    def _embed(self, *, texts: list[str], task: str) -> list[list[float]]:
+        if not texts:
+            return []
+        response = requests.post(
+            self.base_url,
+            headers={
+                "Content-Type": "application/json",
+                "Authorization": f"Bearer {self.api_key}",
+            },
+            json={
+                "model": self.model,
+                "task": task,
+                "embedding_type": "float",
+                "normalized": True,
+                "input": texts,
+            },
+            timeout=60,
+        )
+        response.raise_for_status()
+        data = response.json().get("data", [])
+        vectors = [row.get("embedding", []) for row in data]
+        validated: list[list[float]] = []
+        for vector in vectors:
+            if len(vector) != self.dimensions:
+                raise ValueError(
+                    f"Jina embedding dimension mismatch: got {len(vector)}, expected {self.dimensions}. "
+                    "Adjust EMBEDDING_DIMENSIONS or switch embedding model."
+                )
+            validated.append(vector)
+        return validated
 class VectorStoreService:
     def __init__(self) -> None:
         self.splitter = SimpleTextSplitter(chunk_size=1200, chunk_overlap=200)
+        settings = get_settings()
+        if settings.jina_api_key:
+            self.embeddings = JinaEmbeddings(
+                api_key=settings.jina_api_key,
+                base_url=settings.jina_api_base,
+                model=settings.jina_embedding_model,
+                dimensions=settings.embedding_dimensions,
+            )
+        else:
+            # Lightweight fallback when hosted embedding credentials are not configured.
+            self.embeddings = LocalHashEmbeddings(settings.embedding_dimensions)
     def _get_embeddings(self) -> Any:
         return self.embeddings
     def add_document(self, *, db: Session, document_id: int, file_hash: str, filename: str, pages: list[tuple[int, str]]) -> None:

pyproject.toml CHANGED Viewed

@@ -9,7 +9,6 @@ dependencies = [
   "jinja2>=3.1.4",
   "langchain-community>=0.3.0",
   "langchain-groq>=0.2.0",
-  "langchain-huggingface>=0.1.0",
   "langchain-text-splitters>=0.3.0",
   "langgraph>=0.2.35",
   "passlib[bcrypt]>=1.7.4",
@@ -19,8 +18,8 @@ dependencies = [
   "pypdf>=5.0.1",
   "python-jose[cryptography]>=3.3.0",
   "python-multipart>=0.0.9",
   "sqlalchemy>=2.0.35",
-  "sentence-transformers>=3.0.1",
   "uvicorn[standard]>=0.30.6",
   "email-validator>=2.2.0",
   "tavily-python==0.7.23",

   "jinja2>=3.1.4",
   "langchain-community>=0.3.0",
   "langchain-groq>=0.2.0",
   "langchain-text-splitters>=0.3.0",
   "langgraph>=0.2.35",
   "passlib[bcrypt]>=1.7.4",
   "pypdf>=5.0.1",
   "python-jose[cryptography]>=3.3.0",
   "python-multipart>=0.0.9",
+  "requests>=2.32.0",
   "sqlalchemy>=2.0.35",
   "uvicorn[standard]>=0.30.6",
   "email-validator>=2.2.0",
   "tavily-python==0.7.23",