umanggarg commited on
Commit
be5e148
Β·
1 Parent(s): 52fc686

Use Voyage embeddings by default

Browse files
Files changed (5) hide show
  1. Dockerfile +3 -3
  2. README.md +12 -7
  3. backend/config.py +11 -11
  4. ingestion/embedder.py +5 -5
  5. render.yaml +6 -2
Dockerfile CHANGED
@@ -15,8 +15,8 @@
15
  # By downloading it during the Docker build, it's baked into the image layer.
16
  # Subsequent starts are instant β€” the model is already on disk.
17
  #
18
- # The embedding model (nomic-embed-code) is NOT downloaded here β€” it runs via
19
- # the Nomic API (no local file needed). That's how we stay under the RAM limit.
20
  #
21
  # ARCHITECTURE
22
  # ────────────
@@ -47,7 +47,7 @@ RUN pip install --user --no-cache-dir -r requirements.txt
47
 
48
  # Pre-download the re-ranker model into the image layer.
49
  # This bakes the ~80MB model into the image so cold starts don't download it.
50
- # The Nomic embedding model is NOT downloaded here β€” it lives on Nomic's API.
51
  RUN python -c "\
52
  from sentence_transformers import CrossEncoder; \
53
  print('Pre-downloading re-ranker...'); \
 
15
  # By downloading it during the Docker build, it's baked into the image layer.
16
  # Subsequent starts are instant β€” the model is already on disk.
17
  #
18
+ # The embedding model is NOT downloaded here β€” Voyage/Gemini/Nomic run via API
19
+ # (no local file needed). That's how we stay under the RAM limit.
20
  #
21
  # ARCHITECTURE
22
  # ────────────
 
47
 
48
  # Pre-download the re-ranker model into the image layer.
49
  # This bakes the ~80MB model into the image so cold starts don't download it.
50
+ # The embedding model is NOT downloaded here β€” it lives behind a hosted API.
51
  RUN python -c "\
52
  from sentence_transformers import CrossEncoder; \
53
  print('Pre-downloading re-ranker...'); \
README.md CHANGED
@@ -42,7 +42,7 @@ GitHub URL
42
  Falls back to line-windowed sliding chunks for unsupported languages
43
  β†’ ingestion_service.py (Optional) LLM generates a 1–2 sentence description per chunk
44
  prepended before embedding β€” Anthropic's "contextual retrieval"
45
- β†’ embedder.py Nomic nomic-embed-text-v1.5 (768-dim) via API Β· optional Voyage voyage-code-3 (1024-dim)
46
  β†’ qdrant_store.py Each chunk stored with: dense vector + sparse BM25 vector + full payload metadata
47
  ```
48
 
@@ -201,8 +201,8 @@ The ⟳ button in the sidebar triggers a re-index with LLM-generated chunk descr
201
  | Backend | FastAPI + uvicorn | Async ASGI, 20+ endpoints, SSE streaming throughout |
202
  | Frontend | React + Vite | Component-based UI, localStorage sessions, SSE token streaming |
203
  | Vector DB | Qdrant Cloud | Native hybrid search (dense + sparse), free 1 GB tier |
204
- | Embeddings (default) | Nomic `nomic-embed-text-v1.5` | 768-dim, via Nomic API (zero local RAM) |
205
- | Embeddings (optional) | Voyage `voyage-code-3` | 1024-dim, code-optimised, 200M tokens/month free |
206
  | Code parsing | tree-sitter | Multi-language AST β€” Python, JS, TS, Go, Rust, Java |
207
  | Reranker (primary) | Cohere `rerank-v3.5` | Cross-encoder, API, 1000 calls/month free |
208
  | Reranker (fallback) | `ms-marco-MiniLM-L-6-v2` | Local cross-encoder, baked into Docker image |
@@ -264,11 +264,16 @@ cd ui && npm install && npm run dev
264
  # Vector DB (required)
265
  QDRANT_URL= # Qdrant Cloud cluster URL
266
  QDRANT_API_KEY= # Qdrant Cloud API key
267
- QDRANT_COLLECTION= # e.g. cartographer_nomic
268
 
269
- # Embeddings (one required)
270
- NOMIC_API_KEY= # Default β€” free at atlas.nomic.ai
271
- VOYAGE_API_KEY= # Optional upgrade β€” free at voyageai.com (set EMBEDDING_MODEL=voyage-code-3)
 
 
 
 
 
272
 
273
  # LLM (at least one required)
274
  CEREBRAS_API_KEY= # Fastest β€” free at cloud.cerebras.ai (1M tok/day)
 
42
  Falls back to line-windowed sliding chunks for unsupported languages
43
  β†’ ingestion_service.py (Optional) LLM generates a 1–2 sentence description per chunk
44
  prepended before embedding β€” Anthropic's "contextual retrieval"
45
+ β†’ embedder.py Voyage voyage-code-3 (1024-dim) via API Β· Gemini/Nomic fallback
46
  β†’ qdrant_store.py Each chunk stored with: dense vector + sparse BM25 vector + full payload metadata
47
  ```
48
 
 
201
  | Backend | FastAPI + uvicorn | Async ASGI, 20+ endpoints, SSE streaming throughout |
202
  | Frontend | React + Vite | Component-based UI, localStorage sessions, SSE token streaming |
203
  | Vector DB | Qdrant Cloud | Native hybrid search (dense + sparse), free 1 GB tier |
204
+ | Embeddings (default) | Voyage `voyage-code-3` | 1024-dim, code-optimised, 200M tokens/month free |
205
+ | Embeddings (fallback) | Gemini `gemini-embedding-001` | 768-dim, via Gemini API; good quality but tighter free-tier limits |
206
  | Code parsing | tree-sitter | Multi-language AST β€” Python, JS, TS, Go, Rust, Java |
207
  | Reranker (primary) | Cohere `rerank-v3.5` | Cross-encoder, API, 1000 calls/month free |
208
  | Reranker (fallback) | `ms-marco-MiniLM-L-6-v2` | Local cross-encoder, baked into Docker image |
 
264
  # Vector DB (required)
265
  QDRANT_URL= # Qdrant Cloud cluster URL
266
  QDRANT_API_KEY= # Qdrant Cloud API key
267
+ QDRANT_COLLECTION=github_repos_voyage # new 1024-dim collection for Voyage
268
 
269
+ # Embeddings
270
+ VOYAGE_API_KEY= # Default β€” free at voyageai.com
271
+ EMBEDDING_MODEL=voyage-code-3
272
+ EMBEDDING_DIM=1024
273
+
274
+ # Optional embedding fallbacks
275
+ GEMINI_API_KEY= # Also used for LLMs; set EMBEDDING_MODEL=gemini-embedding-001 and EMBEDDING_DIM=768
276
+ NOMIC_API_KEY= # Legacy fallback; set EMBEDDING_MODEL=nomic-embed-text-v1.5 and EMBEDDING_DIM=768
277
 
278
  # LLM (at least one required)
279
  CEREBRAS_API_KEY= # Fastest β€” free at cloud.cerebras.ai (1M tok/day)
backend/config.py CHANGED
@@ -25,7 +25,7 @@ class Settings:
25
  # ── Vector DB ─────────────────────────────────────────────────────────────
26
  qdrant_url: str = os.getenv("QDRANT_URL", "")
27
  qdrant_api_key: str = os.getenv("QDRANT_API_KEY", "")
28
- qdrant_collection: str = os.getenv("QDRANT_COLLECTION", "github_repos")
29
 
30
  # ── GitHub ────────────────────────────────────────────────────────────────
31
  # Optional β€” without it you get 60 API req/hr; with it 5,000 req/hr
@@ -34,25 +34,25 @@ class Settings:
34
  # ── Embeddings ────────────────────────────────────────────────────────────
35
  # Three embedding providers, selected at startup by EMBEDDING_MODEL:
36
  #
37
- # 1. Gemini (default β€” EMBEDDING_MODEL contains "gemini", needs GEMINI_API_KEY)
38
- # gemini-embedding-001: 768-dim output via MRL, generous free tier.
39
- # Re-uses the same GEMINI_API_KEY used for the LLM β€” no extra signup.
40
- # Free at https://aistudio.google.com.
41
- #
42
- # 2. Voyage AI (EMBEDDING_MODEL contains "voyage", needs VOYAGE_API_KEY)
43
  # voyage-code-3: code-optimised, 1024-dim, 200M tokens/month free.
44
- # ⚠️ Requires EMBEDDING_DIM=1024 and a NEW Qdrant collection β€” dims
45
  # are incompatible with 768-dim collections.
46
  #
47
- # 3. Nomic (legacy fallback β€” NOMIC_API_KEY set)
 
 
 
 
 
48
  # nomic-embed-text-v1.5: 768-dim. Free quota is 10M tokens TOTAL
49
  # (not per month) β€” easy to exhaust across a few large indexes.
50
  #
51
  # EMBEDDING_DIM must match the chosen model exactly.
52
  nomic_api_key: str = os.getenv("NOMIC_API_KEY", "")
53
  voyage_api_key: str = os.getenv("VOYAGE_API_KEY", "")
54
- embedding_model: str = os.getenv("EMBEDDING_MODEL", "gemini-embedding-001")
55
- embedding_dim: int = int(os.getenv("EMBEDDING_DIM", "768"))
56
  gemini_embedding_batch_size: int = int(os.getenv("GEMINI_EMBEDDING_BATCH_SIZE", "8"))
57
  gemini_embedding_min_interval: float = float(os.getenv("GEMINI_EMBEDDING_MIN_INTERVAL", "4.0"))
58
  gemini_embedding_retries: int = int(os.getenv("GEMINI_EMBEDDING_RETRIES", "6"))
 
25
  # ── Vector DB ─────────────────────────────────────────────────────────────
26
  qdrant_url: str = os.getenv("QDRANT_URL", "")
27
  qdrant_api_key: str = os.getenv("QDRANT_API_KEY", "")
28
+ qdrant_collection: str = os.getenv("QDRANT_COLLECTION", "github_repos_voyage")
29
 
30
  # ── GitHub ────────────────────────────────────────────────────────────────
31
  # Optional β€” without it you get 60 API req/hr; with it 5,000 req/hr
 
34
  # ── Embeddings ────────────────────────────────────────────────────────────
35
  # Three embedding providers, selected at startup by EMBEDDING_MODEL:
36
  #
37
+ # 1. Voyage AI (default β€” EMBEDDING_MODEL contains "voyage", needs VOYAGE_API_KEY)
 
 
 
 
 
38
  # voyage-code-3: code-optimised, 1024-dim, 200M tokens/month free.
39
+ # Requires EMBEDDING_DIM=1024 and a NEW Qdrant collection β€” dims
40
  # are incompatible with 768-dim collections.
41
  #
42
+ # 2. Gemini (EMBEDDING_MODEL contains "gemini", needs GEMINI_API_KEY)
43
+ # gemini-embedding-001: 768-dim output via MRL. Re-uses the same
44
+ # GEMINI_API_KEY used for the LLM, but free-tier RPM/TPM limits are
45
+ # too tight for LangChain-scale repos.
46
+ #
47
+ # 3. Nomic (legacy fallback β€” EMBEDDING_MODEL contains "nomic")
48
  # nomic-embed-text-v1.5: 768-dim. Free quota is 10M tokens TOTAL
49
  # (not per month) β€” easy to exhaust across a few large indexes.
50
  #
51
  # EMBEDDING_DIM must match the chosen model exactly.
52
  nomic_api_key: str = os.getenv("NOMIC_API_KEY", "")
53
  voyage_api_key: str = os.getenv("VOYAGE_API_KEY", "")
54
+ embedding_model: str = os.getenv("EMBEDDING_MODEL", "voyage-code-3")
55
+ embedding_dim: int = int(os.getenv("EMBEDDING_DIM", "1024"))
56
  gemini_embedding_batch_size: int = int(os.getenv("GEMINI_EMBEDDING_BATCH_SIZE", "8"))
57
  gemini_embedding_min_interval: float = float(os.getenv("GEMINI_EMBEDDING_MIN_INTERVAL", "4.0"))
58
  gemini_embedding_retries: int = int(os.getenv("GEMINI_EMBEDDING_RETRIES", "6"))
ingestion/embedder.py CHANGED
@@ -12,18 +12,18 @@ THREE PROVIDERS, ONE INTERFACE
12
  ──────────────────────────────
13
  Provider is selected from EMBEDDING_MODEL at init:
14
 
15
- EMBEDDING_MODEL contains "voyage" + VOYAGE_API_KEY set
16
  β†’ Voyage AI: code-optimised, 1024-dim, 200M tokens/month free.
17
  voyage-code-3 is specifically trained on code and outperforms
18
  general-purpose embedders on code retrieval benchmarks.
19
- ⚠️ Requires EMBEDDING_DIM=1024 and a new Qdrant collection.
20
 
21
- EMBEDDING_MODEL contains "gemini" + GEMINI_API_KEY set (default)
22
  β†’ Google Gemini: gemini-embedding-001, 768-dim output (configurable
23
  via MRL), generous free tier. Re-uses the same GEMINI_API_KEY we
24
- use for the LLM β€” no separate signup.
25
 
26
- NOMIC_API_KEY set (legacy fallback)
27
  β†’ Nomic API: nomic-embed-text-v1.5, 768-dim. Free quota is 10M
28
  tokens total β€” easy to exhaust across a few large repo indexes.
29
 
 
12
  ──────────────────────────────
13
  Provider is selected from EMBEDDING_MODEL at init:
14
 
15
+ EMBEDDING_MODEL contains "voyage" + VOYAGE_API_KEY set (default)
16
  β†’ Voyage AI: code-optimised, 1024-dim, 200M tokens/month free.
17
  voyage-code-3 is specifically trained on code and outperforms
18
  general-purpose embedders on code retrieval benchmarks.
19
+ Requires EMBEDDING_DIM=1024 and a new Qdrant collection.
20
 
21
+ EMBEDDING_MODEL contains "gemini" + GEMINI_API_KEY set
22
  β†’ Google Gemini: gemini-embedding-001, 768-dim output (configurable
23
  via MRL), generous free tier. Re-uses the same GEMINI_API_KEY we
24
+ use for the LLM, but free-tier limits are tight for huge repos.
25
 
26
+ EMBEDDING_MODEL contains "nomic" + NOMIC_API_KEY set (legacy fallback)
27
  β†’ Nomic API: nomic-embed-text-v1.5, 768-dim. Free quota is 10M
28
  tokens total β€” easy to exhaust across a few large repo indexes.
29
 
render.yaml CHANGED
@@ -26,16 +26,20 @@ services:
26
  sync: false # set manually in Render dashboard
27
  - key: QDRANT_API_KEY
28
  sync: false
 
 
29
  - key: GROQ_API_KEY
30
  sync: false
31
  - key: ANTHROPIC_API_KEY
32
  sync: false
 
 
33
  - key: GITHUB_TOKEN
34
  sync: false
35
  - key: EMBEDDING_MODEL
36
- value: sentence-transformers/all-MiniLM-L6-v2
37
  - key: EMBEDDING_DIM
38
- value: "384"
39
  - key: TOP_K
40
  value: "6"
41
  # HuggingFace cache dir β€” Render gives 1GB ephemeral disk
 
26
  sync: false # set manually in Render dashboard
27
  - key: QDRANT_API_KEY
28
  sync: false
29
+ - key: QDRANT_COLLECTION
30
+ value: github_repos_voyage
31
  - key: GROQ_API_KEY
32
  sync: false
33
  - key: ANTHROPIC_API_KEY
34
  sync: false
35
+ - key: VOYAGE_API_KEY
36
+ sync: false
37
  - key: GITHUB_TOKEN
38
  sync: false
39
  - key: EMBEDDING_MODEL
40
+ value: voyage-code-3
41
  - key: EMBEDDING_DIM
42
+ value: "1024"
43
  - key: TOP_K
44
  value: "6"
45
  # HuggingFace cache dir β€” Render gives 1GB ephemeral disk