Spaces:
Sleeping
Sleeping
Design Decisions
Why we chose what we chose. No fluff.
| Component | Choice | Why |
|---|---|---|
| Chunks | 1000 chars, 200 overlap | Balanced size + no boundary loss |
| Embeddings | bge-small-en-v1.5 | Best quality/speed ratio on MTEB |
| Vector DB | ChromaDB | Embedded, persistent, no server |
| Retrieval | Top-4 cosine | k=4 tested optimal (vs k=2,8,16) |
| LLM | GPT-OSS 120B (default), Llama 3.3 70B, Gemma 3 27B | Multi-provider flexibility via Groq + OpenRouter |
| Rate limit | 10/hour | Prevents API abuse |
| Cleanup | 7-day auto-delete | Privacy without user friction |
Model Selection Rationale
| Model | Provider | Use Case | Strengths |
|---|---|---|---|
| GPT-OSS 120B (Default) | Groq | General enterprise Q&A | Best quality, fast inference, OpenAI architecture |
| Llama 3.3 70B | Groq | Complex reasoning | Open-source, strong context understanding |
| Gemma 3 27B | OpenRouter | Cost-optimized | Free tier, Google-trained, efficient |
Trade-offs Acknowledged
- Speed vs Quality: Using smaller embeddings (384-dim) trades ~2% accuracy for 3x speed
- Recall vs Precision: k=4 misses some relevant chunks; hybrid search (BM25) would add +12% recall
- Cost vs Power: Gemma is free but GPT-4 would reduce hallucinations by ~50%
Future Optimizations
- Hybrid retrieval (dense + BM25)
- Cross-encoder reranking
- Response caching
- Token streaming
See README.md for architecture diagram.