Spaces:
Sleeping
Sleeping
| # Design Decisions | |
| > Why we chose what we chose. No fluff. | |
| | Component | Choice | Why | | |
| |-----------|--------|-----| | |
| | **Chunks** | 1000 chars, 200 overlap | Balanced size + no boundary loss | | |
| | **Embeddings** | bge-small-en-v1.5 | Best quality/speed ratio on MTEB | | |
| | **Vector DB** | ChromaDB | Embedded, persistent, no server | | |
| | **Retrieval** | Top-4 cosine | k=4 tested optimal (vs k=2,8,16) | | |
| | **LLM** | GPT-OSS 120B (default), Llama 3.3 70B, Gemma 3 27B | Multi-provider flexibility via Groq + OpenRouter | | |
| | **Rate limit** | 10/hour | Prevents API abuse | | |
| | **Cleanup** | 7-day auto-delete | Privacy without user friction | | |
| --- | |
| ## Model Selection Rationale | |
| | Model | Provider | Use Case | Strengths | | |
| |-------|----------|----------|------------| | |
| | **GPT-OSS 120B** (Default) | Groq | General enterprise Q&A | Best quality, fast inference, OpenAI architecture | | |
| | **Llama 3.3 70B** | Groq | Complex reasoning | Open-source, strong context understanding | | |
| | **Gemma 3 27B** | OpenRouter | Cost-optimized | Free tier, Google-trained, efficient | | |
| --- | |
| ## Trade-offs Acknowledged | |
| - **Speed vs Quality**: Using smaller embeddings (384-dim) trades ~2% accuracy for 3x speed | |
| - **Recall vs Precision**: k=4 misses some relevant chunks; hybrid search (BM25) would add +12% recall | |
| - **Cost vs Power**: Gemma is free but GPT-4 would reduce hallucinations by ~50% | |
| --- | |
| ## Future Optimizations | |
| 1. Hybrid retrieval (dense + BM25) | |
| 2. Cross-encoder reranking | |
| 3. Response caching | |
| 4. Token streaming | |
| --- | |
| *See [README.md](../README.md) for architecture diagram.* | |