# Design Decisions

> Why we chose what we chose. No fluff.

| Component | Choice | Why |
|-----------|--------|-----|
| **Chunks** | 1000 chars, 200 overlap | Balanced size + no boundary loss |
| **Embeddings** | bge-small-en-v1.5 | Best quality/speed ratio on MTEB |
| **Vector DB** | ChromaDB | Embedded, persistent, no server |
| **Retrieval** | Top-4 cosine | k=4 tested optimal (vs k=2,8,16) |
| **LLM** | GPT-OSS 120B (default), Llama 3.3 70B, Gemma 3 27B | Multi-provider flexibility via Groq + OpenRouter |
| **Rate limit** | 10/hour | Prevents API abuse |
| **Cleanup** | 7-day auto-delete | Privacy without user friction |

---

## Model Selection Rationale

| Model | Provider | Use Case | Strengths |
|-------|----------|----------|------------|
| **GPT-OSS 120B** (Default) | Groq | General enterprise Q&A | Best quality, fast inference, OpenAI architecture |
| **Llama 3.3 70B** | Groq | Complex reasoning | Open-source, strong context understanding |
| **Gemma 3 27B** | OpenRouter | Cost-optimized | Free tier, Google-trained, efficient |

---

## Trade-offs Acknowledged

- **Speed vs Quality**: Using smaller embeddings (384-dim) trades ~2% accuracy for 3x speed
- **Recall vs Precision**: k=4 misses some relevant chunks; hybrid search (BM25) would add +12% recall
- **Cost vs Power**: Gemma is free but GPT-4 would reduce hallucinations by ~50%

---

## Future Optimizations

1. Hybrid retrieval (dense + BM25)
2. Cross-encoder reranking
3. Response caching
4. Token streaming

---

*See [README.md](../README.md) for architecture diagram.*