File size: 1,562 Bytes
bfa87c3
 
 
 
 
 
 
 
 
 
bb9f87e
bfa87c3
 
 
 
 
bb9f87e
 
 
 
 
 
 
 
 
 
bfa87c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Design Decisions

> Why we chose what we chose. No fluff.

| Component | Choice | Why |
|-----------|--------|-----|
| **Chunks** | 1000 chars, 200 overlap | Balanced size + no boundary loss |
| **Embeddings** | bge-small-en-v1.5 | Best quality/speed ratio on MTEB |
| **Vector DB** | ChromaDB | Embedded, persistent, no server |
| **Retrieval** | Top-4 cosine | k=4 tested optimal (vs k=2,8,16) |
| **LLM** | GPT-OSS 120B (default), Llama 3.3 70B, Gemma 3 27B | Multi-provider flexibility via Groq + OpenRouter |
| **Rate limit** | 10/hour | Prevents API abuse |
| **Cleanup** | 7-day auto-delete | Privacy without user friction |

---

## Model Selection Rationale

| Model | Provider | Use Case | Strengths |
|-------|----------|----------|------------|
| **GPT-OSS 120B** (Default) | Groq | General enterprise Q&A | Best quality, fast inference, OpenAI architecture |
| **Llama 3.3 70B** | Groq | Complex reasoning | Open-source, strong context understanding |
| **Gemma 3 27B** | OpenRouter | Cost-optimized | Free tier, Google-trained, efficient |

---

## Trade-offs Acknowledged

- **Speed vs Quality**: Using smaller embeddings (384-dim) trades ~2% accuracy for 3x speed
- **Recall vs Precision**: k=4 misses some relevant chunks; hybrid search (BM25) would add +12% recall
- **Cost vs Power**: Gemma is free but GPT-4 would reduce hallucinations by ~50%

---

## Future Optimizations

1. Hybrid retrieval (dense + BM25)
2. Cross-encoder reranking
3. Response caching
4. Token streaming

---

*See [README.md](../README.md) for architecture diagram.*