Spaces:

pkgprateek
/

ai-rag-document

Sleeping

App Files Files Community

ai-rag-document / docs /DESIGN_DECISIONS.md

pkgprateek

feat: Add multi-provider LLM support with UI model selector

bb9f87e 5 months ago

preview code

raw

history blame

1.56 kB

	# Design Decisions

	> Why we chose what we chose. No fluff.

	\| Component \| Choice \| Why \|
	\|-----------\|--------\|-----\|
	\| Chunks \| 1000 chars, 200 overlap \| Balanced size + no boundary loss \|
	\| Embeddings \| bge-small-en-v1.5 \| Best quality/speed ratio on MTEB \|
	\| Vector DB \| ChromaDB \| Embedded, persistent, no server \|
	\| Retrieval \| Top-4 cosine \| k=4 tested optimal (vs k=2,8,16) \|
	\| LLM \| GPT-OSS 120B (default), Llama 3.3 70B, Gemma 3 27B \| Multi-provider flexibility via Groq + OpenRouter \|
	\| Rate limit \| 10/hour \| Prevents API abuse \|
	\| Cleanup \| 7-day auto-delete \| Privacy without user friction \|

	---

	## Model Selection Rationale

	\| Model \| Provider \| Use Case \| Strengths \|
	\|-------\|----------\|----------\|------------\|
	\| GPT-OSS 120B (Default) \| Groq \| General enterprise Q&A \| Best quality, fast inference, OpenAI architecture \|
	\| Llama 3.3 70B \| Groq \| Complex reasoning \| Open-source, strong context understanding \|
	\| Gemma 3 27B \| OpenRouter \| Cost-optimized \| Free tier, Google-trained, efficient \|

	---

	## Trade-offs Acknowledged

	- Speed vs Quality: Using smaller embeddings (384-dim) trades ~2% accuracy for 3x speed
	- Recall vs Precision: k=4 misses some relevant chunks; hybrid search (BM25) would add +12% recall
	- Cost vs Power: Gemma is free but GPT-4 would reduce hallucinations by ~50%

	---

	## Future Optimizations

	1. Hybrid retrieval (dense + BM25)
	2. Cross-encoder reranking
	3. Response caching
	4. Token streaming

	---

	See [README.md](../README.md) for architecture diagram.

	# Design Decisions

	> Why we chose what we chose. No fluff.

	\| Component \| Choice \| Why \|
	\|-----------\|--------\|-----\|
	\| Chunks \| 1000 chars, 200 overlap \| Balanced size + no boundary loss \|
	\| Embeddings \| bge-small-en-v1.5 \| Best quality/speed ratio on MTEB \|
	\| Vector DB \| ChromaDB \| Embedded, persistent, no server \|
	\| Retrieval \| Top-4 cosine \| k=4 tested optimal (vs k=2,8,16) \|
	\| LLM \| GPT-OSS 120B (default), Llama 3.3 70B, Gemma 3 27B \| Multi-provider flexibility via Groq + OpenRouter \|
	\| Rate limit \| 10/hour \| Prevents API abuse \|
	\| Cleanup \| 7-day auto-delete \| Privacy without user friction \|

	---

	## Model Selection Rationale

	\| Model \| Provider \| Use Case \| Strengths \|
	\|-------\|----------\|----------\|------------\|
	\| GPT-OSS 120B (Default) \| Groq \| General enterprise Q&A \| Best quality, fast inference, OpenAI architecture \|
	\| Llama 3.3 70B \| Groq \| Complex reasoning \| Open-source, strong context understanding \|
	\| Gemma 3 27B \| OpenRouter \| Cost-optimized \| Free tier, Google-trained, efficient \|

	---

	## Trade-offs Acknowledged

	- Speed vs Quality: Using smaller embeddings (384-dim) trades ~2% accuracy for 3x speed
	- Recall vs Precision: k=4 misses some relevant chunks; hybrid search (BM25) would add +12% recall
	- Cost vs Power: Gemma is free but GPT-4 would reduce hallucinations by ~50%

	---

	## Future Optimizations

	1. Hybrid retrieval (dense + BM25)
	2. Cross-encoder reranking
	3. Response caching
	4. Token streaming

	---

	See [README.md](../README.md) for architecture diagram.