goutam-dev
/

rag-chatbot

+---
+license: mit
+language:
+  - en
+tags:
+  - RAG
+  - retrieval-augmented-generation
+  - document-qa
+  - pdf-processing
+  - hybrid-retrieval
+  - cross-encoder
+  - langchain
+  - chromadb
+  - bm25
+  - semantic-chunking
+  - multi-document
+  - question-answering
+library_name: langchain
+pipeline_tag: question-answering
+datasets: []
+metrics:
+  - accuracy
+base_model:
+  - BAAI/bge-large-en-v1.5
+  - BAAI/bge-reranker-v2-m3
+  - sentence-transformers/all-MiniLM-L6-v2
+---
+# Multi-Document RAG System
+A production-ready **Retrieval-Augmented Generation (RAG)** system for intelligent question-answering over multiple PDF documents. Features hybrid retrieval (vector + keyword search), cross-encoder re-ranking, semantic chunking, and a Gradio web interface.
+![Architecture](https://img.shields.io/badge/Architecture-Hybrid%20RAG-blue)
+![Python](https://img.shields.io/badge/Python-3.10%2B-green)
+![LLM](https://img.shields.io/badge/LLM-Llama%203.3%2070B-orange)
+## Model Description
+This system implements an advanced RAG pipeline that combines multiple state-of-the-art techniques for optimal document retrieval and question answering:
+### Core Models Used
+| Component | Model | Purpose |
+|-----------|-------|---------|
+| **Embeddings** | `BAAI/bge-large-en-v1.5` | 1024-dim normalized embeddings for semantic search |
+| **Re-ranker** | `BAAI/bge-reranker-v2-m3` | Cross-encoder neural re-ranking for precision |
+| **Chunker** | `sentence-transformers/all-MiniLM-L6-v2` | Semantic similarity for intelligent chunking |
+| **LLM** | Llama 3.3 70B (via Groq API) | Generation with inline citations |
+### Architecture
+```
+User Query
+    │
+    ├── Query Classification (factoid/summary/comparison/extraction/reasoning)
+    ├── Multi-Query Expansion (3 alternative phrasings)
+    └── HyDE Generation (hypothetical answer document)
+           │
+           ▼
+    ┌──────────────────────────────────────┐
+    │         Hybrid Retrieval             │
+    │  ┌─────────────┐  ┌─────────────┐    │
+    │  │ ChromaDB    │  │ BM25        │    │
+    │  │ (Vector)    │  │ (Keyword)   │    │
+    │  └─────────────┘  └─────────────┘    │
+    │           │              │           │
+    │           └──────┬───────┘           │
+    │                  ▼                   │
+    │         RRF Fusion + Deduplication   │
+    └──────────────────────────────────────┘
+                       │
+                       ▼
+              Cross-Encoder Re-ranking
+              (BAAI/bge-reranker-v2-m3)
+                       │
+                       ▼
+              LLM Generation (Llama 3.3 70B)
+              with inline source citations
+                       │
+                       ▼
+              Answer Verification (for complex queries)
+```
+## Key Features
+### Hybrid Retrieval
+- **Vector Search (MMR)**: Semantic similarity with diversity via ChromaDB
+- **Keyword Search (BM25)**: Exact term matching for rare words
+- **Reciprocal Rank Fusion**: Combines multiple ranked lists optimally
+### Semantic Chunking
+Documents are split based on sentence embedding similarity rather than fixed character counts, preserving coherent ideas within chunks.
+### Intelligent Query Classification
+Automatically classifies queries into 5 types with adaptive retrieval:
+| Query Type | Retrieval Depth (k) | Answer Style |
+|------------|---------------------|--------------|
+| Factoid | 6 | Direct |
+| Summary | 10 | Bullets |
+| Comparison | 12 | Bullets |
+| Extraction | 8 | Direct |
+| Reasoning | 10 | Steps |
+### Multi-Document Support
+- Upload multiple PDFs to build a combined knowledge base
+- Automatic PDF diversity enforcement for cross-document queries
+- Clear source attribution with document name and page number
+### Query Enhancement
+- **HyDE**: Generates hypothetical answer documents for better retrieval
+- **Multi-Query Expansion**: Creates 3 alternative phrasings for broader coverage
+### Answer Verification
+Self-verification step for complex queries ensures answers are direct, structured, and grounded in sources.
+## Intended Uses
+### Primary Use Cases
+- **Academic Research**: Analyze and compare research papers
+- **Document Q&A**: Answer questions over technical documentation
+- **Literature Review**: Synthesize information across multiple sources
+- **Knowledge Extraction**: Extract specific facts, methodologies, or findings
+### Out-of-Scope Uses
+- Real-time streaming applications (latency-sensitive)
+- Non-English documents (optimized for English)
+- Image/table-heavy PDFs (text extraction only)
+## How to Use
+### Requirements
+- Python 3.10+
+- Groq API key (free at [console.groq.com](https://console.groq.com))
+- GPU recommended but not required
+### Installation
+```bash
+pip install numpy==1.26.4 pandas==2.2.2 scipy==1.13.1
+pip install langchain-core==0.2.40 langchain-community==0.2.16 langchain==0.2.16
+pip install langchain-groq==0.1.9 langchain-text-splitters==0.2.4
+pip install chromadb==0.5.5 sentence-transformers==3.0.1
+pip install pypdf==4.3.1 rank-bm25==0.2.2 gradio torch
+```
+### Quick Start
+1. Open `rag.ipynb` in Jupyter Notebook or Google Colab
+2. Run all cells sequentially
+3. Enter your Groq API key in the Setup tab
+4. Upload PDF documents
+5. Ask questions in the Chat tab
+### Example Queries
+```python
+# Single Document Analysis
+"What is the main contribution of this paper?"
+"Explain the methodology in detail"
+"What are the limitations mentioned by the authors?"
+# Multi-Document Comparison
+"Compare the approaches discussed in these papers"
+"What are the key differences between the methodologies?"
+```
+## Technical Specifications
+### Performance Benchmarks
+| Operation | Typical Duration |
+|-----------|------------------|
+| Model initialization | 30-60 seconds |
+| PDF ingestion (per doc) | 10-30 seconds |
+| Simple queries | 5-8 seconds |
+| Complex queries | 10-15 seconds |
+| Full document summary | 30-90 seconds |
+### Configuration Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `max_chunk_size` | 1000 | Maximum characters per semantic chunk |
+| `similarity_threshold` | 0.5 | Cosine similarity for chunk grouping |
+| `chunk_size` | 800 | Fallback text splitter chunk size |
+| `chunk_overlap` | 150 | Character overlap between chunks |
+| `fetch_factor` | 2 | Multiplier for initial retrieval pool |
+| `lambda_mult` | 0.6 | MMR diversity parameter |
+| `cache_max_size` | 100 | Maximum cached query responses |
+## Limitations
+- Requires active internet connection for Groq API calls
+- PDF quality affects text extraction accuracy
+- Large documents may take longer to process
+- Query cache does not persist between sessions
+- Optimized for English language documents
+## Training Details
+This is a **retrieval system**, not a trained model. It orchestrates pre-trained models:
+- **Embeddings**: Uses pre-trained `BAAI/bge-large-en-v1.5` without fine-tuning
+- **Re-ranker**: Uses pre-trained `BAAI/bge-reranker-v2-m3` without fine-tuning
+- **LLM**: Uses Llama 3.3 70B via Groq API with zero-shot prompting
+## Evaluation
+The system was evaluated qualitatively on academic papers and technical documents for:
+- Answer relevance and accuracy
+- Source attribution correctness
+- Cross-document comparison quality
+- Response structure and readability
+## Environmental Impact
+- **Hardware**: Developed and tested on Google Colab (NVIDIA T4 GPU)
+- **Inference**: Primary compute via Groq API (cloud-hosted)
+- Local model loading: ~2GB VRAM for embeddings + re-ranker
+## Citation
+```bibtex
+@software{multi_doc_rag_system,
+  title = {Multi-Document RAG System},
+  year = {2024},
+  description = {Production-ready RAG system with hybrid retrieval and cross-encoder re-ranking},
+  url = {https://huggingface.co/your-username/your-repo}
+}
+```
+## Acknowledgements
+This project builds upon:
+- [LangChain](https://github.com/langchain-ai/langchain) for RAG orchestration
+- [ChromaDB](https://github.com/chroma-core/chroma) for vector storage
+- [Sentence Transformers](https://www.sbert.net/) for embeddings
+- [BAAI](https://huggingface.co/BAAI) for BGE models
+- [Groq](https://groq.com/) for fast LLM inference
+## Contact
+For questions or feedback, please open an issue on the repository.