goutam-dev commited on
Commit
283d510
·
verified ·
1 Parent(s): 41a8476

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +244 -3
README.md CHANGED
@@ -1,3 +1,244 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - RAG
7
+ - retrieval-augmented-generation
8
+ - document-qa
9
+ - pdf-processing
10
+ - hybrid-retrieval
11
+ - cross-encoder
12
+ - langchain
13
+ - chromadb
14
+ - bm25
15
+ - semantic-chunking
16
+ - multi-document
17
+ - question-answering
18
+ library_name: langchain
19
+ pipeline_tag: question-answering
20
+ datasets: []
21
+ metrics:
22
+ - accuracy
23
+ base_model:
24
+ - BAAI/bge-large-en-v1.5
25
+ - BAAI/bge-reranker-v2-m3
26
+ - sentence-transformers/all-MiniLM-L6-v2
27
+ ---
28
+
29
+ # Multi-Document RAG System
30
+
31
+ A production-ready **Retrieval-Augmented Generation (RAG)** system for intelligent question-answering over multiple PDF documents. Features hybrid retrieval (vector + keyword search), cross-encoder re-ranking, semantic chunking, and a Gradio web interface.
32
+
33
+ ![Architecture](https://img.shields.io/badge/Architecture-Hybrid%20RAG-blue)
34
+ ![Python](https://img.shields.io/badge/Python-3.10%2B-green)
35
+ ![LLM](https://img.shields.io/badge/LLM-Llama%203.3%2070B-orange)
36
+
37
+ ## Model Description
38
+
39
+ This system implements an advanced RAG pipeline that combines multiple state-of-the-art techniques for optimal document retrieval and question answering:
40
+
41
+ ### Core Models Used
42
+
43
+ | Component | Model | Purpose |
44
+ |-----------|-------|---------|
45
+ | **Embeddings** | `BAAI/bge-large-en-v1.5` | 1024-dim normalized embeddings for semantic search |
46
+ | **Re-ranker** | `BAAI/bge-reranker-v2-m3` | Cross-encoder neural re-ranking for precision |
47
+ | **Chunker** | `sentence-transformers/all-MiniLM-L6-v2` | Semantic similarity for intelligent chunking |
48
+ | **LLM** | Llama 3.3 70B (via Groq API) | Generation with inline citations |
49
+
50
+ ### Architecture
51
+
52
+ ```
53
+ User Query
54
+
55
+ ├── Query Classification (factoid/summary/comparison/extraction/reasoning)
56
+ ├── Multi-Query Expansion (3 alternative phrasings)
57
+ └── HyDE Generation (hypothetical answer document)
58
+
59
+
60
+ ┌──────────────────────────────────────┐
61
+ │ Hybrid Retrieval │
62
+ │ ┌─────────────┐ ┌─────────────┐ │
63
+ │ │ ChromaDB │ │ BM25 │ │
64
+ │ │ (Vector) │ │ (Keyword) │ │
65
+ │ └─────────────┘ └─────────────┘ │
66
+ │ │ │ │
67
+ │ └──────┬───────┘ │
68
+ │ ▼ │
69
+ │ RRF Fusion + Deduplication │
70
+ └──────────────────────────────────────┘
71
+
72
+
73
+ Cross-Encoder Re-ranking
74
+ (BAAI/bge-reranker-v2-m3)
75
+
76
+
77
+ LLM Generation (Llama 3.3 70B)
78
+ with inline source citations
79
+
80
+
81
+ Answer Verification (for complex queries)
82
+ ```
83
+
84
+ ## Key Features
85
+
86
+ ### Hybrid Retrieval
87
+ - **Vector Search (MMR)**: Semantic similarity with diversity via ChromaDB
88
+ - **Keyword Search (BM25)**: Exact term matching for rare words
89
+ - **Reciprocal Rank Fusion**: Combines multiple ranked lists optimally
90
+
91
+ ### Semantic Chunking
92
+ Documents are split based on sentence embedding similarity rather than fixed character counts, preserving coherent ideas within chunks.
93
+
94
+ ### Intelligent Query Classification
95
+ Automatically classifies queries into 5 types with adaptive retrieval:
96
+
97
+ | Query Type | Retrieval Depth (k) | Answer Style |
98
+ |------------|---------------------|--------------|
99
+ | Factoid | 6 | Direct |
100
+ | Summary | 10 | Bullets |
101
+ | Comparison | 12 | Bullets |
102
+ | Extraction | 8 | Direct |
103
+ | Reasoning | 10 | Steps |
104
+
105
+ ### Multi-Document Support
106
+ - Upload multiple PDFs to build a combined knowledge base
107
+ - Automatic PDF diversity enforcement for cross-document queries
108
+ - Clear source attribution with document name and page number
109
+
110
+ ### Query Enhancement
111
+ - **HyDE**: Generates hypothetical answer documents for better retrieval
112
+ - **Multi-Query Expansion**: Creates 3 alternative phrasings for broader coverage
113
+
114
+ ### Answer Verification
115
+ Self-verification step for complex queries ensures answers are direct, structured, and grounded in sources.
116
+
117
+ ## Intended Uses
118
+
119
+ ### Primary Use Cases
120
+ - **Academic Research**: Analyze and compare research papers
121
+ - **Document Q&A**: Answer questions over technical documentation
122
+ - **Literature Review**: Synthesize information across multiple sources
123
+ - **Knowledge Extraction**: Extract specific facts, methodologies, or findings
124
+
125
+ ### Out-of-Scope Uses
126
+ - Real-time streaming applications (latency-sensitive)
127
+ - Non-English documents (optimized for English)
128
+ - Image/table-heavy PDFs (text extraction only)
129
+
130
+ ## How to Use
131
+
132
+ ### Requirements
133
+ - Python 3.10+
134
+ - Groq API key (free at [console.groq.com](https://console.groq.com))
135
+ - GPU recommended but not required
136
+
137
+ ### Installation
138
+
139
+ ```bash
140
+ pip install numpy==1.26.4 pandas==2.2.2 scipy==1.13.1
141
+ pip install langchain-core==0.2.40 langchain-community==0.2.16 langchain==0.2.16
142
+ pip install langchain-groq==0.1.9 langchain-text-splitters==0.2.4
143
+ pip install chromadb==0.5.5 sentence-transformers==3.0.1
144
+ pip install pypdf==4.3.1 rank-bm25==0.2.2 gradio torch
145
+ ```
146
+
147
+ ### Quick Start
148
+
149
+ 1. Open `rag.ipynb` in Jupyter Notebook or Google Colab
150
+ 2. Run all cells sequentially
151
+ 3. Enter your Groq API key in the Setup tab
152
+ 4. Upload PDF documents
153
+ 5. Ask questions in the Chat tab
154
+
155
+ ### Example Queries
156
+
157
+ ```python
158
+ # Single Document Analysis
159
+ "What is the main contribution of this paper?"
160
+ "Explain the methodology in detail"
161
+ "What are the limitations mentioned by the authors?"
162
+
163
+ # Multi-Document Comparison
164
+ "Compare the approaches discussed in these papers"
165
+ "What are the key differences between the methodologies?"
166
+ ```
167
+
168
+ ## Technical Specifications
169
+
170
+ ### Performance Benchmarks
171
+
172
+ | Operation | Typical Duration |
173
+ |-----------|------------------|
174
+ | Model initialization | 30-60 seconds |
175
+ | PDF ingestion (per doc) | 10-30 seconds |
176
+ | Simple queries | 5-8 seconds |
177
+ | Complex queries | 10-15 seconds |
178
+ | Full document summary | 30-90 seconds |
179
+
180
+ ### Configuration Parameters
181
+
182
+ | Parameter | Default | Description |
183
+ |-----------|---------|-------------|
184
+ | `max_chunk_size` | 1000 | Maximum characters per semantic chunk |
185
+ | `similarity_threshold` | 0.5 | Cosine similarity for chunk grouping |
186
+ | `chunk_size` | 800 | Fallback text splitter chunk size |
187
+ | `chunk_overlap` | 150 | Character overlap between chunks |
188
+ | `fetch_factor` | 2 | Multiplier for initial retrieval pool |
189
+ | `lambda_mult` | 0.6 | MMR diversity parameter |
190
+ | `cache_max_size` | 100 | Maximum cached query responses |
191
+
192
+ ## Limitations
193
+
194
+ - Requires active internet connection for Groq API calls
195
+ - PDF quality affects text extraction accuracy
196
+ - Large documents may take longer to process
197
+ - Query cache does not persist between sessions
198
+ - Optimized for English language documents
199
+
200
+ ## Training Details
201
+
202
+ This is a **retrieval system**, not a trained model. It orchestrates pre-trained models:
203
+
204
+ - **Embeddings**: Uses pre-trained `BAAI/bge-large-en-v1.5` without fine-tuning
205
+ - **Re-ranker**: Uses pre-trained `BAAI/bge-reranker-v2-m3` without fine-tuning
206
+ - **LLM**: Uses Llama 3.3 70B via Groq API with zero-shot prompting
207
+
208
+ ## Evaluation
209
+
210
+ The system was evaluated qualitatively on academic papers and technical documents for:
211
+ - Answer relevance and accuracy
212
+ - Source attribution correctness
213
+ - Cross-document comparison quality
214
+ - Response structure and readability
215
+
216
+ ## Environmental Impact
217
+
218
+ - **Hardware**: Developed and tested on Google Colab (NVIDIA T4 GPU)
219
+ - **Inference**: Primary compute via Groq API (cloud-hosted)
220
+ - Local model loading: ~2GB VRAM for embeddings + re-ranker
221
+
222
+ ## Citation
223
+
224
+ ```bibtex
225
+ @software{multi_doc_rag_system,
226
+ title = {Multi-Document RAG System},
227
+ year = {2024},
228
+ description = {Production-ready RAG system with hybrid retrieval and cross-encoder re-ranking},
229
+ url = {https://huggingface.co/your-username/your-repo}
230
+ }
231
+ ```
232
+
233
+ ## Acknowledgements
234
+
235
+ This project builds upon:
236
+ - [LangChain](https://github.com/langchain-ai/langchain) for RAG orchestration
237
+ - [ChromaDB](https://github.com/chroma-core/chroma) for vector storage
238
+ - [Sentence Transformers](https://www.sbert.net/) for embeddings
239
+ - [BAAI](https://huggingface.co/BAAI) for BGE models
240
+ - [Groq](https://groq.com/) for fast LLM inference
241
+
242
+ ## Contact
243
+
244
+ For questions or feedback, please open an issue on the repository.