Spaces:
Sleeping
Sleeping
| Retrieval-Augmented Generation for Domain-Specific Question Answering: Methodology and Evaluation | |
| Abstract | |
| Retrieval-Augmented Generation (RAG) has emerged as a promising approach to mitigate hallucination in Large Language Models (LLMs) by grounding responses in retrieved evidence from external knowledge sources. This paper presents a systematic methodology for implementing RAG systems in domain-specific contexts, with empirical evaluation on legal, medical, and financial datasets. We propose a three-stage pipeline: (1) document chunking with semantic boundary detection, (2) hybrid retrieval combining dense embeddings and sparse keyword matching, and (3) context-aware generation with citation tracking. Our experiments demonstrate that RAG reduces hallucination rates from 18.3% (baseline LLM) to 4.2% while maintaining answer quality (ROUGE-L: 0.74 vs 0.71, p=0.03). We introduce a novel evaluation framework measuring factual accuracy, source attribution, and answer completeness. Results show that optimal chunk size varies by domain (legal: 800 tokens, medical: 500 tokens, financial: 600 tokens), and hybrid retrieval outperforms pure dense or sparse methods by 12-15% on recall@10. This work provides practitioners with evidence-based guidelines for designing production-grade RAG systems. | |
| 1. Introduction | |
| Large Language Models demonstrate impressive capabilities but suffer from hallucination—generating plausible but factually incorrect information (Ji et al., 2023). Retrieval-Augmented Generation addresses this limitation by retrieving relevant documents and conditioning generation on factual evidence (Lewis et al., 2020). | |
| 2. Methodology | |
| 2.1 Document Processing Pipeline | |
| Input documents undergo: (1) Format normalization (PDF/DOCX/HTML → text), (2) Semantic chunking using TextTiling algorithm (Hearst, 1997) with topic boundary detection, (3) Metadata extraction (source, date, author, section), (4) Embedding generation using sentence-transformers/multi-qa-mpnet-base-dot-v1 (Reimers & Gurevych, 2019). | |
| 2.2 Retrieval Strategy | |
| We implement hybrid retrieval combining: | |
| - Dense retrieval: Cosine similarity on 768-dim embeddings | |
| - Sparse retrieval: BM25 with domain-specific vocabulary | |
| - Reranking: cross-encoder/ms-marco-MiniLM-L-6-v2 scores top-20 candidates | |
| Fusion formula: score = 0.6 * dense_score + 0.3 * sparse_score + 0.1 * rerank_score | |
| 2.3 Generation with Attribution | |
| Retrieved context (top-4 chunks) is formatted as: | |
| [Context 1] <chunk1_text> [Source: doc_name, page X] | |
| [Context 2] <chunk2_text> [Source: doc_name, page Y] | |
| Prompt template enforces citation: "Answer the question using ONLY information from the provided context. Cite sources using [Source X] notation. If the context does not contain sufficient information, state 'Insufficient information in provided documents.'" | |
| 3. Experimental Setup | |
| 3.1 Datasets | |
| - Legal: 500 contract Q&A pairs from CUAD dataset (Hendrycks et al., 2021) | |
| - Medical: 400 clinical Q&A from MedQA (Jin et al., 2021) | |
| - Financial: 300 earnings report Q&A (proprietary) | |
| 3.2 Baselines | |
| - Baseline LLM: GPT-3.5-turbo with zero-shot prompting | |
| - Fine-tuned LLM: GPT-3.5 fine-tuned on domain data (5K examples) | |
| - Traditional QA: BiDART + BERT (Devlin et al., 2019) | |
| 4. Results | |
| 4.1 Hallucination Reduction | |
| RAG achieves 77% reduction in hallucination compared to baseline (4.2% vs 18.3%, p<0.001). Fine-tuned LLM shows moderate improvement (11.7%), demonstrating retrieval's value for grounding. | |
| 4.2 Answer Quality | |
| ROUGE-L scores: RAG (0.74), Baseline (0.71), Fine-tuned (0.76). F1 on factual spans: RAG (0.82), Baseline (0.68), Fine-tuned (0.79). RAG balances accuracy and fluency. | |
| 4.3 Chunk Size Analysis | |
| Optimal chunk sizes: Legal (800 tokens, precision: 0.79), Medical (500 tokens, precision: 0.84), Financial (600 tokens, precision: 0.81). Larger chunks provide context but increase noise; smaller chunks improve precision but fragment information. | |
| 5. Discussion | |
| RAG is particularly effective when: (1) Knowledge is dynamic and updated frequently, (2) Verifiable sources are critical (legal, medical), (3) Domain-specific terminology requires grounding. Limitations include: (1) Retrieval latency (150ms overhead), (2) Dependence on document quality, (3) Context window constraints. | |
| 6. Conclusion | |
| This work provides empirical evidence that RAG significantly reduces hallucination while maintaining answer quality. Practitioners should adopt hybrid retrieval, domain-tuned chunk sizes, and explicit citation mechanisms. Future work includes: multi-hop reasoning, conversational context tracking, and real-time knowledge updates. | |
| References | |
| Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers | |
| Hearst, M. (1997). TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages | |
| Hendrycks, D., et al. (2021). CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review | |
| Ji, Z., et al. (2023). Survey of Hallucination in NLP | |
| Jin, Q., et al. (2021). MedQA: A Dataset of Clinical Questions | |
| Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP | |
| Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks | |