File size: 5,186 Bytes
785b6bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
Retrieval-Augmented Generation for Domain-Specific Question Answering: Methodology and Evaluation

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a promising approach to mitigate hallucination in Large Language Models (LLMs) by grounding responses in retrieved evidence from external knowledge sources. This paper presents a systematic methodology for implementing RAG systems in domain-specific contexts, with empirical evaluation on legal, medical, and financial datasets. We propose a three-stage pipeline: (1) document chunking with semantic boundary detection, (2) hybrid retrieval combining dense embeddings and sparse keyword matching, and (3) context-aware generation with citation tracking. Our experiments demonstrate that RAG reduces hallucination rates from 18.3% (baseline LLM) to 4.2% while maintaining answer quality (ROUGE-L: 0.74 vs 0.71, p=0.03). We introduce a novel evaluation framework measuring factual accuracy, source attribution, and answer completeness. Results show that optimal chunk size varies by domain (legal: 800 tokens, medical: 500 tokens, financial: 600 tokens), and hybrid retrieval outperforms pure dense or sparse methods by 12-15% on recall@10. This work provides practitioners with evidence-based guidelines for designing production-grade RAG systems.

1. Introduction

Large Language Models demonstrate impressive capabilities but suffer from hallucination—generating plausible but factually incorrect information (Ji et al., 2023). Retrieval-Augmented Generation addresses this limitation by retrieving relevant documents and conditioning generation on factual evidence (Lewis et al., 2020).

2. Methodology

2.1 Document Processing Pipeline
Input documents undergo: (1) Format normalization (PDF/DOCX/HTML → text), (2) Semantic chunking using TextTiling algorithm (Hearst, 1997) with topic boundary detection, (3) Metadata extraction (source, date, author, section), (4) Embedding generation using sentence-transformers/multi-qa-mpnet-base-dot-v1 (Reimers & Gurevych, 2019).

2.2 Retrieval Strategy
We implement hybrid retrieval combining:
- Dense retrieval: Cosine similarity on 768-dim embeddings
- Sparse retrieval: BM25 with domain-specific vocabulary
- Reranking: cross-encoder/ms-marco-MiniLM-L-6-v2 scores top-20 candidates

Fusion formula: score = 0.6 * dense_score + 0.3 * sparse_score + 0.1 * rerank_score

2.3 Generation with Attribution
Retrieved context (top-4 chunks) is formatted as:
[Context 1] <chunk1_text> [Source: doc_name, page X]
[Context 2] <chunk2_text> [Source: doc_name, page Y]

Prompt template enforces citation: "Answer the question using ONLY information from the provided context. Cite sources using [Source X] notation. If the context does not contain sufficient information, state 'Insufficient information in provided documents.'"

3. Experimental Setup

3.1 Datasets
- Legal: 500 contract Q&A pairs from CUAD dataset (Hendrycks et al., 2021)
- Medical: 400 clinical Q&A from MedQA (Jin et al., 2021)
- Financial: 300 earnings report Q&A (proprietary)

3.2 Baselines
- Baseline LLM: GPT-3.5-turbo with zero-shot prompting
- Fine-tuned LLM: GPT-3.5 fine-tuned on domain data (5K examples)
- Traditional QA: BiDART + BERT (Devlin et al., 2019)

4. Results

4.1 Hallucination Reduction
RAG achieves 77% reduction in hallucination compared to baseline (4.2% vs 18.3%, p<0.001). Fine-tuned LLM shows moderate improvement (11.7%), demonstrating retrieval's value for grounding.

4.2 Answer Quality
ROUGE-L scores: RAG (0.74), Baseline (0.71), Fine-tuned (0.76). F1 on factual spans: RAG (0.82), Baseline (0.68), Fine-tuned (0.79). RAG balances accuracy and fluency.

4.3 Chunk Size Analysis
Optimal chunk sizes: Legal (800 tokens, precision: 0.79), Medical (500 tokens, precision: 0.84), Financial (600 tokens, precision: 0.81). Larger chunks provide context but increase noise; smaller chunks improve precision but fragment information.

5. Discussion

RAG is particularly effective when: (1) Knowledge is dynamic and updated frequently, (2) Verifiable sources are critical (legal, medical), (3) Domain-specific terminology requires grounding. Limitations include: (1) Retrieval latency (150ms overhead), (2) Dependence on document quality, (3) Context window constraints.

6. Conclusion

This work provides empirical evidence that RAG significantly reduces hallucination while maintaining answer quality. Practitioners should adopt hybrid retrieval, domain-tuned chunk sizes, and explicit citation mechanisms. Future work includes: multi-hop reasoning, conversational context tracking, and real-time knowledge updates.

References
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers
Hearst, M. (1997). TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages
Hendrycks, D., et al. (2021). CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
Ji, Z., et al. (2023). Survey of Hallucination in NLP
Jin, Q., et al. (2021). MedQA: A Dataset of Clinical Questions
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks