# HUST RAG — Student Regulations Q&A System

A Retrieval-Augmented Generation (RAG) system that helps students query academic regulations and policies at Hanoi University of Science and Technology (HUST). The system processes Markdown-based regulation documents, stores them in a vector database, and uses a hybrid retrieval pipeline with reranking to provide accurate, context-grounded answers through a conversational chat interface.

---

## ✨ Key Features

- **Hybrid Search** — Combines vector similarity search (ChromaDB) with BM25 keyword matching for both semantic and lexical retrieval
- **Reranking** — Uses Qwen3-Reranker-8B via SiliconFlow API to re-score and sort retrieved documents by relevance
- **Small-to-Big Retrieval** — Summarizes large tables with an LLM, embeds the summary for search, and returns the full original table at query time
- **4 Retrieval Modes** — `vector_only`, `bm25_only`, `hybrid`, `hybrid_rerank` — configurable per query
- **Incremental Data Build** — Hash-based change detection ensures only modified files are re-processed when rebuilding the database
- **Streaming Chat UI** — Gradio-based conversational interface with real-time response streaming
- **RAGAS Evaluation** — Built-in evaluation pipeline using the RAGAS framework with metrics like faithfulness, relevancy, precision, recall, and ROUGE scores

---

## 🏗️ System Architecture

```
┌────────────────────────────────────────────────────────────────────┐
│                        User Query (Gradio UI)                      │
└──────────────────────────────┬─────────────────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────────────────┐
│                        Retrieval Pipeline                            │
│                                                                      │
│  ┌──────────────┐   ┌──────────────┐   ┌────────────────────────┐   │
│  │ Vector Search │ + │ BM25 Search  │ → │ Ensemble (weighted)    │   │
│  │  (ChromaDB)   │   │ (rank-bm25)  │   │ vector:0.5 + bm25:0.5 │   │
│  └──────────────┘   └──────────────┘   └──────────┬─────────────┘   │
│                                                     │                │
│                                                     ▼                │
│                                          ┌──────────────────┐        │
│                                          │ Qwen3-Reranker   │        │
│                                          │ (SiliconFlow API) │        │
│                                          └────────┬─────────┘        │
│                                                   │                  │
│                                    Small-to-Big:  │                  │
│                                    summary hit →  │                  │
│                                    swap w/ parent │                  │
└───────────────────────────────────────────┬──────────────────────────┘
                                            │
                                            ▼
┌──────────────────────────────────────────────────────────────────────┐
│                      Context Builder + LLM                           │
│                                                                      │
│   Context (top-k docs + metadata) → Prompt → LLM (Groq API)         │
│                                              → Streaming Response    │
└──────────────────────────────────────────────────────────────────────┘
```

---

## 📁 Project Structure

```
DoAn/
├── core/                          # Core application modules
│   ├── rag/                       # RAG engine
│   │   ├── chunk.py               # Markdown chunking with table extraction & Small-to-Big
│   │   ├── embedding_model.py     # Qwen3-Embedding wrapper (SiliconFlow API)
│   │   ├── vector_store.py        # ChromaDB wrapper with parent node storage
│   │   ├── retrieval.py           # Hybrid retriever + SiliconFlow reranker
│   │   └── generator.py           # Context builder & prompt construction
│   ├── gradio/                    # Chat interfaces
│   │   ├── user_gradio.py         # Main Gradio app (production + debug modes)
│   │   └── gradio_rag.py          # Debug mode launcher (thin wrapper)
│   └── hash_file/                 # File hashing utilities
│       └── hash_file.py           # SHA-256 hash processor for change detection
│
├── scripts/                       # Workflow scripts
│   ├── run_app.py                 # Application entry point (data check + env check + launch)
│   ├── build_data.py              # Build/update ChromaDB from markdown files
│   ├── download_data.py           # Download data from HuggingFace
│   └── run_eval.py                # Run RAGAS evaluation
│
├── evaluation/                    # Evaluation pipeline
│   ├── eval_utils.py              # Shared utilities (RAG init, answer generation)
│   └── ragas_eval.py              # RAGAS evaluation with multiple metrics
│
├── test/                          # Unit tests
│   ├── test_chunk.py              # Chunking logic tests
│  
│
├── data/                          # Data directory (downloaded from HuggingFace)
│   ├── data_process/              # Processed markdown files
│   └── chroma/                    # ChromaDB persistence directory
│
├── requirements.txt               # Python dependencies
├── setup.sh                       # Linux/Mac setup script
├── setup.bat                      # Windows setup script
└── .env                           # API keys (not tracked in git)
```

---

## 🚀 Getting Started

### Prerequisites

- **Python 3.10+**
- **API Keys:**
  - [SiliconFlow](https://siliconflow.ai/) — for embedding (Qwen3-Embedding-4B) and reranking (Qwen3-Reranker-8B)
  - [Groq](https://groq.com/) — for LLM generation (Qwen3-32B)

### Quick Setup (Recommended)

Run the automated setup script which creates a virtual environment, installs dependencies, downloads data, and creates the `.env` file:

```bash
# Linux / macOS
bash setup.sh

# Windows
setup.bat
```

Then edit `.env` with your API keys:

```env
SILICONFLOW_API_KEY=your_siliconflow_key
GROQ_API_KEY=your_groq_key
```

### Manual Setup

```bash
# 1. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate        # Linux/Mac
# venv\Scripts\activate         # Windows

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download data from HuggingFace
python scripts/download_data.py

# 4. Create .env file with your API keys
echo "SILICONFLOW_API_KEY=your_key" > .env
echo "GROQ_API_KEY=your_key" >> .env
```

### Running the Application

```bash
source venv/bin/activate        # Linux/Mac
python scripts/run_app.py
```

Access the chat interface at: **http://127.0.0.1:7860**

### Running with FastAPI (API Mode)

```bash
source venv/bin/activate
python core/api/server.py
```

- API server: **http://127.0.0.1:8000**
- Chat UI: **http://127.0.0.1:8000/** or open `core/api/static/index.html` directly
- API endpoint: `POST /api/chat` with `{"message": "your question"}`

---

## 🐳 Docker Deployment

### Quick Start (Docker Compose)

```bash
# 1. Make sure data/ folder exists (download first if needed)
python scripts/download_data.py

# 2. Create .env with API keys
echo "SILICONFLOW_API_KEY=your_key" > .env
echo "GROQ_API_KEY=your_key" >> .env

# 3. Build and run
docker compose up --build -d

# Access at http://localhost:8000
```

### Manual Docker Build & Run

```bash
# Build image
docker build -t hust-rag-api .

# Run container
docker run -d \
  -p 8000:8000 \
  -v $(pwd)/data:/app/data \
  --env-file .env \
  --name hust-rag \
  hust-rag-api
```

### Deploy to AWS (ECR + EC2)

**Step 1 — Build & push image to ECR:**

```bash
# Login to ECR
aws ecr get-login-password --region ap-southeast-1 | \
  docker login --username AWS --password-stdin <ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com

# Create repository (first time only)
aws ecr create-repository --repository-name hust-rag-api

# Tag and push
docker tag hust-rag-api:latest <ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/hust-rag-api:latest
docker push <ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/hust-rag-api:latest
```

**Step 2 — Run on EC2:**

```bash
# Pull image
docker pull <ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/hust-rag-api:latest

# Upload data to EC2
scp -r data/ ec2-user@<EC2_IP>:/home/ec2-user/data

# Run container
docker run -d \
  -p 8000:8000 \
  -v /home/ec2-user/data:/app/data \
  -e GROQ_API_KEY=your_key \
  -e SILICONFLOW_API_KEY=your_key \
  --restart unless-stopped \
  --name hust-rag \
  hust-rag-api:latest
```

### Docker Notes

- The `data/` directory is **mounted as a volume** — not baked into the image
- API keys are passed via environment variables or `.env` file — never stored in the image
- To update: rebuild image → push → pull on EC2 → restart container

---

## 📖 Usage Guide

### Chat Interface

The Gradio chat interface supports natural language questions about HUST student regulations. Example questions:

| Question | Topic |
|----------|-------|
| Sinh viên vi phạm quy chế thi thì bị xử lý như thế nào? | Exam violation penalties |
| Điều kiện để đổi ngành là gì? | Major transfer requirements |
| Làm thế nào để đăng ký hoãn thi? | Exam postponement registration |

### Debug Mode

To launch the debug interface that shows retrieved documents and relevance scores:

```bash
python core/gradio/gradio_rag.py
```

### Building/Updating the Database

When you add, modify, or delete markdown files in `data/data_process/`, rebuild the database:

```bash
# Incremental update (only changed files)
python scripts/build_data.py

# Force full rebuild
python scripts/build_data.py --force

# Skip orphan deletion
python scripts/build_data.py --no-delete
```

The build script will:
1. Detect changed files via SHA-256 hash comparison
2. Delete chunks from removed files
3. Re-chunk and re-embed only modified files
4. Automatically invalidate the BM25 cache

---

## 🔧 Core Components

### Chunking (`core/rag/chunk.py`)

Processes Markdown documents into searchable chunks:

| Feature | Description |
|---------|-------------|
| **YAML Frontmatter Extraction** | Parses metadata (document type, year, cohort, program) into chunk metadata |
| **Heading-based Splitting** | Uses `MarkdownNodeParser` to split by headings, preserving document structure |
| **Table Extraction & Splitting** | Extracts Markdown tables, splits large tables into chunks of 15 rows |
| **Small-to-Big Pattern** | Summarizes tables with LLM → embeds summary → links to parent (full table) |
| **Small Chunk Merging** | Merges chunks smaller than 200 characters with adjacent chunks |
| **Metadata Enrichment** | Extracts course names and codes from content using regex patterns |

**Configuration:**
```python
CHUNK_SIZE = 1500          # Maximum chunk size in characters
CHUNK_OVERLAP = 150        # Overlap between consecutive chunks
MIN_CHUNK_SIZE = 200       # Minimum chunk size (smaller chunks get merged)
TABLE_ROWS_PER_CHUNK = 15  # Maximum rows per table chunk
```

### Embedding (`core/rag/embedding_model.py`)

- **Model:** Qwen3-Embedding-4B via SiliconFlow API
- **Dimensions:** 2048
- **Batch processing** with configurable batch size (default: 16)
- **Rate limit handling** with exponential backoff retry

### Vector Store (`core/rag/vector_store.py`)

- **Backend:** ChromaDB with LangChain integration
- **Parent node storage:** Separate JSON file for Small-to-Big parent nodes (not embedded)
- **Content-based document IDs:** SHA-256 hash of (source_file, header_path, chunk_index, content)
- **Metadata flattening:** Converts complex metadata types to ChromaDB-compatible formats
- **Batch operations:** `add_documents()` and `upsert_documents()` with configurable batch size

### Retrieval (`core/rag/retrieval.py`)

| Mode | Description |
|------|-------------|
| `vector_only` | Pure vector similarity search via ChromaDB |
| `bm25_only` | Pure keyword matching via BM25 (with lazy-load and disk caching) |
| `hybrid` | Ensemble of vector + BM25 with configurable weights (default: 0.5/0.5) |
| `hybrid_rerank` | Hybrid search followed by Qwen3-Reranker-8B reranking **(default)** |

**Small-to-Big at retrieval time:** When a table summary node is retrieved, it is automatically swapped with the full parent table before returning results to the user.

**Configuration:**
```python
rerank_model = "Qwen/Qwen3-Reranker-8B"  # Reranker model
initial_k = 25                             # Documents fetched before reranking
top_k = 5                                  # Final documents returned
vector_weight = 0.5                        # Weight for vector search
bm25_weight = 0.5                          # Weight for BM25 search
```

### Generator (`core/rag/generator.py`)

- Builds rich context strings with metadata (source, document type, year, cohort, program, faculty)
- Constructs prompts with a Vietnamese system prompt that enforces context-grounded answers
- `RAGContextBuilder` combines retrieval and context preparation into a single step

---

## 📊 Evaluation

The project includes a RAGAS-based evaluation pipeline.

### Running Evaluation

```bash
# Evaluate with default settings (10 samples, hybrid_rerank)
python scripts/run_eval.py

# Custom sample size and mode
python scripts/run_eval.py --samples 50 --mode hybrid_rerank

# Run all retrieval modes for comparison
python scripts/run_eval.py --samples 20 --mode all
```

### Metrics

| Metric | Description |
|--------|-------------|
| **Faithfulness** | How well the answer is grounded in the retrieved context |
| **Answer Relevancy** | How relevant the answer is to the question |
| **Context Precision** | How precise the retrieved contexts are |
| **Context Recall** | How well the retrieved contexts cover the ground truth |
| **ROUGE-1 / ROUGE-2 / ROUGE-L** | N-gram overlap with ground truth answers |

### Results

Benchmark on HUST student regulation Q&A dataset (200 samples):

| Metric | vector_only | bm25_only | hybrid | hybrid_rerank |
|---------------------|:-----------:|:---------:|:------:|:-------------:|
| **Answer Relevancy** | 0.749 | 0.635 | 0.832 | **0.872** |
| **Context Precision** | 0.678 | 0.538 | 0.795 | **0.861** |
| **Context Recall** | 0.815 | 0.732 | 0.849 | **0.872** |
| **Faithfulness** | 0.912 | 0.938 | 0.942 | **0.937** |
| **ROUGE-1** | 0.557 | 0.533 | 0.576 | **0.598** |
| **ROUGE-2** | 0.408 | 0.385 | 0.421 | **0.439** |
| **ROUGE-L** | 0.526 | 0.508 | 0.545 | **0.567** |

**Key takeaways:**

- **`hybrid_rerank` achieves the best scores in 6 out of 7 metrics**, confirming it as the optimal default retrieval mode.
- **Faithfulness is consistently high (>0.91 across all modes)**, meaning the LLM reliably grounds its answers in the provided context with minimal hallucination.
- **Reranking significantly boosts Context Precision** (+60% over BM25-only, +8% over hybrid), demonstrating the value of Qwen3-Reranker in filtering irrelevant documents.
- **Hybrid search substantially outperforms single-mode retrieval**, validating the ensemble approach of combining semantic (vector) and lexical (BM25) search.

Results are saved to `evaluation/results/` as both JSON and CSV files with timestamps.

---

## 🧪 Testing

```bash
# Run all tests
pytest test/ -v

# Run specific test module
pytest test/test_chunk.py -v
pytest test/test_retrieval.py -v

# Run with coverage
pytest test/ --cov=core --cov-report=term-missing
```

---

## 🛠️ Technology Stack

| Category | Technology |
|----------|------------|
| **Embedding** | Qwen3-Embedding-4B (SiliconFlow API) |
| **Reranking** | Qwen3-Reranker-8B (SiliconFlow API) |
| **LLM** | Qwen3-32B (Groq API) |
| **Vector Database** | ChromaDB |
| **Keyword Search** | BM25 (rank-bm25) |
| **Framework** | LangChain + LlamaIndex (chunking) |
| **UI** | Gradio |
| **Evaluation** | RAGAS |
| **Language** | Python 3.10+ |

---

## 📦 Data

The processed data is hosted on HuggingFace: [hungnha/do_an_tot_nghiep](https://huggingface.co/datasets/hungnha/do_an_tot_nghiep)

**Manual download:**
```bash
huggingface-cli download hungnha/do_an_tot_nghiep --repo-type dataset --local-dir ./data
```

The data directory contains:
- `data_process/` — Processed Markdown regulation documents
- `chroma/` — ChromaDB persistence files (vector index + parent nodes)
- `data.csv` — Evaluation dataset (questions + ground truth answers)

---

## 📄 License

This project is developed as an undergraduate thesis at Hanoi University of Science and Technology (HUST).