# HUST RAG — Student Regulations Q&A System A Retrieval-Augmented Generation (RAG) system that helps students query academic regulations and policies at Hanoi University of Science and Technology (HUST). The system processes Markdown-based regulation documents, stores them in a vector database, and uses a hybrid retrieval pipeline with reranking to provide accurate, context-grounded answers through a conversational chat interface. --- ## ✨ Key Features - **Hybrid Search** — Combines vector similarity search (ChromaDB) with BM25 keyword matching for both semantic and lexical retrieval - **Reranking** — Uses Qwen3-Reranker-8B via SiliconFlow API to re-score and sort retrieved documents by relevance - **Small-to-Big Retrieval** — Summarizes large tables with an LLM, embeds the summary for search, and returns the full original table at query time - **4 Retrieval Modes** — `vector_only`, `bm25_only`, `hybrid`, `hybrid_rerank` — configurable per query - **Incremental Data Build** — Hash-based change detection ensures only modified files are re-processed when rebuilding the database - **Streaming Chat UI** — Gradio-based conversational interface with real-time response streaming - **RAGAS Evaluation** — Built-in evaluation pipeline using the RAGAS framework with metrics like faithfulness, relevancy, precision, recall, and ROUGE scores --- ## 🏗️ System Architecture ``` ┌────────────────────────────────────────────────────────────────────┐ │ User Query (Gradio UI) │ └──────────────────────────────┬─────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ Retrieval Pipeline │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │ │ │ Vector Search │ + │ BM25 Search │ → │ Ensemble (weighted) │ │ │ │ (ChromaDB) │ │ (rank-bm25) │ │ vector:0.5 + bm25:0.5 │ │ │ └──────────────┘ └──────────────┘ └──────────┬─────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Qwen3-Reranker │ │ │ │ (SiliconFlow API) │ │ │ └────────┬─────────┘ │ │ │ │ │ Small-to-Big: │ │ │ summary hit → │ │ │ swap w/ parent │ │ └───────────────────────────────────────────┬──────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ Context Builder + LLM │ │ │ │ Context (top-k docs + metadata) → Prompt → LLM (Groq API) │ │ → Streaming Response │ └──────────────────────────────────────────────────────────────────────┘ ``` --- ## 📁 Project Structure ``` DoAn/ ├── core/ # Core application modules │ ├── rag/ # RAG engine │ │ ├── chunk.py # Markdown chunking with table extraction & Small-to-Big │ │ ├── embedding_model.py # Qwen3-Embedding wrapper (SiliconFlow API) │ │ ├── vector_store.py # ChromaDB wrapper with parent node storage │ │ ├── retrieval.py # Hybrid retriever + SiliconFlow reranker │ │ └── generator.py # Context builder & prompt construction │ ├── gradio/ # Chat interfaces │ │ ├── user_gradio.py # Main Gradio app (production + debug modes) │ │ └── gradio_rag.py # Debug mode launcher (thin wrapper) │ └── hash_file/ # File hashing utilities │ └── hash_file.py # SHA-256 hash processor for change detection │ ├── scripts/ # Workflow scripts │ ├── run_app.py # Application entry point (data check + env check + launch) │ ├── build_data.py # Build/update ChromaDB from markdown files │ ├── download_data.py # Download data from HuggingFace │ └── run_eval.py # Run RAGAS evaluation │ ├── evaluation/ # Evaluation pipeline │ ├── eval_utils.py # Shared utilities (RAG init, answer generation) │ └── ragas_eval.py # RAGAS evaluation with multiple metrics │ ├── test/ # Unit tests │ ├── test_chunk.py # Chunking logic tests │ │ ├── data/ # Data directory (downloaded from HuggingFace) │ ├── data_process/ # Processed markdown files │ └── chroma/ # ChromaDB persistence directory │ ├── requirements.txt # Python dependencies ├── setup.sh # Linux/Mac setup script ├── setup.bat # Windows setup script └── .env # API keys (not tracked in git) ``` --- ## 🚀 Getting Started ### Prerequisites - **Python 3.10+** - **API Keys:** - [SiliconFlow](https://siliconflow.ai/) — for embedding (Qwen3-Embedding-4B) and reranking (Qwen3-Reranker-8B) - [Groq](https://groq.com/) — for LLM generation (Qwen3-32B) ### Quick Setup (Recommended) Run the automated setup script which creates a virtual environment, installs dependencies, downloads data, and creates the `.env` file: ```bash # Linux / macOS bash setup.sh # Windows setup.bat ``` Then edit `.env` with your API keys: ```env SILICONFLOW_API_KEY=your_siliconflow_key GROQ_API_KEY=your_groq_key ``` ### Manual Setup ```bash # 1. Create and activate virtual environment python3 -m venv venv source venv/bin/activate # Linux/Mac # venv\Scripts\activate # Windows # 2. Install dependencies pip install -r requirements.txt # 3. Download data from HuggingFace python scripts/download_data.py # 4. Create .env file with your API keys echo "SILICONFLOW_API_KEY=your_key" > .env echo "GROQ_API_KEY=your_key" >> .env ``` ### Running the Application ```bash source venv/bin/activate # Linux/Mac python scripts/run_app.py ``` Access the chat interface at: **http://127.0.0.1:7860** ### Running with FastAPI (API Mode) ```bash source venv/bin/activate python core/api/server.py ``` - API server: **http://127.0.0.1:8000** - Chat UI: **http://127.0.0.1:8000/** or open `core/api/static/index.html` directly - API endpoint: `POST /api/chat` with `{"message": "your question"}` --- ## 🐳 Docker Deployment ### Quick Start (Docker Compose) ```bash # 1. Make sure data/ folder exists (download first if needed) python scripts/download_data.py # 2. Create .env with API keys echo "SILICONFLOW_API_KEY=your_key" > .env echo "GROQ_API_KEY=your_key" >> .env # 3. Build and run docker compose up --build -d # Access at http://localhost:8000 ``` ### Manual Docker Build & Run ```bash # Build image docker build -t hust-rag-api . # Run container docker run -d \ -p 8000:8000 \ -v $(pwd)/data:/app/data \ --env-file .env \ --name hust-rag \ hust-rag-api ``` ### Deploy to AWS (ECR + EC2) **Step 1 — Build & push image to ECR:** ```bash # Login to ECR aws ecr get-login-password --region ap-southeast-1 | \ docker login --username AWS --password-stdin .dkr.ecr.ap-southeast-1.amazonaws.com # Create repository (first time only) aws ecr create-repository --repository-name hust-rag-api # Tag and push docker tag hust-rag-api:latest .dkr.ecr.ap-southeast-1.amazonaws.com/hust-rag-api:latest docker push .dkr.ecr.ap-southeast-1.amazonaws.com/hust-rag-api:latest ``` **Step 2 — Run on EC2:** ```bash # Pull image docker pull .dkr.ecr.ap-southeast-1.amazonaws.com/hust-rag-api:latest # Upload data to EC2 scp -r data/ ec2-user@:/home/ec2-user/data # Run container docker run -d \ -p 8000:8000 \ -v /home/ec2-user/data:/app/data \ -e GROQ_API_KEY=your_key \ -e SILICONFLOW_API_KEY=your_key \ --restart unless-stopped \ --name hust-rag \ hust-rag-api:latest ``` ### Docker Notes - The `data/` directory is **mounted as a volume** — not baked into the image - API keys are passed via environment variables or `.env` file — never stored in the image - To update: rebuild image → push → pull on EC2 → restart container --- ## 📖 Usage Guide ### Chat Interface The Gradio chat interface supports natural language questions about HUST student regulations. Example questions: | Question | Topic | |----------|-------| | Sinh viên vi phạm quy chế thi thì bị xử lý như thế nào? | Exam violation penalties | | Điều kiện để đổi ngành là gì? | Major transfer requirements | | Làm thế nào để đăng ký hoãn thi? | Exam postponement registration | ### Debug Mode To launch the debug interface that shows retrieved documents and relevance scores: ```bash python core/gradio/gradio_rag.py ``` ### Building/Updating the Database When you add, modify, or delete markdown files in `data/data_process/`, rebuild the database: ```bash # Incremental update (only changed files) python scripts/build_data.py # Force full rebuild python scripts/build_data.py --force # Skip orphan deletion python scripts/build_data.py --no-delete ``` The build script will: 1. Detect changed files via SHA-256 hash comparison 2. Delete chunks from removed files 3. Re-chunk and re-embed only modified files 4. Automatically invalidate the BM25 cache --- ## 🔧 Core Components ### Chunking (`core/rag/chunk.py`) Processes Markdown documents into searchable chunks: | Feature | Description | |---------|-------------| | **YAML Frontmatter Extraction** | Parses metadata (document type, year, cohort, program) into chunk metadata | | **Heading-based Splitting** | Uses `MarkdownNodeParser` to split by headings, preserving document structure | | **Table Extraction & Splitting** | Extracts Markdown tables, splits large tables into chunks of 15 rows | | **Small-to-Big Pattern** | Summarizes tables with LLM → embeds summary → links to parent (full table) | | **Small Chunk Merging** | Merges chunks smaller than 200 characters with adjacent chunks | | **Metadata Enrichment** | Extracts course names and codes from content using regex patterns | **Configuration:** ```python CHUNK_SIZE = 1500 # Maximum chunk size in characters CHUNK_OVERLAP = 150 # Overlap between consecutive chunks MIN_CHUNK_SIZE = 200 # Minimum chunk size (smaller chunks get merged) TABLE_ROWS_PER_CHUNK = 15 # Maximum rows per table chunk ``` ### Embedding (`core/rag/embedding_model.py`) - **Model:** Qwen3-Embedding-4B via SiliconFlow API - **Dimensions:** 2048 - **Batch processing** with configurable batch size (default: 16) - **Rate limit handling** with exponential backoff retry ### Vector Store (`core/rag/vector_store.py`) - **Backend:** ChromaDB with LangChain integration - **Parent node storage:** Separate JSON file for Small-to-Big parent nodes (not embedded) - **Content-based document IDs:** SHA-256 hash of (source_file, header_path, chunk_index, content) - **Metadata flattening:** Converts complex metadata types to ChromaDB-compatible formats - **Batch operations:** `add_documents()` and `upsert_documents()` with configurable batch size ### Retrieval (`core/rag/retrieval.py`) | Mode | Description | |------|-------------| | `vector_only` | Pure vector similarity search via ChromaDB | | `bm25_only` | Pure keyword matching via BM25 (with lazy-load and disk caching) | | `hybrid` | Ensemble of vector + BM25 with configurable weights (default: 0.5/0.5) | | `hybrid_rerank` | Hybrid search followed by Qwen3-Reranker-8B reranking **(default)** | **Small-to-Big at retrieval time:** When a table summary node is retrieved, it is automatically swapped with the full parent table before returning results to the user. **Configuration:** ```python rerank_model = "Qwen/Qwen3-Reranker-8B" # Reranker model initial_k = 25 # Documents fetched before reranking top_k = 5 # Final documents returned vector_weight = 0.5 # Weight for vector search bm25_weight = 0.5 # Weight for BM25 search ``` ### Generator (`core/rag/generator.py`) - Builds rich context strings with metadata (source, document type, year, cohort, program, faculty) - Constructs prompts with a Vietnamese system prompt that enforces context-grounded answers - `RAGContextBuilder` combines retrieval and context preparation into a single step --- ## 📊 Evaluation The project includes a RAGAS-based evaluation pipeline. ### Running Evaluation ```bash # Evaluate with default settings (10 samples, hybrid_rerank) python scripts/run_eval.py # Custom sample size and mode python scripts/run_eval.py --samples 50 --mode hybrid_rerank # Run all retrieval modes for comparison python scripts/run_eval.py --samples 20 --mode all ``` ### Metrics | Metric | Description | |--------|-------------| | **Faithfulness** | How well the answer is grounded in the retrieved context | | **Answer Relevancy** | How relevant the answer is to the question | | **Context Precision** | How precise the retrieved contexts are | | **Context Recall** | How well the retrieved contexts cover the ground truth | | **ROUGE-1 / ROUGE-2 / ROUGE-L** | N-gram overlap with ground truth answers | ### Results Benchmark on HUST student regulation Q&A dataset (200 samples): | Metric | vector_only | bm25_only | hybrid | hybrid_rerank | |---------------------|:-----------:|:---------:|:------:|:-------------:| | **Answer Relevancy** | 0.749 | 0.635 | 0.832 | **0.872** | | **Context Precision** | 0.678 | 0.538 | 0.795 | **0.861** | | **Context Recall** | 0.815 | 0.732 | 0.849 | **0.872** | | **Faithfulness** | 0.912 | 0.938 | 0.942 | **0.937** | | **ROUGE-1** | 0.557 | 0.533 | 0.576 | **0.598** | | **ROUGE-2** | 0.408 | 0.385 | 0.421 | **0.439** | | **ROUGE-L** | 0.526 | 0.508 | 0.545 | **0.567** | **Key takeaways:** - **`hybrid_rerank` achieves the best scores in 6 out of 7 metrics**, confirming it as the optimal default retrieval mode. - **Faithfulness is consistently high (>0.91 across all modes)**, meaning the LLM reliably grounds its answers in the provided context with minimal hallucination. - **Reranking significantly boosts Context Precision** (+60% over BM25-only, +8% over hybrid), demonstrating the value of Qwen3-Reranker in filtering irrelevant documents. - **Hybrid search substantially outperforms single-mode retrieval**, validating the ensemble approach of combining semantic (vector) and lexical (BM25) search. Results are saved to `evaluation/results/` as both JSON and CSV files with timestamps. --- ## 🧪 Testing ```bash # Run all tests pytest test/ -v # Run specific test module pytest test/test_chunk.py -v pytest test/test_retrieval.py -v # Run with coverage pytest test/ --cov=core --cov-report=term-missing ``` --- ## 🛠️ Technology Stack | Category | Technology | |----------|------------| | **Embedding** | Qwen3-Embedding-4B (SiliconFlow API) | | **Reranking** | Qwen3-Reranker-8B (SiliconFlow API) | | **LLM** | Qwen3-32B (Groq API) | | **Vector Database** | ChromaDB | | **Keyword Search** | BM25 (rank-bm25) | | **Framework** | LangChain + LlamaIndex (chunking) | | **UI** | Gradio | | **Evaluation** | RAGAS | | **Language** | Python 3.10+ | --- ## 📦 Data The processed data is hosted on HuggingFace: [hungnha/do_an_tot_nghiep](https://huggingface.co/datasets/hungnha/do_an_tot_nghiep) **Manual download:** ```bash huggingface-cli download hungnha/do_an_tot_nghiep --repo-type dataset --local-dir ./data ``` The data directory contains: - `data_process/` — Processed Markdown regulation documents - `chroma/` — ChromaDB persistence files (vector index + parent nodes) - `data.csv` — Evaluation dataset (questions + ground truth answers) --- ## 📄 License This project is developed as an undergraduate thesis at Hanoi University of Science and Technology (HUST).