abhivsh commited on
Commit
086f690
·
verified ·
1 Parent(s): ee58e34

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +83 -0
  2. requirements.txt +19 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: EnggSS RAG ChatBot
3
+ emoji: ⚡
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: "5.0.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ license: other
11
+ ---
12
+
13
+ # EnggSS RAG ChatBot
14
+
15
+ **Serving-only** HuggingFace Space — reads a pre-built private dataset, no PDF
16
+ processing at runtime. Build the dataset locally with
17
+ `preprocessing/create_dataset.py`, then deploy this Space to answer questions.
18
+
19
+ ## How it works
20
+
21
+ ```
22
+ Local machine (once)
23
+ PDFs → create_dataset.py → BAAI/bge-large-en-v1.5 embeddings
24
+
25
+
26
+ Private HuggingFace Dataset
27
+
28
+ ┌─────────────────────┘
29
+ ▼ (Space startup)
30
+ Load dataset → NumPy float32 matrix (L2-normalised)
31
+
32
+ ▼ (each query, ~20 ms)
33
+ Embed query → cosine scores → MMR top-3
34
+
35
+
36
+ Qwen2.5-7B-Instruct (HF Inference API) → answer
37
+
38
+
39
+ Gradio UI
40
+ ```
41
+
42
+ ## Tabs
43
+
44
+ | Tab | Purpose |
45
+ |-----|---------|
46
+ | 💬 Q&A | Ask questions; see top-3 retrieved contexts + generated answer |
47
+ | 📊 Analytics | Total chunks, documents processed, per-file breakdown |
48
+
49
+ ## Required Space Secrets
50
+
51
+ Set in **Settings → Variables and Secrets**:
52
+
53
+ | Secret | Description |
54
+ |--------|-------------|
55
+ | `HF_TOKEN` | HuggingFace token — needs **read** access to the dataset repo |
56
+ | `HF_DATASET_REPO` | e.g. `your-org/enggss-rag-dataset` (created by preprocessing script) |
57
+
58
+ ## Setup order
59
+
60
+ 1. **Run preprocessing locally** (once, or when you add new PDFs):
61
+ ```bash
62
+ cd preprocessing
63
+ pip install -r requirements.txt
64
+ python create_dataset.py ./pdfs --repo your-org/enggss-rag-dataset
65
+ ```
66
+ 2. **Deploy this Space** — upload `app.py` + `requirements.txt` + `README.md`
67
+ 3. **Set the two secrets** above in Space Settings → Secrets
68
+ 4. Space restarts, loads the dataset, and is ready to answer questions
69
+
70
+ To add new PDFs later without rebuilding everything:
71
+ ```bash
72
+ python create_dataset.py ./pdfs --repo your-org/enggss-rag-dataset --update
73
+ ```
74
+
75
+ ## Local development
76
+
77
+ ```bash
78
+ git clone https://huggingface.co/spaces/your-org/enggss-rag-chatbot
79
+ cd enggss-rag-chatbot
80
+ pip install -r requirements.txt
81
+ # create .env with HF_TOKEN and HF_DATASET_REPO
82
+ python app.py
83
+ ```
requirements.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ── Dataset access ────────────────────────────────────────────────────────────
2
+ datasets>=2.18.0
3
+
4
+ # ── Query embedding (local, cached after first download ~1.3 GB) ──────────────
5
+ sentence-transformers
6
+ torch
7
+
8
+ # ── LLM chain ─────────────────────────────────────────────────────────────────
9
+ langchain-core>=1.2.0
10
+ langchain-huggingface
11
+
12
+ # ── UI (Gradio 5.x — sdk_version "5.0.0" in README.md) ──────────────────────
13
+ # Gradio 5.x rewrote oauth.py → no HfFolder import issue (was a 4.x bug)
14
+ # pydub is no longer a gradio 5.x dependency → pyaudioop not required
15
+ gradio>=5.0.0
16
+
17
+ # ── Utilities ─────────────────────────────────────────────────────────────────
18
+ numpy
19
+ python-dotenv