aelgendy commited on
Commit
605bb90
Β·
1 Parent(s): f740dde

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +1 -167
  2. app/prompts.py +1 -1
  3. app/search.py +14 -0
  4. app/state.py +151 -18
README.md CHANGED
@@ -112,172 +112,6 @@ python build_index.py
112
 
113
  ---
114
 
115
- ## API Reference (18 endpoints)
116
-
117
- ### Inference
118
-
119
- | Endpoint | Method | Description |
120
- |----------|--------|-------------|
121
- | `/ask?q=...&top_k=5&source_type=&grade_filter=` | GET | Direct RAG query with full source attribution |
122
- | `/v1/chat/completions` | POST | OpenAI-compatible chat (SSE streaming supported) |
123
-
124
- ### Quran (`/quran/...`)
125
-
126
- | Endpoint | Method | Description |
127
- |----------|--------|-------------|
128
- | `/quran/search?q=...&limit=10` | GET | Text search: find verses by partial Arabic/English text |
129
- | `/quran/topic?topic=...&top_k=10` | GET | Semantic search: find verses related to a topic |
130
- | `/quran/word-frequency?word=...` | GET | Count word occurrences across all Surahs |
131
- | `/quran/analytics` | GET | Overall Quran stats (total verses, Surahs, revelation types) |
132
- | `/quran/chapter/{number}` | GET | All verses and metadata for a specific Surah |
133
- | `/quran/verse/{surah}:{ayah}` | GET | Exact verse lookup by reference (e.g. `/quran/verse/2:255`) |
134
-
135
- ### Hadith (`/hadith/...`)
136
-
137
- | Endpoint | Method | Description |
138
- |----------|--------|-------------|
139
- | `/hadith/search?q=...&collection=&limit=10` | GET | Text search across collections |
140
- | `/hadith/topic?topic=...&top_k=10&grade_filter=` | GET | Semantic search by topic with optional grade filter |
141
- | `/hadith/verify?q=...&collection=` | GET | Authenticity verification (text + semantic search) |
142
- | `/hadith/collection/{name}?limit=20&offset=0` | GET | Browse a specific collection |
143
- | `/hadith/analytics` | GET | Collection-level statistics |
144
-
145
- ### Operations
146
-
147
- | Endpoint | Method | Description |
148
- |----------|--------|-------------|
149
- | `/health` | GET | Readiness check |
150
- | `/v1/models` | GET | OpenAI-compatible model listing |
151
- | `/debug/scores?q=...&top_k=10&source_type=` | GET | Raw retrieval scores (no LLM call) |
152
-
153
- ---
154
-
155
- ### GET `/ask` β€” Main Query
156
-
157
- ```bash
158
- curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?&top_k=5"
159
- ```
160
-
161
- **Parameters:**
162
- | Parameter | Default | Description |
163
- |-----------|---------|-------------|
164
- | `q` | *(required)* | Your Islamic question |
165
- | `top_k` | `5` | Number of sources to retrieve (1–20) |
166
- | `source_type` | both | `quran` or `hadith` |
167
- | `grade_filter` | all | `sahih` or `hasan` |
168
-
169
- **Response:**
170
- ```json
171
- {
172
- "question": "What does Islam say about mercy?",
173
- "answer": "Islam emphasizes mercy as a core value...",
174
- "language": "english",
175
- "intent": "general",
176
- "analysis": null,
177
- "sources": [
178
- {
179
- "source": "Surah Al-Baqarah 2:178",
180
- "type": "quran",
181
- "grade": null,
182
- "arabic": "...",
183
- "english": "...",
184
- "_score": 0.876
185
- }
186
- ],
187
- "top_score": 0.876,
188
- "latency_ms": 342
189
- }
190
- ```
191
-
192
- ### POST `/v1/chat/completions` β€” OpenAI-Compatible
193
-
194
- ```bash
195
- curl -X POST http://localhost:8000/v1/chat/completions \
196
- -H "Content-Type: application/json" \
197
- -d '{
198
- "model": "QModel",
199
- "messages": [{"role": "user", "content": "What does Islam say about patience?"}],
200
- "temperature": 0.2,
201
- "max_tokens": 2048,
202
- "top_k": 5,
203
- "stream": false
204
- }'
205
- ```
206
-
207
- **Response:**
208
- ```json
209
- {
210
- "id": "qmodel-1234567890",
211
- "object": "chat.completion",
212
- "created": 1234567890,
213
- "model": "QModel",
214
- "choices": [
215
- {
216
- "index": 0,
217
- "message": { "role": "assistant", "content": "Islam emphasizes patience..." },
218
- "finish_reason": "stop"
219
- }
220
- ],
221
- "x_metadata": {
222
- "language": "english",
223
- "intent": "general",
224
- "top_score": 0.876,
225
- "latency_ms": 342,
226
- "sources": [{ "source": "Surah Al-Imran 3:200", "type": "quran", "score": 0.876 }]
227
- }
228
- }
229
- ```
230
-
231
- ### GET `/hadith/verify` β€” Authenticity Check
232
-
233
- ```bash
234
- curl "http://localhost:8000/hadith/verify?q=Actions%20are%20judged%20by%20intentions"
235
- ```
236
-
237
- **Response:**
238
- ```json
239
- {
240
- "query": "Actions are judged by intentions",
241
- "found": true,
242
- "collection": "Sahih al-Bukhari",
243
- "grade": "Sahih",
244
- "reference": "Sahih al-Bukhari 1",
245
- "arabic": "Ψ₯Ω†Ω…Ψ§ Ψ§Ω„Ψ£ΨΉΩ…Ψ§Ω„ Ψ¨Ψ§Ω„Ω†ΩŠΨ§Ψͺ",
246
- "english": "Verily, actions are judged by intentions...",
247
- "latency_ms": 156
248
- }
249
- ```
250
-
251
- ### GET `/debug/scores` β€” Retrieval Inspection
252
-
253
- ```bash
254
- curl "http://localhost:8000/debug/scores?q=patience&top_k=10"
255
- ```
256
-
257
- Use this to calibrate `CONFIDENCE_THRESHOLD`. If queries you expect to work have `_score < threshold`, lower the threshold.
258
-
259
- **Response:**
260
- ```json
261
- {
262
- "query": "patience",
263
- "intent": "general",
264
- "threshold": 0.3,
265
- "count": 10,
266
- "results": [
267
- {
268
- "rank": 1,
269
- "source": "Surah Al-Baqarah 2:45",
270
- "type": "quran",
271
- "_dense": 0.8234,
272
- "_sparse": 0.5421,
273
- "_score": 0.7234
274
- }
275
- ]
276
- }
277
- ```
278
-
279
- ---
280
-
281
  ## Example Queries
282
 
283
  ```bash
@@ -648,7 +482,7 @@ docker-compose down && docker system prune
648
 
649
  - [x] Grade-based filtering
650
  - [x] Streaming responses (SSE)
651
- - [x] Modular architecture (4 routers, 18 endpoints)
652
  - [x] Dual LLM backend (Ollama + HuggingFace)
653
  - [x] Text search (exact substring + fuzzy matching)
654
  - [ ] Chain of narrators (Isnad display)
 
112
 
113
  ---
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ## Example Queries
116
 
117
  ```bash
 
482
 
483
  - [x] Grade-based filtering
484
  - [x] Streaming responses (SSE)
485
+ - [x] Modular architecture (4 routers, 16 endpoints)
486
  - [x] Dual LLM backend (Ollama + HuggingFace)
487
  - [x] Text search (exact substring + fuzzy matching)
488
  - [ ] Chain of narrators (Isnad display)
app/prompts.py CHANGED
@@ -11,7 +11,7 @@ from app.arabic_nlp import language_instruction
11
  # ═══════════════════════════════════════════════════════════════════════
12
  PERSONA = (
13
  "You are Sheikh QModel, a meticulous Islamic scholar with expertise "
14
- "in Tafsir (Quranic exegesis), Hadith sciences, Fiqh, and Arabic. "
15
  "You respond with scholarly rigor and modern clarity."
16
  )
17
 
 
11
  # ═══════════════════════════════════════════════════════════════════════
12
  PERSONA = (
13
  "You are Sheikh QModel, a meticulous Islamic scholar with expertise "
14
+ "in Quran, Tafsir (Quranic exegesis), Hadith sciences, Fiqh, and Arabic. "
15
  "You respond with scholarly rigor and modern clarity."
16
  )
17
 
app/search.py CHANGED
@@ -6,6 +6,7 @@ import json
6
  import logging
7
  import re
8
  from collections import Counter
 
9
  from typing import Dict, List, Literal, Optional
10
 
11
  import faiss
@@ -304,6 +305,19 @@ def text_search(
304
  if best_overlap >= max(2, len(q_tokens) * 0.5):
305
  score = best_overlap / max(len(q_tokens), 1)
306
 
 
 
 
 
 
 
 
 
 
 
 
 
 
307
  if score > 0:
308
  results.append({**item, "_score": score})
309
 
 
6
  import logging
7
  import re
8
  from collections import Counter
9
+ from difflib import SequenceMatcher
10
  from typing import Dict, List, Literal, Optional
11
 
12
  import faiss
 
305
  if best_overlap >= max(2, len(q_tokens) * 0.5):
306
  score = best_overlap / max(len(q_tokens), 1)
307
 
308
+ # Fuzzy similarity β€” catch 80%+ similar text (typos, slight differences)
309
+ if score == 0.0 and len(q_norm) >= 10:
310
+ q_len = len(q_norm)
311
+ for text in (ar_norm, en_lower):
312
+ if not text:
313
+ continue
314
+ # Only compare when lengths are comparable (within 3x)
315
+ if len(text) > q_len * 3:
316
+ continue
317
+ ratio = SequenceMatcher(None, q_norm, text).ratio()
318
+ if ratio >= 0.80:
319
+ score = max(score, 1.0 + ratio) # 1.80–2.0 range
320
+
321
  if score > 0:
322
  results.append({**item, "_score": score})
323
 
app/state.py CHANGED
@@ -20,7 +20,7 @@ from app.analysis import (
20
  detect_surah_info,
21
  lookup_surah_info,
22
  )
23
- from app.arabic_nlp import detect_language
24
  from app.config import cfg
25
  from app.llm import LLMProvider, get_llm_provider
26
  from app.prompts import build_messages, not_found_answer
@@ -156,30 +156,40 @@ def _verify_citations(answer: str, results: list) -> str:
156
  If a quoted block doesn't match any source, replace it with a warning.
157
  This prevents the model from fabricating hadith or verse text.
158
  """
159
- source_texts = set()
160
  for r in results:
161
  for field in ("arabic", "english", "text"):
162
  val = r.get(field, "")
163
  if val:
164
- # Normalize whitespace for comparison
165
- source_texts.add(re.sub(r"\s+", " ", val.strip()))
 
 
166
 
167
  def _check_quote(m: re.Match) -> str:
168
- quoted = re.sub(r"\s+", " ", m.group(1).strip())
169
- # Check if any source text contains a significant portion of the quote
170
- for src in source_texts:
171
- # Use a substring match β€” LLMs sometimes trim edges
172
- if len(quoted) < 10:
173
- return m.group(0) # too short to verify
174
- if quoted in src or src in quoted:
175
- return m.group(0) # verified
176
- # Check overlap: at least 60% of words match
177
- q_words = set(quoted.split())
178
- s_words = set(src.split())
179
- if q_words and len(q_words & s_words) / len(q_words) >= 0.6:
180
- return m.group(0) # close enough match
 
 
 
 
 
 
 
 
181
  # Quote not found in any source β€” flag it
182
- logger.warning("Hallucination detected: quoted text not in sources: %.80s...", quoted)
183
  return "❝ ⚠️ [ΨͺΩ… حذف Ω†Ψ΅ غير Ω…ΩˆΨ«Ω‚ β€” Ψ§Ω„Ω†Ψ΅ غير Ω…ΩˆΨ¬ΩˆΨ― في Ω‚Ψ§ΨΉΨ―Ψ© Ψ§Ω„Ψ¨ΩŠΨ§Ω†Ψ§Ψͺ] ❞"
184
 
185
  return _QUOTE_RE.sub(_check_quote, answer)
@@ -188,6 +198,127 @@ def _verify_citations(answer: str, results: list) -> str:
188
  # ═══════════════════════════════════════════════════════════════════════
189
  # HADITH GRADE INFERENCE
190
  # ═══════════════════════════════════════════════════════════════════════
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
  def infer_hadith_grade(item: dict) -> dict:
192
  """Infer hadith grade from collection name if not present."""
193
  if item.get("type") != "hadith" or item.get("grade"):
@@ -391,6 +522,8 @@ async def run_rag_pipeline(
391
  # 7. Post-generation hallucination check
392
  answer = _verify_citations(answer, results)
393
  answer = _verify_references(answer, results)
 
 
394
 
395
  latency = int((time.perf_counter() - t0) * 1000)
396
  logger.info(
 
20
  detect_surah_info,
21
  lookup_surah_info,
22
  )
23
+ from app.arabic_nlp import detect_language, normalize_arabic
24
  from app.config import cfg
25
  from app.llm import LLMProvider, get_llm_provider
26
  from app.prompts import build_messages, not_found_answer
 
156
  If a quoted block doesn't match any source, replace it with a warning.
157
  This prevents the model from fabricating hadith or verse text.
158
  """
159
+ source_texts_raw = []
160
  for r in results:
161
  for field in ("arabic", "english", "text"):
162
  val = r.get(field, "")
163
  if val:
164
+ source_texts_raw.append(re.sub(r"\s+", " ", val.strip()))
165
+
166
+ # Pre-compute normalized versions for diacritics-insensitive comparison
167
+ source_texts_norm = [normalize_arabic(s) for s in source_texts_raw]
168
 
169
  def _check_quote(m: re.Match) -> str:
170
+ quoted_raw = re.sub(r"\s+", " ", m.group(1).strip())
171
+ quoted_norm = normalize_arabic(quoted_raw)
172
+
173
+ if len(quoted_norm) < 10:
174
+ return m.group(0) # too short to verify
175
+
176
+ for src_raw, src_norm in zip(source_texts_raw, source_texts_norm):
177
+ # 1. Exact substring match (raw β€” preserves diacritics)
178
+ if quoted_raw in src_raw or src_raw in quoted_raw:
179
+ return m.group(0)
180
+
181
+ # 2. Normalized substring match (strips diacritics/punctuation)
182
+ if quoted_norm in src_norm or src_norm in quoted_norm:
183
+ return m.group(0)
184
+
185
+ # 3. Word overlap on normalized text (β‰₯50% of quoted words found)
186
+ q_words = set(quoted_norm.split())
187
+ s_words = set(src_norm.split())
188
+ if q_words and len(q_words & s_words) / len(q_words) >= 0.5:
189
+ return m.group(0)
190
+
191
  # Quote not found in any source β€” flag it
192
+ logger.warning("Hallucination detected: quoted text not in sources: %.80s...", quoted_norm)
193
  return "❝ ⚠️ [ΨͺΩ… حذف Ω†Ψ΅ غير Ω…ΩˆΨ«Ω‚ β€” Ψ§Ω„Ω†Ψ΅ غير Ω…ΩˆΨ¬ΩˆΨ― في Ω‚Ψ§ΨΉΨ―Ψ© Ψ§Ω„Ψ¨ΩŠΨ§Ω†Ψ§Ψͺ] ❞"
194
 
195
  return _QUOTE_RE.sub(_check_quote, answer)
 
198
  # ═══════════════════════════════════════════════════════════════════════
199
  # HADITH GRADE INFERENCE
200
  # ═══════════════════════════════════════════════════════════════════════
201
+ def _verify_surah_info(answer: str, surah_info: dict) -> str:
202
+ """Verify and correct surah metadata in the LLM answer.
203
+
204
+ Replaces hallucinated surah names and verse counts with the correct
205
+ values from the authoritative surah_info lookup.
206
+ """
207
+ if not surah_info:
208
+ return answer
209
+
210
+ correct_name_ar = surah_info.get("surah_name_ar", "")
211
+ correct_name_en = surah_info.get("surah_name_en", "")
212
+ correct_verses = surah_info.get("total_verses")
213
+ correct_number = surah_info.get("surah_number")
214
+ correct_type = surah_info.get("revelation_type", "")
215
+ correct_translit = surah_info.get("surah_name_transliteration", "")
216
+
217
+ correct_ar_norm = normalize_arabic(correct_name_ar).lower()
218
+ correct_ar_bare = re.sub(r"^Ψ§Ω„", "", correct_ar_norm).strip()
219
+
220
+ # Words that can follow "سورة" but aren't surah names
221
+ _NOT_SURAH_NAMES = {
222
+ "Ω…ΩƒΩŠΨ©", "Ω…ΩƒΩŠ", "Ω…Ψ―Ω†ΩŠΨ©", "Ω…Ψ―Ω†ΩŠ", "Ψ¨Ψ§Ω„Ω„ΨΊΨ©", "Ω…Ω†", "في", "Ω‡ΩŠ",
223
+ "Ψ§Ω„Ψͺي", "Ψ§Ω„ΩƒΨ±ΩŠΩ…Ψ©", "Ψ§Ω„Ω…Ψ¨Ψ§Ψ±ΩƒΨ©", "Ω‡Ψ°Ω‡", "ΨͺΩ„Ωƒ",
224
+ }
225
+ _NOT_SURAH_NAMES_NORM = {normalize_arabic(w).lower() for w in _NOT_SURAH_NAMES}
226
+
227
+ # ── Fix wrong surah names ───────────────────────────────────────
228
+ # Match "سورة <name>" β€” capture one Arabic word (letters + diacritics only,
229
+ # excluding Arabic punctuation like ، Ψ› ؟ which sit in U+060C-U+061F).
230
+ def _fix_surah_name_ar(m: re.Match) -> str:
231
+ found_name = m.group(1).strip()
232
+ found_norm = normalize_arabic(found_name).lower()
233
+ found_bare = re.sub(r"^Ψ§Ω„", "", found_norm).strip()
234
+ # Skip non-surah-name words (check both raw and normalized)
235
+ if found_name in _NOT_SURAH_NAMES or found_norm in _NOT_SURAH_NAMES_NORM:
236
+ return m.group(0)
237
+ if found_bare == correct_ar_bare or found_norm == correct_ar_norm:
238
+ return m.group(0) # already correct
239
+ # Handle 2-word capture where 1st word is the correct surah name
240
+ # (e.g., "Ψ§Ω„Ω†Ψ­Ω„ Ω…Ω†" starts with "Ψ§Ω„Ω†Ψ­Ω„")
241
+ if found_bare.startswith(correct_ar_bare) or found_norm.startswith(correct_ar_norm):
242
+ return m.group(0) # already correct
243
+ logger.warning(
244
+ "Surah info hallucination: سورة %s -> correcting to سورة %s",
245
+ found_name, correct_name_ar,
246
+ )
247
+ return m.group(0).replace(found_name, correct_name_ar)
248
+
249
+ # Use \u0621-\u06FF to capture Arabic letters/diacritics but exclude
250
+ # Arabic punctuation (ΨŒΨ›ΨŸ etc. at U+060C-U+061F). Allow an optional
251
+ # second word for 2-word names like Ψ’Ω„ ΨΉΩ…Ψ±Ψ§Ω†.
252
+ answer = re.sub(
253
+ r"(?:سورة|Ψ³ΩˆΨ±Ω‡)\s+([\u0621-\u06FF\u0750-\u077F]+(?:\s[\u0621-\u06FF\u0750-\u077F]+)?)"
254
+ r"(?=[\s,ΨŒΨ›ΨŸ\.\n?!]|$)",
255
+ _fix_surah_name_ar,
256
+ answer,
257
+ )
258
+
259
+ # Fix English surah names: "Surah <name>"
260
+ if correct_name_en:
261
+ def _fix_surah_name_en(m: re.Match) -> str:
262
+ found = m.group(1).strip()
263
+ if found.lower() == correct_name_en.lower():
264
+ return m.group(0)
265
+ # Also allow transliteration match
266
+ if correct_translit and found.lower() == correct_translit.lower():
267
+ return m.group(0)
268
+ logger.warning(
269
+ "Surah info hallucination: Surah %s -> correcting to Surah %s",
270
+ found, correct_name_en,
271
+ )
272
+ return m.group(0).replace(found, correct_name_en)
273
+
274
+ answer = re.sub(
275
+ r"(?:Surah|sura)\s+([A-Za-z'\-]+(?:[\s\-][A-Za-z'\-]+)*)",
276
+ _fix_surah_name_en,
277
+ answer,
278
+ flags=re.I,
279
+ )
280
+
281
+ # ── Fix wrong verse counts ──────────────────────────────────────
282
+ if correct_verses is not None:
283
+ def _fix_verse_count(m: re.Match) -> str:
284
+ num = int(m.group(1))
285
+ if num == correct_verses:
286
+ return m.group(0)
287
+ logger.warning(
288
+ "Surah info hallucination: %d verses -> correcting to %d",
289
+ num, correct_verses,
290
+ )
291
+ return m.group(0).replace(m.group(1), str(correct_verses))
292
+
293
+ # Arabic: "34 ؒية" / "34 ؒياΨͺ"
294
+ answer = re.sub(
295
+ r"(\d+)\s*(?:ؒية|ؒياΨͺ|Ψ’ΩŠΩ‡)",
296
+ _fix_verse_count,
297
+ answer,
298
+ )
299
+ # "Ψ§Ω„Ψ’ΩŠΨ© 34" used as count context (after "ΨΉΨ―Ψ―" or near "ؒياΨͺ")
300
+ answer = re.sub(
301
+ r"(Ψ§Ω„Ψ’ΩŠΨ©|Ψ§Ω„Ψ§ΩŠΩ‡)\s+(\d+)",
302
+ lambda m: m.group(1) + " " + (str(correct_verses) if int(m.group(2)) != correct_verses else m.group(2)),
303
+ answer,
304
+ )
305
+ # English: "34 verses" / "has 34 verses"
306
+ answer = re.sub(
307
+ r"(\d+)\s+(?:verses|ayat|ayahs)",
308
+ _fix_verse_count,
309
+ answer,
310
+ flags=re.I,
311
+ )
312
+ # "ΨΉΨ―Ψ―Ω‡Ψ§ 34" / "ΨΉΨ―Ψ― 34"
313
+ answer = re.sub(
314
+ r"(ΨΉΨ―Ψ―[Ω‡Ψ§]*\s+)(\d+)",
315
+ lambda m: m.group(1) + (str(correct_verses) if int(m.group(2)) != correct_verses else m.group(2)),
316
+ answer,
317
+ )
318
+
319
+ return answer
320
+
321
+
322
  def infer_hadith_grade(item: dict) -> dict:
323
  """Infer hadith grade from collection name if not present."""
324
  if item.get("type") != "hadith" or item.get("grade"):
 
522
  # 7. Post-generation hallucination check
523
  answer = _verify_citations(answer, results)
524
  answer = _verify_references(answer, results)
525
+ if surah_info:
526
+ answer = _verify_surah_info(answer, surah_info)
527
 
528
  latency = int((time.perf_counter() - t0) * 1000)
529
  logger.info(