Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +1 -167
app/prompts.py +1 -1
app/search.py +14 -0
app/state.py +151 -18

README.md CHANGED Viewed

@@ -112,172 +112,6 @@ python build_index.py
 ---
-## API Reference (18 endpoints)
-### Inference
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/ask?q=...&top_k=5&source_type=&grade_filter=` | GET | Direct RAG query with full source attribution |
-| `/v1/chat/completions` | POST | OpenAI-compatible chat (SSE streaming supported) |
-### Quran (`/quran/...`)
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/quran/search?q=...&limit=10` | GET | Text search: find verses by partial Arabic/English text |
-| `/quran/topic?topic=...&top_k=10` | GET | Semantic search: find verses related to a topic |
-| `/quran/word-frequency?word=...` | GET | Count word occurrences across all Surahs |
-| `/quran/analytics` | GET | Overall Quran stats (total verses, Surahs, revelation types) |
-| `/quran/chapter/{number}` | GET | All verses and metadata for a specific Surah |
-| `/quran/verse/{surah}:{ayah}` | GET | Exact verse lookup by reference (e.g. `/quran/verse/2:255`) |
-### Hadith (`/hadith/...`)
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/hadith/search?q=...&collection=&limit=10` | GET | Text search across collections |
-| `/hadith/topic?topic=...&top_k=10&grade_filter=` | GET | Semantic search by topic with optional grade filter |
-| `/hadith/verify?q=...&collection=` | GET | Authenticity verification (text + semantic search) |
-| `/hadith/collection/{name}?limit=20&offset=0` | GET | Browse a specific collection |
-| `/hadith/analytics` | GET | Collection-level statistics |
-### Operations
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/health` | GET | Readiness check |
-| `/v1/models` | GET | OpenAI-compatible model listing |
-| `/debug/scores?q=...&top_k=10&source_type=` | GET | Raw retrieval scores (no LLM call) |
----
-### GET `/ask` — Main Query
-```bash
-curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?&top_k=5"
-```
-**Parameters:**
-| Parameter | Default | Description |
-|-----------|---------|-------------|
-| `q` | *(required)* | Your Islamic question |
-| `top_k` | `5` | Number of sources to retrieve (1–20) |
-| `source_type` | both | `quran` or `hadith` |
-| `grade_filter` | all | `sahih` or `hasan` |
-**Response:**
-```json
-{
-  "question": "What does Islam say about mercy?",
-  "answer": "Islam emphasizes mercy as a core value...",
-  "language": "english",
-  "intent": "general",
-  "analysis": null,
-  "sources": [
-    {
-      "source": "Surah Al-Baqarah 2:178",
-      "type": "quran",
-      "grade": null,
-      "arabic": "...",
-      "english": "...",
-      "_score": 0.876
-    }
-  ],
-  "top_score": 0.876,
-  "latency_ms": 342
-}
-```
-### POST `/v1/chat/completions` — OpenAI-Compatible
-```bash
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "QModel",
-    "messages": [{"role": "user", "content": "What does Islam say about patience?"}],
-    "temperature": 0.2,
-    "max_tokens": 2048,
-    "top_k": 5,
-    "stream": false
-  }'
-```
-**Response:**
-```json
-{
-  "id": "qmodel-1234567890",
-  "object": "chat.completion",
-  "created": 1234567890,
-  "model": "QModel",
-  "choices": [
-    {
-      "index": 0,
-      "message": { "role": "assistant", "content": "Islam emphasizes patience..." },
-      "finish_reason": "stop"
-    }
-  ],
-  "x_metadata": {
-    "language": "english",
-    "intent": "general",
-    "top_score": 0.876,
-    "latency_ms": 342,
-    "sources": [{ "source": "Surah Al-Imran 3:200", "type": "quran", "score": 0.876 }]
-  }
-}
-```
-### GET `/hadith/verify` — Authenticity Check
-```bash
-curl "http://localhost:8000/hadith/verify?q=Actions%20are%20judged%20by%20intentions"
-```
-**Response:**
-```json
-{
-  "query": "Actions are judged by intentions",
-  "found": true,
-  "collection": "Sahih al-Bukhari",
-  "grade": "Sahih",
-  "reference": "Sahih al-Bukhari 1",
-  "arabic": "إنما الأعمال بالنيات",
-  "english": "Verily, actions are judged by intentions...",
-  "latency_ms": 156
-}
-```
-### GET `/debug/scores` — Retrieval Inspection
-```bash
-curl "http://localhost:8000/debug/scores?q=patience&top_k=10"
-```
-Use this to calibrate `CONFIDENCE_THRESHOLD`. If queries you expect to work have `_score < threshold`, lower the threshold.
-**Response:**
-```json
-{
-  "query": "patience",
-  "intent": "general",
-  "threshold": 0.3,
-  "count": 10,
-  "results": [
-    {
-      "rank": 1,
-      "source": "Surah Al-Baqarah 2:45",
-      "type": "quran",
-      "_dense": 0.8234,
-      "_sparse": 0.5421,
-      "_score": 0.7234
-    }
-  ]
-}
-```
----
 ## Example Queries
 ```bash
@@ -648,7 +482,7 @@ docker-compose down && docker system prune
 - [x] Grade-based filtering
 - [x] Streaming responses (SSE)
-- [x] Modular architecture (4 routers, 18 endpoints)
 - [x] Dual LLM backend (Ollama + HuggingFace)
 - [x] Text search (exact substring + fuzzy matching)
 - [ ] Chain of narrators (Isnad display)

 ---
 ## Example Queries
 ```bash
 - [x] Grade-based filtering
 - [x] Streaming responses (SSE)
+- [x] Modular architecture (4 routers, 16 endpoints)
 - [x] Dual LLM backend (Ollama + HuggingFace)
 - [x] Text search (exact substring + fuzzy matching)
 - [ ] Chain of narrators (Isnad display)

app/prompts.py CHANGED Viewed

@@ -11,7 +11,7 @@ from app.arabic_nlp import language_instruction
 # ═══════════════════════════════════════════════════════════════════════
 PERSONA = (
     "You are Sheikh QModel, a meticulous Islamic scholar with expertise "
-    "in Tafsir (Quranic exegesis), Hadith sciences, Fiqh, and Arabic. "
     "You respond with scholarly rigor and modern clarity."
 )

 # ═══════════════════════════════════════════════════════════════════════
 PERSONA = (
     "You are Sheikh QModel, a meticulous Islamic scholar with expertise "
+    "in Quran, Tafsir (Quranic exegesis), Hadith sciences, Fiqh, and Arabic. "
     "You respond with scholarly rigor and modern clarity."
 )

app/search.py CHANGED Viewed

@@ -6,6 +6,7 @@ import json
 import logging
 import re
 from collections import Counter
 from typing import Dict, List, Literal, Optional
 import faiss
@@ -304,6 +305,19 @@ def text_search(
             if best_overlap >= max(2, len(q_tokens) * 0.5):
                 score = best_overlap / max(len(q_tokens), 1)
         if score > 0:
             results.append({**item, "_score": score})

 import logging
 import re
 from collections import Counter
+from difflib import SequenceMatcher
 from typing import Dict, List, Literal, Optional
 import faiss
             if best_overlap >= max(2, len(q_tokens) * 0.5):
                 score = best_overlap / max(len(q_tokens), 1)
+        # Fuzzy similarity — catch 80%+ similar text (typos, slight differences)
+        if score == 0.0 and len(q_norm) >= 10:
+            q_len = len(q_norm)
+            for text in (ar_norm, en_lower):
+                if not text:
+                    continue
+                # Only compare when lengths are comparable (within 3x)
+                if len(text) > q_len * 3:
+                    continue
+                ratio = SequenceMatcher(None, q_norm, text).ratio()
+                if ratio >= 0.80:
+                    score = max(score, 1.0 + ratio)  # 1.80–2.0 range
         if score > 0:
             results.append({**item, "_score": score})

app/state.py CHANGED Viewed

@@ -20,7 +20,7 @@ from app.analysis import (
     detect_surah_info,
     lookup_surah_info,
 )
-from app.arabic_nlp import detect_language
 from app.config import cfg
 from app.llm import LLMProvider, get_llm_provider
 from app.prompts import build_messages, not_found_answer
@@ -156,30 +156,40 @@ def _verify_citations(answer: str, results: list) -> str:
     If a quoted block doesn't match any source, replace it with a warning.
     This prevents the model from fabricating hadith or verse text.
     """
-    source_texts = set()
     for r in results:
         for field in ("arabic", "english", "text"):
             val = r.get(field, "")
             if val:
-                # Normalize whitespace for comparison
-                source_texts.add(re.sub(r"\s+", " ", val.strip()))
     def _check_quote(m: re.Match) -> str:
-        quoted = re.sub(r"\s+", " ", m.group(1).strip())
-        # Check if any source text contains a significant portion of the quote
-        for src in source_texts:
-            # Use a substring match — LLMs sometimes trim edges
-            if len(quoted) < 10:
-                return m.group(0)  # too short to verify
-            if quoted in src or src in quoted:
-                return m.group(0)  # verified
-            # Check overlap: at least 60% of words match
-            q_words = set(quoted.split())
-            s_words = set(src.split())
-            if q_words and len(q_words & s_words) / len(q_words) >= 0.6:
-                return m.group(0)  # close enough match
         # Quote not found in any source — flag it
-        logger.warning("Hallucination detected: quoted text not in sources: %.80s...", quoted)
         return "❝ ⚠️ [تم حذف نص غير موثق — النص غير موجود في قاعدة البيانات] ❞"
     return _QUOTE_RE.sub(_check_quote, answer)
@@ -188,6 +198,127 @@ def _verify_citations(answer: str, results: list) -> str:
 # ═══════════════════════════════════════════════════════════════════════
 # HADITH GRADE INFERENCE
 # ═══════════════════════════════════════════════════════════════════════
 def infer_hadith_grade(item: dict) -> dict:
     """Infer hadith grade from collection name if not present."""
     if item.get("type") != "hadith" or item.get("grade"):
@@ -391,6 +522,8 @@ async def run_rag_pipeline(
     # 7. Post-generation hallucination check
     answer = _verify_citations(answer, results)
     answer = _verify_references(answer, results)
     latency = int((time.perf_counter() - t0) * 1000)
     logger.info(

     detect_surah_info,
     lookup_surah_info,
 )
+from app.arabic_nlp import detect_language, normalize_arabic
 from app.config import cfg
 from app.llm import LLMProvider, get_llm_provider
 from app.prompts import build_messages, not_found_answer
     If a quoted block doesn't match any source, replace it with a warning.
     This prevents the model from fabricating hadith or verse text.
     """
+    source_texts_raw = []
     for r in results:
         for field in ("arabic", "english", "text"):
             val = r.get(field, "")
             if val:
+                source_texts_raw.append(re.sub(r"\s+", " ", val.strip()))
+    # Pre-compute normalized versions for diacritics-insensitive comparison
+    source_texts_norm = [normalize_arabic(s) for s in source_texts_raw]
     def _check_quote(m: re.Match) -> str:
+        quoted_raw = re.sub(r"\s+", " ", m.group(1).strip())
+        quoted_norm = normalize_arabic(quoted_raw)
+        if len(quoted_norm) < 10:
+            return m.group(0)  # too short to verify
+        for src_raw, src_norm in zip(source_texts_raw, source_texts_norm):
+            # 1. Exact substring match (raw — preserves diacritics)
+            if quoted_raw in src_raw or src_raw in quoted_raw:
+                return m.group(0)
+            # 2. Normalized substring match (strips diacritics/punctuation)
+            if quoted_norm in src_norm or src_norm in quoted_norm:
+                return m.group(0)
+            # 3. Word overlap on normalized text (≥50% of quoted words found)
+            q_words = set(quoted_norm.split())
+            s_words = set(src_norm.split())
+            if q_words and len(q_words & s_words) / len(q_words) >= 0.5:
+                return m.group(0)
         # Quote not found in any source — flag it
+        logger.warning("Hallucination detected: quoted text not in sources: %.80s...", quoted_norm)
         return "❝ ⚠️ [تم حذف نص غير موثق — النص غير موجود في قاعدة البيانات] ❞"
     return _QUOTE_RE.sub(_check_quote, answer)
 # ═══════════════════════════════════════════════════════════════════════
 # HADITH GRADE INFERENCE
 # ═══════════════════════════════════════════════════════════════════════
+def _verify_surah_info(answer: str, surah_info: dict) -> str:
+    """Verify and correct surah metadata in the LLM answer.
+    Replaces hallucinated surah names and verse counts with the correct
+    values from the authoritative surah_info lookup.
+    """
+    if not surah_info:
+        return answer
+    correct_name_ar = surah_info.get("surah_name_ar", "")
+    correct_name_en = surah_info.get("surah_name_en", "")
+    correct_verses  = surah_info.get("total_verses")
+    correct_number  = surah_info.get("surah_number")
+    correct_type    = surah_info.get("revelation_type", "")
+    correct_translit = surah_info.get("surah_name_transliteration", "")
+    correct_ar_norm = normalize_arabic(correct_name_ar).lower()
+    correct_ar_bare = re.sub(r"^ال", "", correct_ar_norm).strip()
+    # Words that can follow "سورة" but aren't surah names
+    _NOT_SURAH_NAMES = {
+        "مكية", "مكي", "مدنية", "مدني", "باللغة", "من", "في", "هي",
+        "التي", "الكريمة", "المباركة", "هذه", "تلك",
+    }
+    _NOT_SURAH_NAMES_NORM = {normalize_arabic(w).lower() for w in _NOT_SURAH_NAMES}
+    # ── Fix wrong surah names ───────────────────────────────────────
+    # Match "سورة <name>" — capture one Arabic word (letters + diacritics only,
+    # excluding Arabic punctuation like ، ؛ ؟ which sit in U+060C-U+061F).
+    def _fix_surah_name_ar(m: re.Match) -> str:
+        found_name = m.group(1).strip()
+        found_norm = normalize_arabic(found_name).lower()
+        found_bare = re.sub(r"^ال", "", found_norm).strip()
+        # Skip non-surah-name words (check both raw and normalized)
+        if found_name in _NOT_SURAH_NAMES or found_norm in _NOT_SURAH_NAMES_NORM:
+            return m.group(0)
+        if found_bare == correct_ar_bare or found_norm == correct_ar_norm:
+            return m.group(0)                           # already correct
+        # Handle 2-word capture where 1st word is the correct surah name
+        # (e.g., "النحل من" starts with "النحل")
+        if found_bare.startswith(correct_ar_bare) or found_norm.startswith(correct_ar_norm):
+            return m.group(0)                           # already correct
+        logger.warning(
+            "Surah info hallucination: سورة %s -> correcting to سورة %s",
+            found_name, correct_name_ar,
+        )
+        return m.group(0).replace(found_name, correct_name_ar)
+    # Use \u0621-\u06FF to capture Arabic letters/diacritics but exclude
+    # Arabic punctuation (،؛؟ etc. at U+060C-U+061F).  Allow an optional
+    # second word for 2-word names like آل عمران.
+    answer = re.sub(
+        r"(?:سورة|سوره)\s+([\u0621-\u06FF\u0750-\u077F]+(?:\s[\u0621-\u06FF\u0750-\u077F]+)?)"
+        r"(?=[\s,،؛؟\.\n?!]|$)",
+        _fix_surah_name_ar,
+        answer,
+    )
+    # Fix English surah names: "Surah <name>"
+    if correct_name_en:
+        def _fix_surah_name_en(m: re.Match) -> str:
+            found = m.group(1).strip()
+            if found.lower() == correct_name_en.lower():
+                return m.group(0)
+            # Also allow transliteration match
+            if correct_translit and found.lower() == correct_translit.lower():
+                return m.group(0)
+            logger.warning(
+                "Surah info hallucination: Surah %s -> correcting to Surah %s",
+                found, correct_name_en,
+            )
+            return m.group(0).replace(found, correct_name_en)
+        answer = re.sub(
+            r"(?:Surah|sura)\s+([A-Za-z'\-]+(?:[\s\-][A-Za-z'\-]+)*)",
+            _fix_surah_name_en,
+            answer,
+            flags=re.I,
+        )
+    # ── Fix wrong verse counts ──────────────────────────────────────
+    if correct_verses is not None:
+        def _fix_verse_count(m: re.Match) -> str:
+            num = int(m.group(1))
+            if num == correct_verses:
+                return m.group(0)
+            logger.warning(
+                "Surah info hallucination: %d verses -> correcting to %d",
+                num, correct_verses,
+            )
+            return m.group(0).replace(m.group(1), str(correct_verses))
+        # Arabic: "34 آية" / "34 آيات"
+        answer = re.sub(
+            r"(\d+)\s*(?:آية|آيات|آيه)",
+            _fix_verse_count,
+            answer,
+        )
+        # "الآية 34" used as count context (after "عدد" or near "آيات")
+        answer = re.sub(
+            r"(الآية|الايه)\s+(\d+)",
+            lambda m: m.group(1) + " " + (str(correct_verses) if int(m.group(2)) != correct_verses else m.group(2)),
+            answer,
+        )
+        # English: "34 verses" / "has 34 verses"
+        answer = re.sub(
+            r"(\d+)\s+(?:verses|ayat|ayahs)",
+            _fix_verse_count,
+            answer,
+            flags=re.I,
+        )
+        # "عددها 34" / "عدد 34"
+        answer = re.sub(
+            r"(عدد[ها]*\s+)(\d+)",
+            lambda m: m.group(1) + (str(correct_verses) if int(m.group(2)) != correct_verses else m.group(2)),
+            answer,
+        )
+    return answer
 def infer_hadith_grade(item: dict) -> dict:
     """Infer hadith grade from collection name if not present."""
     if item.get("type") != "hadith" or item.get("grade"):
     # 7. Post-generation hallucination check
     answer = _verify_citations(answer, results)
     answer = _verify_references(answer, results)
+    if surah_info:
+        answer = _verify_surah_info(answer, surah_info)
     latency = int((time.perf_counter() - t0) * 1000)
     logger.info(