Spaces:

lablab-ai-amd-developer-hackathon
/

pathshala-ai

Running

App Files Files Community

prasai-ap commited on 2 days ago

Commit

910c9bd

verified ·

1 Parent(s): e173db9

Upload 3 files

Browse files

Files changed (3) hide show

README.md +13 -7
app.py +308 -19
requirements.txt +3 -0

README.md CHANGED Viewed

@@ -13,15 +13,16 @@ pinned: false
 Pathshala AI is a bilingual AI tutor demo for rural primary students in Nepal.
-The Gradio Space mirrors the local Streamlit/web app flow. It can accept a student
-question in English, Nepali, or romanized Nepali plus optional textbook context, then returns:
 - English explanation
 - Nepali explanation
 - 3 simple quiz questions
 - Retrieved textbook sources
-- Quiz grading when a backend is configured
-- Parent/teacher summary when a backend is configured
 ## Deploy To Hugging Face Spaces
@@ -49,8 +50,13 @@ git push
 ## Recommended Submission Mode
 For the easiest hackathon submission, deploy the Space without `BACKEND_URL`.
-It will use the built-in demo fallback, so judges can try it immediately by pasting
-textbook context into the question tab.
 For the full RAG workflow, first deploy the FastAPI backend somewhere public, then set `BACKEND_URL` in the Space settings.
@@ -81,7 +87,7 @@ If the backend returns `normalized_question`, the Space shows the interpreted qu
 ## Mock Mode
-If `BACKEND_URL` is missing or the backend is unavailable, the Space uses a simple demo fallback so the demo remains easy to try. PDF upload, quiz grading, and parent summaries require the backend.
 Example question:

 Pathshala AI is a bilingual AI tutor demo for rural primary students in Nepal.
+The Gradio Space mirrors the local Streamlit/web app flow. It can upload a text-based
+PDF directly inside Hugging Face Spaces, accept a student question in English, Nepali,
+or romanized Nepali, retrieve relevant textbook portions, then returns:
 - English explanation
 - Nepali explanation
 - 3 simple quiz questions
 - Retrieved textbook sources
+- Basic quiz grading in Space-local mode
+- Parent/teacher summary note in Space-local mode
 ## Deploy To Hugging Face Spaces
 ## Recommended Submission Mode
 For the easiest hackathon submission, deploy the Space without `BACKEND_URL`.
+It will run a Space-local workflow:
+1. Upload a text-based PDF.
+2. Extract text with PyMuPDF.
+3. Create embeddings with `sentence-transformers`.
+4. Search the uploaded book in memory.
+5. Show Nepali quiz questions and retrieved textbook portions.
 For the full RAG workflow, first deploy the FastAPI backend somewhere public, then set `BACKEND_URL` in the Space settings.
 ## Mock Mode
+If `BACKEND_URL` is missing or the backend is unavailable, the Space uses local PDF extraction and in-memory retrieval. This supports text-based PDFs. For scanned PDFs or persistent student progress, deploy the backend and set `BACKEND_URL`.
 Example question:

app.py CHANGED Viewed

@@ -1,8 +1,10 @@
 import os
 from typing import Any
 from dotenv import load_dotenv
 import gradio as gr
 import requests
@@ -18,14 +20,20 @@ EXAMPLE_CONTEXT = (
     "Soil erosion is the removal of topsoil by wind, water, or other natural forces. "
     "It can make farmland less fertile and can be reduced by planting trees and grass."
 )
-def upload_textbook(pdf_path: str | None) -> str:
     if not pdf_path:
-        return "Choose a PDF first."
     if not BACKEND_URL:
-        return "Backend URL is not configured for this Space. Paste context below to use demo mode."
     try:
         with open(pdf_path, "rb") as pdf_file:
@@ -41,22 +49,55 @@ def upload_textbook(pdf_path: str | None) -> str:
             method_text = f" Text extraction: {extraction_method}." if extraction_method else ""
             return (
                 f"Uploaded {result['filename']} with {result['page_count']} pages "
-                f"and {result['chunk_count']} chunks.{method_text}"
             )
-        return _response_error(response, "Upload failed.")
     except requests.Timeout:
-        return "Backend is still processing the PDF. Try a smaller PDF for the demo."
     except requests.RequestException as exc:
-        return f"Could not reach backend: {exc}"
     except OSError as exc:
-        return f"Could not read uploaded PDF: {exc}"
 def ask_tutor(
     question: str,
     student_id: str,
     textbook_context: str,
 ) -> tuple[str, str, str, str, str, dict[str, Any]]:
     question = question.strip()
     student_id = (student_id or "hf-space-demo").strip()
@@ -78,7 +119,12 @@ def ask_tutor(
         if backend_result and not is_insufficient_backend_result(backend_result):
             return backend_result
-    return mock_response(question=question, textbook_context=textbook_context)
 def ask_backend(
@@ -145,12 +191,12 @@ def grade_quiz(
     student_id: str,
     quiz_state: dict[str, Any] | None,
 ) -> str:
-    if not BACKEND_URL:
-        return "Quiz grading needs the backend. Demo mode can show questions but cannot grade them."
     quiz_state = quiz_state or {}
     quiz_id = quiz_state.get("quiz_id")
     if not quiz_id:
         return "Ask the tutor first so a quiz can be created."
@@ -177,9 +223,59 @@ def grade_quiz(
         return "Quiz grading returned an invalid response."
 def parent_summary(student_id: str) -> str:
     if not BACKEND_URL:
-        return "Parent/teacher summary needs the backend."
     student_id = (student_id or "hf-space-demo").strip()
@@ -225,6 +321,103 @@ def is_insufficient_backend_result(result: tuple[str, str, str, str, str, dict[s
     return any(marker in combined for marker in markers)
 def mock_response(question: str, textbook_context: str) -> tuple[str, str, str, str, str, dict[str, Any]]:
     context = textbook_context or EXAMPLE_CONTEXT
     normalized_question = normalize_question_mock(question)
@@ -252,6 +445,54 @@ def mock_response(question: str, textbook_context: str) -> tuple[str, str, str,
     )
 def mock_english_explanation(normalized_question: str, context: str) -> str:
     text = f"{normalized_question} {context}".lower()
@@ -330,6 +571,53 @@ def mock_nepali_explanation(normalized_question: str, context: str = "") -> str:
     return "यो विषयलाई सरल रूपमा बुझ्न पाठ्यपुस्तकको सन्दर्भ पढेर मुख्य कुरा सम्झनुहोस्।"
 def normalize_question_mock(question: str) -> str:
     text = question.lower()
@@ -491,12 +779,13 @@ with gr.Blocks(title=APP_NAME, theme=gr.themes.Soft()) as demo:
     gr.Markdown(
         """
         # Pathshala AI
-        Bilingual AI tutor for rural primary students in Nepal. Upload a PDF when a
-        public backend is configured, or paste textbook context for the Space demo.
         """
     )
     quiz_state = gr.State({})
     with gr.Row():
         student_id_input = gr.Textbox(
@@ -508,7 +797,7 @@ with gr.Blocks(title=APP_NAME, theme=gr.themes.Soft()) as demo:
             label="Status",
             value=(
                 "Backend connected." if BACKEND_URL else
-                "Demo fallback active. Set BACKEND_URL in Space settings for full RAG."
             ),
             interactive=False,
             scale=2,
@@ -581,18 +870,18 @@ with gr.Blocks(title=APP_NAME, theme=gr.themes.Soft()) as demo:
             status_output,
             quiz_state,
         ],
-        fn=lambda question, context: ask_tutor(question, "hf-space-demo", context),
         cache_examples=False,
     )
     upload_button.click(
         fn=upload_textbook,
         inputs=[pdf_input],
-        outputs=[upload_output],
     )
     ask_button.click(
         fn=ask_tutor,
-        inputs=[question_input, student_id_input, context_input],
         outputs=[
             english_output,
             nepali_output,

 import os
 from typing import Any
+from functools import lru_cache
 from dotenv import load_dotenv
 import gradio as gr
+import numpy as np
 import requests
     "Soil erosion is the removal of topsoil by wind, water, or other natural forces. "
     "It can make farmland less fertile and can be reduced by planting trees and grass."
 )
+MIN_CHUNK_CHARS = 250
+MAX_CHUNK_CHARS = 900
+EMBEDDING_MODEL = os.getenv(
+    "EMBEDDING_MODEL",
+    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
+)
+def upload_textbook(pdf_path: str | None) -> tuple[str, dict[str, Any], Any]:
     if not pdf_path:
+        return "Choose a PDF first.", {}, gr.update()
     if not BACKEND_URL:
+        return upload_textbook_locally(pdf_path)
     try:
         with open(pdf_path, "rb") as pdf_file:
             method_text = f" Text extraction: {extraction_method}." if extraction_method else ""
             return (
                 f"Uploaded {result['filename']} with {result['page_count']} pages "
+                f"and {result['chunk_count']} chunks.{method_text}",
+                {},
+                gr.update(value=""),
             )
+        return _response_error(response, "Upload failed."), {}, gr.update()
     except requests.Timeout:
+        return "Backend is still processing the PDF. Try a smaller PDF for the demo.", {}, gr.update()
     except requests.RequestException as exc:
+        return f"Could not reach backend: {exc}", {}, gr.update()
     except OSError as exc:
+        return f"Could not read uploaded PDF: {exc}", {}, gr.update()
+def upload_textbook_locally(pdf_path: str) -> tuple[str, dict[str, Any], Any]:
+    try:
+        extracted = extract_pdf_text(pdf_path)
+        chunks = chunk_text(extracted["text"])
+        if not chunks:
+            return "No readable text chunks could be created from this PDF.", {}, gr.update()
+        embeddings = embed_texts(chunks)
+        state = {
+            "filename": os.path.basename(pdf_path),
+            "page_count": extracted["page_count"],
+            "chunk_count": len(chunks),
+            "extraction_method": extracted["extraction_method"],
+            "chunks": chunks,
+            "embeddings": embeddings.tolist(),
+        }
+        return (
+            (
+                f"Uploaded {state['filename']} inside this Space with "
+                f"{state['page_count']} pages and {state['chunk_count']} chunks. "
+                f"Text extraction: {state['extraction_method']}."
+            ),
+            state,
+            gr.update(value=""),
+        )
+    except Exception as exc:
+        return f"Could not process uploaded PDF in this Space: {exc}", {}, gr.update()
 def ask_tutor(
     question: str,
     student_id: str,
     textbook_context: str,
+    textbook_state: dict[str, Any] | None,
 ) -> tuple[str, str, str, str, str, dict[str, Any]]:
     question = question.strip()
     student_id = (student_id or "hf-space-demo").strip()
         if backend_result and not is_insufficient_backend_result(backend_result):
             return backend_result
+    return local_response(
+        question=question,
+        student_id=student_id,
+        textbook_context=textbook_context,
+        textbook_state=textbook_state or {},
+    )
 def ask_backend(
     student_id: str,
     quiz_state: dict[str, Any] | None,
 ) -> str:
     quiz_state = quiz_state or {}
     quiz_id = quiz_state.get("quiz_id")
+    if not BACKEND_URL:
+        return grade_quiz_locally([answer_1, answer_2, answer_3], quiz_state)
     if not quiz_id:
         return "Ask the tutor first so a quiz can be created."
         return "Quiz grading returned an invalid response."
+def grade_quiz_locally(answers: list[str], quiz_state: dict[str, Any]) -> str:
+    questions = quiz_state.get("quiz_questions", [])
+    expected_answers = quiz_state.get("expected_answers", [])
+    if not questions:
+        return "Ask the tutor first so a quiz can be created."
+    score = 0
+    lines = []
+    for index, question in enumerate(questions[:3]):
+        student_answer = answers[index].strip() if index < len(answers) else ""
+        expected_answer = str(expected_answers[index] if index < len(expected_answers) else "")
+        is_correct = is_answer_close(student_answer, expected_answer)
+        if is_correct:
+            score += 1
+        status = "Correct" if is_correct else "Needs practice"
+        lines.append(f"{status}: {question}")
+        if not is_correct and expected_answer:
+            lines.append(f"Expected idea: {expected_answer}")
+    return f"Score: {score} / {min(len(questions), 3)}\n" + "\n".join(lines)
+def is_answer_close(student_answer: str, expected_answer: str) -> bool:
+    student_tokens = set(normalize_answer(student_answer).split())
+    expected_tokens = set(normalize_answer(expected_answer).split())
+    if not student_tokens or not expected_tokens:
+        return False
+    overlap = len(student_tokens & expected_tokens) / max(len(expected_tokens), 1)
+    return overlap >= 0.35 or normalize_answer(student_answer) in normalize_answer(expected_answer)
+def normalize_answer(answer: str) -> str:
+    return " ".join(
+        word.strip(".,?!:;()[]{}\"'।").lower()
+        for word in answer.split()
+        if word.strip(".,?!:;()[]{}\"'।")
+    )
 def parent_summary(student_id: str) -> str:
     if not BACKEND_URL:
+        return (
+            "Parent/teacher summary\n\n"
+            "The student has practiced with the uploaded or pasted textbook context in this Space. "
+            "For persistent progress across sessions, deploy the FastAPI backend and set BACKEND_URL."
+        )
     student_id = (student_id or "hf-space-demo").strip()
     return any(marker in combined for marker in markers)
+def extract_pdf_text(pdf_path: str) -> dict[str, Any]:
+    import fitz
+    page_texts = []
+    with fitz.open(pdf_path) as document:
+        for page in document:
+            text = page.get_text("text").strip()
+            if text:
+                page_texts.append(text)
+        page_count = document.page_count
+    text = "\n\n".join(page_texts).strip()
+    if not text:
+        raise ValueError(
+            "No selectable text was found. For scanned PDFs, deploy with a backend "
+            "or paste a short textbook paragraph into the context box."
+        )
+    return {
+        "text": text,
+        "page_count": page_count,
+        "extraction_method": "pymupdf-local",
+    }
+def chunk_text(text: str) -> list[str]:
+    paragraphs = [part.strip() for part in text.splitlines() if part.strip()]
+    chunks = []
+    current = ""
+    for paragraph in paragraphs:
+        if len(current) + len(paragraph) + 2 <= MAX_CHUNK_CHARS:
+            current = f"{current}\n{paragraph}".strip()
+            continue
+        if len(current) >= MIN_CHUNK_CHARS:
+            chunks.append(current)
+            current = paragraph
+        else:
+            current = f"{current}\n{paragraph}".strip()
+    if current:
+        chunks.append(current)
+    return chunks or ([text.strip()] if text.strip() else [])
+@lru_cache(maxsize=1)
+def get_embedding_model():
+    from sentence_transformers import SentenceTransformer
+    return SentenceTransformer(EMBEDDING_MODEL)
+def embed_texts(texts: list[str]) -> np.ndarray:
+    model = get_embedding_model()
+    return np.asarray(
+        model.encode(
+            texts,
+            convert_to_numpy=True,
+            normalize_embeddings=True,
+            show_progress_bar=False,
+        )
+    )
+def retrieve_local_sources(
+    question: str,
+    textbook_state: dict[str, Any],
+    limit: int = 5,
+) -> list[dict[str, Any]]:
+    chunks = [str(chunk) for chunk in textbook_state.get("chunks", [])]
+    embeddings = np.asarray(textbook_state.get("embeddings", []), dtype=float)
+    if not chunks or embeddings.size == 0:
+        return []
+    query_embedding = embed_texts([question])[0]
+    scores = embeddings @ query_embedding
+    top_indices = np.argsort(scores)[::-1][:limit]
+    return [
+        {
+            "score": float(scores[index]),
+            "text": chunks[index],
+            "metadata": {
+                "filename": textbook_state.get("filename", "uploaded-textbook"),
+                "chunk_index": int(index),
+            },
+        }
+        for index in top_indices
+    ]
 def mock_response(question: str, textbook_context: str) -> tuple[str, str, str, str, str, dict[str, Any]]:
     context = textbook_context or EXAMPLE_CONTEXT
     normalized_question = normalize_question_mock(question)
     )
+def local_response(
+    question: str,
+    student_id: str,
+    textbook_context: str,
+    textbook_state: dict[str, Any],
+) -> tuple[str, str, str, str, str, dict[str, Any]]:
+    normalized_question = normalize_question_mock(question)
+    sources = []
+    if textbook_context.strip():
+        sources = [
+            {
+                "score": 1.0,
+                "text": chunk,
+                "metadata": {"filename": "pasted-context", "chunk_index": index},
+            }
+            for index, chunk in enumerate(chunk_text(textbook_context)[:5])
+        ]
+    elif textbook_state.get("chunks") and textbook_state.get("embeddings"):
+        sources = retrieve_local_sources(normalized_question, textbook_state, limit=5)
+    context = "\n\n".join(str(source.get("text", "")) for source in sources).strip()
+    if not context:
+        return mock_response(question=question, textbook_context=textbook_context)
+    english = (
+        f"Interpreted question: {normalized_question}\n\n"
+        f"Answer from the uploaded textbook context:\n{truncate(context, max_length=700)}"
+    )
+    nepali = local_nepali_answer(normalized_question, context)
+    quiz_questions = local_nepali_quiz_questions(context)
+    quiz_state = {
+        "student_id": student_id,
+        "quiz_questions": quiz_questions,
+        "expected_answers": [source_answer(sources)] * 3,
+    }
+    return (
+        english,
+        nepali,
+        format_quiz(quiz_questions),
+        format_sources(sources),
+        "Answered with the Hugging Face Space local PDF workflow.",
+        quiz_state,
+    )
 def mock_english_explanation(normalized_question: str, context: str) -> str:
     text = f"{normalized_question} {context}".lower()
     return "यो विषयलाई सरल रूपमा बुझ्न पाठ्यपुस्तकको सन्दर्भ पढेर मुख्य कुरा सम्झनुहोस्।"
+def local_nepali_answer(normalized_question: str, context: str) -> str:
+    known_answer = mock_nepali_explanation(normalized_question, context)
+    if known_answer != "यो विषयलाई सरल रूपमा बुझ्न पाठ्यपुस्तकको सन्दर्भ पढेर मुख्य कुरा सम्झनुहोस्।":
+        return known_answer
+    if has_devanagari(context):
+        return (
+            "अपलोड गरिएको पाठ्यपुस्तकको सन्दर्भअनुसार मुख्य कुरा यस्तो छ:\n\n"
+            f"{truncate(context, max_length=700)}"
+        )
+    return (
+        "अपलोड गरिएको पाठ्यपुस्तकको सन्दर्भअनुसार यो विषय महत्त्वपूर्ण छ। "
+        "मुख्य शब्दहरू पढ्नुहोस्, उदाहरणसँग जोड्नुहोस्, र आफ्नै सरल शब्दमा उत्तर लेख्ने अभ्यास गर्नुहोस्।"
+    )
+def local_nepali_quiz_questions(context: str) -> list[str]:
+    short_context = truncate(first_sentence(context), max_length=140)
+    return [
+        "प्राप्त पाठ्यपुस्तक सन्दर्भको मुख्य कुरा के हो?",
+        f"यो वाक्यले के बुझाउँछ: {short_context}",
+        "यस विषयलाई आफ्नै सरल शब्दमा कसरी भन्न सकिन्छ?",
+    ]
+def source_answer(sources: list[dict[str, Any]]) -> str:
+    if not sources:
+        return "पाठ्यपुस्तकको मुख्य कुरा।"
+    text = str(sources[0].get("text", "")).strip()
+    return truncate(first_sentence(text) or text, max_length=220)
+def first_sentence(text: str) -> str:
+    for separator in ["।", ".", "?", "!"]:
+        if separator in text:
+            return text.split(separator, 1)[0].strip() + separator
+    return text.strip()
+def has_devanagari(text: str) -> bool:
+    return any("\u0900" <= character <= "\u097f" for character in text)
 def normalize_question_mock(question: str) -> str:
     text = question.lower()
     gr.Markdown(
         """
         # Pathshala AI
+        Bilingual AI tutor for rural primary students in Nepal. Upload a PDF directly
+        in this Space, or connect a public backend for the full production workflow.
         """
     )
     quiz_state = gr.State({})
+    textbook_state = gr.State({})
     with gr.Row():
         student_id_input = gr.Textbox(
             label="Status",
             value=(
                 "Backend connected." if BACKEND_URL else
+                "Space-local PDF upload is active. Set BACKEND_URL for the full backend workflow."
             ),
             interactive=False,
             scale=2,
             status_output,
             quiz_state,
         ],
+        fn=lambda question, context: ask_tutor(question, "hf-space-demo", context, {}),
         cache_examples=False,
     )
     upload_button.click(
         fn=upload_textbook,
         inputs=[pdf_input],
+        outputs=[upload_output, textbook_state, context_input],
     )
     ask_button.click(
         fn=ask_tutor,
+        inputs=[question_input, student_id_input, context_input, textbook_state],
         outputs=[
             english_output,
             nepali_output,

requirements.txt CHANGED Viewed

@@ -1,3 +1,6 @@
 gradio>=4.44.0
 python-dotenv>=1.0.0
 requests>=2.31.0

 gradio>=4.44.0
 python-dotenv>=1.0.0
 requests>=2.31.0
+numpy>=1.26.0
+PyMuPDF>=1.24.0
+sentence-transformers==2.7.0