prasai-ap commited on
Commit
c46c77f
·
verified ·
1 Parent(s): 7742d2e

Upload 3 files

Browse files
Files changed (2) hide show
  1. README.md +8 -85
  2. app.py +211 -593
README.md CHANGED
@@ -13,90 +13,13 @@ pinned: false
13
 
14
  Pathshala AI is a bilingual AI tutor demo for rural primary students in Nepal.
15
 
16
- The Gradio Space mirrors the local Streamlit/web app flow. It can upload a text-based
17
- PDF directly inside Hugging Face Spaces, accept a student question in English, Nepali,
18
- or romanized Nepali, retrieve relevant textbook portions, then returns:
19
 
20
- - English explanation
21
- - Nepali explanation
22
- - 3 simple quiz questions
23
- - Retrieved textbook sources
24
- - Basic quiz grading in Space-local mode
25
- - Parent/teacher summary note in Space-local mode
26
 
27
- ## Deploy To Hugging Face Spaces
28
-
29
- 1. Create a new Hugging Face Space.
30
- 2. Choose `Gradio` as the SDK.
31
- 3. Upload the files from this `hf_space/` folder into the root of the Space:
32
- - `app.py`
33
- - `requirements.txt`
34
- - `README.md`
35
- 4. Commit the files. Hugging Face will build and run the Space automatically.
36
-
37
- You can also deploy with Git:
38
-
39
- ```bash
40
- git clone https://huggingface.co/spaces/YOUR_USERNAME/pathshala-ai
41
- cp hf_space/app.py pathshala-ai/app.py
42
- cp hf_space/requirements.txt pathshala-ai/requirements.txt
43
- cp hf_space/README.md pathshala-ai/README.md
44
- cd pathshala-ai
45
- git add .
46
- git commit -m "Deploy Pathshala AI Gradio demo"
47
- git push
48
- ```
49
-
50
- ## Recommended Submission Mode
51
-
52
- For the easiest hackathon submission, deploy the Space without `BACKEND_URL`.
53
- It will run a Space-local workflow:
54
-
55
- 1. Upload a text-based PDF.
56
- 2. Extract text with PyMuPDF.
57
- 3. Create embeddings with `sentence-transformers`.
58
- 4. Search the uploaded book in memory.
59
- 5. Show Nepali quiz questions and retrieved textbook portions.
60
-
61
- For the full RAG workflow, first deploy the FastAPI backend somewhere public, then set `BACKEND_URL` in the Space settings.
62
-
63
- ## Backend Mode
64
-
65
- Set `BACKEND_URL` to use the FastAPI backend:
66
-
67
- ```bash
68
- BACKEND_URL=https://your-backend.example.com
69
- ```
70
-
71
- In Hugging Face Spaces, add it under:
72
-
73
- ```text
74
- Space settings -> Variables and secrets -> New variable
75
- ```
76
-
77
- The app calls:
78
-
79
- - `POST /upload-textbook` for PDF uploads
80
- - `POST /ask` for bilingual textbook-grounded answers
81
- - `POST /grade-quiz` for quiz grading
82
- - `GET /parent-summary/{student_id}` for the parent/teacher summary
83
-
84
- The `/ask` request sends both the student question and the optional textbook context.
85
- If a user types context in the Space, the backend can answer from that context even when no PDF has been uploaded.
86
- If the backend returns `normalized_question`, the Space shows the interpreted question above the English explanation.
87
-
88
- ## Mock Mode
89
-
90
- If `BACKEND_URL` is missing or the backend is unavailable, the Space uses local PDF extraction and in-memory retrieval. This supports text-based PDFs. For scanned PDFs or persistent student progress, deploy the backend and set `BACKEND_URL`.
91
-
92
- Example question:
93
-
94
- ```text
95
- soil erosion vaneko ke ho
96
- ```
97
-
98
- You can also try mixed romanized Nepali questions such as:
99
-
100
- ```text
101
- photosynthesis vaneko ke ho vana
102
- ```
 
13
 
14
  Pathshala AI is a bilingual AI tutor demo for rural primary students in Nepal.
15
 
16
+ This Hugging Face Space supports:
 
 
17
 
18
+ - Uploading a text-based PDF textbook directly in the Space
19
+ - Asking questions in English, Nepali, or romanized Nepali
20
+ - Retrieving relevant textbook portions from the uploaded PDF
21
+ - Showing a simple English answer and Nepali explanation
22
+ - Generating Nepali quiz questions
23
+ - Basic quiz grading
24
 
25
+ For scanned PDF OCR and persistent progress, deploy the FastAPI backend separately and add a Space variable named `BACKEND_URL`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1,6 +1,5 @@
1
  import json
2
  import os
3
- from typing import Any
4
  from functools import lru_cache
5
 
6
  from dotenv import load_dotenv
@@ -13,62 +12,31 @@ load_dotenv()
13
 
14
  APP_NAME = os.getenv("APP_NAME", "Pathshala AI")
15
  BACKEND_URL = os.getenv("BACKEND_URL", "").rstrip("/")
16
- UPLOAD_TIMEOUT_SECONDS = 900
17
- ASK_TIMEOUT_SECONDS = 180
18
- SHORT_TIMEOUT_SECONDS = 45
19
- EXAMPLE_QUESTION = "soil erosion vaneko ke ho"
20
- EXAMPLE_CONTEXT = (
21
- "Soil erosion is the removal of topsoil by wind, water, or other natural forces. "
22
- "It can make farmland less fertile and can be reduced by planting trees and grass."
23
- )
24
- MIN_CHUNK_CHARS = 250
25
- MAX_CHUNK_CHARS = 900
26
  EMBEDDING_MODEL = os.getenv(
27
  "EMBEDDING_MODEL",
28
  "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
29
  )
 
 
 
 
 
 
 
30
 
31
 
32
  def upload_textbook(pdf_path):
33
  if not pdf_path:
34
  return "Choose a PDF first.", "{}", gr.update()
35
 
36
- if not BACKEND_URL:
37
- return upload_textbook_locally(pdf_path)
38
-
39
- try:
40
- with open(pdf_path, "rb") as pdf_file:
41
- response = requests.post(
42
- f"{BACKEND_URL}/upload-textbook",
43
- files={"file": (os.path.basename(pdf_path), pdf_file, "application/pdf")},
44
- timeout=UPLOAD_TIMEOUT_SECONDS,
45
- )
46
-
47
- if response.ok:
48
- result = response.json()
49
- extraction_method = result.get("extraction_method")
50
- method_text = f" Text extraction: {extraction_method}." if extraction_method else ""
51
- return (
52
- f"Uploaded {result['filename']} with {result['page_count']} pages "
53
- f"and {result['chunk_count']} chunks.{method_text}",
54
- "{}",
55
- gr.update(value=""),
56
- )
57
-
58
- return _response_error(response, "Upload failed."), "{}", gr.update()
59
- except requests.Timeout:
60
- return "Backend is still processing the PDF. Try a smaller PDF for the demo.", "{}", gr.update()
61
- except requests.RequestException as exc:
62
- return f"Could not reach backend: {exc}", "{}", gr.update()
63
- except OSError as exc:
64
- return f"Could not read uploaded PDF: {exc}", "{}", gr.update()
65
-
66
 
67
- def upload_textbook_locally(pdf_path):
68
  try:
69
  extracted = extract_pdf_text(pdf_path)
70
  chunks = chunk_text(extracted["text"])
71
-
72
  if not chunks:
73
  return "No readable text chunks could be created from this PDF.", "{}", gr.update()
74
 
@@ -77,38 +45,48 @@ def upload_textbook_locally(pdf_path):
77
  "filename": os.path.basename(pdf_path),
78
  "page_count": extracted["page_count"],
79
  "chunk_count": len(chunks),
80
- "extraction_method": extracted["extraction_method"],
81
  "chunks": chunks,
82
  "embeddings": embeddings.tolist(),
83
  }
84
- return (
85
- (
86
- f"Uploaded {state['filename']} inside this Space with "
87
- f"{state['page_count']} pages and {state['chunk_count']} chunks. "
88
- f"Text extraction: {state['extraction_method']}."
89
- ),
90
- encode_state(state),
91
- gr.update(value=""),
92
  )
 
93
  except Exception as exc:
94
- return f"Could not process uploaded PDF in this Space: {exc}", "{}", gr.update()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
 
97
- def ask_tutor(
98
- question,
99
- student_id,
100
- textbook_context,
101
- textbook_state,
102
- ):
103
- question = question.strip()
104
  student_id = (student_id or "hf-space-demo").strip()
105
- textbook_context = textbook_context.strip()
106
 
107
  if not question:
108
  return (
109
  "Please type a student question.",
110
  "कृपया विद्यार्थीको प्रश्न लेख्नुहोस्।",
111
- "1. Add a question first.\n2. Then try again.\n3. Use a textbook topic.",
112
  "",
113
  "Waiting for a question.",
114
  "{}",
@@ -116,259 +94,177 @@ def ask_tutor(
116
 
117
  if BACKEND_URL:
118
  backend_result = ask_backend(question, student_id, textbook_context)
119
-
120
- if backend_result and not is_insufficient_backend_result(backend_result):
121
  return backend_result
122
 
123
- return local_response(
124
- question=question,
125
- student_id=student_id,
126
- textbook_context=textbook_context,
127
- textbook_state=decode_state(textbook_state),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  )
129
 
130
 
131
- def ask_backend(
132
- question: str,
133
- student_id: str,
134
- textbook_context: str,
135
- ) -> tuple[str, str, str, str, str, dict[str, Any]] | None:
136
- payload: dict[str, Any] = {
137
  "question": question,
138
  "student_id": student_id,
139
  "language_support": "English and Nepali",
140
  }
141
-
142
  if textbook_context:
143
  payload["textbook_context"] = textbook_context
144
 
145
  try:
146
- response = requests.post(
147
- f"{BACKEND_URL}/ask",
148
- json=payload,
149
- timeout=ASK_TIMEOUT_SECONDS,
150
- )
151
- response.raise_for_status()
152
  data = response.json()
153
- except requests.RequestException:
154
- return None
155
- except ValueError:
156
  return None
157
 
158
- return format_backend_response(data, student_id=student_id)
159
-
160
-
161
- def format_backend_response(
162
- data: dict[str, Any],
163
- student_id: str,
164
- ) -> tuple[str, str, str, str, str, dict[str, Any]]:
165
- english_answer = str(data.get("answer_english", "No English answer returned."))
166
- normalized_question = str(data.get("normalized_question") or "").strip()
167
-
168
- if normalized_question:
169
- english_answer = f"Interpreted question: {normalized_question}\n\n{english_answer}"
170
-
171
  quiz_questions = data.get("quiz_questions", [])
172
- state = {
 
 
 
 
 
173
  "quiz_id": data.get("quiz_id"),
174
  "quiz_questions": quiz_questions,
175
  "student_id": student_id,
176
  }
177
-
178
  return (
179
- english_answer,
180
  str(data.get("answer_nepali", "नेपाली उत्तर प्राप्त भएन।")),
181
  format_quiz(quiz_questions),
182
  format_sources(data.get("retrieved_sources", [])),
183
  "Answered with the backend RAG workflow.",
184
- encode_state(state),
185
  )
186
 
187
 
188
- def grade_quiz(
189
- answer_1,
190
- answer_2,
191
- answer_3,
192
- student_id,
193
- quiz_state,
194
- ):
195
- quiz_state = decode_state(quiz_state)
196
- quiz_id = quiz_state.get("quiz_id")
197
-
198
- if not BACKEND_URL:
199
- return grade_quiz_locally([answer_1, answer_2, answer_3], quiz_state)
200
-
201
- if not quiz_id:
202
- return "Ask the tutor first so a quiz can be created."
203
-
204
- try:
205
- response = requests.post(
206
- f"{BACKEND_URL}/grade-quiz",
207
- json={
208
- "student_id": (student_id or "hf-space-demo").strip(),
209
- "quiz_id": quiz_id,
210
- "answers": [answer_1, answer_2, answer_3],
211
- },
212
- timeout=SHORT_TIMEOUT_SECONDS,
213
- )
214
 
215
- if not response.ok:
216
- return _response_error(response, "Quiz grading failed.")
217
-
218
- return format_grade(response.json())
219
- except requests.Timeout:
220
- return "Quiz grading timed out. Please try again."
221
- except requests.RequestException as exc:
222
- return f"Could not reach backend: {exc}"
223
- except ValueError:
224
- return "Quiz grading returned an invalid response."
225
-
226
-
227
- def grade_quiz_locally(answers: list[str], quiz_state: dict[str, Any]) -> str:
228
- questions = quiz_state.get("quiz_questions", [])
229
- expected_answers = quiz_state.get("expected_answers", [])
230
 
 
 
231
  if not questions:
232
  return "Ask the tutor first so a quiz can be created."
233
 
 
234
  score = 0
235
  lines = []
236
-
237
  for index, question in enumerate(questions[:3]):
238
- student_answer = answers[index].strip() if index < len(answers) else ""
239
- expected_answer = str(expected_answers[index] if index < len(expected_answers) else "")
240
- is_correct = is_answer_close(student_answer, expected_answer)
241
-
242
- if is_correct:
243
- score += 1
244
-
245
- status = "Correct" if is_correct else "Needs practice"
246
- lines.append(f"{status}: {question}")
247
-
248
- if not is_correct and expected_answer:
249
- lines.append(f"Expected idea: {expected_answer}")
250
-
251
  return f"Score: {score} / {min(len(questions), 3)}\n" + "\n".join(lines)
252
 
253
 
254
- def is_answer_close(student_answer: str, expected_answer: str) -> bool:
255
- student_tokens = set(normalize_answer(student_answer).split())
256
- expected_tokens = set(normalize_answer(expected_answer).split())
257
-
258
- if not student_tokens or not expected_tokens:
259
- return False
260
-
261
- overlap = len(student_tokens & expected_tokens) / max(len(expected_tokens), 1)
262
- return overlap >= 0.35 or normalize_answer(student_answer) in normalize_answer(expected_answer)
263
-
264
-
265
- def normalize_answer(answer: str) -> str:
266
- return " ".join(
267
- word.strip(".,?!:;()[]{}\"'।").lower()
268
- for word in answer.split()
269
- if word.strip(".,?!:;()[]{}\"'।")
270
- )
271
-
272
-
273
  def parent_summary(student_id):
274
  if not BACKEND_URL:
275
  return (
276
  "Parent/teacher summary\n\n"
277
- "The student has practiced with the uploaded or pasted textbook context in this Space. "
278
- "For persistent progress across sessions, deploy the FastAPI backend and set BACKEND_URL."
279
  )
280
 
281
- student_id = (student_id or "hf-space-demo").strip()
282
-
283
  try:
284
  response = requests.get(
285
- f"{BACKEND_URL}/parent-summary/{student_id}",
286
- timeout=SHORT_TIMEOUT_SECONDS,
287
  )
288
-
289
  if not response.ok:
290
- return _response_error(response, "Summary failed.")
291
-
292
- summary = response.json()
293
- except requests.Timeout:
294
- return "Summary request timed out. Please try again."
295
- except requests.RequestException as exc:
296
- return f"Could not reach backend: {exc}"
297
- except ValueError:
298
- return "Summary returned an invalid response."
299
-
300
- strengths = "\n".join(f"- {item}" for item in summary.get("strengths", []))
301
- weak_topics = summary.get("weak_topics", [])
302
- weak_topic_text = "\n".join(f"- {item}" for item in weak_topics) if weak_topics else "No weak topics recorded yet."
303
 
 
 
 
304
  return (
305
  f"Strengths\n{strengths}\n\n"
306
- f"Weak topics\n{weak_topic_text}\n\n"
307
- f"Suggested next practice\n{summary.get('suggested_next_practice', '')}\n\n"
308
- f"Encouraging note\n{summary.get('encouraging_note', '')}\n\n"
309
- f"Questions asked: {summary.get('questions_asked', 0)}"
310
  )
311
 
312
 
313
- def is_insufficient_backend_result(result: tuple[str, str, str, str, str, dict[str, Any]]) -> bool:
314
- combined = " ".join(str(item) for item in result[:5]).lower()
315
- markers = [
316
- "not have enough textbook context",
317
- "not enough textbook context",
318
- "insufficient context",
319
- "पर्याप्त जानकारी छैन",
320
- "पर्याप्त सन्दर्भ छैन",
321
- ]
322
- return any(marker in combined for marker in markers)
323
-
324
-
325
- def extract_pdf_text(pdf_path: str) -> dict[str, Any]:
326
  import fitz
327
 
328
  page_texts = []
329
-
330
  with fitz.open(pdf_path) as document:
 
331
  for page in document:
332
  text = page.get_text("text").strip()
333
  if text:
334
  page_texts.append(text)
335
 
336
- page_count = document.page_count
337
-
338
  text = "\n\n".join(page_texts).strip()
339
-
340
  if not text:
341
  raise ValueError(
342
- "No selectable text was found. For scanned PDFs, deploy with a backend "
343
- "or paste a short textbook paragraph into the context box."
344
  )
345
-
346
- return {
347
- "text": text,
348
- "page_count": page_count,
349
- "extraction_method": "pymupdf-local",
350
- }
351
 
352
 
353
- def chunk_text(text: str) -> list[str]:
354
  paragraphs = [part.strip() for part in text.splitlines() if part.strip()]
355
  chunks = []
356
  current = ""
357
-
358
  for paragraph in paragraphs:
359
  if len(current) + len(paragraph) + 2 <= MAX_CHUNK_CHARS:
360
  current = f"{current}\n{paragraph}".strip()
361
- continue
362
-
363
- if len(current) >= MIN_CHUNK_CHARS:
364
  chunks.append(current)
365
  current = paragraph
366
  else:
367
  current = f"{current}\n{paragraph}".strip()
368
-
369
  if current:
370
  chunks.append(current)
371
-
372
  return chunks or ([text.strip()] if text.strip() else [])
373
 
374
 
@@ -379,7 +275,7 @@ def get_embedding_model():
379
  return SentenceTransformer(EMBEDDING_MODEL)
380
 
381
 
382
- def embed_texts(texts: list[str]) -> np.ndarray:
383
  model = get_embedding_model()
384
  return np.asarray(
385
  model.encode(
@@ -391,27 +287,21 @@ def embed_texts(texts: list[str]) -> np.ndarray:
391
  )
392
 
393
 
394
- def retrieve_local_sources(
395
- question: str,
396
- textbook_state: dict[str, Any],
397
- limit: int = 5,
398
- ) -> list[dict[str, Any]]:
399
- chunks = [str(chunk) for chunk in textbook_state.get("chunks", [])]
400
- embeddings = np.asarray(textbook_state.get("embeddings", []), dtype=float)
401
-
402
  if not chunks or embeddings.size == 0:
403
  return []
404
 
405
  query_embedding = embed_texts([question])[0]
406
  scores = embeddings @ query_embedding
407
  top_indices = np.argsort(scores)[::-1][:limit]
408
-
409
  return [
410
  {
411
  "score": float(scores[index]),
412
  "text": chunks[index],
413
  "metadata": {
414
- "filename": textbook_state.get("filename", "uploaded-textbook"),
415
  "chunk_index": int(index),
416
  },
417
  }
@@ -419,179 +309,52 @@ def retrieve_local_sources(
419
  ]
420
 
421
 
422
- def mock_response(question: str, textbook_context: str) -> tuple[str, str, str, str, str, dict[str, Any]]:
423
- context = textbook_context or EXAMPLE_CONTEXT
424
- normalized_question = normalize_question_mock(question)
425
- concept_answer = mock_english_explanation(normalized_question, context)
426
-
427
- english = f"Interpreted question: {normalized_question}\n\n{concept_answer}"
428
- nepali = mock_nepali_explanation(normalized_question, context)
429
- quiz_questions = mock_quiz_questions(normalized_question)
430
-
431
- return (
432
- english,
433
- nepali,
434
- format_quiz(quiz_questions),
435
- format_sources(
436
- [
437
- {
438
- "score": 1.0,
439
- "text": context,
440
- "metadata": {"filename": "demo-context", "chunk_index": 0},
441
- }
442
- ]
443
- ),
444
- "Demo fallback is active. Configure BACKEND_URL in Space settings for PDF upload, RAG search, quiz grading, and parent summary.",
445
- encode_state({"quiz_questions": quiz_questions}),
446
- )
447
-
448
-
449
- def local_response(
450
- question: str,
451
- student_id: str,
452
- textbook_context: str,
453
- textbook_state: dict[str, Any],
454
- ) -> tuple[str, str, str, str, str, dict[str, Any]]:
455
- normalized_question = normalize_question_mock(question)
456
- sources = []
457
-
458
- if textbook_context.strip():
459
- sources = [
460
- {
461
- "score": 1.0,
462
- "text": chunk,
463
- "metadata": {"filename": "pasted-context", "chunk_index": index},
464
- }
465
- for index, chunk in enumerate(chunk_text(textbook_context)[:5])
466
- ]
467
- elif textbook_state.get("chunks") and textbook_state.get("embeddings"):
468
- sources = retrieve_local_sources(normalized_question, textbook_state, limit=5)
469
-
470
- context = "\n\n".join(str(source.get("text", "")) for source in sources).strip()
471
-
472
- if not context:
473
- return mock_response(question=question, textbook_context=textbook_context)
474
-
475
- english = (
476
- f"Interpreted question: {normalized_question}\n\n"
477
- f"Answer from the uploaded textbook context:\n{truncate(context, max_length=700)}"
478
- )
479
- nepali = local_nepali_answer(normalized_question, context)
480
- quiz_questions = local_nepali_quiz_questions(context)
481
- quiz_state = {
482
- "student_id": student_id,
483
- "quiz_questions": quiz_questions,
484
- "expected_answers": [source_answer(sources)] * 3,
485
- }
486
-
487
- return (
488
- english,
489
- nepali,
490
- format_quiz(quiz_questions),
491
- format_sources(sources),
492
- "Answered with the Hugging Face Space local PDF workflow.",
493
- encode_state(quiz_state),
494
- )
495
-
496
-
497
- def mock_english_explanation(normalized_question: str, context: str) -> str:
498
- text = f"{normalized_question} {context}".lower()
499
-
500
- if "reflection" in text or "mirror" in text:
501
- return (
502
- "Reflection of light means light bounces back after hitting a surface. "
503
- "A mirror reflects light in an orderly way, so we can see a clear image "
504
- "of an object in it. Smooth, flat surfaces make clearer reflections, "
505
- "while rough surfaces scatter light and do not show a clear image."
506
- )
507
-
508
- if "soil erosion" in text:
509
- return (
510
- "Soil erosion means the top fertile layer of soil is carried away by "
511
- "water, wind, or other causes. It makes land less useful for growing "
512
- "plants, so planting trees and grass helps protect the soil."
513
- )
514
-
515
- if "photosynthesis" in text:
516
- return (
517
- "Photosynthesis is the process by which green plants make their own food "
518
- "using sunlight, water, and carbon dioxide. Chlorophyll in leaves helps "
519
- "plants capture sunlight, and oxygen is released during the process."
520
- )
521
-
522
- if "fraction" in text:
523
- return (
524
- "A fraction shows a part of a whole. The top number tells how many parts "
525
- "we have, and the bottom number tells how many equal parts the whole was "
526
- "divided into."
527
- )
528
-
529
- return (
530
- "Demo answer from the pasted textbook context: "
531
- f"{truncate(context, max_length=450)}"
532
- )
533
 
534
 
535
- def mock_nepali_explanation(normalized_question: str, context: str = "") -> str:
536
- text = f"{normalized_question} {context}".lower()
 
 
 
 
 
 
 
537
 
538
- if "reflection" in text or "mirror" in text:
539
- return (
540
- "प्रकाशको परावर्तन भनेको प्रकाश कुनै सतहमा ठोक्किएर फर्कनु हो। ऐनाले "
541
- "प्रकाशलाई राम्रोसँग फर्काउँछ, त्यसैले त्यसमा वस्तुको प्रतिबिम्ब देखिन्छ। "
542
- "समथर र चिल्लो सतहमा प्रतिबिम्ब प्रस्ट देखिन्छ, तर खस्रो सतहमा प्रकाश धेरै "
543
- "दिशामा छरिने भएकाले प्रतिबिम्ब प्रस्ट देखिँदैन।"
544
- )
545
 
546
- if "soil erosion" in text:
 
 
547
  return (
548
- "माटो कटान भनेको हावा, पानी वा अरू प्राृतिक कारणले माटोको माथिल्लो "
549
- "मलिलो हट्नु हो। यसले खेतको ाटो मज बना क्छ। रूख घाँस "
550
- "लगाउँदा माटो जोगाउन मद्दत हुन्छ।"
551
  )
552
-
553
- if "photosynthesis" in text:
554
  return (
555
  "प्रकाश संश्लेषण भनेको हरिया बिरुवाले घामको प्रकाश, पानी र कार्बन "
556
- "डाइअक्साइड प्रयोग गरेर आफ्नो खाना बनाउने प्रक्रिया हो। यस क्रममा "
557
- "अक्सिजन पनि निस्कन्छ।"
558
- )
559
-
560
- if "fraction" in text:
561
- return (
562
- "भिन्न भनेको कुनै पूर्ण वस्तुको भाग देखाउने संख्या हो। जस्तै, एउटा "
563
- "रोटी बराबर भागमा काट्दा एक भागलाई भिन्नबाट देखाउन सकिन्छ।"
564
- )
565
-
566
- if "oxygen" in text:
567
- return (
568
- "अक्सिजन एउटा ग्यास हो। मानिस, जनावर र धेरै जीवहरूले सास फेर्दा "
569
- "अक्सिजन प्रयोग गर्छन्। यो जीवनका लागि महत्त्वपूर्ण हुन्छ।"
570
  )
571
-
572
- return "यो विषयलाई सरल रूपमा बुझ्न पाठ्यपुस्तकको सन्दर्भ पढेर मुख्य कुरा सम्झनुहोस्।"
573
-
574
-
575
- def local_nepali_answer(normalized_question: str, context: str) -> str:
576
- known_answer = mock_nepali_explanation(normalized_question, context)
577
-
578
- if known_answer != "यो विषयलाई सरल रूपमा बुझ्न पाठ्यपुस्तकको सन्दर्भ पढेर मुख्य कुरा सम्झनुहोस्।":
579
- return known_answer
580
-
581
  if has_devanagari(context):
582
- return (
583
- "अपलोड गरिएको पाठ्यपुस्तकको सन्दर्भअनुसार मुख्य कुरा यस्तो छ:\n\n"
584
- f"{truncate(context, max_length=700)}"
585
- )
586
-
587
  return (
588
  "अपलोड गरिएको पाठ्यपुस्तकको सन्दर्भअनुसार यो विषय महत्त्वपूर्ण छ। "
589
- "मुख्य शब्दहरू पढ्नुहोस्, उदाहणसँग जोड्नुहोस्, र आफ्नै सरल शब्दमा उत्तर लेख्ने अभ्यास गर्नुहोस्।"
590
  )
591
 
592
 
593
- def local_nepali_quiz_questions(context: str) -> list[str]:
594
- short_context = truncate(first_sentence(context), max_length=140)
595
  return [
596
  "प्राप्त पाठ्यपुस्तक सन्दर्भको मुख्य कुरा के हो?",
597
  f"यो वाक्यले के बुझाउँछ: {short_context}",
@@ -599,256 +362,144 @@ def local_nepali_quiz_questions(context: str) -> list[str]:
599
  ]
600
 
601
 
602
- def source_answer(sources: list[dict[str, Any]]) -> str:
603
  if not sources:
604
  return "पाठ्यपुस्तकको मुख्य कुरा।"
605
-
606
  text = str(sources[0].get("text", "")).strip()
607
- return truncate(first_sentence(text) or text, max_length=220)
608
 
609
 
610
- def first_sentence(text: str) -> str:
611
  for separator in ["।", ".", "?", "!"]:
612
  if separator in text:
613
  return text.split(separator, 1)[0].strip() + separator
614
-
615
  return text.strip()
616
 
617
 
618
- def has_devanagari(text: str) -> bool:
619
  return any("\u0900" <= character <= "\u097f" for character in text)
620
 
621
 
622
- def normalize_question_mock(question: str) -> str:
623
- text = question.lower()
624
-
625
- if "soil erosion" in text or ("mato" in text and "katan" in text):
626
- return "What is soil erosion?"
627
-
628
- if "reflection" in text or "mirror" in text or "ainaa" in text or "aaina" in text:
629
- return "What is reflection of light?"
630
-
631
- if "photosynthesis" in text or ("prakash" in text and "sansleshan" in text):
632
- return "What is photosynthesis?"
633
-
634
- if "fraction" in text or "bhinn" in text:
635
- return "What is a fraction?"
636
-
637
- if "oxygen" in text or "aksijan" in text:
638
- return "What is oxygen?"
639
-
640
- mixed_topic = extract_mixed_language_topic(text)
641
-
642
- if mixed_topic:
643
- return f"What is {mixed_topic}?"
644
-
645
- return question
646
-
647
-
648
- def extract_mixed_language_topic(text: str) -> str:
649
- markers = [
650
- " vaneko ",
651
- " bhaneko ",
652
- " vanya ",
653
- " bhanya ",
654
- " vanne ",
655
- " bhanne ",
656
- ]
657
-
658
- if not any(marker in f" {text} " for marker in markers):
659
- return ""
660
-
661
- topic = f" {text} "
662
- removable_phrases = [
663
- " vaneko ",
664
- " bhaneko ",
665
- " vanya ",
666
- " bhanya ",
667
- " vanne ",
668
- " bhanne ",
669
- " ke ho ",
670
- " k ho ",
671
- " kya ho ",
672
- " vana ",
673
- " bhana ",
674
- " ho ",
675
- " ? ",
676
- ]
677
-
678
- for phrase in removable_phrases:
679
- topic = topic.replace(phrase, " ")
680
-
681
- topic = " ".join(topic.split()).strip(" ?.,")
682
-
683
- if not topic or len(topic) > 80:
684
- return ""
685
-
686
- return topic
687
-
688
-
689
- def mock_quiz_questions(normalized_question: str) -> list[str]:
690
- text = normalized_question.lower()
691
-
692
- if "reflection" in text:
693
- return [
694
- "What happens to light during reflection?",
695
- "Why does a mirror show a clear image?",
696
- "Why do rough surfaces not show clear reflections?",
697
- ]
698
-
699
- return [
700
- "What is the main idea from the explanation?",
701
- "Can you give one simple example?",
702
- "Can you explain it in your own words?",
703
- ]
704
 
705
 
706
- def format_quiz(quiz_questions: list[Any]) -> str:
707
- questions = [
708
- str(question).strip()
709
- for question in quiz_questions
710
- if str(question).strip()
711
- ]
712
 
713
- if not questions:
714
- questions = [
715
- "What did you learn from the explanation?",
716
- "Can you give one example?",
717
- "Can you explain it to a friend?",
718
- ]
719
 
 
 
720
  return "\n".join(
721
- f"{index}. {question}"
722
- for index, question in enumerate(questions[:3], start=1)
723
  )
724
 
725
 
726
- def format_sources(sources: list[Any]) -> str:
727
  if not sources:
728
  return "No retrieved sources returned."
729
-
730
  formatted = []
731
-
732
  for source in sources[:5]:
733
- if not isinstance(source, dict):
734
- continue
735
-
736
- metadata = source.get("metadata", {}) if isinstance(source.get("metadata"), dict) else {}
737
  filename = metadata.get("filename", "textbook")
738
  chunk_index = metadata.get("chunk_index", "unknown")
739
- score = source.get("score", 0)
740
- text = str(source.get("text", "")).strip()
741
- formatted.append(
742
- f"Source: {filename}, chunk {chunk_index}, score {float(score):.3f}\n{text}"
743
- )
744
 
745
- return "\n\n".join(formatted) if formatted else "No retrieved sources returned."
746
 
747
-
748
- def format_grade(data: dict[str, Any]) -> str:
749
  lines = [f"Score: {data.get('score', 0)} / {data.get('total', 0)}"]
750
- weak_areas = data.get("weak_areas", [])
751
-
752
- if weak_areas:
753
- lines.append(f"Weak areas: {', '.join(str(item) for item in weak_areas)}")
754
-
755
  for item in data.get("results", []):
756
  status = "Correct" if item.get("is_correct") else "Needs practice"
757
  lines.append(f"{status}: {item.get('question', '')}")
758
-
759
  if not item.get("is_correct"):
760
  lines.append(f"Expected idea: {item.get('expected_answer', '')}")
761
-
762
  return "\n".join(lines)
763
 
764
 
765
- def _response_error(response: requests.Response, fallback: str) -> str:
766
- try:
767
- return str(response.json().get("detail", fallback))
768
- except ValueError:
769
- return fallback
770
-
771
-
772
- def encode_state(state: dict[str, Any]) -> str:
773
  return json.dumps(state, ensure_ascii=False)
774
 
775
 
776
- def decode_state(state: Any) -> dict[str, Any]:
777
  if isinstance(state, dict):
778
  return state
779
-
780
  if not state:
781
  return {}
782
-
783
  try:
784
  decoded = json.loads(str(state))
785
  except (TypeError, ValueError):
786
  return {}
787
-
788
  return decoded if isinstance(decoded, dict) else {}
789
 
790
 
791
- def truncate(text: str, max_length: int) -> str:
 
792
  if len(text) <= max_length:
793
  return text
794
-
795
- return f"{text[: max_length - 3]}..."
796
 
797
 
798
  with gr.Blocks(title=APP_NAME, theme=gr.themes.Soft()) as demo:
799
  gr.Markdown(
800
  """
801
  # Pathshala AI
802
- Bilingual AI tutor for rural primary students in Nepal. Upload a PDF directly
803
- in this Space, or connect a public backend for the full production workflow.
804
  """
805
  )
806
 
807
- quiz_state = gr.State("{}")
808
  textbook_state = gr.State("{}")
 
809
 
810
  with gr.Row():
811
- student_id_input = gr.Textbox(
812
- label="Student ID",
813
- value="hf-space-demo",
814
- scale=1,
815
- )
816
  status_output = gr.Textbox(
817
  label="Status",
818
  value=(
819
  "Backend connected." if BACKEND_URL else
820
- "Space-local PDF upload is active. Set BACKEND_URL for the full backend workflow."
821
  ),
822
  interactive=False,
823
- scale=2,
824
  )
825
 
826
  with gr.Tab("Ask"):
827
  with gr.Row():
828
- with gr.Column(scale=1):
829
- pdf_input = gr.File(label="Upload textbook or worksheet PDF", file_types=[".pdf"], type="filepath")
 
 
 
 
830
  upload_button = gr.Button("Upload PDF")
831
  upload_output = gr.Textbox(label="Upload result", lines=3, interactive=False)
832
-
833
  question_input = gr.Textbox(
834
  label="Student question",
835
- placeholder=EXAMPLE_QUESTION,
836
  value=EXAMPLE_QUESTION,
837
  lines=2,
838
  )
839
  context_input = gr.Textbox(
840
  label="Optional textbook context",
841
- placeholder="Paste a short textbook paragraph here.",
842
  value=EXAMPLE_CONTEXT,
843
- lines=7,
844
  )
845
  ask_button = gr.Button("Ask Tutor", variant="primary")
846
-
847
- with gr.Column(scale=1):
848
  english_output = gr.Textbox(label="English explanation", lines=8)
849
  nepali_output = gr.Textbox(label="Nepali explanation", lines=8)
850
  quiz_output = gr.Textbox(label="3 quiz questions", lines=5)
851
-
852
  sources_output = gr.Textbox(label="Retrieved sources", lines=8)
853
 
854
  with gr.Tab("Quiz"):
@@ -860,40 +511,7 @@ with gr.Blocks(title=APP_NAME, theme=gr.themes.Soft()) as demo:
860
 
861
  with gr.Tab("Parent Summary"):
862
  summary_button = gr.Button("Show Parent/Teacher Summary")
863
- summary_output = gr.Textbox(label="Summary", lines=14)
864
-
865
- gr.Examples(
866
- examples=[
867
- [EXAMPLE_QUESTION, EXAMPLE_CONTEXT],
868
- [
869
- "What is reflection of light?",
870
- (
871
- "When an object is placed in front of the mirror, the image is formed "
872
- "due to reflection of light from the mirror. Flat and smooth surfaces "
873
- "reflect light clearly, while rough surfaces do not."
874
- ),
875
- ],
876
- [
877
- "photosynthesis vaneko ke ho vana",
878
- (
879
- "Photosynthesis is the process by which green plants use sunlight, "
880
- "water, and carbon dioxide to make food."
881
- ),
882
- ],
883
- ],
884
- inputs=[question_input, context_input],
885
- outputs=[
886
- english_output,
887
- nepali_output,
888
- quiz_output,
889
- sources_output,
890
- status_output,
891
- quiz_state,
892
- ],
893
- fn=lambda question, context: ask_tutor(question, "hf-space-demo", context, "{}"),
894
- api_name=False,
895
- cache_examples=False,
896
- )
897
 
898
  upload_button.click(
899
  fn=upload_textbook,
 
1
  import json
2
  import os
 
3
  from functools import lru_cache
4
 
5
  from dotenv import load_dotenv
 
12
 
13
  APP_NAME = os.getenv("APP_NAME", "Pathshala AI")
14
  BACKEND_URL = os.getenv("BACKEND_URL", "").rstrip("/")
 
 
 
 
 
 
 
 
 
 
15
  EMBEDDING_MODEL = os.getenv(
16
  "EMBEDDING_MODEL",
17
  "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
18
  )
19
+ EXAMPLE_QUESTION = "mato katan bhaneko ke ho"
20
+ EXAMPLE_CONTEXT = (
21
+ "माटो कटान भनेको पानी, हावा वा अरू कारणले माटोको माथिल्लो मलिलो भाग बग्नु हो। "
22
+ "रूख र घाँस रोप्दा माटो जोगाउन मद्दत हुन्छ।"
23
+ )
24
+ MIN_CHUNK_CHARS = 250
25
+ MAX_CHUNK_CHARS = 900
26
 
27
 
28
  def upload_textbook(pdf_path):
29
  if not pdf_path:
30
  return "Choose a PDF first.", "{}", gr.update()
31
 
32
+ if BACKEND_URL:
33
+ backend_result = upload_to_backend(pdf_path)
34
+ if backend_result:
35
+ return backend_result
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
 
37
  try:
38
  extracted = extract_pdf_text(pdf_path)
39
  chunks = chunk_text(extracted["text"])
 
40
  if not chunks:
41
  return "No readable text chunks could be created from this PDF.", "{}", gr.update()
42
 
 
45
  "filename": os.path.basename(pdf_path),
46
  "page_count": extracted["page_count"],
47
  "chunk_count": len(chunks),
 
48
  "chunks": chunks,
49
  "embeddings": embeddings.tolist(),
50
  }
51
+ message = (
52
+ f"Uploaded {state['filename']} inside this Space with "
53
+ f"{state['page_count']} pages and {state['chunk_count']} chunks."
 
 
 
 
 
54
  )
55
+ return message, encode_state(state), gr.update(value="")
56
  except Exception as exc:
57
+ return f"Could not process uploaded PDF: {exc}", "{}", gr.update()
58
+
59
+
60
+ def upload_to_backend(pdf_path):
61
+ try:
62
+ with open(pdf_path, "rb") as pdf_file:
63
+ response = requests.post(
64
+ f"{BACKEND_URL}/upload-textbook",
65
+ files={"file": (os.path.basename(pdf_path), pdf_file, "application/pdf")},
66
+ timeout=900,
67
+ )
68
+ if not response.ok:
69
+ return None
70
+ result = response.json()
71
+ message = (
72
+ f"Uploaded {result['filename']} with {result['page_count']} pages "
73
+ f"and {result['chunk_count']} chunks."
74
+ )
75
+ return message, "{}", gr.update(value="")
76
+ except (OSError, requests.RequestException, ValueError):
77
+ return None
78
 
79
 
80
+ def ask_tutor(question, student_id, textbook_context, textbook_state):
81
+ question = (question or "").strip()
 
 
 
 
 
82
  student_id = (student_id or "hf-space-demo").strip()
83
+ textbook_context = (textbook_context or "").strip()
84
 
85
  if not question:
86
  return (
87
  "Please type a student question.",
88
  "कृपया विद्यार्थीको प्रश्न लेख्नुहोस्।",
89
+ "",
90
  "",
91
  "Waiting for a question.",
92
  "{}",
 
94
 
95
  if BACKEND_URL:
96
  backend_result = ask_backend(question, student_id, textbook_context)
97
+ if backend_result:
 
98
  return backend_result
99
 
100
+ state = decode_state(textbook_state)
101
+ sources = sources_from_context(textbook_context)
102
+ if not sources and state:
103
+ sources = retrieve_local_sources(normalize_question(question), state, limit=5)
104
+
105
+ if not sources:
106
+ sources = sources_from_context(EXAMPLE_CONTEXT)
107
+
108
+ context = "\n\n".join(source["text"] for source in sources)
109
+ english = (
110
+ f"Interpreted question: {normalize_question(question)}\n\n"
111
+ f"Answer from textbook context:\n{truncate(context, 700)}"
112
+ )
113
+ nepali = nepali_answer(normalize_question(question), context)
114
+ quiz_questions = nepali_quiz_questions(context)
115
+ quiz_state = {
116
+ "quiz_questions": quiz_questions,
117
+ "expected_answers": [source_answer(sources)] * 3,
118
+ }
119
+ return (
120
+ english,
121
+ nepali,
122
+ format_quiz(quiz_questions),
123
+ format_sources(sources),
124
+ "Answered with the Hugging Face Space local PDF workflow.",
125
+ encode_state(quiz_state),
126
  )
127
 
128
 
129
+ def ask_backend(question, student_id, textbook_context):
130
+ payload = {
 
 
 
 
131
  "question": question,
132
  "student_id": student_id,
133
  "language_support": "English and Nepali",
134
  }
 
135
  if textbook_context:
136
  payload["textbook_context"] = textbook_context
137
 
138
  try:
139
+ response = requests.post(f"{BACKEND_URL}/ask", json=payload, timeout=180)
140
+ if not response.ok:
141
+ return None
 
 
 
142
  data = response.json()
143
+ except (requests.RequestException, ValueError):
 
 
144
  return None
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  quiz_questions = data.get("quiz_questions", [])
147
+ english = str(data.get("answer_english", "No English answer returned."))
148
+ normalized = str(data.get("normalized_question") or "").strip()
149
+ if normalized:
150
+ english = f"Interpreted question: {normalized}\n\n{english}"
151
+
152
+ quiz_state = {
153
  "quiz_id": data.get("quiz_id"),
154
  "quiz_questions": quiz_questions,
155
  "student_id": student_id,
156
  }
 
157
  return (
158
+ english,
159
  str(data.get("answer_nepali", "नेपाली उत्तर प्राप्त भएन।")),
160
  format_quiz(quiz_questions),
161
  format_sources(data.get("retrieved_sources", [])),
162
  "Answered with the backend RAG workflow.",
163
+ encode_state(quiz_state),
164
  )
165
 
166
 
167
+ def grade_quiz(answer_1, answer_2, answer_3, student_id, quiz_state):
168
+ state = decode_state(quiz_state)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
+ if BACKEND_URL and state.get("quiz_id"):
171
+ try:
172
+ response = requests.post(
173
+ f"{BACKEND_URL}/grade-quiz",
174
+ json={
175
+ "student_id": (student_id or "hf-space-demo").strip(),
176
+ "quiz_id": state["quiz_id"],
177
+ "answers": [answer_1, answer_2, answer_3],
178
+ },
179
+ timeout=45,
180
+ )
181
+ if response.ok:
182
+ return format_grade(response.json())
183
+ except (requests.RequestException, ValueError):
184
+ pass
185
 
186
+ questions = state.get("quiz_questions", [])
187
+ expected_answers = state.get("expected_answers", [])
188
  if not questions:
189
  return "Ask the tutor first so a quiz can be created."
190
 
191
+ answers = [answer_1, answer_2, answer_3]
192
  score = 0
193
  lines = []
 
194
  for index, question in enumerate(questions[:3]):
195
+ expected = str(expected_answers[index] if index < len(expected_answers) else "")
196
+ answer = str(answers[index] if index < len(answers) else "")
197
+ is_correct = is_answer_close(answer, expected)
198
+ score += 1 if is_correct else 0
199
+ lines.append(f"{'Correct' if is_correct else 'Needs practice'}: {question}")
200
+ if not is_correct and expected:
201
+ lines.append(f"Expected idea: {expected}")
 
 
 
 
 
 
202
  return f"Score: {score} / {min(len(questions), 3)}\n" + "\n".join(lines)
203
 
204
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
205
  def parent_summary(student_id):
206
  if not BACKEND_URL:
207
  return (
208
  "Parent/teacher summary\n\n"
209
+ "The student practiced with uploaded or pasted textbook context in this Space. "
210
+ "For persistent progress, deploy the FastAPI backend and set BACKEND_URL."
211
  )
212
 
 
 
213
  try:
214
  response = requests.get(
215
+ f"{BACKEND_URL}/parent-summary/{student_id or 'hf-space-demo'}",
216
+ timeout=45,
217
  )
 
218
  if not response.ok:
219
+ return "Summary failed."
220
+ data = response.json()
221
+ except (requests.RequestException, ValueError):
222
+ return "Summary failed."
 
 
 
 
 
 
 
 
 
223
 
224
+ strengths = "\n".join(f"- {item}" for item in data.get("strengths", []))
225
+ weak_topics = data.get("weak_topics", [])
226
+ weak_text = "\n".join(f"- {item}" for item in weak_topics) if weak_topics else "No weak topics recorded yet."
227
  return (
228
  f"Strengths\n{strengths}\n\n"
229
+ f"Weak topics\n{weak_text}\n\n"
230
+ f"Suggested next practice\n{data.get('suggested_next_practice', '')}\n\n"
231
+ f"Encouraging note\n{data.get('encouraging_note', '')}"
 
232
  )
233
 
234
 
235
+ def extract_pdf_text(pdf_path):
 
 
 
 
 
 
 
 
 
 
 
 
236
  import fitz
237
 
238
  page_texts = []
 
239
  with fitz.open(pdf_path) as document:
240
+ page_count = document.page_count
241
  for page in document:
242
  text = page.get_text("text").strip()
243
  if text:
244
  page_texts.append(text)
245
 
 
 
246
  text = "\n\n".join(page_texts).strip()
 
247
  if not text:
248
  raise ValueError(
249
+ "No selectable text found. For scanned PDFs, use backend OCR or paste a paragraph."
 
250
  )
251
+ return {"text": text, "page_count": page_count}
 
 
 
 
 
252
 
253
 
254
+ def chunk_text(text):
255
  paragraphs = [part.strip() for part in text.splitlines() if part.strip()]
256
  chunks = []
257
  current = ""
 
258
  for paragraph in paragraphs:
259
  if len(current) + len(paragraph) + 2 <= MAX_CHUNK_CHARS:
260
  current = f"{current}\n{paragraph}".strip()
261
+ elif len(current) >= MIN_CHUNK_CHARS:
 
 
262
  chunks.append(current)
263
  current = paragraph
264
  else:
265
  current = f"{current}\n{paragraph}".strip()
 
266
  if current:
267
  chunks.append(current)
 
268
  return chunks or ([text.strip()] if text.strip() else [])
269
 
270
 
 
275
  return SentenceTransformer(EMBEDDING_MODEL)
276
 
277
 
278
+ def embed_texts(texts):
279
  model = get_embedding_model()
280
  return np.asarray(
281
  model.encode(
 
287
  )
288
 
289
 
290
+ def retrieve_local_sources(question, state, limit=5):
291
+ chunks = [str(chunk) for chunk in state.get("chunks", [])]
292
+ embeddings = np.asarray(state.get("embeddings", []), dtype=float)
 
 
 
 
 
293
  if not chunks or embeddings.size == 0:
294
  return []
295
 
296
  query_embedding = embed_texts([question])[0]
297
  scores = embeddings @ query_embedding
298
  top_indices = np.argsort(scores)[::-1][:limit]
 
299
  return [
300
  {
301
  "score": float(scores[index]),
302
  "text": chunks[index],
303
  "metadata": {
304
+ "filename": state.get("filename", "uploaded-textbook"),
305
  "chunk_index": int(index),
306
  },
307
  }
 
309
  ]
310
 
311
 
312
+ def sources_from_context(text):
313
+ chunks = chunk_text(text)
314
+ return [
315
+ {
316
+ "score": 1.0,
317
+ "text": chunk,
318
+ "metadata": {"filename": "pasted-context", "chunk_index": index},
319
+ }
320
+ for index, chunk in enumerate(chunks[:5])
321
+ ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
322
 
323
 
324
+ def normalize_question(question):
325
+ text = question.lower()
326
+ if "mato" in text and "katan" in text:
327
+ return "What is soil erosion?"
328
+ if "prakash" in text and "sansleshan" in text:
329
+ return "What is photosynthesis?"
330
+ if "bhinn" in text or "fraction" in text:
331
+ return "What is a fraction?"
332
+ return question
333
 
 
 
 
 
 
 
 
334
 
335
+ def nepali_answer(question, context):
336
+ text = f"{question} {context}".lower()
337
+ if "soil erosion" in text or "माटो कटान" in context:
338
  return (
339
+ "माटो कटान भनेको पानी, हावा वा अरू कारणले माटोको माथिल्लो मलिलो भाग "
340
+ "बग्नु ा हट्नु हो। यसले िनको उर्वर क्ति घटाउँछ। रूख, घाँस र बिरुवा "
341
+ "रोप्दा माटो जोगाउन मद्दत हुन्छ।"
342
  )
343
+ if "photosynthesis" in text or "प्रकाश संश्लेषण" in context:
 
344
  return (
345
  "प्रकाश संश्लेषण भनेको हरिया बिरुवाले घामको प्रकाश, पानी र कार्बन "
346
+ "डाइअक्साइड प्रयोग गरेर खाना बनाउने प्रक्रिया हो। यस क्रममा अक्सिजन पनि निस्कन्छ।"
 
 
 
 
 
 
 
 
 
 
 
 
 
347
  )
 
 
 
 
 
 
 
 
 
 
348
  if has_devanagari(context):
349
+ return "अपलोड गरिएको पाठ्यपुस्तकको सन्दर्भअनुसार मुख्य कुरा यस्तो छ:\n\n" + truncate(context, 700)
 
 
 
 
350
  return (
351
  "अपलोड गरिएको पाठ्यपुस्तकको सन्दर्भअनुसार यो विषय महत्त्वपूर्ण छ। "
352
+ "मुख्य शब्दहरू पढर आफ्नै सरल शब्दमा उत्तर लेख्ने अभ्यास गर्नुहोस्।"
353
  )
354
 
355
 
356
+ def nepali_quiz_questions(context):
357
+ short_context = truncate(first_sentence(context), 140)
358
  return [
359
  "प्राप्त पाठ्यपुस्तक सन्दर्भको मुख्य कुरा के हो?",
360
  f"यो वाक्यले के बुझाउँछ: {short_context}",
 
362
  ]
363
 
364
 
365
+ def source_answer(sources):
366
  if not sources:
367
  return "पाठ्यपुस्तकको मुख्य कुरा।"
 
368
  text = str(sources[0].get("text", "")).strip()
369
+ return truncate(first_sentence(text) or text, 220)
370
 
371
 
372
+ def first_sentence(text):
373
  for separator in ["।", ".", "?", "!"]:
374
  if separator in text:
375
  return text.split(separator, 1)[0].strip() + separator
 
376
  return text.strip()
377
 
378
 
379
+ def has_devanagari(text):
380
  return any("\u0900" <= character <= "\u097f" for character in text)
381
 
382
 
383
+ def is_answer_close(student_answer, expected_answer):
384
+ student = normalize_answer(student_answer)
385
+ expected = normalize_answer(expected_answer)
386
+ if not student or not expected:
387
+ return False
388
+ student_tokens = set(student.split())
389
+ expected_tokens = set(expected.split())
390
+ overlap = len(student_tokens & expected_tokens) / max(len(expected_tokens), 1)
391
+ return overlap >= 0.35 or student in expected or expected in student
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
392
 
393
 
394
+ def normalize_answer(answer):
395
+ return " ".join(
396
+ word.strip(".,?!:;()[]{}\"'।").lower()
397
+ for word in str(answer).split()
398
+ if word.strip(".,?!:;()[]{}\"'।")
399
+ )
400
 
 
 
 
 
 
 
401
 
402
+ def format_quiz(questions):
403
+ clean_questions = [str(question).strip() for question in questions if str(question).strip()]
404
  return "\n".join(
405
+ f"{index}. {question}" for index, question in enumerate(clean_questions[:3], start=1)
 
406
  )
407
 
408
 
409
+ def format_sources(sources):
410
  if not sources:
411
  return "No retrieved sources returned."
 
412
  formatted = []
 
413
  for source in sources[:5]:
414
+ metadata = source.get("metadata", {}) if isinstance(source, dict) else {}
 
 
 
415
  filename = metadata.get("filename", "textbook")
416
  chunk_index = metadata.get("chunk_index", "unknown")
417
+ score = float(source.get("score", 0)) if isinstance(source, dict) else 0
418
+ text = str(source.get("text", "")).strip() if isinstance(source, dict) else ""
419
+ formatted.append(f"Source: {filename}, chunk {chunk_index}, score {score:.3f}\n{text}")
420
+ return "\n\n".join(formatted)
 
421
 
 
422
 
423
+ def format_grade(data):
 
424
  lines = [f"Score: {data.get('score', 0)} / {data.get('total', 0)}"]
 
 
 
 
 
425
  for item in data.get("results", []):
426
  status = "Correct" if item.get("is_correct") else "Needs practice"
427
  lines.append(f"{status}: {item.get('question', '')}")
 
428
  if not item.get("is_correct"):
429
  lines.append(f"Expected idea: {item.get('expected_answer', '')}")
 
430
  return "\n".join(lines)
431
 
432
 
433
+ def encode_state(state):
 
 
 
 
 
 
 
434
  return json.dumps(state, ensure_ascii=False)
435
 
436
 
437
+ def decode_state(state):
438
  if isinstance(state, dict):
439
  return state
 
440
  if not state:
441
  return {}
 
442
  try:
443
  decoded = json.loads(str(state))
444
  except (TypeError, ValueError):
445
  return {}
 
446
  return decoded if isinstance(decoded, dict) else {}
447
 
448
 
449
+ def truncate(text, max_length):
450
+ text = str(text)
451
  if len(text) <= max_length:
452
  return text
453
+ return text[: max_length - 3] + "..."
 
454
 
455
 
456
  with gr.Blocks(title=APP_NAME, theme=gr.themes.Soft()) as demo:
457
  gr.Markdown(
458
  """
459
  # Pathshala AI
460
+ Upload a textbook PDF, ask a question, and get textbook-grounded bilingual help.
 
461
  """
462
  )
463
 
 
464
  textbook_state = gr.State("{}")
465
+ quiz_state = gr.State("{}")
466
 
467
  with gr.Row():
468
+ student_id_input = gr.Textbox(label="Student ID", value="hf-space-demo")
 
 
 
 
469
  status_output = gr.Textbox(
470
  label="Status",
471
  value=(
472
  "Backend connected." if BACKEND_URL else
473
+ "Space-local PDF upload is active. Set BACKEND_URL for full backend OCR/progress."
474
  ),
475
  interactive=False,
 
476
  )
477
 
478
  with gr.Tab("Ask"):
479
  with gr.Row():
480
+ with gr.Column():
481
+ pdf_input = gr.File(
482
+ label="Upload textbook or worksheet PDF",
483
+ file_types=[".pdf"],
484
+ type="filepath",
485
+ )
486
  upload_button = gr.Button("Upload PDF")
487
  upload_output = gr.Textbox(label="Upload result", lines=3, interactive=False)
 
488
  question_input = gr.Textbox(
489
  label="Student question",
 
490
  value=EXAMPLE_QUESTION,
491
  lines=2,
492
  )
493
  context_input = gr.Textbox(
494
  label="Optional textbook context",
 
495
  value=EXAMPLE_CONTEXT,
496
+ lines=6,
497
  )
498
  ask_button = gr.Button("Ask Tutor", variant="primary")
499
+ with gr.Column():
 
500
  english_output = gr.Textbox(label="English explanation", lines=8)
501
  nepali_output = gr.Textbox(label="Nepali explanation", lines=8)
502
  quiz_output = gr.Textbox(label="3 quiz questions", lines=5)
 
503
  sources_output = gr.Textbox(label="Retrieved sources", lines=8)
504
 
505
  with gr.Tab("Quiz"):
 
511
 
512
  with gr.Tab("Parent Summary"):
513
  summary_button = gr.Button("Show Parent/Teacher Summary")
514
+ summary_output = gr.Textbox(label="Summary", lines=10)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
515
 
516
  upload_button.click(
517
  fn=upload_textbook,