Spaces:

Babajaan
/

bioinformatics-bb-tutor

Sleeping

App Files Files Community

bioinformatics-bb-tutor / ARCHITECTURE.md

Babajaan

Add architecture diagram, sample conversations, and module spec

4c15f39 verified 15 days ago

preview code

raw

history blame contribute delete

13.9 kB

	# Bioinformatics with BB Tutor — Architecture & Implementation Notes

	## System Architecture (Text Diagram)

	```
	┌─────────────────────────────────────────────────────────────────────────────┐
	│ USER INTERACTION LAYER │
	│ ┌─────────┐ ┌─────────────┐ ┌────────┐ ┌──────────┐ ┌───────────────┐ │
	│ │AskTutor │ │UploadExplain│ │QuizMe │ │BuildLesson│ │WorkflowCoach │ │
	│ │ Chat │ │ File + Chat │ │Generate│ │Generate │ │ Chat │ │
	│ └────┬────┘ └──────┬──────┘ └───┬────┘ └────┬─────┘ └───────┬───────┘ │
	│ ┌─────────┐ │ │ │ │ │
	│ │PaperTo │ │ │ │ │ │
	│ │ Lesson │ │ │ │ │ │
	│ └────┬────┘ ┌──────┴─────────────┴────┬───────┴─────┬──────────┴────┐ │
	│ ┌────┴────┐ │ │ │ │ │
	│ │VivaPrac │ │ gr.State(rag_store) │ gr.State │ gr.State │ │
	│ │ Chat │ │ (shared doc chunks) │ (quiz_key) │ (session) │ │
	│ └─────────┘ └─────────────────────────┴─────────────┴───────────────┘ │
	└────────────────────┬────────────────────────────────────────────────────────┘
	│ HTTP / REST
	▼
	┌─────────────────────────────────────────────────────────────────────────────┐
	│ BACKEND ORCHESTRATION LAYER │
	│ │
	│ ┌────────────────┐ ┌────────────────┐ ┌──────────────────────────────┐ │
	│ │ LLMService │ │ RAGService │ │ DocumentParser │ │
	│ │ (Singleton) │ │ (Singleton) │ │ (PDF/text/sequence parse) │ │
	│ │ │ │ │ │ │ │
	│ │ HF Inference │ │ SentenceTransf │ │ - fitz (PyMuPDF) │ │
	│ │ Client │ │ all-MiniLM-L6 │ │ - text file reader │ │
	│ │ stream_chat() │ │ 384-dim embed │ │ - chunker (400w/60w overlap)│ │
	│ │ generate() │ │ cosine sim │ │ │ │
	│ │ fallback KB │ │ top-k retrieve │ │ │ │
	│ └────────────────┘ └────────────────┘ └──────────────────────────────┘ │
	│ │
	│ ┌─────────────────────────────────────────────────────────────────────┐ │
	│ │ KNOWLEDGE BASE (Python module, loaded at startup) │ │
	│ │ - DOMAIN_TAXONOMY: 15 domain categories, 100+ subtopics │ │
	│ │ - WORKFLOWS: 5 detailed step-by-step pipelines (RNA-seq, exome, │ │
	│ │ microbiome, single-cell) with tools, params, common mistakes │ │
	│ │ - GLOSSARY: 25 key terms with precise definitions │ │
	│ │ - COMMON_MISCONCEPTIONS: 10 curated misconception/correction │ │
	│ │ pairs with severity ratings │ │
	│ │ - SYSTEM_PROMPTS: 7 per-module personas (tutor, coach, examiner) │ │
	│ │ - QUIZ_TEMPLATES: JSON-format generation templates for MCQ/TF/SA │ │
	│ │ - LESSON_TEMPLATE: Structured lesson generation prompt │ │
	│ │ - TOPIC_CHOICES: 50+ dropdown options for topic selection │ │
	│ │ - WORKFLOW_CHOICES: 10 pipeline options for workflow coaching │ │
	│ └─────────────────────────────────────────────────────────────────────┘ │
	└────────────────────┬────────────────────────────────────────────────────────┘
	│
	│ External APIs (conditional, lazy-loaded)
	▼
	┌─────────────────────────────────────────────────────────────────────────────┐
	│ EXTERNAL SERVICES │
	│ │
	│ HuggingFace Inference API HuggingFace Model Hub │
	│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
	│ │ POST /v1/chat/completions│ │ sentence-transformers/ │ │
	│ │ Streaming + non-streaming│ │ all-MiniLM-L6-v2 │ │
	│ │ Model: Mistral-7B-Instruct│ │ (384-dim, 80MB, fast) │ │
	│ │ Token: HF_TOKEN (secret) │ │ Download on first use │ │
	│ │ Timeout: 120s │ │ CPU inference OK │ │
	│ └─────────────────────────┘ └─────────────────────────┘ │
	│ │
	│ Fallback (when HF_TOKEN missing): Knowledge base keyword search + │
	│ structured responses from curated content (no LLM required) │
	└─────────────────────────────────────────────────────────────────────────────┘
	```

	## Data Flow

	```
	User Query → RAG Search (KB + uploaded docs) → Format context + system prompt
	→ LLM API (streaming) → Token stream → Gradio ChatInterface display
	↓
	User Upload → DocumentParser → Chunker → Embedder → Store in gr.State
	→ LLM summarize (non-streaming) → Display explanation
	→ Future queries search uploaded chunks via RAG
	```

	## State Management

	\| State \| Scope \| Type \| Content \|
	\|-------\|-------\|------\|---------\|
	\| `rag_store` \| Global (all tabs) \| `gr.State(dict)` \| `{chunks: [...], embeddings: np.array}` \|
	\| `answer_key_state` \| Quiz Me tab only \| `gr.State(str)` \| Raw LLM response for answer checking \|

	## Task Policies (Agent-like behavior)

	\| Task Type \| Iteration Budget \| Retrieval \| Approval \| Notes \|
	\|-----------\|-----------------\|-----------\|----------\|-------\|
	\| Short factual Q&A \| 1 LLM call \| KB only \| None \| Direct answer with RAG context \|
	\| Long teaching answer \| 1 LLM call \| KB + uploaded docs \| None \| Streaming, max 4096 tokens \|
	\| Figure interpretation \| 1 LLM call \| Uploaded content only \| None \| Requires prior upload \|
	\| Workflow coaching \| 1-3 LLM calls \| KB + workflow steps \| None \| Stateful chat, accumulates \|
	\| Quiz generation \| 1 LLM call \| KB \| None \| Non-streaming, stored in State \|
	\| Paper→Lesson \| 1-2 LLM calls \| Uploaded content \| None \| First call = upload analysis \|
	\| Viva practice \| Multi-turn \| KB \| None \| Examiner persona, adaptive \|

	## Safety Boundaries

	- Educational only: All system prompts explicitly state "you are a teaching assistant, not a clinical system"
	- Clinical refusal: Variant interpretation questions that could be clinical trigger educational redirect + referral to professionals
	- Uncertainty expression: System prompts require "say so explicitly" when uncertain
	- No hallucinated citations: RAG provides real KB content; LLM is instructed to cite specific tools/methods

	## Failure Modes & Mitigation

	\| Failure \| Detection \| Mitigation \|
	\|---------\|-----------\|------------\|
	\| HF_TOKEN missing \| `LLMService.is_available()` = False \| Knowledge base fallback responses \|
	\| Embedding model fails \| `HAS_ST = False` or load exception \| Keyword search fallback \|
	\| PDF parsing fails \| `fitz` import error or exception \| Text-only mode, graceful message \|
	\| LLM API timeout \| Exception in stream_chat() \| Error message + KB fallback suggestion \|
	\| Large file upload \| Size check in parse_file() \| Truncate, warn user \|
	\| Empty RAG results \| Score < 0.15 threshold \| Respond from general knowledge \|

	## Module Specifications

	### Module: Ask the Tutor (Tab 1)
	- Input: User message (str), system prompt (hidden), temperature (hidden), max_tokens (hidden), rag_store
	- Output: Streaming text response
	- Backend: `tutor_respond()` → RAG search → LLM stream_chat()
	- Retrieval: KB + uploaded documents (if any)
	- Latency: Streaming, first token <3s (with HF API)
	- Guardrails: System prompt enforces educational boundary, uncertainty expression, no clinical claims

	### Module: Upload & Explain (Tab 2)
	- Input: File (PDF/TXT/FASTA/VCF/etc.), rag_store
	- Output: Document analysis (Markdown), raw text (Textbox), updated rag_store
	- Backend: `process_upload()` → parse → chunk → embed → LLM summarize
	- Retrieval: Uploaded content becomes searchable across all tabs
	- Latency: Parse+embed ~2-5s, LLM summarize ~5-15s
	- Guardrails: Only bioinformatics file types accepted, max reasonable size

	### Module: Quiz Me (Tab 3)
	- Input: Topic (dropdown), format (radio), difficulty (radio), # questions (slider), rag_store
	- Output: Quiz (Markdown), answer key (hidden State)
	- Backend: `generate_quiz()` → RAG context → LLM generate() with JSON template
	- Retrieval: KB topics related to selected domain
	- Latency: ~10-20s for generation
	- Guardrails: Plausible distractors, misconception-based wrong answers

	### Module: Build a Lesson (Tab 4)
	- Input: Topic, level, include_exercises (checkbox), include_quiz (checkbox)
	- Output: Structured lesson (Markdown)
	- Backend: `generate_lesson()` → RAG context → LLM generate() with LESSON_TEMPLATE
	- Retrieval: KB workflow steps + glossary terms for topic
	- Latency: ~15-30s
	- Guardrails: Progressive disclosure, prerequisite listing, common pitfalls section

	### Module: Workflow Coach (Tab 5)
	- Input: Message, workflow selector (dropdown), temperature
	- Output: Streaming chat response with workflow context
	- Backend: `workflow_respond()` → inject workflow steps → LLM stream_chat()
	- Retrieval: Full workflow steps from KB injected as system context
	- Latency: Streaming, first token <3s
	- Guardrails: Specific tool names, parameter mentions, QC checkpoint reminders

	### Module: Paper to Lesson (Tab 6)
	- Input: Message, output_format (radio), rag_store
	- Output: Streaming lesson/study notes/slides/quiz
	- Backend: `paper_to_lesson_respond()` → search uploaded docs → LLM stream_chat()
	- Retrieval: User-uploaded document chunks
	- Latency: Streaming
	- Guardrails: Requires prior upload; warns if no uploaded content available

	### Module: Viva Practice (Tab 7)
	- Input: Message, topic (dropdown), difficulty (radio)
	- Output: Streaming examiner questions and feedback
	- Backend: `viva_respond()` → KB context + viva persona → LLM stream_chat()
	- Retrieval: Topic-specific KB content
	- Latency: Streaming
	- Guardrails: Examiner persona, one question at a time, adaptive difficulty

	## Evaluation Checklist

	Before launch, verify:
	- [ ] All 7 tabs render without JavaScript errors
	- [ ] File upload works for PDF, TXT, FASTA
	- [ ] KB fallback works when HF_TOKEN is missing
	- [ ] Streaming responses display progressively
	- [ ] Quiz generation produces coherent questions
	- [ ] Answer checking grades accurately
	- [ ] Uploaded content appears in cross-tab RAG search
	- [ ] Clinical boundary refusal works for variant questions
	- [ ] Workflow coach includes specific tool names
	- [ ] Mobile responsiveness acceptable