bioinformatics-bb-tutor / ARCHITECTURE.md
Babajaan's picture
Add architecture diagram, sample conversations, and module spec
4c15f39 verified
# Bioinformatics with BB Tutor β€” Architecture & Implementation Notes
## System Architecture (Text Diagram)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ USER INTERACTION LAYER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚AskTutor β”‚ β”‚UploadExplainβ”‚ β”‚QuizMe β”‚ β”‚BuildLessonβ”‚ β”‚WorkflowCoach β”‚ β”‚
β”‚ β”‚ Chat β”‚ β”‚ File + Chat β”‚ β”‚Generateβ”‚ β”‚Generate β”‚ β”‚ Chat β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚PaperTo β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ Lesson β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚
β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚VivaPrac β”‚ β”‚ gr.State(rag_store) β”‚ gr.State β”‚ gr.State β”‚ β”‚
β”‚ β”‚ Chat β”‚ β”‚ (shared doc chunks) β”‚ (quiz_key) β”‚ (session) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ HTTP / REST
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BACKEND ORCHESTRATION LAYER β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ LLMService β”‚ β”‚ RAGService β”‚ β”‚ DocumentParser β”‚ β”‚
β”‚ β”‚ (Singleton) β”‚ β”‚ (Singleton) β”‚ β”‚ (PDF/text/sequence parse) β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ HF Inference β”‚ β”‚ SentenceTransf β”‚ β”‚ - fitz (PyMuPDF) β”‚ β”‚
β”‚ β”‚ Client β”‚ β”‚ all-MiniLM-L6 β”‚ β”‚ - text file reader β”‚ β”‚
β”‚ β”‚ stream_chat() β”‚ β”‚ 384-dim embed β”‚ β”‚ - chunker (400w/60w overlap)β”‚ β”‚
β”‚ β”‚ generate() β”‚ β”‚ cosine sim β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ fallback KB β”‚ β”‚ top-k retrieve β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ KNOWLEDGE BASE (Python module, loaded at startup) β”‚ β”‚
β”‚ β”‚ - DOMAIN_TAXONOMY: 15 domain categories, 100+ subtopics β”‚ β”‚
β”‚ β”‚ - WORKFLOWS: 5 detailed step-by-step pipelines (RNA-seq, exome, β”‚ β”‚
β”‚ β”‚ microbiome, single-cell) with tools, params, common mistakes β”‚ β”‚
β”‚ β”‚ - GLOSSARY: 25 key terms with precise definitions β”‚ β”‚
β”‚ β”‚ - COMMON_MISCONCEPTIONS: 10 curated misconception/correction β”‚ β”‚
β”‚ β”‚ pairs with severity ratings β”‚ β”‚
β”‚ β”‚ - SYSTEM_PROMPTS: 7 per-module personas (tutor, coach, examiner) β”‚ β”‚
β”‚ β”‚ - QUIZ_TEMPLATES: JSON-format generation templates for MCQ/TF/SA β”‚ β”‚
β”‚ β”‚ - LESSON_TEMPLATE: Structured lesson generation prompt β”‚ β”‚
β”‚ β”‚ - TOPIC_CHOICES: 50+ dropdown options for topic selection β”‚ β”‚
β”‚ β”‚ - WORKFLOW_CHOICES: 10 pipeline options for workflow coaching β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”‚ External APIs (conditional, lazy-loaded)
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ EXTERNAL SERVICES β”‚
β”‚ β”‚
β”‚ HuggingFace Inference API HuggingFace Model Hub β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ POST /v1/chat/completionsβ”‚ β”‚ sentence-transformers/ β”‚ β”‚
β”‚ β”‚ Streaming + non-streamingβ”‚ β”‚ all-MiniLM-L6-v2 β”‚ β”‚
β”‚ β”‚ Model: Mistral-7B-Instructβ”‚ β”‚ (384-dim, 80MB, fast) β”‚ β”‚
β”‚ β”‚ Token: HF_TOKEN (secret) β”‚ β”‚ Download on first use β”‚ β”‚
β”‚ β”‚ Timeout: 120s β”‚ β”‚ CPU inference OK β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ Fallback (when HF_TOKEN missing): Knowledge base keyword search + β”‚
β”‚ structured responses from curated content (no LLM required) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Data Flow
```
User Query β†’ RAG Search (KB + uploaded docs) β†’ Format context + system prompt
β†’ LLM API (streaming) β†’ Token stream β†’ Gradio ChatInterface display
↓
User Upload β†’ DocumentParser β†’ Chunker β†’ Embedder β†’ Store in gr.State
β†’ LLM summarize (non-streaming) β†’ Display explanation
β†’ Future queries search uploaded chunks via RAG
```
## State Management
| State | Scope | Type | Content |
|-------|-------|------|---------|
| `rag_store` | Global (all tabs) | `gr.State(dict)` | `{chunks: [...], embeddings: np.array}` |
| `answer_key_state` | Quiz Me tab only | `gr.State(str)` | Raw LLM response for answer checking |
## Task Policies (Agent-like behavior)
| Task Type | Iteration Budget | Retrieval | Approval | Notes |
|-----------|-----------------|-----------|----------|-------|
| Short factual Q&A | 1 LLM call | KB only | None | Direct answer with RAG context |
| Long teaching answer | 1 LLM call | KB + uploaded docs | None | Streaming, max 4096 tokens |
| Figure interpretation | 1 LLM call | Uploaded content only | None | Requires prior upload |
| Workflow coaching | 1-3 LLM calls | KB + workflow steps | None | Stateful chat, accumulates |
| Quiz generation | 1 LLM call | KB | None | Non-streaming, stored in State |
| Paper→Lesson | 1-2 LLM calls | Uploaded content | None | First call = upload analysis |
| Viva practice | Multi-turn | KB | None | Examiner persona, adaptive |
## Safety Boundaries
- **Educational only**: All system prompts explicitly state "you are a teaching assistant, not a clinical system"
- **Clinical refusal**: Variant interpretation questions that could be clinical trigger educational redirect + referral to professionals
- **Uncertainty expression**: System prompts require "say so explicitly" when uncertain
- **No hallucinated citations**: RAG provides real KB content; LLM is instructed to cite specific tools/methods
## Failure Modes & Mitigation
| Failure | Detection | Mitigation |
|---------|-----------|------------|
| HF_TOKEN missing | `LLMService.is_available()` = False | Knowledge base fallback responses |
| Embedding model fails | `HAS_ST = False` or load exception | Keyword search fallback |
| PDF parsing fails | `fitz` import error or exception | Text-only mode, graceful message |
| LLM API timeout | Exception in stream_chat() | Error message + KB fallback suggestion |
| Large file upload | Size check in parse_file() | Truncate, warn user |
| Empty RAG results | Score < 0.15 threshold | Respond from general knowledge |
## Module Specifications
### Module: Ask the Tutor (Tab 1)
- **Input**: User message (str), system prompt (hidden), temperature (hidden), max_tokens (hidden), rag_store
- **Output**: Streaming text response
- **Backend**: `tutor_respond()` β†’ RAG search β†’ LLM stream_chat()
- **Retrieval**: KB + uploaded documents (if any)
- **Latency**: Streaming, first token <3s (with HF API)
- **Guardrails**: System prompt enforces educational boundary, uncertainty expression, no clinical claims
### Module: Upload & Explain (Tab 2)
- **Input**: File (PDF/TXT/FASTA/VCF/etc.), rag_store
- **Output**: Document analysis (Markdown), raw text (Textbox), updated rag_store
- **Backend**: `process_upload()` β†’ parse β†’ chunk β†’ embed β†’ LLM summarize
- **Retrieval**: Uploaded content becomes searchable across all tabs
- **Latency**: Parse+embed ~2-5s, LLM summarize ~5-15s
- **Guardrails**: Only bioinformatics file types accepted, max reasonable size
### Module: Quiz Me (Tab 3)
- **Input**: Topic (dropdown), format (radio), difficulty (radio), # questions (slider), rag_store
- **Output**: Quiz (Markdown), answer key (hidden State)
- **Backend**: `generate_quiz()` β†’ RAG context β†’ LLM generate() with JSON template
- **Retrieval**: KB topics related to selected domain
- **Latency**: ~10-20s for generation
- **Guardrails**: Plausible distractors, misconception-based wrong answers
### Module: Build a Lesson (Tab 4)
- **Input**: Topic, level, include_exercises (checkbox), include_quiz (checkbox)
- **Output**: Structured lesson (Markdown)
- **Backend**: `generate_lesson()` β†’ RAG context β†’ LLM generate() with LESSON_TEMPLATE
- **Retrieval**: KB workflow steps + glossary terms for topic
- **Latency**: ~15-30s
- **Guardrails**: Progressive disclosure, prerequisite listing, common pitfalls section
### Module: Workflow Coach (Tab 5)
- **Input**: Message, workflow selector (dropdown), temperature
- **Output**: Streaming chat response with workflow context
- **Backend**: `workflow_respond()` β†’ inject workflow steps β†’ LLM stream_chat()
- **Retrieval**: Full workflow steps from KB injected as system context
- **Latency**: Streaming, first token <3s
- **Guardrails**: Specific tool names, parameter mentions, QC checkpoint reminders
### Module: Paper to Lesson (Tab 6)
- **Input**: Message, output_format (radio), rag_store
- **Output**: Streaming lesson/study notes/slides/quiz
- **Backend**: `paper_to_lesson_respond()` β†’ search uploaded docs β†’ LLM stream_chat()
- **Retrieval**: User-uploaded document chunks
- **Latency**: Streaming
- **Guardrails**: Requires prior upload; warns if no uploaded content available
### Module: Viva Practice (Tab 7)
- **Input**: Message, topic (dropdown), difficulty (radio)
- **Output**: Streaming examiner questions and feedback
- **Backend**: `viva_respond()` β†’ KB context + viva persona β†’ LLM stream_chat()
- **Retrieval**: Topic-specific KB content
- **Latency**: Streaming
- **Guardrails**: Examiner persona, one question at a time, adaptive difficulty
## Evaluation Checklist
Before launch, verify:
- [ ] All 7 tabs render without JavaScript errors
- [ ] File upload works for PDF, TXT, FASTA
- [ ] KB fallback works when HF_TOKEN is missing
- [ ] Streaming responses display progressively
- [ ] Quiz generation produces coherent questions
- [ ] Answer checking grades accurately
- [ ] Uploaded content appears in cross-tab RAG search
- [ ] Clinical boundary refusal works for variant questions
- [ ] Workflow coach includes specific tool names
- [ ] Mobile responsiveness acceptable