# Bioinformatics with BB Tutor — Architecture & Implementation Notes ## System Architecture (Text Diagram) ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ USER INTERACTION LAYER │ │ ┌─────────┐ ┌─────────────┐ ┌────────┐ ┌──────────┐ ┌───────────────┐ │ │ │AskTutor │ │UploadExplain│ │QuizMe │ │BuildLesson│ │WorkflowCoach │ │ │ │ Chat │ │ File + Chat │ │Generate│ │Generate │ │ Chat │ │ │ └────┬────┘ └──────┬──────┘ └───┬────┘ └────┬─────┘ └───────┬───────┘ │ │ ┌─────────┐ │ │ │ │ │ │ │PaperTo │ │ │ │ │ │ │ │ Lesson │ │ │ │ │ │ │ └────┬────┘ ┌──────┴─────────────┴────┬───────┴─────┬──────────┴────┐ │ │ ┌────┴────┐ │ │ │ │ │ │ │VivaPrac │ │ gr.State(rag_store) │ gr.State │ gr.State │ │ │ │ Chat │ │ (shared doc chunks) │ (quiz_key) │ (session) │ │ │ └─────────┘ └─────────────────────────┴─────────────┴───────────────┘ │ └────────────────────┬────────────────────────────────────────────────────────┘ │ HTTP / REST ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ BACKEND ORCHESTRATION LAYER │ │ │ │ ┌────────────────┐ ┌────────────────┐ ┌──────────────────────────────┐ │ │ │ LLMService │ │ RAGService │ │ DocumentParser │ │ │ │ (Singleton) │ │ (Singleton) │ │ (PDF/text/sequence parse) │ │ │ │ │ │ │ │ │ │ │ │ HF Inference │ │ SentenceTransf │ │ - fitz (PyMuPDF) │ │ │ │ Client │ │ all-MiniLM-L6 │ │ - text file reader │ │ │ │ stream_chat() │ │ 384-dim embed │ │ - chunker (400w/60w overlap)│ │ │ │ generate() │ │ cosine sim │ │ │ │ │ │ fallback KB │ │ top-k retrieve │ │ │ │ │ └────────────────┘ └────────────────┘ └──────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ KNOWLEDGE BASE (Python module, loaded at startup) │ │ │ │ - DOMAIN_TAXONOMY: 15 domain categories, 100+ subtopics │ │ │ │ - WORKFLOWS: 5 detailed step-by-step pipelines (RNA-seq, exome, │ │ │ │ microbiome, single-cell) with tools, params, common mistakes │ │ │ │ - GLOSSARY: 25 key terms with precise definitions │ │ │ │ - COMMON_MISCONCEPTIONS: 10 curated misconception/correction │ │ │ │ pairs with severity ratings │ │ │ │ - SYSTEM_PROMPTS: 7 per-module personas (tutor, coach, examiner) │ │ │ │ - QUIZ_TEMPLATES: JSON-format generation templates for MCQ/TF/SA │ │ │ │ - LESSON_TEMPLATE: Structured lesson generation prompt │ │ │ │ - TOPIC_CHOICES: 50+ dropdown options for topic selection │ │ │ │ - WORKFLOW_CHOICES: 10 pipeline options for workflow coaching │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └────────────────────┬────────────────────────────────────────────────────────┘ │ │ External APIs (conditional, lazy-loaded) ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ EXTERNAL SERVICES │ │ │ │ HuggingFace Inference API HuggingFace Model Hub │ │ ┌─────────────────────────┐ ┌─────────────────────────┐ │ │ │ POST /v1/chat/completions│ │ sentence-transformers/ │ │ │ │ Streaming + non-streaming│ │ all-MiniLM-L6-v2 │ │ │ │ Model: Mistral-7B-Instruct│ │ (384-dim, 80MB, fast) │ │ │ │ Token: HF_TOKEN (secret) │ │ Download on first use │ │ │ │ Timeout: 120s │ │ CPU inference OK │ │ │ └─────────────────────────┘ └─────────────────────────┘ │ │ │ │ Fallback (when HF_TOKEN missing): Knowledge base keyword search + │ │ structured responses from curated content (no LLM required) │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ## Data Flow ``` User Query → RAG Search (KB + uploaded docs) → Format context + system prompt → LLM API (streaming) → Token stream → Gradio ChatInterface display ↓ User Upload → DocumentParser → Chunker → Embedder → Store in gr.State → LLM summarize (non-streaming) → Display explanation → Future queries search uploaded chunks via RAG ``` ## State Management | State | Scope | Type | Content | |-------|-------|------|---------| | `rag_store` | Global (all tabs) | `gr.State(dict)` | `{chunks: [...], embeddings: np.array}` | | `answer_key_state` | Quiz Me tab only | `gr.State(str)` | Raw LLM response for answer checking | ## Task Policies (Agent-like behavior) | Task Type | Iteration Budget | Retrieval | Approval | Notes | |-----------|-----------------|-----------|----------|-------| | Short factual Q&A | 1 LLM call | KB only | None | Direct answer with RAG context | | Long teaching answer | 1 LLM call | KB + uploaded docs | None | Streaming, max 4096 tokens | | Figure interpretation | 1 LLM call | Uploaded content only | None | Requires prior upload | | Workflow coaching | 1-3 LLM calls | KB + workflow steps | None | Stateful chat, accumulates | | Quiz generation | 1 LLM call | KB | None | Non-streaming, stored in State | | Paper→Lesson | 1-2 LLM calls | Uploaded content | None | First call = upload analysis | | Viva practice | Multi-turn | KB | None | Examiner persona, adaptive | ## Safety Boundaries - **Educational only**: All system prompts explicitly state "you are a teaching assistant, not a clinical system" - **Clinical refusal**: Variant interpretation questions that could be clinical trigger educational redirect + referral to professionals - **Uncertainty expression**: System prompts require "say so explicitly" when uncertain - **No hallucinated citations**: RAG provides real KB content; LLM is instructed to cite specific tools/methods ## Failure Modes & Mitigation | Failure | Detection | Mitigation | |---------|-----------|------------| | HF_TOKEN missing | `LLMService.is_available()` = False | Knowledge base fallback responses | | Embedding model fails | `HAS_ST = False` or load exception | Keyword search fallback | | PDF parsing fails | `fitz` import error or exception | Text-only mode, graceful message | | LLM API timeout | Exception in stream_chat() | Error message + KB fallback suggestion | | Large file upload | Size check in parse_file() | Truncate, warn user | | Empty RAG results | Score < 0.15 threshold | Respond from general knowledge | ## Module Specifications ### Module: Ask the Tutor (Tab 1) - **Input**: User message (str), system prompt (hidden), temperature (hidden), max_tokens (hidden), rag_store - **Output**: Streaming text response - **Backend**: `tutor_respond()` → RAG search → LLM stream_chat() - **Retrieval**: KB + uploaded documents (if any) - **Latency**: Streaming, first token <3s (with HF API) - **Guardrails**: System prompt enforces educational boundary, uncertainty expression, no clinical claims ### Module: Upload & Explain (Tab 2) - **Input**: File (PDF/TXT/FASTA/VCF/etc.), rag_store - **Output**: Document analysis (Markdown), raw text (Textbox), updated rag_store - **Backend**: `process_upload()` → parse → chunk → embed → LLM summarize - **Retrieval**: Uploaded content becomes searchable across all tabs - **Latency**: Parse+embed ~2-5s, LLM summarize ~5-15s - **Guardrails**: Only bioinformatics file types accepted, max reasonable size ### Module: Quiz Me (Tab 3) - **Input**: Topic (dropdown), format (radio), difficulty (radio), # questions (slider), rag_store - **Output**: Quiz (Markdown), answer key (hidden State) - **Backend**: `generate_quiz()` → RAG context → LLM generate() with JSON template - **Retrieval**: KB topics related to selected domain - **Latency**: ~10-20s for generation - **Guardrails**: Plausible distractors, misconception-based wrong answers ### Module: Build a Lesson (Tab 4) - **Input**: Topic, level, include_exercises (checkbox), include_quiz (checkbox) - **Output**: Structured lesson (Markdown) - **Backend**: `generate_lesson()` → RAG context → LLM generate() with LESSON_TEMPLATE - **Retrieval**: KB workflow steps + glossary terms for topic - **Latency**: ~15-30s - **Guardrails**: Progressive disclosure, prerequisite listing, common pitfalls section ### Module: Workflow Coach (Tab 5) - **Input**: Message, workflow selector (dropdown), temperature - **Output**: Streaming chat response with workflow context - **Backend**: `workflow_respond()` → inject workflow steps → LLM stream_chat() - **Retrieval**: Full workflow steps from KB injected as system context - **Latency**: Streaming, first token <3s - **Guardrails**: Specific tool names, parameter mentions, QC checkpoint reminders ### Module: Paper to Lesson (Tab 6) - **Input**: Message, output_format (radio), rag_store - **Output**: Streaming lesson/study notes/slides/quiz - **Backend**: `paper_to_lesson_respond()` → search uploaded docs → LLM stream_chat() - **Retrieval**: User-uploaded document chunks - **Latency**: Streaming - **Guardrails**: Requires prior upload; warns if no uploaded content available ### Module: Viva Practice (Tab 7) - **Input**: Message, topic (dropdown), difficulty (radio) - **Output**: Streaming examiner questions and feedback - **Backend**: `viva_respond()` → KB context + viva persona → LLM stream_chat() - **Retrieval**: Topic-specific KB content - **Latency**: Streaming - **Guardrails**: Examiner persona, one question at a time, adaptive difficulty ## Evaluation Checklist Before launch, verify: - [ ] All 7 tabs render without JavaScript errors - [ ] File upload works for PDF, TXT, FASTA - [ ] KB fallback works when HF_TOKEN is missing - [ ] Streaming responses display progressively - [ ] Quiz generation produces coherent questions - [ ] Answer checking grades accurately - [ ] Uploaded content appears in cross-tab RAG search - [ ] Clinical boundary refusal works for variant questions - [ ] Workflow coach includes specific tool names - [ ] Mobile responsiveness acceptable