Spaces:
Sleeping
Sleeping
| # Bioinformatics with BB Tutor β Architecture & Implementation Notes | |
| ## System Architecture (Text Diagram) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β USER INTERACTION LAYER β | |
| β βββββββββββ βββββββββββββββ ββββββββββ ββββββββββββ βββββββββββββββββ β | |
| β βAskTutor β βUploadExplainβ βQuizMe β βBuildLessonβ βWorkflowCoach β β | |
| β β Chat β β File + Chat β βGenerateβ βGenerate β β Chat β β | |
| β ββββββ¬βββββ ββββββββ¬βββββββ βββββ¬βββββ ββββββ¬ββββββ βββββββββ¬ββββββββ β | |
| β βββββββββββ β β β β β | |
| β βPaperTo β β β β β β | |
| β β Lesson β β β β β β | |
| β ββββββ¬βββββ ββββββββ΄ββββββββββββββ΄βββββ¬ββββββββ΄ββββββ¬βββββββββββ΄βββββ β | |
| β ββββββ΄βββββ β β β β β | |
| β βVivaPrac β β gr.State(rag_store) β gr.State β gr.State β β | |
| β β Chat β β (shared doc chunks) β (quiz_key) β (session) β β | |
| β βββββββββββ βββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ β | |
| ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β HTTP / REST | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β BACKEND ORCHESTRATION LAYER β | |
| β β | |
| β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββββββββ β | |
| β β LLMService β β RAGService β β DocumentParser β β | |
| β β (Singleton) β β (Singleton) β β (PDF/text/sequence parse) β β | |
| β β β β β β β β | |
| β β HF Inference β β SentenceTransf β β - fitz (PyMuPDF) β β | |
| β β Client β β all-MiniLM-L6 β β - text file reader β β | |
| β β stream_chat() β β 384-dim embed β β - chunker (400w/60w overlap)β β | |
| β β generate() β β cosine sim β β β β | |
| β β fallback KB β β top-k retrieve β β β β | |
| β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββββββββ β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β KNOWLEDGE BASE (Python module, loaded at startup) β β | |
| β β - DOMAIN_TAXONOMY: 15 domain categories, 100+ subtopics β β | |
| β β - WORKFLOWS: 5 detailed step-by-step pipelines (RNA-seq, exome, β β | |
| β β microbiome, single-cell) with tools, params, common mistakes β β | |
| β β - GLOSSARY: 25 key terms with precise definitions β β | |
| β β - COMMON_MISCONCEPTIONS: 10 curated misconception/correction β β | |
| β β pairs with severity ratings β β | |
| β β - SYSTEM_PROMPTS: 7 per-module personas (tutor, coach, examiner) β β | |
| β β - QUIZ_TEMPLATES: JSON-format generation templates for MCQ/TF/SA β β | |
| β β - LESSON_TEMPLATE: Structured lesson generation prompt β β | |
| β β - TOPIC_CHOICES: 50+ dropdown options for topic selection β β | |
| β β - WORKFLOW_CHOICES: 10 pipeline options for workflow coaching β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| β External APIs (conditional, lazy-loaded) | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β EXTERNAL SERVICES β | |
| β β | |
| β HuggingFace Inference API HuggingFace Model Hub β | |
| β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β | |
| β β POST /v1/chat/completionsβ β sentence-transformers/ β β | |
| β β Streaming + non-streamingβ β all-MiniLM-L6-v2 β β | |
| β β Model: Mistral-7B-Instructβ β (384-dim, 80MB, fast) β β | |
| β β Token: HF_TOKEN (secret) β β Download on first use β β | |
| β β Timeout: 120s β β CPU inference OK β β | |
| β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β | |
| β β | |
| β Fallback (when HF_TOKEN missing): Knowledge base keyword search + β | |
| β structured responses from curated content (no LLM required) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Data Flow | |
| ``` | |
| User Query β RAG Search (KB + uploaded docs) β Format context + system prompt | |
| β LLM API (streaming) β Token stream β Gradio ChatInterface display | |
| β | |
| User Upload β DocumentParser β Chunker β Embedder β Store in gr.State | |
| β LLM summarize (non-streaming) β Display explanation | |
| β Future queries search uploaded chunks via RAG | |
| ``` | |
| ## State Management | |
| | State | Scope | Type | Content | | |
| |-------|-------|------|---------| | |
| | `rag_store` | Global (all tabs) | `gr.State(dict)` | `{chunks: [...], embeddings: np.array}` | | |
| | `answer_key_state` | Quiz Me tab only | `gr.State(str)` | Raw LLM response for answer checking | | |
| ## Task Policies (Agent-like behavior) | |
| | Task Type | Iteration Budget | Retrieval | Approval | Notes | | |
| |-----------|-----------------|-----------|----------|-------| | |
| | Short factual Q&A | 1 LLM call | KB only | None | Direct answer with RAG context | | |
| | Long teaching answer | 1 LLM call | KB + uploaded docs | None | Streaming, max 4096 tokens | | |
| | Figure interpretation | 1 LLM call | Uploaded content only | None | Requires prior upload | | |
| | Workflow coaching | 1-3 LLM calls | KB + workflow steps | None | Stateful chat, accumulates | | |
| | Quiz generation | 1 LLM call | KB | None | Non-streaming, stored in State | | |
| | PaperβLesson | 1-2 LLM calls | Uploaded content | None | First call = upload analysis | | |
| | Viva practice | Multi-turn | KB | None | Examiner persona, adaptive | | |
| ## Safety Boundaries | |
| - **Educational only**: All system prompts explicitly state "you are a teaching assistant, not a clinical system" | |
| - **Clinical refusal**: Variant interpretation questions that could be clinical trigger educational redirect + referral to professionals | |
| - **Uncertainty expression**: System prompts require "say so explicitly" when uncertain | |
| - **No hallucinated citations**: RAG provides real KB content; LLM is instructed to cite specific tools/methods | |
| ## Failure Modes & Mitigation | |
| | Failure | Detection | Mitigation | | |
| |---------|-----------|------------| | |
| | HF_TOKEN missing | `LLMService.is_available()` = False | Knowledge base fallback responses | | |
| | Embedding model fails | `HAS_ST = False` or load exception | Keyword search fallback | | |
| | PDF parsing fails | `fitz` import error or exception | Text-only mode, graceful message | | |
| | LLM API timeout | Exception in stream_chat() | Error message + KB fallback suggestion | | |
| | Large file upload | Size check in parse_file() | Truncate, warn user | | |
| | Empty RAG results | Score < 0.15 threshold | Respond from general knowledge | | |
| ## Module Specifications | |
| ### Module: Ask the Tutor (Tab 1) | |
| - **Input**: User message (str), system prompt (hidden), temperature (hidden), max_tokens (hidden), rag_store | |
| - **Output**: Streaming text response | |
| - **Backend**: `tutor_respond()` β RAG search β LLM stream_chat() | |
| - **Retrieval**: KB + uploaded documents (if any) | |
| - **Latency**: Streaming, first token <3s (with HF API) | |
| - **Guardrails**: System prompt enforces educational boundary, uncertainty expression, no clinical claims | |
| ### Module: Upload & Explain (Tab 2) | |
| - **Input**: File (PDF/TXT/FASTA/VCF/etc.), rag_store | |
| - **Output**: Document analysis (Markdown), raw text (Textbox), updated rag_store | |
| - **Backend**: `process_upload()` β parse β chunk β embed β LLM summarize | |
| - **Retrieval**: Uploaded content becomes searchable across all tabs | |
| - **Latency**: Parse+embed ~2-5s, LLM summarize ~5-15s | |
| - **Guardrails**: Only bioinformatics file types accepted, max reasonable size | |
| ### Module: Quiz Me (Tab 3) | |
| - **Input**: Topic (dropdown), format (radio), difficulty (radio), # questions (slider), rag_store | |
| - **Output**: Quiz (Markdown), answer key (hidden State) | |
| - **Backend**: `generate_quiz()` β RAG context β LLM generate() with JSON template | |
| - **Retrieval**: KB topics related to selected domain | |
| - **Latency**: ~10-20s for generation | |
| - **Guardrails**: Plausible distractors, misconception-based wrong answers | |
| ### Module: Build a Lesson (Tab 4) | |
| - **Input**: Topic, level, include_exercises (checkbox), include_quiz (checkbox) | |
| - **Output**: Structured lesson (Markdown) | |
| - **Backend**: `generate_lesson()` β RAG context β LLM generate() with LESSON_TEMPLATE | |
| - **Retrieval**: KB workflow steps + glossary terms for topic | |
| - **Latency**: ~15-30s | |
| - **Guardrails**: Progressive disclosure, prerequisite listing, common pitfalls section | |
| ### Module: Workflow Coach (Tab 5) | |
| - **Input**: Message, workflow selector (dropdown), temperature | |
| - **Output**: Streaming chat response with workflow context | |
| - **Backend**: `workflow_respond()` β inject workflow steps β LLM stream_chat() | |
| - **Retrieval**: Full workflow steps from KB injected as system context | |
| - **Latency**: Streaming, first token <3s | |
| - **Guardrails**: Specific tool names, parameter mentions, QC checkpoint reminders | |
| ### Module: Paper to Lesson (Tab 6) | |
| - **Input**: Message, output_format (radio), rag_store | |
| - **Output**: Streaming lesson/study notes/slides/quiz | |
| - **Backend**: `paper_to_lesson_respond()` β search uploaded docs β LLM stream_chat() | |
| - **Retrieval**: User-uploaded document chunks | |
| - **Latency**: Streaming | |
| - **Guardrails**: Requires prior upload; warns if no uploaded content available | |
| ### Module: Viva Practice (Tab 7) | |
| - **Input**: Message, topic (dropdown), difficulty (radio) | |
| - **Output**: Streaming examiner questions and feedback | |
| - **Backend**: `viva_respond()` β KB context + viva persona β LLM stream_chat() | |
| - **Retrieval**: Topic-specific KB content | |
| - **Latency**: Streaming | |
| - **Guardrails**: Examiner persona, one question at a time, adaptive difficulty | |
| ## Evaluation Checklist | |
| Before launch, verify: | |
| - [ ] All 7 tabs render without JavaScript errors | |
| - [ ] File upload works for PDF, TXT, FASTA | |
| - [ ] KB fallback works when HF_TOKEN is missing | |
| - [ ] Streaming responses display progressively | |
| - [ ] Quiz generation produces coherent questions | |
| - [ ] Answer checking grades accurately | |
| - [ ] Uploaded content appears in cross-tab RAG search | |
| - [ ] Clinical boundary refusal works for variant questions | |
| - [ ] Workflow coach includes specific tool names | |
| - [ ] Mobile responsiveness acceptable | |