Spaces:
Sleeping
Sleeping
File size: 13,931 Bytes
4c15f39 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | # Bioinformatics with BB Tutor β Architecture & Implementation Notes
## System Architecture (Text Diagram)
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER INTERACTION LAYER β
β βββββββββββ βββββββββββββββ ββββββββββ ββββββββββββ βββββββββββββββββ β
β βAskTutor β βUploadExplainβ βQuizMe β βBuildLessonβ βWorkflowCoach β β
β β Chat β β File + Chat β βGenerateβ βGenerate β β Chat β β
β ββββββ¬βββββ ββββββββ¬βββββββ βββββ¬βββββ ββββββ¬ββββββ βββββββββ¬ββββββββ β
β βββββββββββ β β β β β
β βPaperTo β β β β β β
β β Lesson β β β β β β
β ββββββ¬βββββ ββββββββ΄ββββββββββββββ΄βββββ¬ββββββββ΄ββββββ¬βββββββββββ΄βββββ β
β ββββββ΄βββββ β β β β β
β βVivaPrac β β gr.State(rag_store) β gr.State β gr.State β β
β β Chat β β (shared doc chunks) β (quiz_key) β (session) β β
β βββββββββββ βββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTP / REST
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKEND ORCHESTRATION LAYER β
β β
β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββββββββ β
β β LLMService β β RAGService β β DocumentParser β β
β β (Singleton) β β (Singleton) β β (PDF/text/sequence parse) β β
β β β β β β β β
β β HF Inference β β SentenceTransf β β - fitz (PyMuPDF) β β
β β Client β β all-MiniLM-L6 β β - text file reader β β
β β stream_chat() β β 384-dim embed β β - chunker (400w/60w overlap)β β
β β generate() β β cosine sim β β β β
β β fallback KB β β top-k retrieve β β β β
β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β KNOWLEDGE BASE (Python module, loaded at startup) β β
β β - DOMAIN_TAXONOMY: 15 domain categories, 100+ subtopics β β
β β - WORKFLOWS: 5 detailed step-by-step pipelines (RNA-seq, exome, β β
β β microbiome, single-cell) with tools, params, common mistakes β β
β β - GLOSSARY: 25 key terms with precise definitions β β
β β - COMMON_MISCONCEPTIONS: 10 curated misconception/correction β β
β β pairs with severity ratings β β
β β - SYSTEM_PROMPTS: 7 per-module personas (tutor, coach, examiner) β β
β β - QUIZ_TEMPLATES: JSON-format generation templates for MCQ/TF/SA β β
β β - LESSON_TEMPLATE: Structured lesson generation prompt β β
β β - TOPIC_CHOICES: 50+ dropdown options for topic selection β β
β β - WORKFLOW_CHOICES: 10 pipeline options for workflow coaching β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β External APIs (conditional, lazy-loaded)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXTERNAL SERVICES β
β β
β HuggingFace Inference API HuggingFace Model Hub β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β POST /v1/chat/completionsβ β sentence-transformers/ β β
β β Streaming + non-streamingβ β all-MiniLM-L6-v2 β β
β β Model: Mistral-7B-Instructβ β (384-dim, 80MB, fast) β β
β β Token: HF_TOKEN (secret) β β Download on first use β β
β β Timeout: 120s β β CPU inference OK β β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β
β Fallback (when HF_TOKEN missing): Knowledge base keyword search + β
β structured responses from curated content (no LLM required) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Data Flow
```
User Query β RAG Search (KB + uploaded docs) β Format context + system prompt
β LLM API (streaming) β Token stream β Gradio ChatInterface display
β
User Upload β DocumentParser β Chunker β Embedder β Store in gr.State
β LLM summarize (non-streaming) β Display explanation
β Future queries search uploaded chunks via RAG
```
## State Management
| State | Scope | Type | Content |
|-------|-------|------|---------|
| `rag_store` | Global (all tabs) | `gr.State(dict)` | `{chunks: [...], embeddings: np.array}` |
| `answer_key_state` | Quiz Me tab only | `gr.State(str)` | Raw LLM response for answer checking |
## Task Policies (Agent-like behavior)
| Task Type | Iteration Budget | Retrieval | Approval | Notes |
|-----------|-----------------|-----------|----------|-------|
| Short factual Q&A | 1 LLM call | KB only | None | Direct answer with RAG context |
| Long teaching answer | 1 LLM call | KB + uploaded docs | None | Streaming, max 4096 tokens |
| Figure interpretation | 1 LLM call | Uploaded content only | None | Requires prior upload |
| Workflow coaching | 1-3 LLM calls | KB + workflow steps | None | Stateful chat, accumulates |
| Quiz generation | 1 LLM call | KB | None | Non-streaming, stored in State |
| PaperβLesson | 1-2 LLM calls | Uploaded content | None | First call = upload analysis |
| Viva practice | Multi-turn | KB | None | Examiner persona, adaptive |
## Safety Boundaries
- **Educational only**: All system prompts explicitly state "you are a teaching assistant, not a clinical system"
- **Clinical refusal**: Variant interpretation questions that could be clinical trigger educational redirect + referral to professionals
- **Uncertainty expression**: System prompts require "say so explicitly" when uncertain
- **No hallucinated citations**: RAG provides real KB content; LLM is instructed to cite specific tools/methods
## Failure Modes & Mitigation
| Failure | Detection | Mitigation |
|---------|-----------|------------|
| HF_TOKEN missing | `LLMService.is_available()` = False | Knowledge base fallback responses |
| Embedding model fails | `HAS_ST = False` or load exception | Keyword search fallback |
| PDF parsing fails | `fitz` import error or exception | Text-only mode, graceful message |
| LLM API timeout | Exception in stream_chat() | Error message + KB fallback suggestion |
| Large file upload | Size check in parse_file() | Truncate, warn user |
| Empty RAG results | Score < 0.15 threshold | Respond from general knowledge |
## Module Specifications
### Module: Ask the Tutor (Tab 1)
- **Input**: User message (str), system prompt (hidden), temperature (hidden), max_tokens (hidden), rag_store
- **Output**: Streaming text response
- **Backend**: `tutor_respond()` β RAG search β LLM stream_chat()
- **Retrieval**: KB + uploaded documents (if any)
- **Latency**: Streaming, first token <3s (with HF API)
- **Guardrails**: System prompt enforces educational boundary, uncertainty expression, no clinical claims
### Module: Upload & Explain (Tab 2)
- **Input**: File (PDF/TXT/FASTA/VCF/etc.), rag_store
- **Output**: Document analysis (Markdown), raw text (Textbox), updated rag_store
- **Backend**: `process_upload()` β parse β chunk β embed β LLM summarize
- **Retrieval**: Uploaded content becomes searchable across all tabs
- **Latency**: Parse+embed ~2-5s, LLM summarize ~5-15s
- **Guardrails**: Only bioinformatics file types accepted, max reasonable size
### Module: Quiz Me (Tab 3)
- **Input**: Topic (dropdown), format (radio), difficulty (radio), # questions (slider), rag_store
- **Output**: Quiz (Markdown), answer key (hidden State)
- **Backend**: `generate_quiz()` β RAG context β LLM generate() with JSON template
- **Retrieval**: KB topics related to selected domain
- **Latency**: ~10-20s for generation
- **Guardrails**: Plausible distractors, misconception-based wrong answers
### Module: Build a Lesson (Tab 4)
- **Input**: Topic, level, include_exercises (checkbox), include_quiz (checkbox)
- **Output**: Structured lesson (Markdown)
- **Backend**: `generate_lesson()` β RAG context β LLM generate() with LESSON_TEMPLATE
- **Retrieval**: KB workflow steps + glossary terms for topic
- **Latency**: ~15-30s
- **Guardrails**: Progressive disclosure, prerequisite listing, common pitfalls section
### Module: Workflow Coach (Tab 5)
- **Input**: Message, workflow selector (dropdown), temperature
- **Output**: Streaming chat response with workflow context
- **Backend**: `workflow_respond()` β inject workflow steps β LLM stream_chat()
- **Retrieval**: Full workflow steps from KB injected as system context
- **Latency**: Streaming, first token <3s
- **Guardrails**: Specific tool names, parameter mentions, QC checkpoint reminders
### Module: Paper to Lesson (Tab 6)
- **Input**: Message, output_format (radio), rag_store
- **Output**: Streaming lesson/study notes/slides/quiz
- **Backend**: `paper_to_lesson_respond()` β search uploaded docs β LLM stream_chat()
- **Retrieval**: User-uploaded document chunks
- **Latency**: Streaming
- **Guardrails**: Requires prior upload; warns if no uploaded content available
### Module: Viva Practice (Tab 7)
- **Input**: Message, topic (dropdown), difficulty (radio)
- **Output**: Streaming examiner questions and feedback
- **Backend**: `viva_respond()` β KB context + viva persona β LLM stream_chat()
- **Retrieval**: Topic-specific KB content
- **Latency**: Streaming
- **Guardrails**: Examiner persona, one question at a time, adaptive difficulty
## Evaluation Checklist
Before launch, verify:
- [ ] All 7 tabs render without JavaScript errors
- [ ] File upload works for PDF, TXT, FASTA
- [ ] KB fallback works when HF_TOKEN is missing
- [ ] Streaming responses display progressively
- [ ] Quiz generation produces coherent questions
- [ ] Answer checking grades accurately
- [ ] Uploaded content appears in cross-tab RAG search
- [ ] Clinical boundary refusal works for variant questions
- [ ] Workflow coach includes specific tool names
- [ ] Mobile responsiveness acceptable
|