File size: 13,931 Bytes
4c15f39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# Bioinformatics with BB Tutor β€” Architecture & Implementation Notes

## System Architecture (Text Diagram)

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           USER INTERACTION LAYER                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚AskTutor β”‚  β”‚UploadExplainβ”‚  β”‚QuizMe  β”‚  β”‚BuildLessonβ”‚  β”‚WorkflowCoach β”‚ β”‚
β”‚  β”‚  Chat   β”‚  β”‚ File + Chat β”‚  β”‚Generateβ”‚  β”‚Generate  β”‚  β”‚   Chat       β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚             β”‚            β”‚                β”‚       β”‚
β”‚  β”‚PaperTo  β”‚         β”‚             β”‚            β”‚                β”‚       β”‚
β”‚  β”‚ Lesson  β”‚         β”‚             β”‚            β”‚                β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”  β”‚                         β”‚             β”‚               β”‚  β”‚
β”‚  β”‚VivaPrac β”‚  β”‚   gr.State(rag_store)   β”‚  gr.State  β”‚  gr.State     β”‚  β”‚
β”‚  β”‚  Chat   β”‚  β”‚   (shared doc chunks)   β”‚ (quiz_key) β”‚  (session)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ HTTP / REST
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         BACKEND ORCHESTRATION LAYER                         β”‚
β”‚                                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚ LLMService     β”‚  β”‚ RAGService     β”‚  β”‚ DocumentParser               β”‚   β”‚
β”‚   β”‚ (Singleton)    β”‚  β”‚ (Singleton)    β”‚  β”‚ (PDF/text/sequence parse)    β”‚   β”‚
β”‚   β”‚                β”‚  β”‚                β”‚  β”‚                              β”‚   β”‚
β”‚   β”‚ HF Inference   β”‚  β”‚ SentenceTransf β”‚  β”‚  - fitz (PyMuPDF)           β”‚   β”‚
β”‚   β”‚ Client         β”‚  β”‚ all-MiniLM-L6  β”‚  β”‚  - text file reader         β”‚   β”‚
β”‚   β”‚ stream_chat()  β”‚  β”‚ 384-dim embed  β”‚  β”‚  - chunker (400w/60w overlap)β”‚   β”‚
β”‚   β”‚ generate()     β”‚  β”‚ cosine sim     β”‚  β”‚                              β”‚   β”‚
β”‚   β”‚ fallback KB      β”‚  β”‚ top-k retrieve β”‚  β”‚                              β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚ KNOWLEDGE BASE (Python module, loaded at startup)                     β”‚   β”‚
β”‚   β”‚  - DOMAIN_TAXONOMY: 15 domain categories, 100+ subtopics             β”‚   β”‚
β”‚   β”‚  - WORKFLOWS: 5 detailed step-by-step pipelines (RNA-seq, exome,   β”‚   β”‚
β”‚   β”‚    microbiome, single-cell) with tools, params, common mistakes     β”‚   β”‚
β”‚   β”‚  - GLOSSARY: 25 key terms with precise definitions                  β”‚   β”‚
β”‚   β”‚  - COMMON_MISCONCEPTIONS: 10 curated misconception/correction     β”‚   β”‚
β”‚   β”‚    pairs with severity ratings                                     β”‚   β”‚
β”‚   β”‚  - SYSTEM_PROMPTS: 7 per-module personas (tutor, coach, examiner)  β”‚   β”‚
β”‚   β”‚  - QUIZ_TEMPLATES: JSON-format generation templates for MCQ/TF/SA β”‚   β”‚
β”‚   β”‚  - LESSON_TEMPLATE: Structured lesson generation prompt           β”‚   β”‚
β”‚   β”‚  - TOPIC_CHOICES: 50+ dropdown options for topic selection        β”‚   β”‚
β”‚   β”‚  - WORKFLOW_CHOICES: 10 pipeline options for workflow coaching     β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β”‚ External APIs (conditional, lazy-loaded)
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           EXTERNAL SERVICES                                  β”‚
β”‚                                                                             β”‚
β”‚   HuggingFace Inference API          HuggingFace Model Hub                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚   β”‚ POST /v1/chat/completionsβ”‚        β”‚ sentence-transformers/  β”‚           β”‚
β”‚   β”‚ Streaming + non-streamingβ”‚       β”‚ all-MiniLM-L6-v2          β”‚           β”‚
β”‚   β”‚ Model: Mistral-7B-Instructβ”‚      β”‚ (384-dim, 80MB, fast)    β”‚           β”‚
β”‚   β”‚ Token: HF_TOKEN (secret)  β”‚      β”‚ Download on first use    β”‚           β”‚
β”‚   β”‚ Timeout: 120s               β”‚      β”‚ CPU inference OK         β”‚           β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                                                                             β”‚
β”‚   Fallback (when HF_TOKEN missing): Knowledge base keyword search +        β”‚
β”‚   structured responses from curated content (no LLM required)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Data Flow

```
User Query β†’ RAG Search (KB + uploaded docs) β†’ Format context + system prompt
    β†’ LLM API (streaming) β†’ Token stream β†’ Gradio ChatInterface display
                                        ↓
User Upload β†’ DocumentParser β†’ Chunker β†’ Embedder β†’ Store in gr.State
    β†’ LLM summarize (non-streaming) β†’ Display explanation
    β†’ Future queries search uploaded chunks via RAG
```

## State Management

| State | Scope | Type | Content |
|-------|-------|------|---------|
| `rag_store` | Global (all tabs) | `gr.State(dict)` | `{chunks: [...], embeddings: np.array}` |
| `answer_key_state` | Quiz Me tab only | `gr.State(str)` | Raw LLM response for answer checking |

## Task Policies (Agent-like behavior)

| Task Type | Iteration Budget | Retrieval | Approval | Notes |
|-----------|-----------------|-----------|----------|-------|
| Short factual Q&A | 1 LLM call | KB only | None | Direct answer with RAG context |
| Long teaching answer | 1 LLM call | KB + uploaded docs | None | Streaming, max 4096 tokens |
| Figure interpretation | 1 LLM call | Uploaded content only | None | Requires prior upload |
| Workflow coaching | 1-3 LLM calls | KB + workflow steps | None | Stateful chat, accumulates |
| Quiz generation | 1 LLM call | KB | None | Non-streaming, stored in State |
| Paper→Lesson | 1-2 LLM calls | Uploaded content | None | First call = upload analysis |
| Viva practice | Multi-turn | KB | None | Examiner persona, adaptive |

## Safety Boundaries

- **Educational only**: All system prompts explicitly state "you are a teaching assistant, not a clinical system"
- **Clinical refusal**: Variant interpretation questions that could be clinical trigger educational redirect + referral to professionals
- **Uncertainty expression**: System prompts require "say so explicitly" when uncertain
- **No hallucinated citations**: RAG provides real KB content; LLM is instructed to cite specific tools/methods

## Failure Modes & Mitigation

| Failure | Detection | Mitigation |
|---------|-----------|------------|
| HF_TOKEN missing | `LLMService.is_available()` = False | Knowledge base fallback responses |
| Embedding model fails | `HAS_ST = False` or load exception | Keyword search fallback |
| PDF parsing fails | `fitz` import error or exception | Text-only mode, graceful message |
| LLM API timeout | Exception in stream_chat() | Error message + KB fallback suggestion |
| Large file upload | Size check in parse_file() | Truncate, warn user |
| Empty RAG results | Score < 0.15 threshold | Respond from general knowledge |

## Module Specifications

### Module: Ask the Tutor (Tab 1)
- **Input**: User message (str), system prompt (hidden), temperature (hidden), max_tokens (hidden), rag_store
- **Output**: Streaming text response
- **Backend**: `tutor_respond()` β†’ RAG search β†’ LLM stream_chat()
- **Retrieval**: KB + uploaded documents (if any)
- **Latency**: Streaming, first token <3s (with HF API)
- **Guardrails**: System prompt enforces educational boundary, uncertainty expression, no clinical claims

### Module: Upload & Explain (Tab 2)
- **Input**: File (PDF/TXT/FASTA/VCF/etc.), rag_store
- **Output**: Document analysis (Markdown), raw text (Textbox), updated rag_store
- **Backend**: `process_upload()` β†’ parse β†’ chunk β†’ embed β†’ LLM summarize
- **Retrieval**: Uploaded content becomes searchable across all tabs
- **Latency**: Parse+embed ~2-5s, LLM summarize ~5-15s
- **Guardrails**: Only bioinformatics file types accepted, max reasonable size

### Module: Quiz Me (Tab 3)
- **Input**: Topic (dropdown), format (radio), difficulty (radio), # questions (slider), rag_store
- **Output**: Quiz (Markdown), answer key (hidden State)
- **Backend**: `generate_quiz()` β†’ RAG context β†’ LLM generate() with JSON template
- **Retrieval**: KB topics related to selected domain
- **Latency**: ~10-20s for generation
- **Guardrails**: Plausible distractors, misconception-based wrong answers

### Module: Build a Lesson (Tab 4)
- **Input**: Topic, level, include_exercises (checkbox), include_quiz (checkbox)
- **Output**: Structured lesson (Markdown)
- **Backend**: `generate_lesson()` β†’ RAG context β†’ LLM generate() with LESSON_TEMPLATE
- **Retrieval**: KB workflow steps + glossary terms for topic
- **Latency**: ~15-30s
- **Guardrails**: Progressive disclosure, prerequisite listing, common pitfalls section

### Module: Workflow Coach (Tab 5)
- **Input**: Message, workflow selector (dropdown), temperature
- **Output**: Streaming chat response with workflow context
- **Backend**: `workflow_respond()` β†’ inject workflow steps β†’ LLM stream_chat()
- **Retrieval**: Full workflow steps from KB injected as system context
- **Latency**: Streaming, first token <3s
- **Guardrails**: Specific tool names, parameter mentions, QC checkpoint reminders

### Module: Paper to Lesson (Tab 6)
- **Input**: Message, output_format (radio), rag_store
- **Output**: Streaming lesson/study notes/slides/quiz
- **Backend**: `paper_to_lesson_respond()` β†’ search uploaded docs β†’ LLM stream_chat()
- **Retrieval**: User-uploaded document chunks
- **Latency**: Streaming
- **Guardrails**: Requires prior upload; warns if no uploaded content available

### Module: Viva Practice (Tab 7)
- **Input**: Message, topic (dropdown), difficulty (radio)
- **Output**: Streaming examiner questions and feedback
- **Backend**: `viva_respond()` β†’ KB context + viva persona β†’ LLM stream_chat()
- **Retrieval**: Topic-specific KB content
- **Latency**: Streaming
- **Guardrails**: Examiner persona, one question at a time, adaptive difficulty

## Evaluation Checklist

Before launch, verify:
- [ ] All 7 tabs render without JavaScript errors
- [ ] File upload works for PDF, TXT, FASTA
- [ ] KB fallback works when HF_TOKEN is missing
- [ ] Streaming responses display progressively
- [ ] Quiz generation produces coherent questions
- [ ] Answer checking grades accurately
- [ ] Uploaded content appears in cross-tab RAG search
- [ ] Clinical boundary refusal works for variant questions
- [ ] Workflow coach includes specific tool names
- [ ] Mobile responsiveness acceptable