Spaces:

Babajaan
/

bioinformatics-bb-tutor

Sleeping

File size: 13,931 Bytes

4c15f39

# Bioinformatics with BB Tutor — Architecture & Implementation Notes

## System Architecture (Text Diagram)

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           USER INTERACTION LAYER                            │
│  ┌─────────┐  ┌─────────────┐  ┌────────┐  ┌──────────┐  ┌───────────────┐ │
│  │AskTutor │  │UploadExplain│  │QuizMe  │  │BuildLesson│  │WorkflowCoach │ │
│  │  Chat   │  │ File + Chat │  │Generate│  │Generate  │  │   Chat       │ │
│  └────┬────┘  └──────┬──────┘  └───┬────┘  └────┬─────┘  └───────┬───────┘ │
│  ┌─────────┐         │             │            │                │       │
│  │PaperTo  │         │             │            │                │       │
│  │ Lesson  │         │             │            │                │       │
│  └────┬────┘  ┌──────┴─────────────┴────┬───────┴─────┬──────────┴────┐  │
│  ┌────┴────┐  │                         │             │               │  │
│  │VivaPrac │  │   gr.State(rag_store)   │  gr.State  │  gr.State     │  │
│  │  Chat   │  │   (shared doc chunks)   │ (quiz_key) │  (session)    │  │
│  └─────────┘  └─────────────────────────┴─────────────┴───────────────┘  │
└────────────────────┬────────────────────────────────────────────────────────┘
                     │ HTTP / REST
                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         BACKEND ORCHESTRATION LAYER                         │
│                                                                             │
│   ┌────────────────┐  ┌────────────────┐  ┌──────────────────────────────┐   │
│   │ LLMService     │  │ RAGService     │  │ DocumentParser               │   │
│   │ (Singleton)    │  │ (Singleton)    │  │ (PDF/text/sequence parse)    │   │
│   │                │  │                │  │                              │   │
│   │ HF Inference   │  │ SentenceTransf │  │  - fitz (PyMuPDF)           │   │
│   │ Client         │  │ all-MiniLM-L6  │  │  - text file reader         │   │
│   │ stream_chat()  │  │ 384-dim embed  │  │  - chunker (400w/60w overlap)│   │
│   │ generate()     │  │ cosine sim     │  │                              │   │
│   │ fallback KB      │  │ top-k retrieve │  │                              │   │
│   └────────────────┘  └────────────────┘  └──────────────────────────────┘   │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ KNOWLEDGE BASE (Python module, loaded at startup)                     │   │
│   │  - DOMAIN_TAXONOMY: 15 domain categories, 100+ subtopics             │   │
│   │  - WORKFLOWS: 5 detailed step-by-step pipelines (RNA-seq, exome,   │   │
│   │    microbiome, single-cell) with tools, params, common mistakes     │   │
│   │  - GLOSSARY: 25 key terms with precise definitions                  │   │
│   │  - COMMON_MISCONCEPTIONS: 10 curated misconception/correction     │   │
│   │    pairs with severity ratings                                     │   │
│   │  - SYSTEM_PROMPTS: 7 per-module personas (tutor, coach, examiner)  │   │
│   │  - QUIZ_TEMPLATES: JSON-format generation templates for MCQ/TF/SA │   │
│   │  - LESSON_TEMPLATE: Structured lesson generation prompt           │   │
│   │  - TOPIC_CHOICES: 50+ dropdown options for topic selection        │   │
│   │  - WORKFLOW_CHOICES: 10 pipeline options for workflow coaching     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
└────────────────────┬────────────────────────────────────────────────────────┘
                     │
                     │ External APIs (conditional, lazy-loaded)
                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           EXTERNAL SERVICES                                  │
│                                                                             │
│   HuggingFace Inference API          HuggingFace Model Hub                 │
│   ┌─────────────────────────┐         ┌─────────────────────────┐           │
│   │ POST /v1/chat/completions│        │ sentence-transformers/  │           │
│   │ Streaming + non-streaming│       │ all-MiniLM-L6-v2          │           │
│   │ Model: Mistral-7B-Instruct│      │ (384-dim, 80MB, fast)    │           │
│   │ Token: HF_TOKEN (secret)  │      │ Download on first use    │           │
│   │ Timeout: 120s               │      │ CPU inference OK         │           │
│   └─────────────────────────┘         └─────────────────────────┘           │
│                                                                             │
│   Fallback (when HF_TOKEN missing): Knowledge base keyword search +        │
│   structured responses from curated content (no LLM required)              │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Data Flow

```
User Query → RAG Search (KB + uploaded docs) → Format context + system prompt
    → LLM API (streaming) → Token stream → Gradio ChatInterface display
                                        ↓
User Upload → DocumentParser → Chunker → Embedder → Store in gr.State
    → LLM summarize (non-streaming) → Display explanation
    → Future queries search uploaded chunks via RAG
```

## State Management

| State | Scope | Type | Content |
|-------|-------|------|---------|
| `rag_store` | Global (all tabs) | `gr.State(dict)` | `{chunks: [...], embeddings: np.array}` |
| `answer_key_state` | Quiz Me tab only | `gr.State(str)` | Raw LLM response for answer checking |

## Task Policies (Agent-like behavior)

| Task Type | Iteration Budget | Retrieval | Approval | Notes |
|-----------|-----------------|-----------|----------|-------|
| Short factual Q&A | 1 LLM call | KB only | None | Direct answer with RAG context |
| Long teaching answer | 1 LLM call | KB + uploaded docs | None | Streaming, max 4096 tokens |
| Figure interpretation | 1 LLM call | Uploaded content only | None | Requires prior upload |
| Workflow coaching | 1-3 LLM calls | KB + workflow steps | None | Stateful chat, accumulates |
| Quiz generation | 1 LLM call | KB | None | Non-streaming, stored in State |
| Paper→Lesson | 1-2 LLM calls | Uploaded content | None | First call = upload analysis |
| Viva practice | Multi-turn | KB | None | Examiner persona, adaptive |

## Safety Boundaries

- **Educational only**: All system prompts explicitly state "you are a teaching assistant, not a clinical system"
- **Clinical refusal**: Variant interpretation questions that could be clinical trigger educational redirect + referral to professionals
- **Uncertainty expression**: System prompts require "say so explicitly" when uncertain
- **No hallucinated citations**: RAG provides real KB content; LLM is instructed to cite specific tools/methods

## Failure Modes & Mitigation

| Failure | Detection | Mitigation |
|---------|-----------|------------|
| HF_TOKEN missing | `LLMService.is_available()` = False | Knowledge base fallback responses |
| Embedding model fails | `HAS_ST = False` or load exception | Keyword search fallback |
| PDF parsing fails | `fitz` import error or exception | Text-only mode, graceful message |
| LLM API timeout | Exception in stream_chat() | Error message + KB fallback suggestion |
| Large file upload | Size check in parse_file() | Truncate, warn user |
| Empty RAG results | Score < 0.15 threshold | Respond from general knowledge |

## Module Specifications

### Module: Ask the Tutor (Tab 1)
- **Input**: User message (str), system prompt (hidden), temperature (hidden), max_tokens (hidden), rag_store
- **Output**: Streaming text response
- **Backend**: `tutor_respond()` → RAG search → LLM stream_chat()
- **Retrieval**: KB + uploaded documents (if any)
- **Latency**: Streaming, first token <3s (with HF API)
- **Guardrails**: System prompt enforces educational boundary, uncertainty expression, no clinical claims

### Module: Upload & Explain (Tab 2)
- **Input**: File (PDF/TXT/FASTA/VCF/etc.), rag_store
- **Output**: Document analysis (Markdown), raw text (Textbox), updated rag_store
- **Backend**: `process_upload()` → parse → chunk → embed → LLM summarize
- **Retrieval**: Uploaded content becomes searchable across all tabs
- **Latency**: Parse+embed ~2-5s, LLM summarize ~5-15s
- **Guardrails**: Only bioinformatics file types accepted, max reasonable size

### Module: Quiz Me (Tab 3)
- **Input**: Topic (dropdown), format (radio), difficulty (radio), # questions (slider), rag_store
- **Output**: Quiz (Markdown), answer key (hidden State)
- **Backend**: `generate_quiz()` → RAG context → LLM generate() with JSON template
- **Retrieval**: KB topics related to selected domain
- **Latency**: ~10-20s for generation
- **Guardrails**: Plausible distractors, misconception-based wrong answers

### Module: Build a Lesson (Tab 4)
- **Input**: Topic, level, include_exercises (checkbox), include_quiz (checkbox)
- **Output**: Structured lesson (Markdown)
- **Backend**: `generate_lesson()` → RAG context → LLM generate() with LESSON_TEMPLATE
- **Retrieval**: KB workflow steps + glossary terms for topic
- **Latency**: ~15-30s
- **Guardrails**: Progressive disclosure, prerequisite listing, common pitfalls section

### Module: Workflow Coach (Tab 5)
- **Input**: Message, workflow selector (dropdown), temperature
- **Output**: Streaming chat response with workflow context
- **Backend**: `workflow_respond()` → inject workflow steps → LLM stream_chat()
- **Retrieval**: Full workflow steps from KB injected as system context
- **Latency**: Streaming, first token <3s
- **Guardrails**: Specific tool names, parameter mentions, QC checkpoint reminders

### Module: Paper to Lesson (Tab 6)
- **Input**: Message, output_format (radio), rag_store
- **Output**: Streaming lesson/study notes/slides/quiz
- **Backend**: `paper_to_lesson_respond()` → search uploaded docs → LLM stream_chat()
- **Retrieval**: User-uploaded document chunks
- **Latency**: Streaming
- **Guardrails**: Requires prior upload; warns if no uploaded content available

### Module: Viva Practice (Tab 7)
- **Input**: Message, topic (dropdown), difficulty (radio)
- **Output**: Streaming examiner questions and feedback
- **Backend**: `viva_respond()` → KB context + viva persona → LLM stream_chat()
- **Retrieval**: Topic-specific KB content
- **Latency**: Streaming
- **Guardrails**: Examiner persona, one question at a time, adaptive difficulty

## Evaluation Checklist

Before launch, verify:
- [ ] All 7 tabs render without JavaScript errors
- [ ] File upload works for PDF, TXT, FASTA
- [ ] KB fallback works when HF_TOKEN is missing
- [ ] Streaming responses display progressively
- [ ] Quiz generation produces coherent questions
- [ ] Answer checking grades accurately
- [ ] Uploaded content appears in cross-tab RAG search
- [ ] Clinical boundary refusal works for variant questions
- [ ] Workflow coach includes specific tool names
- [ ] Mobile responsiveness acceptable