Spaces:

Babajaan
/

bioinformatics-bb-tutor

Sleeping

App Files Files Community

bioinformatics-bb-tutor / ARCHITECTURE.md

Babajaan

Add architecture diagram, sample conversations, and module spec

4c15f39 verified 14 days ago

preview code

raw

history blame contribute delete

13.9 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Bioinformatics with BB Tutor — Architecture & Implementation Notes

System Architecture (Text Diagram)

┌─────────────────────────────────────────────────────────────────────────────┐
│                           USER INTERACTION LAYER                            │
│  ┌─────────┐  ┌─────────────┐  ┌────────┐  ┌──────────┐  ┌───────────────┐ │
│  │AskTutor │  │UploadExplain│  │QuizMe  │  │BuildLesson│  │WorkflowCoach │ │
│  │  Chat   │  │ File + Chat │  │Generate│  │Generate  │  │   Chat       │ │
│  └────┬────┘  └──────┬──────┘  └───┬────┘  └────┬─────┘  └───────┬───────┘ │
│  ┌─────────┐         │             │            │                │       │
│  │PaperTo  │         │             │            │                │       │
│  │ Lesson  │         │             │            │                │       │
│  └────┬────┘  ┌──────┴─────────────┴────┬───────┴─────┬──────────┴────┐  │
│  ┌────┴────┐  │                         │             │               │  │
│  │VivaPrac │  │   gr.State(rag_store)   │  gr.State  │  gr.State     │  │
│  │  Chat   │  │   (shared doc chunks)   │ (quiz_key) │  (session)    │  │
│  └─────────┘  └─────────────────────────┴─────────────┴───────────────┘  │
└────────────────────┬────────────────────────────────────────────────────────┘
                     │ HTTP / REST
                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         BACKEND ORCHESTRATION LAYER                         │
│                                                                             │
│   ┌────────────────┐  ┌────────────────┐  ┌──────────────────────────────┐   │
│   │ LLMService     │  │ RAGService     │  │ DocumentParser               │   │
│   │ (Singleton)    │  │ (Singleton)    │  │ (PDF/text/sequence parse)    │   │
│   │                │  │                │  │                              │   │
│   │ HF Inference   │  │ SentenceTransf │  │  - fitz (PyMuPDF)           │   │
│   │ Client         │  │ all-MiniLM-L6  │  │  - text file reader         │   │
│   │ stream_chat()  │  │ 384-dim embed  │  │  - chunker (400w/60w overlap)│   │
│   │ generate()     │  │ cosine sim     │  │                              │   │
│   │ fallback KB      │  │ top-k retrieve │  │                              │   │
│   └────────────────┘  └────────────────┘  └──────────────────────────────┘   │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ KNOWLEDGE BASE (Python module, loaded at startup)                     │   │
│   │  - DOMAIN_TAXONOMY: 15 domain categories, 100+ subtopics             │   │
│   │  - WORKFLOWS: 5 detailed step-by-step pipelines (RNA-seq, exome,   │   │
│   │    microbiome, single-cell) with tools, params, common mistakes     │   │
│   │  - GLOSSARY: 25 key terms with precise definitions                  │   │
│   │  - COMMON_MISCONCEPTIONS: 10 curated misconception/correction     │   │
│   │    pairs with severity ratings                                     │   │
│   │  - SYSTEM_PROMPTS: 7 per-module personas (tutor, coach, examiner)  │   │
│   │  - QUIZ_TEMPLATES: JSON-format generation templates for MCQ/TF/SA │   │
│   │  - LESSON_TEMPLATE: Structured lesson generation prompt           │   │
│   │  - TOPIC_CHOICES: 50+ dropdown options for topic selection        │   │
│   │  - WORKFLOW_CHOICES: 10 pipeline options for workflow coaching     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
└────────────────────┬────────────────────────────────────────────────────────┘
                     │
                     │ External APIs (conditional, lazy-loaded)
                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           EXTERNAL SERVICES                                  │
│                                                                             │
│   HuggingFace Inference API          HuggingFace Model Hub                 │
│   ┌─────────────────────────┐         ┌─────────────────────────┐           │
│   │ POST /v1/chat/completions│        │ sentence-transformers/  │           │
│   │ Streaming + non-streaming│       │ all-MiniLM-L6-v2          │           │
│   │ Model: Mistral-7B-Instruct│      │ (384-dim, 80MB, fast)    │           │
│   │ Token: HF_TOKEN (secret)  │      │ Download on first use    │           │
│   │ Timeout: 120s               │      │ CPU inference OK         │           │
│   └─────────────────────────┘         └─────────────────────────┘           │
│                                                                             │
│   Fallback (when HF_TOKEN missing): Knowledge base keyword search +        │
│   structured responses from curated content (no LLM required)              │
└─────────────────────────────────────────────────────────────────────────────┘

Data Flow

User Query → RAG Search (KB + uploaded docs) → Format context + system prompt
    → LLM API (streaming) → Token stream → Gradio ChatInterface display
                                        ↓
User Upload → DocumentParser → Chunker → Embedder → Store in gr.State
    → LLM summarize (non-streaming) → Display explanation
    → Future queries search uploaded chunks via RAG

State Management

State	Scope	Type	Content
`rag_store`	Global (all tabs)	`gr.State(dict)`	`{chunks: [...], embeddings: np.array}`
`answer_key_state`	Quiz Me tab only	`gr.State(str)`	Raw LLM response for answer checking

Task Policies (Agent-like behavior)

Task Type	Iteration Budget	Retrieval	Approval	Notes
Short factual Q&A	1 LLM call	KB only	None	Direct answer with RAG context
Long teaching answer	1 LLM call	KB + uploaded docs	None	Streaming, max 4096 tokens
Figure interpretation	1 LLM call	Uploaded content only	None	Requires prior upload
Workflow coaching	1-3 LLM calls	KB + workflow steps	None	Stateful chat, accumulates
Quiz generation	1 LLM call	KB	None	Non-streaming, stored in State
Paper→Lesson	1-2 LLM calls	Uploaded content	None	First call = upload analysis
Viva practice	Multi-turn	KB	None	Examiner persona, adaptive

Safety Boundaries

Educational only: All system prompts explicitly state "you are a teaching assistant, not a clinical system"
Clinical refusal: Variant interpretation questions that could be clinical trigger educational redirect + referral to professionals
Uncertainty expression: System prompts require "say so explicitly" when uncertain
No hallucinated citations: RAG provides real KB content; LLM is instructed to cite specific tools/methods

Failure Modes & Mitigation

Failure	Detection	Mitigation
HF_TOKEN missing	`LLMService.is_available()` = False	Knowledge base fallback responses
Embedding model fails	`HAS_ST = False` or load exception	Keyword search fallback
PDF parsing fails	`fitz` import error or exception	Text-only mode, graceful message
LLM API timeout	Exception in stream_chat()	Error message + KB fallback suggestion
Large file upload	Size check in parse_file()	Truncate, warn user
Empty RAG results	Score < 0.15 threshold	Respond from general knowledge

Module Specifications

Module: Ask the Tutor (Tab 1)

Input: User message (str), system prompt (hidden), temperature (hidden), max_tokens (hidden), rag_store
Output: Streaming text response
Backend: tutor_respond() → RAG search → LLM stream_chat()
Retrieval: KB + uploaded documents (if any)
Latency: Streaming, first token <3s (with HF API)
Guardrails: System prompt enforces educational boundary, uncertainty expression, no clinical claims

Module: Upload & Explain (Tab 2)

Input: File (PDF/TXT/FASTA/VCF/etc.), rag_store
Output: Document analysis (Markdown), raw text (Textbox), updated rag_store
Backend: process_upload() → parse → chunk → embed → LLM summarize
Retrieval: Uploaded content becomes searchable across all tabs
Latency: Parse+embed ~2-5s, LLM summarize ~5-15s
Guardrails: Only bioinformatics file types accepted, max reasonable size

Module: Quiz Me (Tab 3)

Input: Topic (dropdown), format (radio), difficulty (radio), # questions (slider), rag_store
Output: Quiz (Markdown), answer key (hidden State)
Backend: generate_quiz() → RAG context → LLM generate() with JSON template
Retrieval: KB topics related to selected domain
Latency: ~10-20s for generation
Guardrails: Plausible distractors, misconception-based wrong answers

Module: Build a Lesson (Tab 4)

Input: Topic, level, include_exercises (checkbox), include_quiz (checkbox)
Output: Structured lesson (Markdown)
Backend: generate_lesson() → RAG context → LLM generate() with LESSON_TEMPLATE
Retrieval: KB workflow steps + glossary terms for topic
Latency: ~15-30s
Guardrails: Progressive disclosure, prerequisite listing, common pitfalls section

Module: Workflow Coach (Tab 5)

Input: Message, workflow selector (dropdown), temperature
Output: Streaming chat response with workflow context
Backend: workflow_respond() → inject workflow steps → LLM stream_chat()
Retrieval: Full workflow steps from KB injected as system context
Latency: Streaming, first token <3s
Guardrails: Specific tool names, parameter mentions, QC checkpoint reminders

Module: Paper to Lesson (Tab 6)

Input: Message, output_format (radio), rag_store
Output: Streaming lesson/study notes/slides/quiz
Backend: paper_to_lesson_respond() → search uploaded docs → LLM stream_chat()
Retrieval: User-uploaded document chunks
Latency: Streaming
Guardrails: Requires prior upload; warns if no uploaded content available

Module: Viva Practice (Tab 7)

Input: Message, topic (dropdown), difficulty (radio)
Output: Streaming examiner questions and feedback
Backend: viva_respond() → KB context + viva persona → LLM stream_chat()
Retrieval: Topic-specific KB content
Latency: Streaming
Guardrails: Examiner persona, one question at a time, adaptive difficulty

Evaluation Checklist

Before launch, verify:

All 7 tabs render without JavaScript errors
File upload works for PDF, TXT, FASTA
KB fallback works when HF_TOKEN is missing
Streaming responses display progressively
Quiz generation produces coherent questions
Answer checking grades accurately
Uploaded content appears in cross-tab RAG search
Clinical boundary refusal works for variant questions
Workflow coach includes specific tool names
Mobile responsiveness acceptable