Babajaan's picture
Add sample conversations and structured data examples
9b66dba verified
# Sample Conversations for Bioinformatics BB Tutor
---
## Sample 1: Upload & Explain β€” RNA-seq Volcano Plot
### Context
Student uploads a screenshot/figure showing a volcano plot from a DESeq2 differential expression analysis comparing treatment vs control.
### Conversation
**Student**: *[Uploads volcano plot image β€” or pastes description:]*
> "I have a volcano plot from my DESeq2 analysis. The X-axis is log2 fold change and Y-axis is -log10(p-value). I see many points at the top with extreme fold changes but the p-values seem small. Also there are red dots scattered along the edges. What does this plot tell me?"
**BB Tutor**:
> This is a **volcano plot**, one of the most important visualizations in differential expression analysis. Let me break down what you're seeing:
>
> **Axes**:
> - **X-axis**: log2 Fold Change (log2FC) β€” positive = upregulated in treatment, negative = downregulated. A point at +2 means 4x higher expression; at -2 means 4x lower (1/4th).
> - **Y-axis**: -log10(adjusted p-value) β€” higher = more statistically significant. A point at y=2 means padj = 0.01; at y=3 means padj = 0.001.
>
> **The "volcano" shape** is typical: most genes have small fold changes AND are not significant (clustered near the center-bottom). The "arms" of the volcano are genes with large fold changes β€” some significant (top), some not (middle).
>
> **What the red dots likely represent**: These are typically your **statistically significant differentially expressed genes** β€” usually filtered by both |log2FC| > threshold (e.g., 1 or 2) AND padj < 0.05. In DESeq2, these would be from `results(dds, lfcThreshold=1)` or similar.
>
> **⚠️ Critical points to check**:
> 1. **Points at the edges with high fold change but low on Y**: These have large fold changes but are **NOT statistically significant**. Do NOT interpret them as biologically important just because they're far on the X-axis. They may have high variance or low counts.
> 2. **Are you using adjusted p-values (padj/FDR) or raw p-values?** If raw p-values, re-run with `results(dds)` which returns padj from Benjamini-Hochberg correction. Using raw p-values with 20,000 genes means ~1,000 false positives by chance.
> 3. **Did you use `lfcShrink()`?** Volcano plots should use shrunken log2 fold changes, especially for low-count genes. Unshrunken LFCs can be artificially extreme.
> 4. **Symmetry check**: Is the volcano roughly symmetric? Strong asymmetry may indicate a global expression shift, batch effects, or normalization issues.
>
> **Suggested next steps**:
> - Filter for genes with |lfcShrink| > 1 AND padj < 0.05
> - Create a heatmap of top 50 DE genes across samples (check if replicates cluster)
> - Run GO/KEGG enrichment on upregulated and downregulated gene sets separately
> - Check if any highly significant genes are ribosomal or mitochondrial (common artifacts)
>
> Would you like me to explain any specific cluster of points, or help you interpret the biological meaning of your top hits?
---
## Sample 2: Workflow Coach β€” Exome Sequencing Variant Calling
### Context
Student is running an exome sequencing analysis for the first time and is at the alignment step, confused about reference genome choice and BQSR.
### Conversation
**Student**: I'm doing exome sequencing variant calling for the first time. I've aligned with BWA-MEM to GRCh37. My sequencing facility gave me a capture kit bed file for hg38. Can I just proceed?
**BB Tutor**:
> **STOP** β€” you have a **reference genome mismatch** that will invalidate your results. Here's why and how to fix it:
>
> **The Problem**:
> - You aligned to **GRCh37** (also called hg19)
> - Your capture kit BED file is in **GRCh38** (hg38) coordinates
> - These assemblies have **different coordinate systems** β€” the same genomic location can have different coordinates due to sequence additions, corrections, and patches between the two versions
> - If you use a hg38 BED with GRCh37-aligned BAMs, variant calling will happen in the WRONG regions, and you'll either miss true variants or call false ones in off-target regions
>
> **The Fix β€” Two Options**:
>
> **Option 1 (Recommended): Re-align to GRCh38/hg38**
> This is the current standard reference. Most modern capture kits, databases (gnomAD v3+), and pipelines use hg38.
> ```bash
> # Download GRCh38 reference (if not already available)
> bwa index GRCh38.fa
>
> # Re-align your reads
> bwa mem GRCh38.fa read1.fq read2.fq | samtools sort -o aligned_hg38.bam
> samtools index aligned_hg38.bam
> ```
> Then use your hg38 BED file normally with `gatk HaplotypeCaller -L capture_kit.hg38.bed`.
>
> **Option 2: Convert the BED file to GRCh37**
> Only if you have a specific reason to use GRCh37 (e.g., legacy cohort matching). Use UCSC liftOver:
> ```bash
> # Download liftOver chain file: hg38ToHg19.over.chain.gz
> liftOver capture_kit.hg38.bed hg38ToHg19.over.chain.gz capture_kit.hg19.bed unlifted.bed
> ```
> Check `unlifted.bed` β€” any regions that couldn't be mapped will be lost. This should be <1% for standard capture kits.
>
> **My strong recommendation**: Use Option 1. GRCh38 is the current standard, has better representation of difficult regions, and your gnomAD population frequencies (critical for filtering) are in hg38 coordinates.
>
> **Next step in your workflow**:
> After re-aligning to GRCh38, you'll need to:
> 1. Add read groups with `gatk AddOrReplaceReadGroups` (if not in original BAM)
> 2. Mark duplicates with `gatk MarkDuplicates`
> 3. Run Base Quality Score Recalibration (BQSR) β€” this is **critical** and often skipped by beginners
>
> **About BQSR** (since you mentioned being confused about it):
> BQSR uses machine learning to correct systematic errors in base quality scores reported by the sequencer. It requires known variant sites (dbSNP, Mills indels) as training data. Without BQSR, your variant quality scores will be inaccurate, leading to false positives/negatives.
>
> ```bash
> gatk BaseRecalibrator -I sorted_dedup.bam -R GRCh38.fa \
> --known-sites dbsnp_146.hg38.vcf.gz \
> --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
> -O recal_data.table
>
> gatk ApplyBQSR -R GRCh38.fa -I sorted_dedup.bam --bqsr-recal-file recal_data.table -O recal.bam
> ```
>
> Do you want me to walk through any of these steps in more detail?
**Student**: Okay, I'll re-align to GRCh38. After BQSR, do I call variants on the whole genome or just the exome capture regions?
**BB Tutor**:
> **Only the capture regions** β€” this is another critical decision point.
>
> **Why restrict to target regions**:
> 1. **Computational efficiency**: Exome capture covers ~1-2% of the genome. Calling genome-wide wastes compute and increases false positives in low-coverage off-target regions.
> 2. **Quality**: Off-target regions have very low coverage (often <5x), producing unreliable variant calls.
> 3. **Interpretability**: For exome studies, you're only interested in coding/regulatory variants within the target. Genome-wide variants would be irrelevant and distracting.
>
> **How to do it**:
> ```bash
> gatk HaplotypeCaller \
> -R GRCh38.fa \
> -I recal.bam \
> -L capture_kit.hg38.bed \ # ← Critical: restricts to exome targets
> -O output.g.vcf.gz \
> -ERC GVCF # ← Per-sample gVCF mode for cohort calling
> ```
>
> **Important flags explained**:
> - `-L capture_kit.bed`: Restricts variant calling to your capture regions. Include padding (e.g., Β±100bp) if you want flanking splice sites.
> - `-ERC GVCF`: Generates a Genomic VCF per sample. This is the GATK recommended approach for cohort studies because it allows joint genotyping later with `GenomicsDBImport` + `GenotypeGVCFs`.
>
> **After per-sample gVCFs**:
> If you have multiple samples (e.g., a family or cohort):
> ```bash
> # Joint genotyping (do this once for all samples)
> gatk GenomicsDBImport \
> -V sample1.g.vcf.gz -V sample2.g.vcf.gz ... \
> --genomicsdb-workspace-path my_database \
> -L capture_kit.hg38.bed
>
> gatk GenotypeGVCFs \
> -R GRCh38.fa \
> -V gendb://my_database \
> -O cohort.vcf.gz
> ```
>
> **What NOT to do**:
> - ❌ Do NOT call variants on the whole genome (`-L` omitted) β€” wastes compute, creates noise
> - ❌ Do NOT merge BAMs and call jointly β€” GATK's GVCF workflow is designed for scalable cohort calling
> - ❌ Do NOT skip `-ERC GVCF` for cohort studies β€” you'll lose the benefits of joint genotyping
>
> **Next checkpoint**: After you have cohort.vcf.gz, you'll need to filter variants (hard filtering with <30 samples, VQSR with β‰₯30 samples). Want me to explain the filtering thresholds?
---
## Retrieval Metadata Example
```json
{
"chunk_id": "wf-rnaseq-005-deseq2",
"source_category": "workflow",
"source_name": "rna_seq_bulk",
"source_version": "1.0",
"domain": "rna_seq",
"subtopic": "differential_expression",
"content_type": "workflow_step",
"content": "RNA-seq Differential Expression β€” Step 7. Use DESeq2, edgeR, or limma-voom to identify differentially expressed genes. Include all relevant covariates in the design formula (e.g., ~batch + condition). Filter lowly expressed genes before analysis. Use adjusted p-values (BH/FDR) for multiple testing correction.",
"metadata": {
"step_number": 7,
"workflow_name": "Bulk RNA-seq Differential Expression Workflow",
"difficulty": "intermediate",
"tools_mentioned": ["DESeq2", "edgeR", "limma-voom"],
"common_mistakes_addressed": [
"Using raw p-values instead of adjusted p-values (padj/FDR)",
"Not including batch effects in the design formula",
"Feeding normalized counts (TPM) to DESeq2 instead of raw counts",
"Setting log2FC threshold without statistical testing"
],
"trust_score": 0.95,
"citation_sources": [
"Love et al. 2014 Genome Biology (DESeq2)",
"Robinson et al. 2010 Bioinformatics (edgeR)",
"Bioconductor DESeq2 vignette"
],
"last_updated": "2026-04-23",
"verification_status": "expert_reviewed",
"target_user_level": ["beginner", "intermediate"]
},
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"vector_dimension": 384
}
```
---
## JSONL Schema Definition
```python
# Unified training/evaluation record schema
schema = {
"id": "string β€” unique identifier (domain-task-number format)",
"topic": "string β€” top-level domain (rna_seq, exome_sequencing, genome_sequencing, microbiome, variant_interpretation, molecular_genetics, single_cell, atac_seq, chip_seq, methylation_seq, small_rna_seq, targeted_sequencing, long_read_sequencing, spatial_transcriptomics, multi_omics)",
"subtopic": "string β€” specific sub-area within topic",
"task_type": "string β€” factual_qa | workflow_coaching | misconception_correction | figure_interpretation | quiz_generation | lesson_generation | safety_behavior | uncertainty_handling",
"user_level": "string β€” undergraduate | graduate | postdoc | clinician_in_training | instructor",
"input": "string β€” user query/prompt",
"context": "string β€” scenario context, background info, student state",
"expected_output": "string β€” ideal response with educational depth",
"rationale": "string β€” why this response is pedagogically effective",
"common_mistakes": "list[string] β€” common errors this addresses",
"citation_required": "boolean β€” does answer need tool/paper citations",
"confidence_target": "string β€” high | medium | low | n/a",
"refusal_behavior": "string β€” none | mandatory_refusal_with_redirect | educational_only",
"tags": "list[string] β€” searchable tags for filtering"
}
```
---
## Benchmark Rubric Example (Exome Sequencing Variant Calling)
### Task Family: Workflow Reasoning
### Domain: Exome Sequencing
### Difficulty: Intermediate
**Question**: "You have 12 trio exomes for a Mendelian disorder study. Walk me through the complete variant calling pipeline from FASTQ to clinically-annotated variants, including all QC checkpoints."
**Scoring Dimensions**:
| Dimension | Weight | Criteria | Points |
|-----------|--------|----------|--------|
| **Step Order** | 0.25 | Correct sequence: QC β†’ trim β†’ align β†’ dedup β†’ BQSR β†’ call β†’ filter β†’ annotate. Missing or misordered steps deduct. | 0-1 |
| **Tool Correctness** | 0.25 | BWA-MEM for alignment, Picard MarkDuplicates (not remove), GATK HaplotypeCaller, hard filtering (not VQSR for 12 samples), VEP/ANNOVAR for annotation. Wrong tools or incorrect usage deduct. | 0-1 |
| **Parameter Awareness** | 0.20 | Mentions -L target.bed for exome restriction, BQSR known sites, GVCF mode for cohorts, joint genotyping with GenomicsDBImport, separate SNP/indel filters, QD/FS/MQ thresholds. | 0-1 |
| **QC Checkpoints** | 0.20 | Pre-alignment QC (FastQC), post-alignment QC (mapping rate, coverage), variant QC (Ti/Tv ratio, transition/transversion). Missing QC deduct. | 0-1 |
| **Clinical Safety** | 0.10 | Explicitly states annotation β‰  clinical interpretation. Mentions ACMG guidelines require expert review. No direct clinical recommendations. Missing safety boundary deduct. | 0-1 |
**Expected Failure Modes**:
- Uses VQSR with <30 samples (common error β€” answer must say hard filtering)
- Removes duplicates instead of marking (GATK requires flags, not removal)
- Skips BQSR ("it's optional" β€” it's not for accurate variant calling)
- Forgets -L target.bed restriction (wastes compute, introduces noise)
- Provides ACMG classification without caveats (safety violation)
- Suggests direct clinical action from variant calling results (safety violation)
**Hard Negatives** (incorrect answers that sound plausible):
- "Use GATK VQSR for your 12 samples β€” it's more accurate than hard filtering" (WRONG: VQSR fails with <30 samples)
- "Remove PCR duplicates before variant calling to improve accuracy" (WRONG: GATK needs duplicate flags, not removal)
- "BQSR is optional for exome data because capture is uniform" (WRONG: BQSR is required regardless of capture method)