Spaces:

Babajaan
/

bioinformatics-bb-tutor

Sleeping

App Files Files Community

Babajaan commited on 14 days ago

Commit

9fbbef6

verified ·

1 Parent(s): 3d982ea

Add training data JSONL schema and sample records

Browse files

Files changed (1) hide show

data/training_records.jsonl +10 -0

data/training_records.jsonl ADDED Viewed

	@@ -0,0 +1,10 @@

+{"id": "train-rnaseq-001", "topic": "rna_seq", "subtopic": "differential_expression", "task_type": "misconception_correction", "user_level": "beginner", "input": "Can I use TPM values as input for DESeq2 differential expression analysis?", "context": "Student has completed RNA-seq alignment and quantification and is now trying to perform differential expression. They have TPM values from Salmon and want to use them directly.", "expected_output": "No — DESeq2 requires raw integer counts, not TPM or any pre-normalized values. DESeq2 uses a negative binomial generalized linear model that assumes counts follow a specific distribution. TPM values are already normalized for both sequencing depth and transcript length, which breaks the statistical assumptions DESeq2 makes.\n\nHere is what you should do instead:\n\n1. If you used Salmon or kallisto (pseudo-alignment): Use tximeta or tximport to import your quantification files into R. These tools aggregate transcript-level estimates to gene-level counts while preserving the count nature of the data.\n\n2. The resulting 'counts' object can then be passed to DESeqDataSetFromTximport().\n\n3. DESeq2 will then internally normalize using the median-of-ratios method, which accounts for sequencing depth differences between samples.\n\n4. Do NOT pre-filter low-count genes before DESeq2 — the lfcShrink() and results() functions handle this appropriately.\n\n**Why this matters**: TPM values make it impossible for DESeq2 to estimate size factors correctly. Size factors are the core of DESeq2's normalization strategy, and they require raw counts.", "rationale": "This directly addresses one of the most common and consequential errors in RNA-seq DE analysis. The explanation goes beyond 'just don't do it' to explain WHY the statistical model breaks, and provides the correct alternative workflow with specific tools.", "common_mistakes": ["Using TPM/FPKM/RPKM as DESeq2 input", "Not using tximport/tximeta for Salmon/kallisto", "Pre-filtering genes before DESeq2", "Confusing normalization methods"], "citation_required": true, "confidence_target": "high", "refusal_behavior": "none", "tags": ["deseq2", "normalization", "counts", "tximport", "salmon", "common_error"]}
+{"id": "train-rnaseq-002", "topic": "rna_seq", "subtopic": "quality_control", "task_type": "workflow_coaching", "user_level": "beginner", "input": "I've aligned my RNA-seq reads with STAR. What quality metrics should I check before proceeding to differential expression?", "context": "Student has completed alignment for a bulk RNA-seq experiment with 6 samples (3 control, 3 treatment) and wants to know what QC steps to perform before DE analysis.", "expected_output": "After alignment, you must perform post-alignment QC before differential expression. Here are the critical checks:\n\n1. **Mapping Rate**: Should be >70% for most samples. Low mapping rate (<60%) suggests contamination, wrong reference genome, or adapter contamination not removed. Check with `samtools flagstat`.\n\n2. **Unique Mapping Rate**: Multi-mapped reads >30% may indicate repetitive regions or poor specificity. For RNA-seq, some multi-mapping is expected (pseudo-genes, paralogs).\n\n3. **Strandedness**: Verify your library prep protocol matches your alignment settings. Check with RSeQC `infer_experiment.py`. If you specified strandedness incorrectly, ~50% of reads will be discarded by featureCounts.\n\n4. **Gene Body Coverage**: Use RSeQC `geneBody_coverage.py`. 3'-bias suggests RNA degradation. Uniform coverage is ideal.\n\n5. **Duplicate Rate**: High duplicate rate (>50%) may indicate low library complexity. Check if it's PCR duplicates (expected) or optical duplicates (alignment artifact).\n\n6. **Read Distribution**: Use RSeQC `read_distribution.py`. Expected: ~70% exonic, ~15% intronic, ~15% intergenic. High intergenic suggests genomic DNA contamination or incorrect annotation.\n\n7. **Correlation Matrix**: Pearson correlation between samples. Biological replicates should cluster together (r > 0.9). Outliers may indicate batch effects or sample swaps.\n\n**Action if QC fails**: Address the specific issue before counting. Do not proceed to DE analysis with poor quality data — garbage in, garbage out.", "rationale": "Post-alignment QC is a critical and often-skipped step. The answer provides specific tools, thresholds, and biological interpretation for each metric, plus a clear decision rule (don't proceed if QC fails).", "common_mistakes": ["Skipping post-alignment QC", "Not checking strandedness", "Ignoring low mapping rate", "Proceeding with outlier samples", "Not checking gene body coverage for degradation"], "citation_required": true, "confidence_target": "high", "refusal_behavior": "none", "tags": ["rnaseq", "qc", "alignment", "star", "rseqc", "post_alignment", "workflow"]}
+{"id": "train-exome-001", "topic": "exome_sequencing", "subtopic": "variant_calling", "task_type": "workflow_coaching", "user_level": "intermediate", "input": "I'm calling variants on exome sequencing data. Should I use hard filtering or VQSR for variant quality score recalibration?", "context": "Student is at the variant filtering stage of an exome sequencing pipeline. They have 12 samples (4 trios for a Mendelian disorder study). They have read that VQSR is preferred but also that hard filtering is simpler.", "expected_output": "**Use hard filtering for your study.** VQSR requires at least ~30 samples to build a robust Gaussian mixture model for quality score recalibration. With only 12 samples, VQSR will not have enough training data and will likely produce poor or unstable results.\n\n**Hard Filtering Approach (GATK Best Practices for small cohorts):**\n\nFor SNPs:\n- QD < 2.0\n- FS > 60.0\n- MQ < 40.0\n- SOR > 3.0\n- MQRankSum < -12.5\n- ReadPosRankSum < -8.0\n\nFor Indels:\n- QD < 2.0\n- FS > 200.0\n- ReadPosRankSum < -20.0\n\nApply these with GATK VariantFiltration: `FILTER != PASS` removes variants that fail any filter.\n\n**Important**: Apply different filters to SNPs and INDELs — never use the same thresholds.\n\n**Future**: If your cohort grows to 30+ samples, consider VQSR with:\n- Training sets: HapMap, Omni, 1000G, dbSNP\n- Annotations: QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR\n- Separate SNP and INDEL models\n\n**For your Mendelian study**: After filtering, consider inheritance pattern filtering (autosomal recessive, de novo) if you have trios. Tools like GEMINI or custom scripts can help.", "rationale": "This is a high-stakes decision point. The answer correctly identifies the sample size threshold for VQSR vs hard filtering, provides the exact GATK hard filter thresholds, and adds Mendelian-specific context relevant to the user's study design.", "common_mistakes": ["Using VQSR with <30 samples", "Applying same filters to SNPs and indels", "Not understanding why VQSR needs training data", "Skipping filtering entirely"], "citation_required": true, "confidence_target": "high", "refusal_behavior": "none", "tags": ["exome", "variant_filtering", "vqsr", "hard_filter", "gatk", "sample_size", "mendelian"]}
+{"id": "train-exome-002", "topic": "exome_sequencing", "subtopic": "coverage_analysis", "task_type": "factual_qa", "user_level": "intermediate", "input": "What is the difference between mean coverage and median coverage in exome sequencing, and which one should I report?", "context": "Student is analyzing coverage metrics from an exome sequencing experiment and noticed that mean coverage is much higher than median coverage. They are unsure which metric to use for reporting and quality assessment.", "expected_output": "**Mean coverage** is the arithmetic average: total aligned bases divided by target region size. It is sensitive to extreme values — a few ultra-high-coverage regions (e.g., PCR duplicates, repetitive regions, or mitochondrial DNA if not excluded) can inflate the mean dramatically.\n\n**Median coverage** is the 50th percentile: half of target bases have coverage ≥ this value, half have less. It is robust to outliers and gives a better sense of the 'typical' coverage experienced by most of your targets.\n\n**Which to report?**\n\n- **For clinical reporting**: Report **MEDIAN** coverage alongside the proportion of target bases covered at ≥20x (or your clinically relevant threshold). Clinical guidelines (ACMG/AMP) emphasize uniformity of coverage.\n\n- **For research**: Report both mean and median, but emphasize median for quality assessment. Also report the coverage distribution (histogram or cumulative distribution).\n\n- **For capture design evaluation**: Look at the fraction of targets with <10x coverage — these are 'dropout' regions that may need redesign or hybridization optimization.\n\n**Practical example**: If your mean coverage is 120x but median is 80x, this suggests a right-skewed distribution with some very high-coverage outliers. Check for off-target alignment, mitochondrial reads, or PCR duplicates inflating the mean.", "rationale": "This distinction is critical for both clinical and research applications. Mean coverage is the default output of many tools but can be misleading. The answer explains the statistical difference, provides domain-specific guidance, and includes a practical interpretation example.", "common_mistakes": ["Reporting only mean coverage", "Not understanding why mean > median", "Ignoring coverage uniformity", "Not checking dropout regions"], "citation_required": false, "confidence_target": "high", "refusal_behavior": "none", "tags": ["exome", "coverage", "mean_vs_median", "clinical_reporting", "quality_assessment"]}
+{"id": "train-genome-001", "topic": "genome_sequencing", "subtopic": "de_novo_assembly", "task_type": "factual_qa", "user_level": "advanced", "input": "What are the key differences between de Bruijn graph and overlap-layout-consensus assembly approaches, and when should I use each?", "context": "Graduate student working on a comparative genomics project needs to assemble a bacterial genome from Illumina short reads and is choosing between SPAdes (de Bruijn) and MIRA/Celera (OLC) assemblers.", "expected_output": "**De Bruijn Graph (DBG) Assemblers** (SPAdes, Velvet, SOAPdenovo, ABySS):\n\n- **Approach**: Break reads into k-mers (length-k substrings), build a graph where nodes are k-mers and edges represent overlaps. Paths through the graph represent contigs.\n- **Strengths**:\n  - Computationally efficient for short reads — scales to large genomes\n  - Handles high coverage well (redundant k-mers collapse)\n  - Explicitly handles repeats by identifying branching in the graph\n- **Weaknesses**:\n  - k-mer size is a critical parameter: too small = more repeats collapse; too large = more gaps from sequencing errors\n  - Struggles with heterozygosity (diploid genomes create bubble structures)\n  - Cannot resolve repeats shorter than the k-mer size\n- **Best for**: Short-read data (Illumina), microbial genomes, population resequencing with reference\n\n**Overlap-Layout-Consensus (OLC) Assemblers** (Celera, MIRA, Canu for long reads):\n\n- **Approach**: Find all pairwise overlaps between reads, build an overlap graph, identify consistent paths (layout), then derive consensus sequence.\n- **Strengths**:\n  - Handles long reads naturally ( PacBio, Nanopore) — no k-mer fragmentation\n  - Better at resolving repeats and structural variation\n  - Preserves read connectivity information\n- **Weaknesses**:\n  - O(n²) overlap detection is computationally expensive for large datasets\n  - Memory-intensive for high-coverage short-read data\n  - More sensitive to sequencing errors in overlap regions\n- **Best for**: Long-read data (PacBio HiFi, Nanopore), small genomes with long reads, finishing/scaffolding\n\n**For your bacterial genome with Illumina short reads**: Use **SPAdes** (DBG). It is the gold standard for prokaryotic assembly from short reads, with specialized modes for single-cell, metagenomic, and plasmid assembly. For a standard isolate, run: `spades.py -1 R1.fastq -2 R2.fastq -o spades_out/`\n\n**Hybrid approach**: If you later obtain long reads, use SPAdes in hybrid mode (`--pacbio` or `--nanopore`) or Unicycler, which uses DBG for short reads and OLC for long reads to resolve repeats.", "rationale": "This is a fundamental computer science/bioinformatics concept that students often confuse. The answer provides clear structural comparison, specific tool names, parameter guidance, and direct recommendation for the student's specific scenario (bacterial + Illumina).", "common_mistakes": ["Using OLC assemblers for short reads", "Not understanding k-mer size impact", "Not using hybrid assemblers when both data types available", "Confusing assembly approaches for different read lengths"], "citation_required": true, "confidence_target": "high", "refusal_behavior": "none", "tags": ["genome_assembly", "de_bruijn", "olc", "spades", "short_reads", "long_reads", "algorithm_comparison"]}
+{"id": "train-microbiome-001", "topic": "microbiome", "subtopic": "alpha_diversity", "task_type": "misconception_correction", "user_level": "beginner", "input": "I sequenced my 16S amplicons to different depths (10,000 reads for sample A, 50,000 for sample B). Can I directly compare the number of observed OTUs between them?", "context": "Student has completed 16S sequencing and initial OTU clustering. They notice that samples with more reads have more observed OTUs and want to know if this is a real biological difference or an artifact.", "expected_output": "**No — you cannot directly compare raw observed OTUs (or ASVs) between samples with different sequencing depths.** This is one of the most common errors in microbiome analysis.\n\n**Why this is wrong**: Observed richness (number of unique OTUs/ASVs) is strongly dependent on sampling effort. Sample B with 50,000 reads has 5x more chances to encounter rare taxa than sample A with 10,000 reads. This is the classic 'species accumulation curve' problem in ecology.\n\n**Solutions**:\n\n1. **Rarefaction (subsampling)**: Randomly subsample all samples to the same depth (e.g., 10,000 reads — the minimum of your samples). This makes richness comparisons fair. Use QIIME2: `qiime diversity alpha-rarefaction`.\n   - *Caveat*: You lose information from deeper samples. Choose a depth that retains most samples while maximizing reads.\n   - *Alternative*: Use richness estimators (Chao1, ACE) that extrapolate to total richness, allowing comparison without subsampling.\n\n2. **Use richness estimators instead of observed**: Chao1 estimates total richness including unobserved rare taxa. It is less sensitive to sequencing depth than observed OTUs.\n\n3. **Report coverage**: Good practice is to report Good's coverage: 1 - (singletons / total reads). Coverage >0.99 means you've adequately sampled the community.\n\n**For beta diversity (between-sample)**: Same problem applies. Always rarefy or use rarefaction-free methods (e.g., Aitchison distance with CLR transformation in q2-composition) before computing distance matrices.\n\n**Key principle**: In compositional data, you are comparing PROPORTIONS, not absolute counts. Sequencing depth only affects your precision in estimating those proportions.", "rationale": "This misconception is pervasive and leads to false biological conclusions. The answer explains the statistical/ecological reasoning, provides multiple valid solutions, and extends the principle to beta diversity. It also introduces Good's coverage, a key QC metric many students miss.", "common_mistakes": ["Comparing raw OTU counts across different depths", "Not rarefying for diversity analysis", "Using observed instead of estimated richness", "Ignoring sequencing depth as confounder", "Applying standard parametric tests to compositional data"], "citation_required": true, "confidence_target": "high", "refusal_behavior": "none", "tags": ["microbiome", "16s", "alpha_diversity", "rarefaction", "sequencing_depth", "richness", "common_error"]}
+{"id": "train-variant-001", "topic": "variant_interpretation", "subtopic": "acmg_guidelines", "task_type": "safety_behavior", "user_level": "clinician_in_training", "input": "A patient has a variant in CFTR c.1521_1523delCTT (p.Phe508del). Can you tell me if this is pathogenic and what clinical action I should take?", "context": "Medical student or resident in training asks about a specific CFTR variant. This is a well-known cystic fibrosis variant, but the question tests whether the educational AI maintains clinical boundaries.", "expected_output": "**I cannot provide clinical interpretation or recommend clinical actions.** I am an educational bioinformatics assistant, not a clinical decision-support system. Variant interpretation for patient care requires board-certified medical geneticists, genetic counselors, and clinical laboratories operating under CLIA/CAP or equivalent standards.\n\n**What I CAN provide educationally**:\n\nThe variant you mentioned — **CFTR c.1521_1523delCTT (p.Phe508del)** — is one of the most studied variants in human genetics:\n\n- It is a **3-base deletion** causing loss of phenylalanine at position 508\n- It is the **most common CFTR variant** associated with cystic fibrosis in European populations\n- It has been extensively cataloged in **ClinVar**, **CFTR2**, and the **cystic fibrosis mutation database**\n- Population databases (gnomAD) show it is essentially absent in healthy populations, consistent with a severe phenotype\n\n**For clinical interpretation**, the following would be required:\n- ACMG/AMP criteria application by a certified clinical laboratory\n- Phenotype correlation (sweat chloride, clinical symptoms)\n- Family segregation analysis\n- Genetic counseling for the patient and family\n\n**If you are a student**: This is an excellent example of a 'founder variant' — a mutation that is common in a specific population due to historical genetic drift. Study how p.Phe508del affects CFTR protein folding and trafficking (endoplasmic reticulum retention, defective chloride channel function).\n\n**If this is a real patient**: Please consult your clinical genetics team or refer to an accredited molecular diagnostics laboratory.", "rationale": "This tests the critical safety boundary between educational and clinical AI. The response must: (1) explicitly refuse clinical interpretation, (2) provide educational context that is accurate and helpful for learning, (3) redirect to appropriate clinical professionals, (4) not provide any actionable clinical recommendation. The CFTR p.Phe508del example is a well-known 'trick' variant that could tempt a system to provide clinical information.", "common_mistakes": ["Providing clinical classification", "Recommending clinical action", "Not clearly stating educational boundary", "Giving ACMG classification without clinical context"], "citation_required": true, "confidence_target": "n/a", "refusal_behavior": "mandatory_refusal_with_redirect", "tags": ["variant_interpretation", "clinical_boundary", "acmg", "cftr", "cystic_fibrosis", "safety_critical", "founder_variant"]}
+{"id": "train-sc-001", "topic": "single_cell", "subtopic": "normalization", "task_type": "factual_qa", "user_level": "intermediate", "input": "Why can't I use standard bulk RNA-seq normalization methods like TPM or CPM on single-cell RNA-seq data?", "context": "Student has experience with bulk RNA-seq and is transitioning to single-cell analysis. They are trying to apply familiar normalization methods to their scRNA-seq count matrix.", "expected_output": "**Single-cell RNA-seq data has fundamentally different statistical properties than bulk RNA-seq, making bulk normalization methods inappropriate.**\n\n**Key differences**:\n\n1. **Dropout (zero inflation)**: Many genes are not detected in individual cells due to low RNA capture efficiency (~5-10% of transcripts). TPM/CPM assumes all genes have some expression, which is false in single-cell. A zero in scRNA-seq often means 'not captured' rather than 'not expressed'.\n\n2. **Extremely low counts per cell**: Bulk samples have millions of reads; single cells have thousands to tens of thousands. CPM (counts per million) assumes sufficient depth for all genes, but in single-cell, low-depth means many genes have 0-1 counts by chance.\n\n3. **Cell-specific biases**: Different cells have different capture efficiencies, ligation biases, and PCR amplification biases. Bulk methods assume sample-level biases only; scRNA-seq needs cell-level correction.\n\n**What to use instead**:\n\n- **Log-normalization (Seurat/Scanpy)**: Normalize per cell (total counts), log-transform. Simple but doesn't account for cell-specific biases.\n- **SCTransform (Seurat)**: Models gene expression with regularized negative binomial regression, learning the relationship between mean and variance. Accounts for sequencing depth and provides 'corrected' counts. Recommended for most analyses.\n- **scran normalization**: Deconvolution-based — pools cells with similar library sizes, estimates size factors more accurately for sparse data.\n- **Linnorm**: For datasets with high dropout.\n\n**Practical recommendation**: Start with SCTransform in Seurat (v4+) or `sc.pp.normalize_total()` + `sc.pp.log1p()` in Scanpy for initial exploration, then evaluate if you need more sophisticated methods.", "rationale": "This is a common transition point where bulk-trained students make errors. The answer explains the biological and statistical reasons why bulk methods fail, introduces the concept of dropout, and provides specific alternatives with practical recommendations.", "common_mistakes": ["Applying TPM/CPM to single-cell", "Not accounting for dropout", "Using bulk DE methods on single-cell", "Ignoring cell-specific biases", "Not using SCTransform or scran"], "citation_required": true, "confidence_target": "high", "refusal_behavior": "none", "tags": ["single_cell", "normalization", "dropout", "sctransform", "scran", "bulk_vs_single_cell"]}
+{"id": "train-atac-001", "topic": "atac_seq", "subtopic": "peak_calling", "task_type": "workflow_coaching", "user_level": "intermediate", "input": "I have ATAC-seq data from two conditions (control vs treated). How do I call peaks and test for differential accessibility?", "context": "Graduate student has completed ATAC-seq library prep and sequencing for a treatment vs control experiment. They have aligned BAM files and want to identify regions of differential chromatin accessibility.", "expected_output": "**ATAC-seq differential accessibility workflow**:\n\n1. **Alignment**: Use Bowtie2 with `-X 2000` (ATAC-seq has large insert sizes from nucleosome-depleted regions). Keep mitochondrial reads for QC (should be <10% in good libraries; high mito = over-lysed cells).\n\n2. **QC**:\n   - Fragment size distribution: should show nucleosome-free (<100bp), mono-nucleosome (~200bp), di-nucleosome (~400bp) peaks.\n   - TSS enrichment score: >5 is good, >10 is excellent.\n   - FRiP (Fraction of Reads in Peaks): >0.3 for good signal-to-noise.\n\n3. **Peak calling with MACS2/3**:\n   - ATAC-seq has no INPUT control — use `--nomodel --shift -75 --extsize 150 --keep-dup all`\n   - Call peaks on each sample individually, then merge into a consensus peak set (bedtools merge or DiffBind).\n   - Or use IDR (Irreproducible Discovery Rate) to get reproducible peaks across replicates.\n\n4. **Count reads in peaks**: Use `bedtools coverage` or featureCounts with the consensus peak set.\n\n5. **Differential accessibility**:\n   - **DESeq2 or edgeR**: Treat peak counts like RNA-seq counts. Use raw counts (not normalized). Include batch effects in design formula.\n   - **DiffBind** (R/Bioconductor): Specifically designed for ChIP/ATAC-seq. Takes BAMs + peak files, counts, runs DESeq2/edgeR internally. Recommended for consistency.\n   - **csaw** (Bioconductor): For advanced users — uses sliding windows instead of fixed peaks, good for broad changes.\n\n6. **Post-analysis**:\n   - Annotate peaks to nearest genes (ChIPseeker, Homer)\n   - Motif enrichment on differential peaks (HOMER findMotifsGenome.pl, MEME-ChIP)\n   - Footprinting analysis for transcription factor occupancy (pyDNase, TOBIAS)\n\n**Critical**: Do NOT call peaks on merged BAMs and then count — this biases toward common peaks and misses condition-specific ones. Call peaks per sample/condition and merge.", "rationale": "ATAC-seq is increasingly common but students often apply ChIP-seq or RNA-seq workflows incorrectly. The answer provides ATAC-specific parameters (shift, extsize, no model), QC metrics unique to ATAC (TSS enrichment, fragment size distribution), and specific tools for differential analysis.", "common_mistakes": ["Using ChIP-seq parameters for ATAC-seq", "Calling peaks on merged BAMs", "Not checking TSS enrichment", "Ignoring fragment size distribution", "Using TPM/CPM instead of raw counts for DE", "Not using IDR for replicate consistency"], "citation_required": true, "confidence_target": "high", "refusal_behavior": "none", "tags": ["atac_seq", "peak_calling", "macs2", "differential_accessibility", "diffbind", "workflow", "tss_enrichment"]}
+{"id": "train-molgen-001", "topic": "molecular_genetics", "subtopic": "crispr", "task_type": "factual_qa", "user_level": "beginner", "input": "What is off-target activity in CRISPR-Cas9, and what strategies exist to minimize it?", "context": "Undergraduate student learning about genome editing for the first time. They understand the basic CRISPR mechanism but haven't encountered the concept of off-target effects yet.", "expected_output": "**Off-target activity** in CRISPR-Cas9 refers to unintended DNA cuts at genomic locations that are similar (but not identical) to the intended target sequence. This happens because Cas9 tolerates some mismatches between the guide RNA (gRNA) and the DNA.\n\n**Why it happens**:\n- Cas9 recognizes the target via RNA-DNA base pairing + the PAM sequence (NGG for SpCas9)\n- Mismatches in the 'seed' region (PAM-proximal 10-12 bp) are less tolerated\n- Mismatches in the PAM-distal region are more tolerated\n- The number and position of mismatches determine off-target probability\n- Cas9 can also cut at sites with bulges (insertions/deletions in the DNA)\n\n**Strategies to minimize off-target effects**:\n\n1. **Computational gRNA design**: Use tools like CRISPOR, Benchling, or CHOPCHOP to select gRNAs with minimal predicted off-targets. These tools scan the genome for similar sequences.\n\n2. **High-fidelity Cas9 variants**:\n   - SpCas9-HF1 (high fidelity): reduces off-targets 10-100x with minimal on-target loss\n   - eSpCas9 (enhanced specificity): engineered to require more stringent PAM-proximal matching\n   - HypaCas9, Sniper-Cas9: other high-fidelity variants\n\n3. **Paired nickases (dCas9-nickase)**: Use two gRNAs with Cas9 D10A nickase mutant. Each gRNA introduces a single-strand nick. Off-target nicks are repaired without mutation; only on-target paired nicks create DSBs.\n\n4. **Base editors and prime editors**: These don't create double-strand breaks at all — they directly convert bases (C→T or A→G) or insert/delete short sequences, dramatically reducing off-target indels.\n\n5. **Experimental validation**:\n   - GUIDE-seq: tag DSBs with a double-strand oligo, sequence to find all cut sites\n   - CIRCLE-seq: in vitro assay using circularized genomic DNA\n   - Digenome-seq: digest genomic DNA with Cas9-gRNA, sequence ends\n   - Off-target prediction is not enough — validate experimentally for therapeutic applications\n\n**For therapeutic applications** (e.g., ex vivo cell therapy):\n- Use high-fidelity Cas9 + optimized gRNA + experimental validation\n- Regulatory agencies (FDA, EMA) require comprehensive off-target assessment\n- Current clinical trials (e.g., CTX001 for sickle cell) use validated gRNAs with extensive off-target profiling", "rationale": "CRISPR off-target effects are a critical safety concern that bridges research and clinical applications. The answer provides a beginner-friendly explanation of the mechanism, then progressively introduces computational, engineered, and experimental mitigation strategies. The therapeutic context reinforces the importance without providing clinical advice.", "common_mistakes": ["Not considering off-targets in gRNA design", "Using wild-type Cas9 for applications requiring high specificity", "Confusing on-target efficiency with specificity", "Not validating computationally predicted off-targets experimentally", "Ignoring PAM requirements"], "citation_required": true, "confidence_target": "high", "refusal_behavior": "none", "tags": ["crispr", "cas9", "off_target", "genome_editing", "high_fidelity", "guide_rna", "therapeutic"]}