Omni-DNA ICI-DC V2 (128K Synthetic Pre-training)
⚠️ WARNING: This model underwent ICI-DC pre-training only. It was never subjected to SAD (Sequential Attenuation Denoising) fine-tuning. It does NOT have a SAD coefficient.
Training Details
- Base model: zehui127/Omni-DNA-20M (20M parameter decoder-only causal LM)
- Training method: ICI-DC (Interleaved Codon Insertion) pre-training on synthetic mutation pairs
- Synthetic data: 129,948 pairs generated via ICI-DC gap-fill from 4,998 weighted parent sequences across 8 taxonomic domains
- Epochs: 5
- Learning rate: 2e-5 (cosine schedule)
- Batch size: 16 × 2 (gradient accumulation)
- Best eval loss: 1.4072 at epoch 4.92
- No overfitting observed — loss still improving at final epoch
Benchmark Results
| Metric | Value |
|---|---|
| Detection F1 | 0.677 |
| Embedding AUC | 0.780 |
| Masked Surprise | −0.597 |
| Discriminative AUC (mutate: prefix) | 0.756 |
| Discriminative AUC (raw DNA prefix) | 0.625 |
Important Notes
- The base Omni-DNA-20M model achieves 0.951 AUC on discriminative mutation detection with raw DNA prompts WITHOUT any fine-tuning. Fine-tuning on ICI-DC synthetic data degraded this capability.
- The ICI-DC synthetic data has Ti/Tv ratio ~0.50 (random baseline), compared to real HGT mutations at Ti/Tv ~0.77. The synthetic mutations do not reflect biological mutation patterns.
- See Prompt Ablation Results for details on how the
mutate:instruction prefix affects benchmark scores.
- Downloads last month
- 19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Nhoodie/omni-dna-ici-dc-v2
Base model
zehui127/Omni-DNA-20M