Omni-DNA ICI-DC V2 (128K Synthetic Pre-training)

⚠️ WARNING: This model underwent ICI-DC pre-training only. It was never subjected to SAD (Sequential Attenuation Denoising) fine-tuning. It does NOT have a SAD coefficient.

Training Details

  • Base model: zehui127/Omni-DNA-20M (20M parameter decoder-only causal LM)
  • Training method: ICI-DC (Interleaved Codon Insertion) pre-training on synthetic mutation pairs
  • Synthetic data: 129,948 pairs generated via ICI-DC gap-fill from 4,998 weighted parent sequences across 8 taxonomic domains
  • Epochs: 5
  • Learning rate: 2e-5 (cosine schedule)
  • Batch size: 16 × 2 (gradient accumulation)
  • Best eval loss: 1.4072 at epoch 4.92
  • No overfitting observed — loss still improving at final epoch

Benchmark Results

Metric Value
Detection F1 0.677
Embedding AUC 0.780
Masked Surprise −0.597
Discriminative AUC (mutate: prefix) 0.756
Discriminative AUC (raw DNA prefix) 0.625

Important Notes

  • The base Omni-DNA-20M model achieves 0.951 AUC on discriminative mutation detection with raw DNA prompts WITHOUT any fine-tuning. Fine-tuning on ICI-DC synthetic data degraded this capability.
  • The ICI-DC synthetic data has Ti/Tv ratio ~0.50 (random baseline), compared to real HGT mutations at Ti/Tv ~0.77. The synthetic mutations do not reflect biological mutation patterns.
  • See Prompt Ablation Results for details on how the mutate: instruction prefix affects benchmark scores.
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nhoodie/omni-dna-ici-dc-v2

Finetuned
(4)
this model