โ ๏ธ IMPORTANT WARNING โ Model Effectiveness
This model's reported SAD coefficient is unreliable. The ICI-DC (S1) and SAD (S2) training stages both underfit near-instantly. The SAD coefficient was calculated from training hyperparameters (synthetic_pairs ร epochs / real_pairs ร epochs ร LR_factor), NOT from converged training. The model never reached a stable training state that would justify the coefficient value.
Subsequent testing revealed that the unmodified base Omni-DNA-20M model achieves 0.951 AUC on discriminative mutation detection with raw DNA prompts โ significantly outperforming ALL fine-tuned variants including this one. Fine-tuning with ICI-DC synthetic data systematically degraded the base model's innate mutation discrimination capability.
This model is preserved for historical and reproducibility purposes only.
Omni-DNA SAD Checkpoint
Sequential Attenuation Denoising (SAD) โ ICI-DC checkpoint fine-tuned on real mutation data.
Starting from the ICI-DC checkpoint, this model undergoes attenuation: real mutation pairs at 10x lower learning rate. Weights that don't contradict real data persist from ICI-DC; contradicted patterns get corrected.
SAD Training Details
| Parameter | Value |
|---|---|
| Base model | Nhoodie/omni-dna-ici-dc (ICI-DC pre-trained) |
| Training data | 3,317 real mutation pairs |
| Epochs | 5 |
| Learning rate | 1e-5 (10x lower than ICI-DC) |
| SAD coefficient | 4.89 (81,120 synthetic exposures / 16,585 real exposures) |
| Batch size | 32 effective |
| Precision | fp32 |
Note: SAD coefficient of 4.89 is considered too high. A coefficient of ~1.5 was also tested (16 real epochs) with similar results. See benchmarks below.
Multi-Axis Benchmark (100 test pairs, 4 models)
Axis 6: Discriminative (most important)
| Model | AUC | Best F1 | Score Gap |
|---|---|---|---|
| Base Omni (no fine-tune) | 0.588 | 0.688 | 25.6 |
| ICI-DC | 0.887 | 0.858 | 370.2 |
| SAD coeff=4.89 (this) | 0.904 | 0.862 | 411.4 |
| SAD coeff=1.5 | 0.908 | 0.862 | 413.7 |
Axis 5: Mutation Surprise
| Model | Surprise | p-value | Interpretation |
|---|---|---|---|
| Base Omni | +0.777 | <0.0001 | Expects parent (mutations are surprising) |
| ICI-DC | -0.216 | 0.0001 | Expects mutations everywhere |
| SAD coeff=4.89 | -0.203 | 0.0003 | Partially attenuated |
| SAD coeff=1.5 | -0.169 | 0.0026 | More attenuation |
Full 6-Axis Comparison
| Axis | Metric | Base | ICI-DC | SAD 4.89 | SAD 1.5 |
|---|---|---|---|---|---|
| A1: Detection | Recall | 0.511 | 0.518 | 0.517 | 0.497 |
| A2: Logits | Top-3 acc | 0.846 | 0.846 | 0.846 | 0.846 |
| A3: Ti/Tv | Predicted ratio | 0.20 | 0.41 | 0.42 | 0.43 |
| A4: Embeddings | Seq AUC | 0.41 | 0.35 | 0.40 | 0.40 |
| A5: Surprise | ฮLL | +0.78 | -0.22 | -0.20 | -0.17 |
| A6: Discrim. | AUC | 0.588 | 0.887 | 0.904 | 0.908 |
Interpretation
- ICI-DC provides the main training signal (AUC 0.59โ0.89). Synthetic data builds a strong internal representation of valid mutation pairs.
- SAD fine-tunes that representation (0.887โ0.908). The attenuation step does help, but the marginal gain over ICI-DC is modest.
- The model is a judge, not a generator. 90.8% discriminative AUC but ~30% generative recall. The representations encode mutation structure, but autoregressive decoding can't access it efficiently.
- Ti/Tv ratio converges from 0.20 (base) to 0.43 (SAD 1.5), approaching the biological value of 0.48.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Nhoodie/omni-dna-sad-mutation", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Nhoodie/omni-dna-sad-mutation", trust_remote_code=True)
# Score a mutation pair (discriminative use)
prompt = "mutate: ATGGCTAGCTGA -> ATAGCTGGCTAA"
logits = model(**tokenizer(prompt, return_tensors="pt")).logits
Related
- Nhoodie/omni-dna-ici-dc โ ICI-DC checkpoint (synthetic-only)
- Nhoodie/omni-dna-sad-mutation-dataset โ Training data
Source Code
Git commit: 7be4e73 (branch dev/sad-omni-hyena, private repo)
Citation
- Omni-DNA: Zehui127 et al.
- HyenaDNA: Nguyen et al., NeurIPS 2023
- ENBED: Malusare et al., Bioinformatics Advances, 2024 (arXiv: 2311.02333)