โš ๏ธ IMPORTANT WARNING โ€” Model Effectiveness

This model's reported SAD coefficient is unreliable. The ICI-DC (S1) and SAD (S2) training stages both underfit near-instantly. The SAD coefficient was calculated from training hyperparameters (synthetic_pairs ร— epochs / real_pairs ร— epochs ร— LR_factor), NOT from converged training. The model never reached a stable training state that would justify the coefficient value.

Subsequent testing revealed that the unmodified base Omni-DNA-20M model achieves 0.951 AUC on discriminative mutation detection with raw DNA prompts โ€” significantly outperforming ALL fine-tuned variants including this one. Fine-tuning with ICI-DC synthetic data systematically degraded the base model's innate mutation discrimination capability.

This model is preserved for historical and reproducibility purposes only.


Omni-DNA SAD Checkpoint

Sequential Attenuation Denoising (SAD) โ€” ICI-DC checkpoint fine-tuned on real mutation data.

Starting from the ICI-DC checkpoint, this model undergoes attenuation: real mutation pairs at 10x lower learning rate. Weights that don't contradict real data persist from ICI-DC; contradicted patterns get corrected.

SAD Training Details

Parameter Value
Base model Nhoodie/omni-dna-ici-dc (ICI-DC pre-trained)
Training data 3,317 real mutation pairs
Epochs 5
Learning rate 1e-5 (10x lower than ICI-DC)
SAD coefficient 4.89 (81,120 synthetic exposures / 16,585 real exposures)
Batch size 32 effective
Precision fp32

Note: SAD coefficient of 4.89 is considered too high. A coefficient of ~1.5 was also tested (16 real epochs) with similar results. See benchmarks below.

Multi-Axis Benchmark (100 test pairs, 4 models)

Axis 6: Discriminative (most important)

Model AUC Best F1 Score Gap
Base Omni (no fine-tune) 0.588 0.688 25.6
ICI-DC 0.887 0.858 370.2
SAD coeff=4.89 (this) 0.904 0.862 411.4
SAD coeff=1.5 0.908 0.862 413.7

Axis 5: Mutation Surprise

Model Surprise p-value Interpretation
Base Omni +0.777 <0.0001 Expects parent (mutations are surprising)
ICI-DC -0.216 0.0001 Expects mutations everywhere
SAD coeff=4.89 -0.203 0.0003 Partially attenuated
SAD coeff=1.5 -0.169 0.0026 More attenuation

Full 6-Axis Comparison

Axis Metric Base ICI-DC SAD 4.89 SAD 1.5
A1: Detection Recall 0.511 0.518 0.517 0.497
A2: Logits Top-3 acc 0.846 0.846 0.846 0.846
A3: Ti/Tv Predicted ratio 0.20 0.41 0.42 0.43
A4: Embeddings Seq AUC 0.41 0.35 0.40 0.40
A5: Surprise ฮ”LL +0.78 -0.22 -0.20 -0.17
A6: Discrim. AUC 0.588 0.887 0.904 0.908

Interpretation

  • ICI-DC provides the main training signal (AUC 0.59โ†’0.89). Synthetic data builds a strong internal representation of valid mutation pairs.
  • SAD fine-tunes that representation (0.887โ†’0.908). The attenuation step does help, but the marginal gain over ICI-DC is modest.
  • The model is a judge, not a generator. 90.8% discriminative AUC but ~30% generative recall. The representations encode mutation structure, but autoregressive decoding can't access it efficiently.
  • Ti/Tv ratio converges from 0.20 (base) to 0.43 (SAD 1.5), approaching the biological value of 0.48.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Nhoodie/omni-dna-sad-mutation", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Nhoodie/omni-dna-sad-mutation", trust_remote_code=True)

# Score a mutation pair (discriminative use)
prompt = "mutate: ATGGCTAGCTGA -> ATAGCTGGCTAA"
logits = model(**tokenizer(prompt, return_tensors="pt")).logits

Related

Source Code

Git commit: 7be4e73 (branch dev/sad-omni-hyena, private repo)

Citation

  • Omni-DNA: Zehui127 et al.
  • HyenaDNA: Nguyen et al., NeurIPS 2023
  • ENBED: Malusare et al., Bioinformatics Advances, 2024 (arXiv: 2311.02333)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Nhoodie/omni-dna-sad-mutation

Finetuned
(1)
this model