⚠️ IMPORTANT WARNING — Model Effectiveness

This model's reported SAD coefficient is unreliable. The ICI-DC (S1) and SAD (S2) training stages both underfit near-instantly. The SAD coefficient was calculated from training hyperparameters (synthetic_pairs × epochs / real_pairs × epochs × LR_factor), NOT from converged training. The model never reached a stable training state that would justify the coefficient value.

Subsequent testing revealed that the unmodified base Omni-DNA-20M model achieves 0.951 AUC on discriminative mutation detection with raw DNA prompts — significantly outperforming ALL fine-tuned variants including this one. Fine-tuning with ICI-DC synthetic data systematically degraded the base model's innate mutation discrimination capability.

This model is preserved for historical and reproducibility purposes only.

Omni-DNA SAD Checkpoint

Sequential Attenuation Denoising (SAD) — ICI-DC checkpoint fine-tuned on real mutation data.

Starting from the ICI-DC checkpoint, this model undergoes attenuation: real mutation pairs at 10x lower learning rate. Weights that don't contradict real data persist from ICI-DC; contradicted patterns get corrected.

SAD Training Details

Parameter	Value
Base model	`Nhoodie/omni-dna-ici-dc` (ICI-DC pre-trained)
Training data	3,317 real mutation pairs
Epochs	5
Learning rate	1e-5 (10x lower than ICI-DC)
SAD coefficient	4.89 (81,120 synthetic exposures / 16,585 real exposures)
Batch size	32 effective
Precision	fp32

Note: SAD coefficient of 4.89 is considered too high. A coefficient of ~1.5 was also tested (16 real epochs) with similar results. See benchmarks below.

Multi-Axis Benchmark (100 test pairs, 4 models)

Axis 6: Discriminative (most important)

Model	AUC	Best F1	Score Gap
Base Omni (no fine-tune)	0.588	0.688	25.6
ICI-DC	0.887	0.858	370.2
SAD coeff=4.89 (this)	0.904	0.862	411.4
SAD coeff=1.5	0.908	0.862	413.7

Axis 5: Mutation Surprise

Model	Surprise	p-value	Interpretation
Base Omni	+0.777	<0.0001	Expects parent (mutations are surprising)
ICI-DC	-0.216	0.0001	Expects mutations everywhere
SAD coeff=4.89	-0.203	0.0003	Partially attenuated
SAD coeff=1.5	-0.169	0.0026	More attenuation

Full 6-Axis Comparison

Axis	Metric	Base	ICI-DC	SAD 4.89	SAD 1.5
A1: Detection	Recall	0.511	0.518	0.517	0.497
A2: Logits	Top-3 acc	0.846	0.846	0.846	0.846
A3: Ti/Tv	Predicted ratio	0.20	0.41	0.42	0.43
A4: Embeddings	Seq AUC	0.41	0.35	0.40	0.40
A5: Surprise	ΔLL	+0.78	-0.22	-0.20	-0.17
A6: Discrim.	AUC	0.588	0.887	0.904	0.908

Interpretation

ICI-DC provides the main training signal (AUC 0.59→0.89). Synthetic data builds a strong internal representation of valid mutation pairs.
SAD fine-tunes that representation (0.887→0.908). The attenuation step does help, but the marginal gain over ICI-DC is modest.
The model is a judge, not a generator. 90.8% discriminative AUC but ~30% generative recall. The representations encode mutation structure, but autoregressive decoding can't access it efficiently.
Ti/Tv ratio converges from 0.20 (base) to 0.43 (SAD 1.5), approaching the biological value of 0.48.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Nhoodie/omni-dna-sad-mutation", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Nhoodie/omni-dna-sad-mutation", trust_remote_code=True)

# Score a mutation pair (discriminative use)
prompt = "mutate: ATGGCTAGCTGA -> ATAGCTGGCTAA"
logits = model(**tokenizer(prompt, return_tensors="pt")).logits

Nhoodie/omni-dna-ici-dc — ICI-DC checkpoint (synthetic-only)
Nhoodie/omni-dna-sad-mutation-dataset — Training data

Source Code

Git commit: 7be4e73 (branch dev/sad-omni-hyena, private repo)

Citation

Omni-DNA: Zehui127 et al.
HyenaDNA: Nguyen et al., NeurIPS 2023
ENBED: Malusare et al., Bioinformatics Advances, 2024 (arXiv: 2311.02333)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Nhoodie/omni-dna-sad-mutation

Base model

zehui127/Omni-DNA-20M

Finetuned

Nhoodie/omni-dna-ici-dc