Carbon-8B / README.md

loubnabnl HF Staff

update tata and syn

6fac784 verified 4 days ago

5.89 kB

library_name: transformers
license: apache-2.0
language:
  - dna
tags:
  - dna
  - genomic
  - transformers

Carbon-8B

A larger, higher-capacity member of the Carbon family of generative DNA foundation models.

Carbon-8B is the 8B-parameter sibling of Carbon-3B. It is intended for users who can afford additional inference cost in exchange for stronger downstream performance. For the full design rationale, tokenizer specification, evaluation protocol, and usage details, please refer to the Carbon-3B model card and the Carbon technical report — this card focuses only on what is specific to Carbon-8B.

Model Summary

8B-parameter decoder-only autoregressive model trained on DNA and RNA sequences with a primary focus on eukaryotes.
Same hybrid tokenizer as Carbon-3B (non-overlapping 6-mer for DNA + Qwen3 BPE for English text). Each DNA token encodes 6 bp. Wrap DNA inputs with <dna>...</dna> — see the Carbon-3B card for tokenizer details and usage caveats.
Native context: 32,768 tokens (≈ 196 kbp). Carbon-8B was extended with a long-context decay stage from an 8 k-context base, so it natively handles 32 k tokens. You can apply YaRN at 4× to extrapolate up to 128 k tokens (≈ 786 kbp).
Released as a standard Hugging Face causal LM (LlamaForCausalLM).

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "HuggingFaceBio/Carbon-8B"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, dtype=torch.bfloat16,
).cuda().eval()

prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"   # multiple of 6 bp
inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Training

Carbon-8B follows the same pre-training recipe as Carbon-3B on the HuggingFaceBio/carbon-pretraining-corpus with the identical data mixture on 1T DNA 6-mer tokens. The main recipe ingredients:

Learning-rate schedule: cosine (instead of the WSD schedule used for Carbon-3B).
Loss schedule: after 100B tokens the loss switches from cross-entropy to FNS loss until the end of training.
Pre-training: on 1T 6-mer tokens (≈ 6T DNA base pairs), with GBS=512, seq=8192 → 4.19 M tok/step. On 32 nodes (TP=4, DP=64), bfloat16, AdamW. We keep the same training mixture even in the decay phase with 70% Generator eukaryote data with metadata with dropout, 16% mRNA, 4% splice mRNA and 10% Prokaryote data.
Long-context extension stage. After pre-training, Carbon-8B undergoes a long-context decay phase that extends the native context from 8,192 to 32,768 tokens (≈ 196 kbp). You can apply YaRN at 4× to further extrapolate to 128 k tokens (≈ 786 kbp).

Training infrastructure, framework (Megatron-LM-Carbon), and conversion path (Megatron-Bridge) are identical to Carbon-3B.

Evaluation

All evaluations are zero-shot and use the public Carbon evaluation pipeline. See the Carbon-3B card for the full task suite, metrics, and methodology.

Downstream tasks

Category	Metric (%)	Carbon 3B	Carbon 8B	Δ
Generative	Sequence Recovery eukaryote	61.54	64.05	+2.51
Variant effect prediction	BRCA2	84.63	85.72	+1.09
	TraitGym Mendelian	33.65	36.43	+2.78
	ClinVar coding (24 kb)	92.89	93.11	+0.22
	ClinVar non-coding (24 kb)	91.14	91.63	+0.49
Perturbation	Nucleotide triplet-expansion	85.20	89.05	+3.85
	Synonymous codon replacement	88.89	91.46	+2.57
Long-context retrieval	Genomic-NIAH @ 393 kbp	79.00	86.00	+7.00

Genomic-NIAH (long-context retrieval)

Genomic-NIAH measures how well a DNA model actually uses its long context. See the HuggingFaceBio/genomic-niah dataset card for the benchmark design.

Context length	Carbon 3B (native / YaRN 4×)	Carbon 8B (native / YaRN 4×)	Evo2 7B
16 k tokens (98 kbp)	0.73 / 0.91	0.78 / 0.89	0.97
32 k tokens (196 kbp)	0.55 / 0.90	0.69 / 0.87	0.95
64 k tokens (393 kbp)	— / 0.79	— / 0.86	0.80
128 k tokens (786 kbp)	— / 0.27	— / 0.65	0.53

Carbon-8B retrieves reliably up to its 32 k native boundary; YaRN 4× recovers most of the loss at the 32 k → 64 k boundary and extends usable retrieval to ≈ 786 kbp.

Intended use

Generative modelling, variant-effect prediction, motif-perturbation analysis, and long-context retrieval on DNA sequences. For faster inference at shorter contexts, use Carbon-3B.

⚠️ Genetic data is highly sensitive. Depending on how this model is used (local download, inference API/endpoints, third-party inference providers, Spaces demos or others), input and output data may be processed or handled differently by different providers or space owners. Please make sure you understand and agree with how your data is handled before using the model.

License

Apache 2.0.