HuggingFaceBio
/

Carbon-8B

@@ -24,6 +24,8 @@ Carbon-8B is the 8B-parameter sibling of [Carbon-3B](https://huggingface.co/Hugg
 - **Native context: 32,768 tokens (≈ 196 kbp).** Carbon-8B was extended with a long-context decay stage from an 8 k-context base, so it natively handles 32 k tokens. You can apply YaRN at 4× to extrapolate up to 128 k tokens (≈ 786 kbp).
 - Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
@@ -40,6 +42,78 @@ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
 print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```
 ## Training
 Carbon-8B follows the same pre-training recipe as Carbon-3B on the **[`HuggingFaceBio/carbon-pretraining-corpus`](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus)** with the identical data mixture on 1T DNA 6-mer tokens. The main recipe ingredients:

 - **Native context: 32,768 tokens (≈ 196 kbp).** Carbon-8B was extended with a long-context decay stage from an 8 k-context base, so it natively handles 32 k tokens. You can apply YaRN at 4× to extrapolate up to 128 k tokens (≈ 786 kbp).
 - Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
+## How to use
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
 ```
+### Base-pair-level generation and scoring
+The `fns` branch loads custom modeling code for Factorized Nucleotide Supervision (FNS). Carbon still uses its efficient 6-mer tokenizer, but during generation each selected 6-mer is assembled from six per-position nucleotide distributions, giving base-pair-level control over decoded DNA. Use this branch when you need exact base-pair counts, per-position masks, or temperature/top-p behavior applied at the nucleotide level rather than over the 4,096-way 6-mer distribution:
+```py
+import math
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "HuggingFaceBio/Carbon-8B"
+revision = "fns"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    revision=revision,
+    trust_remote_code=True,
+    dtype=torch.bfloat16,
+).to(device).eval()
+context = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
+n_bp = 60
+inputs = tokenizer(f"<dna>{context}", return_tensors="pt", add_special_tokens=False).to(device)
+with torch.no_grad():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=math.ceil(n_bp / tokenizer.k),
+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+generated_ids = output_ids[0, inputs.input_ids.shape[1]:]
+generated_dna = tokenizer.decode(generated_ids, skip_special_tokens=True)[:n_bp]
+print(generated_dna)
+```
+The same per-base marginals are exposed through `score_sequence()`, which returns the probability assigned to the observed base at each position. Taking the mean log probability gives a base-pair-level sequence score, where higher values indicate higher model likelihood:
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "HuggingFaceBio/Carbon-8B"
+revision = "fns"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    revision=revision,
+    trust_remote_code=True,
+    dtype=torch.bfloat16,
+).to(device).eval()
+reference = "GGGCTATAAAGGCCATCGATCGATCGATCGATCGATCGATCG"
+perturbed = "GGGCGCGCGCGGCCATCGATCGATCGATCGATCGATCGATCG"
+with torch.no_grad():
+    bp_probs, actual_probs = model.score_sequence([reference, perturbed])
+scores = [torch.log(p.clamp_min(1e-12)).mean().item() for p in actual_probs]
+print(f"reference mean bp logp: {scores[0]:.4f}")
+print(f"perturbed mean bp logp: {scores[1]:.4f}")
+print(f"reference preferred: {scores[0] > scores[1]}")
+```
 ## Training
 Carbon-8B follows the same pre-training recipe as Carbon-3B on the **[`HuggingFaceBio/carbon-pretraining-corpus`](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus)** with the identical data mixture on 1T DNA 6-mer tokens. The main recipe ingredients: