lewtun HF Staff commited on
Commit
0cb5e66
·
verified ·
1 Parent(s): eff6877

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md CHANGED
@@ -24,6 +24,8 @@ Carbon-8B is the 8B-parameter sibling of [Carbon-3B](https://huggingface.co/Hugg
24
  - **Native context: 32,768 tokens (≈ 196 kbp).** Carbon-8B was extended with a long-context decay stage from an 8 k-context base, so it natively handles 32 k tokens. You can apply YaRN at 4× to extrapolate up to 128 k tokens (≈ 786 kbp).
25
  - Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
26
 
 
 
27
  ```python
28
  from transformers import AutoModelForCausalLM, AutoTokenizer
29
  import torch
@@ -40,6 +42,78 @@ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
40
  print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
41
  ```
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Training
44
 
45
  Carbon-8B follows the same pre-training recipe as Carbon-3B on the **[`HuggingFaceBio/carbon-pretraining-corpus`](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus)** with the identical data mixture on 1T DNA 6-mer tokens. The main recipe ingredients:
 
24
  - **Native context: 32,768 tokens (≈ 196 kbp).** Carbon-8B was extended with a long-context decay stage from an 8 k-context base, so it natively handles 32 k tokens. You can apply YaRN at 4× to extrapolate up to 128 k tokens (≈ 786 kbp).
25
  - Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
26
 
27
+ ## How to use
28
+
29
  ```python
30
  from transformers import AutoModelForCausalLM, AutoTokenizer
31
  import torch
 
42
  print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
43
  ```
44
 
45
+ ### Base-pair-level generation and scoring
46
+
47
+ The `fns` branch loads custom modeling code for Factorized Nucleotide Supervision (FNS). Carbon still uses its efficient 6-mer tokenizer, but during generation each selected 6-mer is assembled from six per-position nucleotide distributions, giving base-pair-level control over decoded DNA. Use this branch when you need exact base-pair counts, per-position masks, or temperature/top-p behavior applied at the nucleotide level rather than over the 4,096-way 6-mer distribution:
48
+
49
+ ```py
50
+ import math
51
+ import torch
52
+ from transformers import AutoModelForCausalLM, AutoTokenizer
53
+
54
+ model_id = "HuggingFaceBio/Carbon-8B"
55
+ revision = "fns"
56
+ device = "cuda"
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
59
+ model = AutoModelForCausalLM.from_pretrained(
60
+ model_id,
61
+ revision=revision,
62
+ trust_remote_code=True,
63
+ dtype=torch.bfloat16,
64
+ ).to(device).eval()
65
+
66
+ context = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
67
+ n_bp = 60
68
+
69
+ inputs = tokenizer(f"<dna>{context}", return_tensors="pt", add_special_tokens=False).to(device)
70
+
71
+ with torch.no_grad():
72
+ output_ids = model.generate(
73
+ **inputs,
74
+ max_new_tokens=math.ceil(n_bp / tokenizer.k),
75
+ do_sample=False,
76
+ pad_token_id=tokenizer.eos_token_id,
77
+ )
78
+
79
+ generated_ids = output_ids[0, inputs.input_ids.shape[1]:]
80
+ generated_dna = tokenizer.decode(generated_ids, skip_special_tokens=True)[:n_bp]
81
+
82
+ print(generated_dna)
83
+ ```
84
+
85
+ The same per-base marginals are exposed through `score_sequence()`, which returns the probability assigned to the observed base at each position. Taking the mean log probability gives a base-pair-level sequence score, where higher values indicate higher model likelihood:
86
+
87
+ ```py
88
+ import torch
89
+ from transformers import AutoModelForCausalLM, AutoTokenizer
90
+
91
+ model_id = "HuggingFaceBio/Carbon-8B"
92
+ revision = "fns"
93
+ device = "cuda"
94
+
95
+ tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
96
+ model = AutoModelForCausalLM.from_pretrained(
97
+ model_id,
98
+ revision=revision,
99
+ trust_remote_code=True,
100
+ dtype=torch.bfloat16,
101
+ ).to(device).eval()
102
+
103
+ reference = "GGGCTATAAAGGCCATCGATCGATCGATCGATCGATCGATCG"
104
+ perturbed = "GGGCGCGCGCGGCCATCGATCGATCGATCGATCGATCGATCG"
105
+
106
+ with torch.no_grad():
107
+ bp_probs, actual_probs = model.score_sequence([reference, perturbed])
108
+
109
+ scores = [torch.log(p.clamp_min(1e-12)).mean().item() for p in actual_probs]
110
+
111
+ print(f"reference mean bp logp: {scores[0]:.4f}")
112
+ print(f"perturbed mean bp logp: {scores[1]:.4f}")
113
+ print(f"reference preferred: {scores[0] > scores[1]}")
114
+ ```
115
+
116
+
117
  ## Training
118
 
119
  Carbon-8B follows the same pre-training recipe as Carbon-3B on the **[`HuggingFaceBio/carbon-pretraining-corpus`](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus)** with the identical data mixture on 1T DNA 6-mer tokens. The main recipe ingredients: