lewtun HF Staff commited on
Commit
57127eb
·
verified ·
1 Parent(s): 4d2ebe0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md CHANGED
@@ -79,6 +79,78 @@ print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
79
 
80
  Output is guaranteed identical to greedy decoding with the target model alone; only wall-clock latency is reduced.
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ## Evaluation
83
 
84
  Carbon-500M is benchmarked against ≈ 1B-parameter DNA models on the standard Carbon evaluation suite. See the [Carbon-3B card](https://huggingface.co/HuggingFaceBio/Carbon-3B#evaluation) for the task definitions and methodology.
 
79
 
80
  Output is guaranteed identical to greedy decoding with the target model alone; only wall-clock latency is reduced.
81
 
82
+ ### Base-pair-level generation and scoring
83
+
84
+ The `fns` branch loads custom modeling code for Factorized Nucleotide Supervision (FNS). Carbon still uses its efficient 6-mer tokenizer, but during generation each selected 6-mer is assembled from six per-position nucleotide distributions, giving base-pair-level control over decoded DNA. Use this branch when you need exact base-pair counts, per-position masks, or temperature/top-p behavior applied at the nucleotide level rather than over the 4,096-way 6-mer distribution:
85
+
86
+ ```py
87
+ import math
88
+ import torch
89
+ from transformers import AutoModelForCausalLM, AutoTokenizer
90
+
91
+ model_id = "HuggingFaceBio/Carbon-500M"
92
+ revision = "fns"
93
+ device = "cuda"
94
+
95
+ tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
96
+ model = AutoModelForCausalLM.from_pretrained(
97
+ model_id,
98
+ revision=revision,
99
+ trust_remote_code=True,
100
+ dtype=torch.bfloat16,
101
+ ).to(device).eval()
102
+
103
+ context = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
104
+ n_bp = 60
105
+
106
+ inputs = tokenizer(f"<dna>{context}", return_tensors="pt", add_special_tokens=False).to(device)
107
+
108
+ with torch.no_grad():
109
+ output_ids = model.generate(
110
+ **inputs,
111
+ max_new_tokens=math.ceil(n_bp / tokenizer.k),
112
+ do_sample=False,
113
+ pad_token_id=tokenizer.eos_token_id,
114
+ )
115
+
116
+ generated_ids = output_ids[0, inputs.input_ids.shape[1]:]
117
+ generated_dna = tokenizer.decode(generated_ids, skip_special_tokens=True)[:n_bp]
118
+
119
+ print(generated_dna)
120
+ ```
121
+
122
+ The same per-base marginals are exposed through `score_sequence()`, which returns the probability assigned to the observed base at each position. Taking the mean log probability gives a base-pair-level sequence score, where higher values indicate higher model likelihood:
123
+
124
+ ```py
125
+ import torch
126
+ from transformers import AutoModelForCausalLM, AutoTokenizer
127
+
128
+ model_id = "HuggingFaceBio/Carbon-500M"
129
+ revision = "fns"
130
+ device = "cuda"
131
+
132
+ tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
133
+ model = AutoModelForCausalLM.from_pretrained(
134
+ model_id,
135
+ revision=revision,
136
+ trust_remote_code=True,
137
+ dtype=torch.bfloat16,
138
+ ).to(device).eval()
139
+
140
+ reference = "GGGCTATAAAGGCCATCGATCGATCGATCGATCGATCGATCG"
141
+ perturbed = "GGGCGCGCGCGGCCATCGATCGATCGATCGATCGATCGATCG"
142
+
143
+ with torch.no_grad():
144
+ bp_probs, actual_probs = model.score_sequence([reference, perturbed])
145
+
146
+ scores = [torch.log(p.clamp_min(1e-12)).mean().item() for p in actual_probs]
147
+
148
+ print(f"reference mean bp logp: {scores[0]:.4f}")
149
+ print(f"perturbed mean bp logp: {scores[1]:.4f}")
150
+ print(f"reference preferred: {scores[0] > scores[1]}")
151
+ ```
152
+
153
+
154
  ## Evaluation
155
 
156
  Carbon-500M is benchmarked against ≈ 1B-parameter DNA models on the standard Carbon evaluation suite. See the [Carbon-3B card](https://huggingface.co/HuggingFaceBio/Carbon-3B#evaluation) for the task definitions and methodology.