Error: backbone_model.embed_tokens.embed_audio_tokens.weight | MISSING |
After installing transformers and using the code provided in the read me, the following error appears:
backbone_model.embed_tokens.embed_audio_tokens.weight | MISSING |
This leads to the generation being gibberish due to random initialization of parameters
(this also happens with transformers pipeline)
Code:
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("cartesia/azzurra-voice")
model = CsmForConditionalGeneration.from_pretrained("cartesia/azzurra-voice").to(device)
text = "La sintesi vocale è un processo complesso"
conversation = [
{"role": "user", "content": [{"type": "text", "text": text}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
audio_output = model.generate(**inputs, output_audio=True)
waveform = audio_output[0].cpu().numpy()
sf.write("output.wav", waveform, 24_000)
Output:
/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
preprocessor_config.json: 100% 271/271 [00:00<00:00, 9.14kB/s]chat_template.jinja: 1.40k/? [00:00<00:00, 30.2kB/s]Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 3.27k/? [00:00<00:00, 51.8kB/s]tokenizer_config.json: 50.7k/? [00:00<00:00, 985kB/s]tokenizer.json: 100% 17.2M/17.2M [00:00<00:00, 19.8MB/s]special_tokens_map.json: 100% 449/449 [00:00<00:00, 16.6kB/s]model.safetensors.index.json: 55.9k/? [00:00<00:00, 4.23MB/s]Download complete: 100% 6.60G/6.60G [06:37<00:00, 293MB/s]Fetching 2 files: 100% 2/2 [01:34<00:00, 94.62s/it]Loading weights: 100% 537/537 [00:01<00:00, 614.43it/s, Materializing param=lm_head.weight]CsmForConditionalGeneration LOAD REPORT from: cartesia/azzurra-voice
Key | Status |
------------------------------------------------------+---------+-
backbone_model.embed_tokens.embed_audio_tokens.weight | MISSING |
Notes:
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
generation_config.json: 100% 306/306 [00:00<00:00, 14.7kB/s]
Thanks for the report!
I'm not sure what changed regarding the layer naming in recent transformers versions, as it was working correctly before. However, I have just pushed a new version of the model renaming the weight to match the expectation.
It works now. You might need to clear your cache or force a re-download of the model files to see the changes. Let me know if it solves the issue!
Thanks, it's working now