Training and fine-tuning example
I found that the training example config in NeMo (https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/speechlm2/salm_train.py, https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/speechlm2/conf/salm.yaml) is different from the one used for canary-qwen. Even the dataset config does not match. As a result, when I use the NeMo example to fine-tune canary-qwen, the performance is not good. Could you provide a training example that matches this canary-qwen model?
Hi there,
Yes it seems you're right. Some of the hparams of Canary-Qwen are available in its config.json: https://huggingface.co/nvidia/canary-qwen-2.5b/blob/main/config.json
The actual config used is this:
model:
# Every name/path here starting with 'pretrained' is used to initialize the model weights.
pretrained_llm: Qwen/Qwen3-1.7B
pretrained_asr: nvidia/canary-1b-flash
pretrained_weights: True # When False, we use pretrained_name to load the architecture, but with random init
# Regexp (re.compile) patterns matching parameters to be frozen.
freeze_params:
# Frozen LLM
- "^llm\\..+$" # LLM
- "^embed_tokens\\..+$" # LLM embedding is moved
# Frozen pretrained ASR (only the modality adapter layers are trainable)
# - "^perception\\.preprocessor\\..+$"
# - "^perception\\.encoder\\..+$"
prevent_freeze_params: [] # Use to make specific submodules trainable; overrides freeze_params
lora:
task_type: CAUSAL_LM
r: 128
lora_alpha: 256
lora_dropout: 0.01
target_modules: ["q_proj", "v_proj"]
prompt_format: qwen
audio_locator_tag: "<|audioplaceholder|>" # placeholder token for audio turn is expected
perception:
target: nemo.collections.speechlm2.modules.perception.AudioPerceptionModule
output_dim: 2048
modality_adapter:
_target_: nemo.collections.speechlm2.modules.perception.IdentityConnector
d_model: 1024
# spec_augment:
# _target_: nemo.collections.asr.modules.SpectrogramAugmentation
# freq_masks: 2 # set to zero to disable it
# time_masks: 10 # set to zero to disable it
# freq_width: 27
# time_width: 5 # 5 frames = 50ms
optimizer:
_target_: torch.optim.AdamW
lr: 5e-4
betas: [0.9, 0.98]
weight_decay: 1e-3
foreach: true
lr_scheduler:
_target_: nemo.core.optim.lr_scheduler.CosineAnnealing
warmup_steps: 1000
min_lr: 1e-6
max_steps: ${trainer.max_steps}
trainer:
devices: -1
accelerator: gpu
num_nodes: 1
precision: bf16-true
logger: False # logger provided by exp_manager
enable_checkpointing: False
use_distributed_sampler: False
max_steps: 100000
limit_train_batches: 5000
val_check_interval: ${trainer.limit_train_batches}
limit_val_batches: 10
log_every_n_steps: 10
num_sanity_val_steps: 1
gradient_clip_val: 1.0
accumulate_grad_batches: 1
strategy:
_target_: lightning.pytorch.strategies.DDPStrategy
gradient_as_bucket_view: true
find_unused_parameters: true
data:
train_ds:
sample_rate: 16000
prompt_format: ${model.prompt_format}
token_equivalent_duration: 0.08
text_field: answer
lang_field: target_lang
input_cfg:
- type: lhotse_as_conversation
input_cfg: /path/to/input_cfg.yaml
audio_locator_tag: ${model.audio_locator_tag}
tags:
context: 'Transcribe the following:'
seed: 42
shuffle: true
shard_seed: "randomized"
num_workers: 4
use_multimodal_sampling: true
min_duration: 0.1
min_tokens: 2
max_tokens: 1024
bucket_duration_bins: [99,110,117,124,184,247,324,391,457,520,555,579,600,618,638,1024]
bucket_batch_size: [69, 64, 60, 40, 28, 22, 16, 14, 12, 11, 10, 6, 4, 4, 4, 2]
use_bucketing: true
num_buckets: 16
bucket_buffer_size: 20000
validation_ds:
# The entries under 'datasets' are a list of separate dataloaders.
# The structure is <dataset-name>: {<dataloader-dict-config>}
# They inherit all settings from validation_ds, but can individually override them.
prompt_format: ${model.prompt_format}
token_equivalent_duration: 0.08
datasets:
devset_nemo_manifest:
input_cfg:
- audio_locator_tag: ${model.audio_locator_tag}
manifest_filepath: /path/to/devset_nemo_manifest.json
tags:
context: 'Transcribe the following:'
type: lhotse_as_conversation
sample_rate: 16000
batch_size: 8
seed: 42
shard_seed: "randomized"
exp_manager:
exp_dir: null
explicit_log_dir: canary-qwen-2.5b-results/
name: canary-qwen-2.5b
create_tensorboard_logger: false
create_checkpoint_callback: true
use_datetime_version: true
max_time_per_run: 00:03:50:00
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
# you need to set these two to True to continue the training
resume_if_exists: true
resume_ignore_no_checkpoint: true
# You may use this section to create a W&B logger
create_wandb_logger: true
wandb_logger_kwargs:
name: canary-qwen-2.5b-release
project: canary-qwen-2.5b
resume: true
checkpoint_callback_params:
filename: "{step}"
monitor: val_acc
mode: max
every_n_train_steps: null
every_n_epochs: 1
save_top_k: 1
always_save_nemo: false
It was ran as a series of consecutive 4h training jobs until 100k steps is reached (~1 day on 32xA100 80GB), if you do the same, you will need to set a different random seed (train_ds.seed) for each of those. You might also want to remove text_field and lang_field options if you have standard NeMo ASR manifests (this was using some custom formatted data).
If you wish to finetune you will need to set pretrained_weights: False and add 1 LOC to manually load canary-qwen pretrained weights before the the training starts (this capability isn't yet exposed into the YAML config). Hopefully this helps you get started.
I'll try to push a complete example/tutorial but I can't make any promises about the timeline for this.
Hi there,
Yes it seems you're right. Some of the hparams of Canary-Qwen are available in its config.json: https://huggingface.co/nvidia/canary-qwen-2.5b/blob/main/config.jsonThe actual config used is this:
model: # Every name/path here starting with 'pretrained' is used to initialize the model weights. pretrained_llm: Qwen/Qwen3-1.7B pretrained_asr: nvidia/canary-1b-flash pretrained_weights: True # When False, we use pretrained_name to load the architecture, but with random init # Regexp (re.compile) patterns matching parameters to be frozen. freeze_params: # Frozen LLM - "^llm\\..+$" # LLM - "^embed_tokens\\..+$" # LLM embedding is moved # Frozen pretrained ASR (only the modality adapter layers are trainable) # - "^perception\\.preprocessor\\..+$" # - "^perception\\.encoder\\..+$" prevent_freeze_params: [] # Use to make specific submodules trainable; overrides freeze_params lora: task_type: CAUSAL_LM r: 128 lora_alpha: 256 lora_dropout: 0.01 target_modules: ["q_proj", "v_proj"] prompt_format: qwen audio_locator_tag: "<|audioplaceholder|>" # placeholder token for audio turn is expected perception: target: nemo.collections.speechlm2.modules.perception.AudioPerceptionModule output_dim: 2048 modality_adapter: _target_: nemo.collections.speechlm2.modules.perception.IdentityConnector d_model: 1024 # spec_augment: # _target_: nemo.collections.asr.modules.SpectrogramAugmentation # freq_masks: 2 # set to zero to disable it # time_masks: 10 # set to zero to disable it # freq_width: 27 # time_width: 5 # 5 frames = 50ms optimizer: _target_: torch.optim.AdamW lr: 5e-4 betas: [0.9, 0.98] weight_decay: 1e-3 foreach: true lr_scheduler: _target_: nemo.core.optim.lr_scheduler.CosineAnnealing warmup_steps: 1000 min_lr: 1e-6 max_steps: ${trainer.max_steps} trainer: devices: -1 accelerator: gpu num_nodes: 1 precision: bf16-true logger: False # logger provided by exp_manager enable_checkpointing: False use_distributed_sampler: False max_steps: 100000 limit_train_batches: 5000 val_check_interval: ${trainer.limit_train_batches} limit_val_batches: 10 log_every_n_steps: 10 num_sanity_val_steps: 1 gradient_clip_val: 1.0 accumulate_grad_batches: 1 strategy: _target_: lightning.pytorch.strategies.DDPStrategy gradient_as_bucket_view: true find_unused_parameters: true data: train_ds: sample_rate: 16000 prompt_format: ${model.prompt_format} token_equivalent_duration: 0.08 text_field: answer lang_field: target_lang input_cfg: - type: lhotse_as_conversation input_cfg: /path/to/input_cfg.yaml audio_locator_tag: ${model.audio_locator_tag} tags: context: 'Transcribe the following:' seed: 42 shuffle: true shard_seed: "randomized" num_workers: 4 use_multimodal_sampling: true min_duration: 0.1 min_tokens: 2 max_tokens: 1024 bucket_duration_bins: [99,110,117,124,184,247,324,391,457,520,555,579,600,618,638,1024] bucket_batch_size: [69, 64, 60, 40, 28, 22, 16, 14, 12, 11, 10, 6, 4, 4, 4, 2] use_bucketing: true num_buckets: 16 bucket_buffer_size: 20000 validation_ds: # The entries under 'datasets' are a list of separate dataloaders. # The structure is <dataset-name>: {<dataloader-dict-config>} # They inherit all settings from validation_ds, but can individually override them. prompt_format: ${model.prompt_format} token_equivalent_duration: 0.08 datasets: devset_nemo_manifest: input_cfg: - audio_locator_tag: ${model.audio_locator_tag} manifest_filepath: /path/to/devset_nemo_manifest.json tags: context: 'Transcribe the following:' type: lhotse_as_conversation sample_rate: 16000 batch_size: 8 seed: 42 shard_seed: "randomized" exp_manager: exp_dir: null explicit_log_dir: canary-qwen-2.5b-results/ name: canary-qwen-2.5b create_tensorboard_logger: false create_checkpoint_callback: true use_datetime_version: true max_time_per_run: 00:03:50:00 resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc. # you need to set these two to True to continue the training resume_if_exists: true resume_ignore_no_checkpoint: true # You may use this section to create a W&B logger create_wandb_logger: true wandb_logger_kwargs: name: canary-qwen-2.5b-release project: canary-qwen-2.5b resume: true checkpoint_callback_params: filename: "{step}" monitor: val_acc mode: max every_n_train_steps: null every_n_epochs: 1 save_top_k: 1 always_save_nemo: falseIt was ran as a series of consecutive 4h training jobs until 100k steps is reached (~1 day on 32xA100 80GB), if you do the same, you will need to set a different random seed (
train_ds.seed) for each of those. You might also want to removetext_fieldandlang_fieldoptions if you have standard NeMo ASR manifests (this was using some custom formatted data).If you wish to finetune you will need to set
pretrained_weights: Falseand add 1 LOC to manually load canary-qwen pretrained weights before the the training starts (this capability isn't yet exposed into the YAML config). Hopefully this helps you get started.I'll try to push a complete example/tutorial but I can't make any promises about the timeline for this.
Hey, was there ever any example released? Im interested of pairing canary 1b v2 with Qwen3-1.7B
No, we did not end up releasing an example for this. However, there's a pending PR (soon to be merged) adding NeMo Automodel integration that allows easier scaling of Speech LLM training, and it has both documentation and a tutorial as a part of it. Please note it's subject to change as it's a pre-release version of NeMo Speech though.
No, we did not end up releasing an example for this. However, there's a pending PR (soon to be merged) adding NeMo Automodel integration that allows easier scaling of Speech LLM training, and it has both documentation and a tutorial as a part of it. Please note it's subject to change as it's a pre-release version of NeMo Speech though.
Thanks i'll take a look, have another question regarding the speechlm2 proejction layer. The canary-qwen-2.5 model card claims that the canary-qwen-2.5b was trained with 234k hours, is this 234k hours for the projection layer if so does it actually need so much training since the pretrained models (canary 1b) don't have that much training.
Also for multilingual support does the training have to be multilingual since the projection layer isn't really responsible for the multilingual part.
For canary-qwen-2.5b, the encoder + projection + LoRA on LLM were all trainable, while base LLM weights were frozen.
You can train just the projection layer with smaller data and get decent results in a small number of training steps (1-10k), but in my experience not as good as when more parameters are trainable. For canary-qwen it was around 6% Open ASR WER when I did that. However, in my observations the projection layer encodes a lot of task/data bias, so I expect multilingual ASR to work but be somewhat unreliable if you don't use any multilingual data to train it.
For canary-qwen-2.5b, the encoder + projection + LoRA on LLM were all trainable, while base LLM weights were frozen.
You can train just the projection layer with smaller data and get decent results in a small number of training steps (1-10k), but in my experience not as good as when more parameters are trainable. For canary-qwen it was around 6% Open ASR WER when I did that. However, in my observations the projection layer encodes a lot of task/data bias, so I expect multilingual ASR to work but be somewhat unreliable if you don't use any multilingual data to train it.
Is it normal that with the tutorial inside that branch using qwen 1.7b (originally nemotron) + canary 1b flash it outputs total garbage that makes no sense for any audio you input it with, it will return same kind of answer by repeating same words over and over again?
That will get resolved with larger dataset? this tutorial was using only 5 hours of audio
{"id": "5895-34615-0000-42", "duration": 3.335, "text": "but is laughter a synonym of joy", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0001-43", "duration": 3.305, "text": "such perfect completeness is not in nature", "pred_text": "and the rest of the time he was in the house"}
{"id": "3536-23268-0004-4", "duration": 3.13, "text": "i hope miss milner you pass this evening at home", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0008-50", "duration": 3.055, "text": "the outside did not depend on the interior", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0009-51", "duration": 2.78, "text": "no one could escape from this rictus", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0011-53", "duration": 2.76, "text": "an everlasting laugh", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0006-48", "duration": 2.525, "text": "he showed himself on the platform", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0005-47", "duration": 2.495, "text": "gwynplaine was a mountebank", "pred_text": "and the rest of the time he was in the house"}
It should be able to converge on train-clean-5 within ~3k steps, and even output reasonable predictions on dev-clean-2, but don't expect fireworks on in-the-wild data.
If you predict on training data and get garbage, it means it hasn't converged well - experiment with larger batch sizes / gradient accumulation and learning rates.
Make sure that you changed the prompt format to qwen.
And yeah, generally more data helps, though if you can't get it to converge with train-clean-5 when only the modality projector is trainable, that indicates some issue or misconfiguration.