Training and fine-tuning example

#13
by shimohan - opened

I found that the training example config in NeMo (https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/speechlm2/salm_train.py, https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/speechlm2/conf/salm.yaml) is different from the one used for canary-qwen. Even the dataset config does not match. As a result, when I use the NeMo example to fine-tune canary-qwen, the performance is not good. Could you provide a training example that matches this canary-qwen model?

Hi there,
Yes it seems you're right. Some of the hparams of Canary-Qwen are available in its config.json: https://huggingface.co/nvidia/canary-qwen-2.5b/blob/main/config.json

The actual config used is this:

model:
  # Every name/path here starting with 'pretrained' is used to initialize the model weights.
  pretrained_llm: Qwen/Qwen3-1.7B
  pretrained_asr: nvidia/canary-1b-flash

  pretrained_weights: True  # When False, we use pretrained_name to load the architecture, but with random init

  # Regexp (re.compile) patterns matching parameters to be frozen.
  freeze_params:
    # Frozen LLM
    - "^llm\\..+$"  # LLM
    - "^embed_tokens\\..+$"  # LLM embedding is moved
    # Frozen pretrained ASR (only the modality adapter layers are trainable)
    # - "^perception\\.preprocessor\\..+$"
    # - "^perception\\.encoder\\..+$"
  prevent_freeze_params: []  # Use to make specific submodules trainable; overrides freeze_params

  lora:
    task_type: CAUSAL_LM
    r: 128
    lora_alpha: 256
    lora_dropout: 0.01
    target_modules: ["q_proj", "v_proj"]

  prompt_format: qwen
  audio_locator_tag: "<|audioplaceholder|>"  # placeholder token for audio turn is expected

  perception:
     target: nemo.collections.speechlm2.modules.perception.AudioPerceptionModule
     output_dim: 2048
     modality_adapter:
       _target_: nemo.collections.speechlm2.modules.perception.IdentityConnector
       d_model: 1024

#     spec_augment:
#       _target_: nemo.collections.asr.modules.SpectrogramAugmentation
#       freq_masks: 2 # set to zero to disable it
#       time_masks: 10 # set to zero to disable it
#       freq_width: 27
#       time_width: 5  # 5 frames = 50ms

  optimizer:
    _target_: torch.optim.AdamW
    lr: 5e-4
    betas: [0.9, 0.98]
    weight_decay: 1e-3
    foreach: true

  lr_scheduler:
    _target_: nemo.core.optim.lr_scheduler.CosineAnnealing
    warmup_steps: 1000
    min_lr: 1e-6
    max_steps: ${trainer.max_steps}

trainer:
  devices: -1
  accelerator: gpu
  num_nodes: 1
  precision: bf16-true
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  use_distributed_sampler: False
  max_steps: 100000
  limit_train_batches: 5000
  val_check_interval: ${trainer.limit_train_batches}
  limit_val_batches: 10
  log_every_n_steps: 10
  num_sanity_val_steps: 1
  gradient_clip_val: 1.0
  accumulate_grad_batches: 1
  strategy:
    _target_: lightning.pytorch.strategies.DDPStrategy
    gradient_as_bucket_view: true
    find_unused_parameters: true

data:
  train_ds:
    sample_rate: 16000
    prompt_format: ${model.prompt_format}
    token_equivalent_duration: 0.08
    text_field: answer
    lang_field: target_lang
    input_cfg:
      - type: lhotse_as_conversation
        input_cfg: /path/to/input_cfg.yaml
        audio_locator_tag: ${model.audio_locator_tag}
        tags:
          context: 'Transcribe the following:'
    seed: 42
    shuffle: true
    shard_seed: "randomized"
    num_workers: 4

    use_multimodal_sampling: true
    min_duration: 0.1
    min_tokens: 2
    max_tokens: 1024
    bucket_duration_bins: [99,110,117,124,184,247,324,391,457,520,555,579,600,618,638,1024]
    bucket_batch_size: [69, 64, 60, 40, 28, 22, 16, 14, 12, 11, 10, 6, 4, 4, 4, 2]
    use_bucketing: true
    num_buckets: 16
    bucket_buffer_size: 20000

  validation_ds:
    # The entries under 'datasets' are a list of separate dataloaders.
    # The structure is <dataset-name>: {<dataloader-dict-config>}
    # They inherit all settings from validation_ds, but can individually override them.
    prompt_format: ${model.prompt_format}
    token_equivalent_duration: 0.08
    datasets:
      devset_nemo_manifest:
        input_cfg:
        - audio_locator_tag: ${model.audio_locator_tag}
          manifest_filepath: /path/to/devset_nemo_manifest.json
          tags:
            context: 'Transcribe the following:'
          type: lhotse_as_conversation

    sample_rate: 16000
    batch_size: 8
    seed: 42
    shard_seed: "randomized"

exp_manager:
   exp_dir: null
   explicit_log_dir: canary-qwen-2.5b-results/
   name: canary-qwen-2.5b
   create_tensorboard_logger: false
   create_checkpoint_callback: true
   use_datetime_version: true
   max_time_per_run: 00:03:50:00

   resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   # you need to set these two to True to continue the training
   resume_if_exists: true
   resume_ignore_no_checkpoint: true

   # You may use this section to create a W&B logger
   create_wandb_logger: true
   wandb_logger_kwargs:
     name: canary-qwen-2.5b-release
     project: canary-qwen-2.5b
     resume: true

   checkpoint_callback_params:
     filename: "{step}"
     monitor: val_acc
     mode: max
     every_n_train_steps: null
     every_n_epochs: 1
     save_top_k: 1
     always_save_nemo: false

It was ran as a series of consecutive 4h training jobs until 100k steps is reached (~1 day on 32xA100 80GB), if you do the same, you will need to set a different random seed (train_ds.seed) for each of those. You might also want to remove text_field and lang_field options if you have standard NeMo ASR manifests (this was using some custom formatted data).

If you wish to finetune you will need to set pretrained_weights: False and add 1 LOC to manually load canary-qwen pretrained weights before the the training starts (this capability isn't yet exposed into the YAML config). Hopefully this helps you get started.

I'll try to push a complete example/tutorial but I can't make any promises about the timeline for this.

Hi there,
Yes it seems you're right. Some of the hparams of Canary-Qwen are available in its config.json: https://huggingface.co/nvidia/canary-qwen-2.5b/blob/main/config.json

The actual config used is this:

model:
  # Every name/path here starting with 'pretrained' is used to initialize the model weights.
  pretrained_llm: Qwen/Qwen3-1.7B
  pretrained_asr: nvidia/canary-1b-flash

  pretrained_weights: True  # When False, we use pretrained_name to load the architecture, but with random init

  # Regexp (re.compile) patterns matching parameters to be frozen.
  freeze_params:
    # Frozen LLM
    - "^llm\\..+$"  # LLM
    - "^embed_tokens\\..+$"  # LLM embedding is moved
    # Frozen pretrained ASR (only the modality adapter layers are trainable)
    # - "^perception\\.preprocessor\\..+$"
    # - "^perception\\.encoder\\..+$"
  prevent_freeze_params: []  # Use to make specific submodules trainable; overrides freeze_params

  lora:
    task_type: CAUSAL_LM
    r: 128
    lora_alpha: 256
    lora_dropout: 0.01
    target_modules: ["q_proj", "v_proj"]

  prompt_format: qwen
  audio_locator_tag: "<|audioplaceholder|>"  # placeholder token for audio turn is expected

  perception:
     target: nemo.collections.speechlm2.modules.perception.AudioPerceptionModule
     output_dim: 2048
     modality_adapter:
       _target_: nemo.collections.speechlm2.modules.perception.IdentityConnector
       d_model: 1024

#     spec_augment:
#       _target_: nemo.collections.asr.modules.SpectrogramAugmentation
#       freq_masks: 2 # set to zero to disable it
#       time_masks: 10 # set to zero to disable it
#       freq_width: 27
#       time_width: 5  # 5 frames = 50ms

  optimizer:
    _target_: torch.optim.AdamW
    lr: 5e-4
    betas: [0.9, 0.98]
    weight_decay: 1e-3
    foreach: true

  lr_scheduler:
    _target_: nemo.core.optim.lr_scheduler.CosineAnnealing
    warmup_steps: 1000
    min_lr: 1e-6
    max_steps: ${trainer.max_steps}

trainer:
  devices: -1
  accelerator: gpu
  num_nodes: 1
  precision: bf16-true
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  use_distributed_sampler: False
  max_steps: 100000
  limit_train_batches: 5000
  val_check_interval: ${trainer.limit_train_batches}
  limit_val_batches: 10
  log_every_n_steps: 10
  num_sanity_val_steps: 1
  gradient_clip_val: 1.0
  accumulate_grad_batches: 1
  strategy:
    _target_: lightning.pytorch.strategies.DDPStrategy
    gradient_as_bucket_view: true
    find_unused_parameters: true

data:
  train_ds:
    sample_rate: 16000
    prompt_format: ${model.prompt_format}
    token_equivalent_duration: 0.08
    text_field: answer
    lang_field: target_lang
    input_cfg:
      - type: lhotse_as_conversation
        input_cfg: /path/to/input_cfg.yaml
        audio_locator_tag: ${model.audio_locator_tag}
        tags:
          context: 'Transcribe the following:'
    seed: 42
    shuffle: true
    shard_seed: "randomized"
    num_workers: 4

    use_multimodal_sampling: true
    min_duration: 0.1
    min_tokens: 2
    max_tokens: 1024
    bucket_duration_bins: [99,110,117,124,184,247,324,391,457,520,555,579,600,618,638,1024]
    bucket_batch_size: [69, 64, 60, 40, 28, 22, 16, 14, 12, 11, 10, 6, 4, 4, 4, 2]
    use_bucketing: true
    num_buckets: 16
    bucket_buffer_size: 20000

  validation_ds:
    # The entries under 'datasets' are a list of separate dataloaders.
    # The structure is <dataset-name>: {<dataloader-dict-config>}
    # They inherit all settings from validation_ds, but can individually override them.
    prompt_format: ${model.prompt_format}
    token_equivalent_duration: 0.08
    datasets:
      devset_nemo_manifest:
        input_cfg:
        - audio_locator_tag: ${model.audio_locator_tag}
          manifest_filepath: /path/to/devset_nemo_manifest.json
          tags:
            context: 'Transcribe the following:'
          type: lhotse_as_conversation

    sample_rate: 16000
    batch_size: 8
    seed: 42
    shard_seed: "randomized"

exp_manager:
   exp_dir: null
   explicit_log_dir: canary-qwen-2.5b-results/
   name: canary-qwen-2.5b
   create_tensorboard_logger: false
   create_checkpoint_callback: true
   use_datetime_version: true
   max_time_per_run: 00:03:50:00

   resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   # you need to set these two to True to continue the training
   resume_if_exists: true
   resume_ignore_no_checkpoint: true

   # You may use this section to create a W&B logger
   create_wandb_logger: true
   wandb_logger_kwargs:
     name: canary-qwen-2.5b-release
     project: canary-qwen-2.5b
     resume: true

   checkpoint_callback_params:
     filename: "{step}"
     monitor: val_acc
     mode: max
     every_n_train_steps: null
     every_n_epochs: 1
     save_top_k: 1
     always_save_nemo: false

It was ran as a series of consecutive 4h training jobs until 100k steps is reached (~1 day on 32xA100 80GB), if you do the same, you will need to set a different random seed (train_ds.seed) for each of those. You might also want to remove text_field and lang_field options if you have standard NeMo ASR manifests (this was using some custom formatted data).

If you wish to finetune you will need to set pretrained_weights: False and add 1 LOC to manually load canary-qwen pretrained weights before the the training starts (this capability isn't yet exposed into the YAML config). Hopefully this helps you get started.

I'll try to push a complete example/tutorial but I can't make any promises about the timeline for this.

Hey, was there ever any example released? Im interested of pairing canary 1b v2 with Qwen3-1.7B

No, we did not end up releasing an example for this. However, there's a pending PR (soon to be merged) adding NeMo Automodel integration that allows easier scaling of Speech LLM training, and it has both documentation and a tutorial as a part of it. Please note it's subject to change as it's a pre-release version of NeMo Speech though.

https://github.com/NVIDIA-NeMo/NeMo/pull/15447

No, we did not end up releasing an example for this. However, there's a pending PR (soon to be merged) adding NeMo Automodel integration that allows easier scaling of Speech LLM training, and it has both documentation and a tutorial as a part of it. Please note it's subject to change as it's a pre-release version of NeMo Speech though.

https://github.com/NVIDIA-NeMo/NeMo/pull/15447

Thanks i'll take a look, have another question regarding the speechlm2 proejction layer. The canary-qwen-2.5 model card claims that the canary-qwen-2.5b was trained with 234k hours, is this 234k hours for the projection layer if so does it actually need so much training since the pretrained models (canary 1b) don't have that much training.

Also for multilingual support does the training have to be multilingual since the projection layer isn't really responsible for the multilingual part.

For canary-qwen-2.5b, the encoder + projection + LoRA on LLM were all trainable, while base LLM weights were frozen.

You can train just the projection layer with smaller data and get decent results in a small number of training steps (1-10k), but in my experience not as good as when more parameters are trainable. For canary-qwen it was around 6% Open ASR WER when I did that. However, in my observations the projection layer encodes a lot of task/data bias, so I expect multilingual ASR to work but be somewhat unreliable if you don't use any multilingual data to train it.

For canary-qwen-2.5b, the encoder + projection + LoRA on LLM were all trainable, while base LLM weights were frozen.

You can train just the projection layer with smaller data and get decent results in a small number of training steps (1-10k), but in my experience not as good as when more parameters are trainable. For canary-qwen it was around 6% Open ASR WER when I did that. However, in my observations the projection layer encodes a lot of task/data bias, so I expect multilingual ASR to work but be somewhat unreliable if you don't use any multilingual data to train it.

Is it normal that with the tutorial inside that branch using qwen 1.7b (originally nemotron) + canary 1b flash it outputs total garbage that makes no sense for any audio you input it with, it will return same kind of answer by repeating same words over and over again?

That will get resolved with larger dataset? this tutorial was using only 5 hours of audio

{"id": "5895-34615-0000-42", "duration": 3.335, "text": "but is laughter a synonym of joy", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0001-43", "duration": 3.305, "text": "such perfect completeness is not in nature", "pred_text": "and the rest of the time he was in the house"}
{"id": "3536-23268-0004-4", "duration": 3.13, "text": "i hope miss milner you pass this evening at home", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0008-50", "duration": 3.055, "text": "the outside did not depend on the interior", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0009-51", "duration": 2.78, "text": "no one could escape from this rictus", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0011-53", "duration": 2.76, "text": "an everlasting laugh", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0006-48", "duration": 2.525, "text": "he showed himself on the platform", "pred_text": "and the rest of the time he was in the house"}
{"id": "5895-34615-0005-47", "duration": 2.495, "text": "gwynplaine was a mountebank", "pred_text": "and the rest of the time he was in the house"}

It should be able to converge on train-clean-5 within ~3k steps, and even output reasonable predictions on dev-clean-2, but don't expect fireworks on in-the-wild data.
If you predict on training data and get garbage, it means it hasn't converged well - experiment with larger batch sizes / gradient accumulation and learning rates.
Make sure that you changed the prompt format to qwen.
And yeah, generally more data helps, though if you can't get it to converge with train-clean-5 when only the modality projector is trainable, that indicates some issue or misconfiguration.

Sign up or log in to comment