Qwen3-TTS-12Hz-0.6B-CustomVoice

FenomAI

np-deploys commited on 22 days ago

Commit

9bc2997

0 Parent(s):

Duplicate from np-deploys/Qwen3-TTS-12Hz-0.6B-CustomVoice

Browse files

Co-authored-by: Jose Cantero <np-deploys@users.noreply.huggingface.co>

Files changed (13) hide show

.gitattributes +35 -0
README.md +98 -0
config.json +183 -0
generation_config.json +12 -0
merges.txt +0 -0
model.safetensors +3 -0
preprocessor_config.json +6 -0
speech_tokenizer/config.json +94 -0
speech_tokenizer/configuration.json +1 -0
speech_tokenizer/model.safetensors +3 -0
speech_tokenizer/preprocessor_config.json +10 -0
tokenizer_config.json +316 -0
vocab.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,98 @@

+---
+license: apache-2.0
+pipeline_tag: text-to-speech
+language:
+- zh
+- en
+- ja
+- ko
+- de
+- fr
+- ru
+- pt
+- es
+- it
+tags:
+- tts
+- qwen
+- audio
+arxiv: 2601.15621
+---
+# Qwen3-TTS-12Hz-0.6B-CustomVoice
+[Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) is a series of advanced multilingual, controllable, robust, and streaming text-to-speech models developed by the Qwen team.
+This specific checkpoint is the **0.6B CustomVoice** variant, based on the **12Hz** tokenizer. It supports 9 premium timbres and allows for fine-grained style control over target voices via natural language instructions across 10 major languages.
+- **Paper:** [Qwen3-TTS Technical Report](https://huggingface.co/papers/2601.15621)
+- **GitHub:** [QwenLM/Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS)
+- **Demo:** [Hugging Face Spaces](https://huggingface.co/spaces/Qwen/Qwen3-TTS)
+## Key Features
+* **Multilingual Synthesis**: Supports Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.
+* **Intelligent Control**: Adapts tone, rhythm, and emotional expression based on natural language instructions (e.g., "Speak in a very happy tone").
+* **Low Latency**: Optimized for streaming generation with the Qwen3-TTS-Tokenizer-12Hz, achieving end-to-end synthesis latency as low as 97ms.
+## Quickstart
+To use Qwen3-TTS, you can install the `qwen-tts` package:
+```bash
+pip install -U qwen-tts
+```
+### Sample Usage
+```python
+import torch
+import soundfile as sf
+from qwen_tts import Qwen3TTSModel
+# Load the model
+model = Qwen3TTSModel.from_pretrained(
+    "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
+    device_map="cuda:0",
+    dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+)
+# Generate speech with specific instructions
+wavs, sr = model.generate_custom_voice(
+    text="其实我真的有发现，我是一个特别善于观察别人情绪的人。",
+    language="Chinese",
+    speaker="Vivian",
+    instruct="用特别愤怒的语气说",
+)
+# Save the generated audio
+sf.write("output_custom_voice.wav", wavs[0], sr)
+```
+## Supported Speakers
+For `Qwen3-TTS-12Hz-0.6B-CustomVoice`, the following speakers are supported. We recommend using each speaker’s native language for the best results:
+| Speaker | Voice Description | Native Language |
+| --- | --- | --- |
+| Vivian | Bright young female voice. | Chinese |
+| Serena | Warm, gentle young female voice. | Chinese |
+| Uncle_Fu | Seasoned male voice, mellow timbre. | Chinese |
+| Dylan | Youthful Beijing male voice. | Chinese (Beijing) |
+| Eric | Lively Chengdu male voice. | Chinese (Sichuan) |
+| Ryan | Dynamic male voice with rhythm. | English |
+| Aiden | Sunny American male voice. | English |
+| Ono_Anna | Playful Japanese female voice. | Japanese |
+| Sohee | Warm Korean female voice. | Korean |
+## Citation
+If you find Qwen3-TTS useful for your research, please consider citing:
+```bibtex
+@article{Qwen3-TTS,
+  title={Qwen3-TTS Technical Report},
+  author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
+  journal={arXiv preprint arXiv:2601.15621},
+  year={2026}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,183 @@

+{
+  "architectures": [
+    "Qwen3TTSForConditionalGeneration"
+  ],
+  "assistant_token_id": 77091,
+  "im_end_token_id": 151645,
+  "im_start_token_id": 151644,
+  "tts_bos_token_id": 151672,
+  "tts_eos_token_id": 151673,
+  "tts_pad_token_id": 151671,
+  "model_type": "qwen3_tts",
+  "tokenizer_type": "qwen3_tts_tokenizer_12hz",
+  "tts_model_size": "0b6",
+  "tts_model_type": "custom_voice",
+  "talker_config": {
+    "attention_bias": false,
+    "attention_dropout": 0,
+    "code_predictor_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "attention_bias": false,
+      "attention_dropout": 0,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "head_dim": 128,
+      "hidden_act": "silu",
+      "hidden_size": 1024,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "intermediate_size": 3072,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_types": [
+        "full_attention",
+        "full_attention",
+        "full_attention",
+        "full_attention",
+        "full_attention"
+      ],
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "max_position_embeddings": 65536,
+      "max_window_layers": 28,
+      "min_length": 0,
+      "model_type": "qwen3_tts_talker_code_predictor",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 16,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_code_groups": 16,
+      "num_hidden_layers": 5,
+      "num_key_value_heads": 8,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "pruned_heads": {},
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "rms_norm_eps": 1e-06,
+      "rope_scaling": null,
+      "rope_theta": 1000000,
+      "sep_token_id": null,
+      "sliding_window": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tf_legacy_loss": false,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": false,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "dtype": null,
+      "torchscript": false,
+      "typical_p": 1.0,
+      "use_bfloat16": false,
+      "use_cache": true,
+      "use_sliding_window": false,
+      "vocab_size": 2048
+    },
+    "codec_bos_id": 2149,
+    "codec_eos_token_id": 2150,
+    "codec_think_id": 2154,
+    "codec_language_id": {
+        "chinese": 2055,
+        "english": 2050,
+        "german": 2053,
+        "italian": 2070,
+        "portuguese": 2071,
+        "spanish": 2054,
+        "japanese": 2058,
+        "korean": 2064,
+        "french": 2061,
+        "russian": 2069,
+        "beijing_dialect": 2074,
+        "sichuan_dialect": 2062
+    },
+    "codec_nothink_id": 2155,
+    "codec_pad_id": 2148,
+    "codec_think_bos_id": 2156,
+    "codec_think_eos_id": 2157,
+    "spk_id": {
+      "serena": 3066,
+      "vivian": 3065,
+      "uncle_fu": 3010,
+      "ryan": 3061,
+      "aiden": 2861,
+      "ono_anna": 2873,
+      "sohee": 2864,
+      "eric": 2875,
+      "dylan": 2878
+    },
+    "spk_is_dialect": {
+      "serena": false,
+      "vivian": false,
+      "uncle_fu": false,
+      "ryan": false,
+      "aiden": false,
+      "ono_anna": false,
+      "sohee": false,
+      "eric": "sichuan_dialect",
+      "dylan": "beijing_dialect"
+    },
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 1024,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "max_position_embeddings": 32768,
+    "model_type": "qwen3_tts_talker",
+    "num_attention_heads": 16,
+    "num_code_groups": 16,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 8,
+    "position_id_per_seconds": 13,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "interleaved": true,
+      "mrope_section": [
+        24,
+        20,
+        20
+      ],
+      "rope_type": "default",
+      "type": "default"
+    },
+    "rope_theta": 1000000,
+    "sliding_window": null,
+    "text_hidden_size": 2048,
+    "text_vocab_size": 151936,
+    "use_cache": true,
+    "use_sliding_window": false,
+    "vocab_size": 3072
+  },
+  "transformers_version": "4.57.3"
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "do_sample": true,
+  "repetition_penalty": 1.05,
+  "temperature": 0.9,
+  "top_p": 1.0,
+  "top_k": 50,
+  "subtalker_dosample": true,
+  "subtalker_temperature": 0.9,
+  "subtalker_top_p": 1.0,
+  "subtalker_top_k": 50,
+  "max_new_tokens": 8192
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc3c7e785eb961179c25450d1acff03f839e0002f2f3a5aeb67b5735c0fa2adb
+size 1811626576

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "padding_side": "left",
+  "padding_value": 0.0,
+  "processor_class": "Qwen3TTSProcessor",
+  "return_attention_mask": true
+}

speech_tokenizer/config.json ADDED Viewed

	@@ -0,0 +1,94 @@

+{
+  "architectures": [
+    "Qwen3TTSTokenizerV2Model"
+  ],
+  "model_type": "qwen3_tts_tokenizer_12hz",
+  "encoder_valid_num_quantizers": 16,
+  "input_sample_rate": 24000,
+  "output_sample_rate": 24000,
+  "decode_upsample_rate": 1920,
+  "encode_downsample_rate": 1920,
+  "decoder_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "latent_dim": 1024,
+    "codebook_dim": 512,
+    "codebook_size": 2048,
+    "decoder_dim": 1536,
+    "hidden_act": "silu",
+    "hidden_size": 512,
+    "intermediate_size": 1024,
+    "layer_scale_initial_scale": 0.01,
+    "max_position_embeddings": 8000,
+    "head_dim": 64,
+    "num_attention_heads": 16,
+    "num_hidden_layers": 8,
+    "num_key_value_heads": 16,
+    "num_quantizers": 16,
+    "num_semantic_quantizers": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_theta": 10000,
+    "semantic_codebook_size": 4096,
+    "sliding_window": 72,
+    "upsample_rates": [
+      8,
+      5,
+      4,
+      3
+    ],
+    "upsampling_ratios": [
+      2,
+      2
+    ],
+    "vector_quantization_hidden_dimension": 512
+  },
+  "encoder_config": {
+    "_frame_rate": 12.5,
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "audio_channels": 1,
+    "codebook_dim": 256,
+    "codebook_size": 2048,
+    "compress": 2,
+    "dilation_growth_rate": 2,
+    "dtype": "float32",
+    "head_dim": 64,
+    "hidden_act": "gelu",
+    "hidden_size": 512,
+    "initializer_range": 0.02,
+    "intermediate_size": 2048,
+    "kernel_size": 7,
+    "last_kernel_size": 3,
+    "layer_scale_initial_scale": 0.01,
+    "max_position_embeddings": 8000,
+    "norm_eps": 1e-05,
+    "normalize": false,
+    "num_attention_heads": 8,
+    "num_filters": 64,
+    "num_hidden_layers": 8,
+    "num_key_value_heads": 8,
+    "num_quantizers": 32,
+    "num_residual_layers": 1,
+    "num_semantic_quantizers": 1,
+    "pad_mode": "constant",
+    "residual_kernel_size": 3,
+    "rope_theta": 10000.0,
+    "sampling_rate": 24000,
+    "sliding_window": 250,
+    "transformers_version": "4.57.0.dev0",
+    "trim_right_ratio": 1.0,
+    "upsample_groups": 512,
+    "upsampling_ratios": [
+      8,
+      6,
+      5,
+      4
+    ],
+    "use_cache": false,
+    "use_causal_conv": true,
+    "use_conv_shortcut": false,
+    "use_streaming": false,
+    "vector_quantization_hidden_dimension": 256
+  },
+  "transformers_version": "4.57.3"
+}

speech_tokenizer/configuration.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"framework": "pytorch", "task": "feature-extraction", "allow_remote": true}

speech_tokenizer/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:836b7b357f5ea43e889936a3709af68dfe3751881acefe4ecf0dbd30ba571258
+size 682293092

speech_tokenizer/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "chunk_length_s": null,
+  "feature_extractor_type": "EncodecFeatureExtractor",
+  "feature_size": 1,
+  "overlap": null,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": true,
+  "sampling_rate": 24000
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,316 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151669": {
+      "content": "<|audio_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151670": {
+      "content": "<|audio_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151671": {
+      "content": "<tts_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151672": {
+      "content": "<tts_text_bos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151673": {
+      "content": "<tts_text_eod>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151674": {
+      "content": "<tts_text_bos_single>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151675": {
+      "content": "<|audio_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<|audio_start|>",
+    "<|audio_end|>",
+    "<tts_pad>",
+    "<tts_text_bos>",
+    "<tts_text_bos_single>",
+    "<|audio_pad|>"
+  ],
+  "extra_special_tokens": {
+    "image_token": "<|image_pad|>",
+    "audio_token": "<|audio_pad|>",
+    "video_token": "<|video_pad|>",
+    "vision_bos_token": "<|vision_start|>",
+    "vision_eos_token": "<|vision_end|>",
+    "audio_bos_token": "<|audio_start|>",
+    "audio_eos_token": "<|audio_end|>"
+  },
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null,
+  "image_token": "<|image_pad|>",
+  "audio_token": "<|audio_pad|>",
+  "video_token": "<|video_pad|>",
+  "vision_bos_token": "<|vision_start|>",
+  "vision_eos_token": "<|vision_end|>",
+  "audio_bos_token": "<|audio_start|>",
+  "audio_eos_token": "<|audio_end|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff