mouddane commited on
Commit
54790f9
·
0 Parent(s):

Duplicate from mouddane/granite-speech-4.1-2b-nar-mlx

Browse files
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ multilingual_sample.wav filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: mlx
3
+ license: apache-2.0
4
+ base_model: ibm-granite/granite-speech-4.1-2b-nar
5
+ language: [en, fr, de, es, pt]
6
+ pipeline_tag: automatic-speech-recognition
7
+ tags:
8
+ - mlx
9
+ - mlx-audio
10
+ - speech-to-text
11
+ - non-autoregressive
12
+ - granite
13
+ ---
14
+
15
+ # Granite Speech 4.1 2B NAR — MLX
16
+
17
+ MLX port of [`ibm-granite/granite-speech-4.1-2b-nar`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar) for Apple Silicon. Runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio).
18
+
19
+ ## Architecture
20
+
21
+ Non-autoregressive ASR via CTC + bidirectional LM editing:
22
+
23
+ 1. **16-layer Conformer encoder** (543M params) produces an initial BPE CTC hypothesis.
24
+ 2. **2-layer windowed Q-Former projector** (80M params) converts multi-layer encoder states into audio embeddings.
25
+ 3. **40-layer bidirectional Granite editor** (1.6B params) takes `[audio | hypothesis_tokens]` and emits edited logits in a single forward pass — no autoregression, no KV cache.
26
+ 4. Final CTC collapse on text-position logits yields the transcript.
27
+
28
+ Total: ~2.25B params, bf16.
29
+
30
+ ## Quickstart
31
+
32
+ ```python
33
+ from pathlib import Path
34
+ from mlx_audio.stt.utils import load_model
35
+
36
+ model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx"))
37
+ out = model.generate("audio.wav")
38
+ print(out.text)
39
+ ```
40
+
41
+ ## Limitations
42
+
43
+ - Batch size 1.
44
+ - bf16 baseline only — no quantized variants yet.
45
+ - No streaming inference.
46
+ - macOS 14+, Apple Silicon.
47
+
48
+ ## Reference
49
+
50
+ Upstream model card: https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar
51
+
52
+ Validated against the upstream PyTorch reference: exact 44-token match and exact transcript string on the example wav.
53
+
54
+ ## License
55
+
56
+ Apache-2.0, matching the upstream model.
chat_template.jinja ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- set tools_system_message_prefix = 'You are a helpful assistant with access to the following tools. You may call one or more tools to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>' %}
2
+ {%- set tools_system_message_suffix = '\n</tools>\n\nFor each tool call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request.' %}
3
+ {%- set documents_system_message_prefix = 'You are a helpful assistant with access to the following documents. You may use one or more documents to assist with the user query.\n\nYou are given a list of documents within <documents></documents> XML tags:\n<documents>' %}
4
+ {%- set documents_system_message_suffix = '\n</documents>\n\nWrite the response to the user\'s input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.' %}
5
+ {%- set g4_default_system_message = 'You are a helpful assistant. Please ensure responses are professional, accurate, and safe.' %}
6
+ {%- if available_tools is defined and available_tools %}
7
+ {%- set tools = available_tools %}
8
+ {%- endif %}
9
+ {%- set ns = namespace(tools_system_message=tools_system_message_prefix,
10
+ documents_system_message=documents_system_message_prefix,
11
+ default_system_message=g4_default_system_message,
12
+ system_message=''
13
+ ) %}
14
+ {%- if tools %}
15
+ {%- for tool in tools %}
16
+ {%- set ns.tools_system_message = ns.tools_system_message + '\n' + (tool | tojson) %}
17
+ {%- endfor %}
18
+ {%- set ns.tools_system_message = ns.tools_system_message + tools_system_message_suffix %}
19
+ {%- else %}
20
+ {%- set ns.tools_system_message = '' %}
21
+ {%- endif %}
22
+ {%- if documents %}
23
+ {%- for document in documents %}
24
+ {%- set ns.documents_system_message = ns.documents_system_message + '\n' + (document | tojson) %}
25
+ {%- endfor %}
26
+ {%- set ns.documents_system_message = ns.documents_system_message + documents_system_message_suffix %}
27
+ {%- else %}
28
+ {%- set ns.documents_system_message = '' %}
29
+ {%- endif %}
30
+ {%- if messages[0].role == 'system' %}
31
+ {%- if messages[0].content is string %}
32
+ {%- set ns.system_message = messages[0].content %}
33
+ {%- elif messages[0].content is iterable %}
34
+ {%- for entry in messages[0].content %}
35
+ {%- if entry.type== 'text' %}
36
+ {%- if ns.system_message != '' %}
37
+ {%- set ns.system_message = ns.system_message + '\n' %}
38
+ {%- endif %}
39
+ {%- set ns.system_message = ns.system_message + entry.text %}
40
+ {%- endif %}
41
+ {%- endfor %}
42
+ {%- endif %}
43
+ {%- if tools and documents %}
44
+ {%- set ns.system_message = ns.system_message + '\n\n' + ns.tools_system_message + '\n\n' + ns.documents_system_message %}
45
+ {%- elif tools %}
46
+ {%- set ns.system_message = ns.system_message + '\n\n' + ns.tools_system_message %}
47
+ {%- elif documents %}
48
+ {%- set ns.system_message = ns.system_message + '\n\n' + ns.documents_system_message %}
49
+ {%- endif %}
50
+ {%- else %}
51
+ {%- if tools and documents %}
52
+ {%- set ns.system_message = ns.tools_system_message + '\n\n' + ns.documents_system_message %}
53
+ {%- elif tools %}
54
+ {%- set ns.system_message = ns.tools_system_message %}
55
+ {%- elif documents %}
56
+ {%- set ns.system_message = ns.documents_system_message %}
57
+ {%- endif %}
58
+ {%- endif %}
59
+ {%- if ns.system_message %}
60
+ {{- '<|start_of_role|>system<|end_of_role|>' + ns.system_message + '<|end_of_text|>\n' }}
61
+ {%- else %}
62
+ {{- '<|start_of_role|>system<|end_of_role|>' + ns.default_system_message + '<|end_of_text|>\n' }}
63
+ {%- endif %}
64
+ {%- for message in messages %}
65
+ {%- set content = namespace(val='') %}
66
+ {%- if message.content is string %}
67
+ {%- set content.val = message.content %}
68
+ {%- else %}
69
+ {%- if message.content is iterable %}
70
+ {%- for entry in message.content %}
71
+ {%- if entry.type== 'text' %}
72
+ {%- if content.val != '' %}
73
+ {%- set content.val = content.val + '\n' %}
74
+ {%- endif %}
75
+ {%- set content.val = content.val + entry.text %}
76
+ {%- endif %}
77
+ {%- endfor %}
78
+ {%- endif %}
79
+ {%- endif %}
80
+ {%- if (message.role == 'user') or (message.role == 'system' and not loop.first) %}
81
+ {{- '<|start_of_role|>' + message.role + '<|end_of_role|>' + content.val + '<|end_of_text|>\n' }}
82
+ {%- elif message.role == 'assistant' %}
83
+ {{- '<|start_of_role|>' + message.role + '<|end_of_role|>' + content.val }}
84
+ {%- if message.tool_calls %}
85
+ {%- for tool_call in message.tool_calls %}
86
+ {%- if (loop.first and content.val) or (not loop.first) %}
87
+ {{- '\n' }}
88
+ {%- endif %}
89
+ {%- if tool_call.function %}
90
+ {%- set tool_call = tool_call.function %}
91
+ {%- endif %}
92
+ {{- '<tool_call>\n{"name": "' }}
93
+ {{- tool_call.name }}
94
+ {{- '", "arguments": ' }}
95
+ {%- if tool_call.arguments is string %}
96
+ {{- tool_call.arguments }}
97
+ {%- else %}
98
+ {{- tool_call.arguments | tojson }}
99
+ {%- endif %}
100
+ {{- '}\n</tool_call>' }}
101
+ {%- endfor %}
102
+ {%- endif %}
103
+ {{- '<|end_of_text|>\n' }}
104
+ {%- elif message.role == 'tool' %}
105
+ {%- if loop.first or (messages[loop.index0 - 1].role != 'tool') %}
106
+ {{- '<|start_of_role|>user<|end_of_role|>' }}
107
+ {%- endif %}
108
+ {{- '\n<tool_response>\n' }}
109
+ {{- content.val }}
110
+ {{- '\n</tool_response>' }}
111
+ {%- if loop.last or (messages[loop.index0 + 1].role != 'tool') %}
112
+ {{- '<|end_of_text|>\n' }}
113
+ {%- endif %}
114
+ {%- endif %}
115
+ {%- endfor %}
116
+ {%- if add_generation_prompt %}
117
+ {{- '<|start_of_role|>assistant<|end_of_role|>' }}
118
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GraniteSpeechNarForASR"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_granite_speech_nar.GraniteSpeechNarConfig",
7
+ "AutoFeatureExtractor": "feature_extraction_granite_speech_nar.GraniteSpeechNarFeatureExtractor",
8
+ "AutoModel": "modeling_granite_speech_nar.GraniteSpeechNarForASR",
9
+ "AutoProcessor": "processing_granite_speech_nar.GraniteSpeechNarProcessor"
10
+ },
11
+ "blank_token_id": 100257,
12
+ "ce_loss_lambda": 0.02,
13
+ "dtype": "bfloat16",
14
+ "encoder_config": {
15
+ "blank_token_id": 100257,
16
+ "bpe_output_dim": 100352,
17
+ "bpe_pooling_window": 4,
18
+ "context_size": 200,
19
+ "conv_expansion_factor": 2,
20
+ "conv_kernel_size": 15,
21
+ "dim_head": 128,
22
+ "dropout": 0.1,
23
+ "feedforward_mult": 4,
24
+ "hidden_dim": 1024,
25
+ "initializer_range": 0.02,
26
+ "input_dim": 160,
27
+ "max_pos_emb": 512,
28
+ "model_type": "granite_speech_nar_encoder",
29
+ "num_heads": 8,
30
+ "num_layers": 16,
31
+ "output_dim": 348,
32
+ "pred_dropout": 0.25,
33
+ "self_conditioning_layer": 8
34
+ },
35
+ "encoder_ctc_loss_lambda": 0.8,
36
+ "encoder_layer_indices": [
37
+ 4,
38
+ 8,
39
+ 12,
40
+ -1
41
+ ],
42
+ "min_edit_sequence_length": 8,
43
+ "model_type": "granite_speech_nar",
44
+ "projector_config": {
45
+ "attn_bias": true,
46
+ "block_size": 15,
47
+ "downsample_rate": 5,
48
+ "dropout_prob": 0.1,
49
+ "encoder_dim": 1024,
50
+ "hidden_size": 2048,
51
+ "layernorm_eps": 1e-06,
52
+ "llm_dim": 2048,
53
+ "mlp_bias": true,
54
+ "mlp_ratio": 2,
55
+ "model_type": "granite_speech_nar_projector",
56
+ "num_encoder_layers": 4,
57
+ "num_heads": 32,
58
+ "num_layers": 2
59
+ },
60
+ "scale_projected_embeddings": true,
61
+ "text_config": {
62
+ "_name_or_path": "ibm-granite/granite-4.0-1b-base",
63
+ "attention_bias": false,
64
+ "attention_dropout": 0.0,
65
+ "attention_multiplier": 0.0078125,
66
+ "bos_token_id": 100257,
67
+ "dtype": "bfloat16",
68
+ "embedding_multiplier": 12,
69
+ "eos_token_id": 100257,
70
+ "hidden_act": "silu",
71
+ "hidden_size": 2048,
72
+ "initializer_range": 0.1,
73
+ "intermediate_size": 4096,
74
+ "logits_scaling": 8,
75
+ "max_position_embeddings": 4096,
76
+ "mlp_bias": false,
77
+ "model_type": "granite",
78
+ "num_attention_heads": 16,
79
+ "num_hidden_layers": 40,
80
+ "num_key_value_heads": 4,
81
+ "pad_token_id": 100256,
82
+ "residual_multiplier": 0.22,
83
+ "rms_norm_eps": 1e-05,
84
+ "rope_parameters": {
85
+ "rope_theta": 10000,
86
+ "rope_type": "default"
87
+ },
88
+ "tie_word_embeddings": true,
89
+ "use_cache": true,
90
+ "vocab_size": 100352
91
+ },
92
+ "tie_word_embeddings": true,
93
+ "transformers_version": "5.8.0.dev0"
94
+ }
generation_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "transformers_version": "5.8.1"
3
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f2ec4215196eaa20c7b816c0526b3012dfb9b0078e545c9ecdb09c601a6c4992
3
+ size 4509376040
multilingual_sample.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91d243650809c1274141ec20ff23045315eaf27567694002ea3ef390048b7058
3
+ size 1596240
preprocessor_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoFeatureExtractor": "feature_extraction_granite_speech_nar.GraniteSpeechNarFeatureExtractor"
4
+ },
5
+ "feature_extractor_type": "GraniteSpeechNarFeatureExtractor",
6
+ "hop_length": 160,
7
+ "n_fft": 512,
8
+ "n_mels": 80,
9
+ "sampling_rate": 16000,
10
+ "win_length": 400
11
+ }
processor_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "processor_class": "GraniteSpeechNarProcessor",
3
+ "auto_map": {
4
+ "AutoProcessor": "processing_granite_speech_nar.GraniteSpeechNarProcessor"
5
+ }
6
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|end_of_text|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|end_of_text|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|pad|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|unk|>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<|end_of_text|>",
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|end_of_text|>",
7
+ "is_local": true,
8
+ "local_files_only": false,
9
+ "model_max_length": 1000000000000000019884624838656,
10
+ "pad_token": "<|pad|>",
11
+ "padding_side": "left",
12
+ "tokenizer_class": "TokenizersBackend",
13
+ "unk_token": "<|unk|>"
14
+ }