denisko Claude Opus 4.6 commited on
Commit
d54d8f4
·
1 Parent(s): bd26458

Add model card for SuperApriel-15b-Base

Browse files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +189 -0
  3. assets/super-apriel.png +3 -0
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ track_downloads: true
6
+ ---
7
+
8
+ # SuperApriel-15b-Base
9
+
10
+ <img src="assets/super-apriel.png" width="120" alt="thumbnail"/> `/ˈɑː.pri.əl/`
11
+
12
+ A 15B-parameter **token-mixer supernet** derived from [Apriel-1.6](https://huggingface.co/ServiceNow-AI/Apriel-1.6-15b-Thinker) via stochastic distillation. Every decoder layer exposes **four trained mixer options**—Full Attention, Sliding Window Attention, Gated DeltaNet, and Kimi Delta Attention—enabling flexible architecture selection from a single checkpoint.
13
+
14
+ - **Model Size:** 15B parameters
15
+ - **Layers:** 48 decoder layers, each with 4 mixer variants
16
+ - **Context Length:** 262K positions (runtime dependent)
17
+ - **Languages:** English (best)
18
+
19
+ ## Highlights
20
+
21
+ - **Supernet architecture**: Single checkpoint containing 4 mixer types at every layer, yielding 4⁴⁸ ≈ 7.9 × 10²⁸ possible architectures
22
+ - **Four mixer types**: Full Attention (FA), Sliding Window Attention (SWA, window=4096), Gated DeltaNet (GDN), Kimi Delta Attention (KDA)
23
+ - **Stage 1 distillation checkpoint**: Trained via stochastic distillation from frozen Apriel-1.6 teacher on 266B tokens
24
+ - **Foundation for fine-tuning**: Use this checkpoint to fine-tune on your own data with targeted placement strategies
25
+
26
+ ## Model Overview
27
+
28
+ SuperApriel-15b-Base is the **Stage 1 (distillation) checkpoint** of the Super Apriel supernet. During training, all four mixer types at each layer were trained simultaneously using stochastic local sampling—each layer's mixer was drawn uniformly from the four types at each training step. Only mixer weights were trained; all shared parameters (FFNs, embeddings, layer norms, vision encoder) remain frozen from the Apriel-1.6 teacher.
29
+
30
+ This checkpoint is intended as a **foundation for downstream fine-tuning**. For a ready-to-use model with optimized deployment presets, see [SuperApriel-15b-Instruct](https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Instruct).
31
+
32
+ ### Architecture Details
33
+
34
+ | Component | Details |
35
+ |-----------|---------|
36
+ | Parameters | 15B |
37
+ | Decoder layers | 48 |
38
+ | Query / KV heads | 32 / 8 (grouped-query attention), d_h = 128 |
39
+ | Hidden dimension | 5,120 |
40
+ | FFN width | 14,336 (SiLU-gated) |
41
+ | Vocabulary | 131,072 tokens |
42
+ | Vision encoder | Pixtral (16×16 patches) |
43
+
44
+ ### Mixer Types
45
+
46
+ | Mixer | Time | Memory | Description |
47
+ |-------|------|--------|-------------|
48
+ | Full Attention (FA) | O(n²) | O(n) KV cache | Standard grouped-query attention |
49
+ | Sliding Window (SWA) | O(w·n) | O(w) | Local window of 4,096 tokens |
50
+ | Gated DeltaNet (GDN) | O(n) | O(1) fixed state | Matrix-valued recurrent state with delta rule |
51
+ | Kimi Delta Attention (KDA) | O(n) | O(1) fixed state | Linear attention with channel-wise gating |
52
+
53
+ ### Training Details
54
+
55
+ - **Objective**: Stochastic distillation from frozen Apriel-1.6 teacher
56
+ - **Losses**: Activation matching (𝓛_act), Forward KL (weight 0.1), Reverse KL (weight 0.9)
57
+ - **Data**: 266B tokens, curated mixture focused on reasoning and domain-specific data
58
+ - **Sampling**: Uniform local sampling (each layer independently samples a mixer type)
59
+ - **Compute**: Up to 192 H100 GPUs
60
+ - **Training framework**: [Fast-LLM](https://github.com/ServiceNow/Fast-LLM)
61
+
62
+ ## How to Use
63
+
64
+ Install dependencies:
65
+
66
+ ```bash
67
+ pip install transformers
68
+ ```
69
+
70
+ > **🔴 TODO: There is currently no mechanism to select a placement when using Transformers directly. The model defaults to the all-attention preset during inference. Placement selection requires vLLM with the Fast-LLM plugin (see below). We need to add a Transformers API for placement switching.**
71
+
72
+ Basic usage with Transformers (all-attention preset):
73
+
74
+ ```python
75
+ from transformers import AutoModelForCausalLM, AutoTokenizer
76
+
77
+ model_name = "ServiceNow-AI/SuperApriel-15b-Base"
78
+
79
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ model_name,
82
+ torch_dtype="auto",
83
+ device_map="auto",
84
+ trust_remote_code=True,
85
+ )
86
+
87
+ prompt = "The capital of France is"
88
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
89
+
90
+ generated_ids = model.generate(**inputs, max_new_tokens=64)
91
+ output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
92
+ print(output)
93
+ ```
94
+
95
+ > **Note:** This base model requires `trust_remote_code=True` as it uses custom architecture code for the multi-mixer supernet.
96
+
97
+ ## Use with vLLM
98
+
99
+ > **🔴 TODO: Confirm the exact vLLM plugin source (Fast-LLM branch/tag?), installation steps, and the CLI/API for selecting a placement. The instructions below are placeholders.**
100
+
101
+ The supernet is served via a vLLM plugin implemented in [Fast-LLM](https://github.com/ServiceNow/Fast-LLM). Two serving modes are available:
102
+
103
+ - **`single-preset` mode**: Only the weights of a single selected mixer placement are loaded. Inactive mixer weights are offloaded to CPU, so GPU memory footprint and throughput match a dedicated single-placement model.
104
+ - **`supernet` mode**: The full supernet is loaded into memory, enabling placement switching at a per-request level (5–15s switch time depending on how many layers change mixer type).
105
+
106
+ ### Installation
107
+
108
+ ```bash
109
+ uv venv --python 3.12 --seed
110
+ source .venv/bin/activate
111
+
112
+ git clone git@github.com:ServiceNow/Fast-LLM.git
113
+ cd Fast-LLM
114
+ uv pip install vllm==0.10.2 --torch-backend=auto
115
+ pip install .
116
+ ```
117
+
118
+ ### Running a vLLM Server
119
+
120
+ ```bash
121
+ vllm serve \
122
+ --model ServiceNow-AI/SuperApriel-15b-Base \
123
+ --port 8000 \
124
+ --trust-remote-code
125
+ ```
126
+
127
+ ## Intended Use
128
+
129
+ SuperApriel-15b-Base is designed as a **foundation checkpoint** for:
130
+
131
+ - Fine-tuning with custom placement strategies on domain-specific data
132
+ - Research on hybrid architectures and mixer placement optimization
133
+ - Placement search and Pareto frontier exploration using the optimization toolkit
134
+
135
+ It is **not intended** for direct deployment without further fine-tuning or for safety-critical applications without human oversight.
136
+
137
+ ## Limitations
138
+
139
+ - **Factual accuracy:** May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts.
140
+ - **Bias:** May reflect societal, cultural, or systemic biases present in training data.
141
+ - **Ethics:** Do not use the model to produce harmful, unlawful, or unethical content.
142
+ - **Language:** Strongest performance is in English. Output quality may degrade in underrepresented languages.
143
+ - **Critical use:** Not suitable for medical, legal, financial, or other high-risk applications without safeguards.
144
+ - **Base model:** This is a distillation checkpoint without instruction tuning. For instruction-following use cases, see [SuperApriel-15b-Instruct](https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Instruct).
145
+
146
+ ## Security and Responsible Use
147
+
148
+ **Security Responsibilities:**
149
+ Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF).
150
+
151
+ **Guidelines for Deployers:**
152
+
153
+ - Regularly conduct robustness assessments to identify and mitigate adversarial inputs.
154
+ - Implement validation and filtering processes to prevent harmful or biased outputs.
155
+ - Continuously perform data privacy checks to guard against unintended data leaks.
156
+ - Document and communicate the model's limitations, intended usage, and known security risks to all end-users.
157
+ - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities.
158
+
159
+ **Guidelines for Users:**
160
+
161
+ - Follow established security policies and usage guidelines provided by deployers.
162
+ - Protect and manage sensitive information when interacting with the model.
163
+ - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers.
164
+ - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions.
165
+
166
+ **Disclaimer:**
167
+ Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
168
+
169
+ ## Software
170
+
171
+ - **Training stack:** [Fast-LLM](https://github.com/ServiceNow/Fast-LLM)
172
+ - **Serving:** [Fast-LLM vLLM plugin](https://github.com/ServiceNow/Fast-LLM)
173
+
174
+ ## License
175
+
176
+ MIT
177
+
178
+ ## Citation
179
+
180
+ ```bibtex
181
+ @misc{super_apriel_2025,
182
+ title = {Super Apriel: One Checkpoint, Many Speeds},
183
+ author = {ServiceNow Language Models Lab},
184
+ year = {2026},
185
+ eprint = {TODO},
186
+ archivePrefix= {arXiv},
187
+ primaryClass = {cs.CL}
188
+ }
189
+ ```
assets/super-apriel.png ADDED

Git LFS Details

  • SHA256: 509bc1acfa435ef526f0a06d408009f20b9d2cdb6f2ac1e236fdf2b3a2a46be5
  • Pointer size: 132 Bytes
  • Size of remote file: 1.58 MB