Sync modeling code with Instruct, update README for Base checkpoint
Browse files- Sync modeling_apriel2.py from Instruct (adds PatternMixerAdapter for
pattern config weight loading, config_class fix for AutoModelForCausalLM)
- Add AutoModelForCausalLM to config.json auto_map
- Rewrite README: remove TODO placeholders, point to Instruct for
inference/serving, document how to copy preset configs for evaluation
- Fix citation year and serving link
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- README.md +10 -59
- config.json +1 -0
- modeling_apriel2.py +40 -2
README.md
CHANGED
|
@@ -61,68 +61,19 @@ This checkpoint is intended as a **foundation for downstream fine-tuning**. For
|
|
| 61 |
|
| 62 |
## How to Use
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
| 67 |
-
pip install transformers
|
| 68 |
-
```
|
| 69 |
-
|
| 70 |
-
> **🔴 TODO: There is currently no mechanism to select a placement when using Transformers directly. The model defaults to the all-attention preset during inference. Placement selection requires vLLM with the Fast-LLM plugin (see below). We need to add a Transformers API for placement switching.**
|
| 71 |
-
|
| 72 |
-
Basic usage with Transformers (all-attention preset):
|
| 73 |
-
|
| 74 |
-
```python
|
| 75 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 76 |
-
|
| 77 |
-
model_name = "ServiceNow-AI/SuperApriel-15b-Base"
|
| 78 |
-
|
| 79 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 80 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 81 |
-
model_name,
|
| 82 |
-
torch_dtype="auto",
|
| 83 |
-
device_map="auto",
|
| 84 |
-
trust_remote_code=True,
|
| 85 |
-
)
|
| 86 |
-
|
| 87 |
-
prompt = "The capital of France is"
|
| 88 |
-
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 89 |
-
|
| 90 |
-
generated_ids = model.generate(**inputs, max_new_tokens=64)
|
| 91 |
-
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
|
| 92 |
-
print(output)
|
| 93 |
-
```
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
- **`single-preset` mode**: Only the weights of a single selected mixer placement are loaded. Inactive mixer weights are offloaded to CPU, so GPU memory footprint and throughput match a dedicated single-placement model.
|
| 104 |
-
- **`supernet` mode**: The full supernet is loaded into memory, enabling placement switching at a per-request level (5–15s switch time depending on how many layers change mixer type).
|
| 105 |
-
|
| 106 |
-
### Installation
|
| 107 |
-
|
| 108 |
-
```bash
|
| 109 |
-
uv venv --python 3.12 --seed
|
| 110 |
-
source .venv/bin/activate
|
| 111 |
-
|
| 112 |
-
git clone git@github.com:ServiceNow/Fast-LLM.git
|
| 113 |
-
cd Fast-LLM
|
| 114 |
-
uv pip install vllm==0.10.2 --torch-backend=auto
|
| 115 |
-
pip install .
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
### Running a vLLM Server
|
| 119 |
-
|
| 120 |
-
```bash
|
| 121 |
-
vllm serve \
|
| 122 |
-
--model ServiceNow-AI/SuperApriel-15b-Base \
|
| 123 |
-
--port 8000 \
|
| 124 |
-
--trust-remote-code
|
| 125 |
-
```
|
| 126 |
|
| 127 |
## Intended Use
|
| 128 |
|
|
@@ -169,7 +120,7 @@ Users accept responsibility for securely deploying, managing, and using this ope
|
|
| 169 |
## Software
|
| 170 |
|
| 171 |
- **Training stack:** [Fast-LLM](https://github.com/ServiceNow/Fast-LLM)
|
| 172 |
-
- **Serving:** [Fast-LLM vLLM plugin](https://github.com/ServiceNow/Fast-LLM)
|
| 173 |
|
| 174 |
## License
|
| 175 |
|
|
@@ -178,7 +129,7 @@ MIT
|
|
| 178 |
## Citation
|
| 179 |
|
| 180 |
```bibtex
|
| 181 |
-
@misc{
|
| 182 |
title = {Super Apriel: One Checkpoint, Many Speeds},
|
| 183 |
author = {ServiceNow Language Models Lab},
|
| 184 |
year = {2026},
|
|
|
|
| 61 |
|
| 62 |
## How to Use
|
| 63 |
|
| 64 |
+
This checkpoint is intended as a foundation for fine-tuning and research, not for direct inference. For a ready-to-use model with optimized deployment presets and full serving instructions, see [SuperApriel-15b-Instruct](https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Instruct).
|
| 65 |
|
| 66 |
+
### Loading for evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
+
If you need to load this checkpoint for evaluation or experimentation, copy a preset config from [SuperApriel-15b-Instruct](https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Instruct) to select a specific mixer placement. The Base and Instruct checkpoints share the same architecture and config format — preset configs from Instruct work with this checkpoint.
|
| 69 |
|
| 70 |
+
For example, to load with the all-attention placement:
|
| 71 |
|
| 72 |
+
1. Download a preset `config.json` from `SuperApriel-15b-Instruct/preset_configs/all-attention/`
|
| 73 |
+
2. Place it as this model's `config.json`
|
| 74 |
+
3. Load with vLLM or Transformers following the [Instruct README instructions](https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Instruct#how-to-use)
|
| 75 |
|
| 76 |
+
> **Note:** This model requires `trust_remote_code=True` as it uses custom architecture code for the multi-mixer supernet.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
## Intended Use
|
| 79 |
|
|
|
|
| 120 |
## Software
|
| 121 |
|
| 122 |
- **Training stack:** [Fast-LLM](https://github.com/ServiceNow/Fast-LLM)
|
| 123 |
+
- **Serving:** [Fast-LLM vLLM plugin](https://github.com/ServiceNow/Fast-LLM/tree/oo/feature/vllm-apriel2-model-modeling/apriel2-vllm-plugin)
|
| 124 |
|
| 125 |
## License
|
| 126 |
|
|
|
|
| 129 |
## Citation
|
| 130 |
|
| 131 |
```bibtex
|
| 132 |
+
@misc{super_apriel_2026,
|
| 133 |
title = {Super Apriel: One Checkpoint, Many Speeds},
|
| 134 |
author = {ServiceNow Language Models Lab},
|
| 135 |
year = {2026},
|
config.json
CHANGED
|
@@ -5,6 +5,7 @@
|
|
| 5 |
"auto_map": {
|
| 6 |
"AutoConfig": "configuration_apriel2.Apriel2Config",
|
| 7 |
"AutoModel": "modeling_apriel2.Apriel2Model",
|
|
|
|
| 8 |
"AutoModelForImageTextToText": "modeling_apriel2.Apriel2ForConditionalGeneration"
|
| 9 |
},
|
| 10 |
"bos_token_id": 1,
|
|
|
|
| 5 |
"auto_map": {
|
| 6 |
"AutoConfig": "configuration_apriel2.Apriel2Config",
|
| 7 |
"AutoModel": "modeling_apriel2.Apriel2Model",
|
| 8 |
+
"AutoModelForCausalLM": "modeling_apriel2.Apriel2ForCausalLM",
|
| 9 |
"AutoModelForImageTextToText": "modeling_apriel2.Apriel2ForConditionalGeneration"
|
| 10 |
},
|
| 11 |
"bos_token_id": 1,
|
modeling_apriel2.py
CHANGED
|
@@ -972,6 +972,32 @@ def create_mixer(mixer_config: dict, hidden_size: int, layer_idx: int, config, a
|
|
| 972 |
return mixer_class(hidden_size, mixer_config, layer_idx=layer_idx)
|
| 973 |
|
| 974 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 975 |
class Apriel2Mamba(nn.Module):
|
| 976 |
"""Mamba mixer."""
|
| 977 |
|
|
@@ -1906,14 +1932,16 @@ class Apriel2BlockSequence(nn.Module):
|
|
| 1906 |
|
| 1907 |
blocks = []
|
| 1908 |
for layer_idx in range(num_blocks):
|
| 1909 |
-
# Get block_config for this layer
|
| 1910 |
if seq_type == "fixed":
|
| 1911 |
block_config = self.sequence_config.get("block", {})
|
|
|
|
| 1912 |
elif seq_type == "pattern":
|
| 1913 |
pattern = self.sequence_config.get("pattern", [])
|
| 1914 |
blocks_config = self.sequence_config.get("blocks", {})
|
| 1915 |
block_name = pattern[layer_idx % len(pattern)]
|
| 1916 |
block_config = blocks_config[block_name]
|
|
|
|
| 1917 |
else:
|
| 1918 |
raise ValueError(f"Unknown sequence type: {seq_type}")
|
| 1919 |
|
|
@@ -1925,6 +1953,7 @@ class Apriel2BlockSequence(nn.Module):
|
|
| 1925 |
layer_idx=layer_idx,
|
| 1926 |
rms_norm_eps=rms_norm_eps,
|
| 1927 |
config=self.config,
|
|
|
|
| 1928 |
)
|
| 1929 |
)
|
| 1930 |
|
|
@@ -2031,6 +2060,7 @@ class Apriel2Block(nn.Module):
|
|
| 2031 |
layer_idx: int,
|
| 2032 |
rms_norm_eps: float,
|
| 2033 |
config: Apriel2TextConfig,
|
|
|
|
| 2034 |
):
|
| 2035 |
"""
|
| 2036 |
Args:
|
|
@@ -2039,6 +2069,7 @@ class Apriel2Block(nn.Module):
|
|
| 2039 |
layer_idx: Layer index in the sequence
|
| 2040 |
rms_norm_eps: Epsilon for RMS normalization
|
| 2041 |
config: Model config (passed to mixers that need it)
|
|
|
|
| 2042 |
"""
|
| 2043 |
super().__init__()
|
| 2044 |
self.hidden_size = hidden_size
|
|
@@ -2046,7 +2077,13 @@ class Apriel2Block(nn.Module):
|
|
| 2046 |
|
| 2047 |
# Create mixer based on type
|
| 2048 |
mixer_config = block_config.get("mixer", {"type": "attention"})
|
| 2049 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2050 |
|
| 2051 |
# Create MLP
|
| 2052 |
mlp_config = block_config.get("mlp", {"type": "mlp"})
|
|
@@ -2435,6 +2472,7 @@ class Apriel2TextModel(Apriel2PreTrainedModel):
|
|
| 2435 |
class Apriel2ForCausalLM(Apriel2PreTrainedModel, GenerationMixin):
|
| 2436 |
"""Apriel2 model with a language modeling head (text-only)."""
|
| 2437 |
|
|
|
|
| 2438 |
_tied_weights_keys = ["lm_head.weight"]
|
| 2439 |
|
| 2440 |
def __init__(self, config: Apriel2TextConfig):
|
|
|
|
| 972 |
return mixer_class(hidden_size, mixer_config, layer_idx=layer_idx)
|
| 973 |
|
| 974 |
|
| 975 |
+
class Apriel2PatternMixerAdapter(nn.Module):
|
| 976 |
+
"""Adapter that wraps a single mixer under mixers.{name} to match supernet weight paths.
|
| 977 |
+
|
| 978 |
+
The supernet checkpoint stores weights as blocks.{i}.mixer.mixers.{type}.{param},
|
| 979 |
+
but a bare mixer creates blocks.{i}.mixer.{param}. This adapter adds the intermediate
|
| 980 |
+
mixers.{name} level so pattern configs can load from supernet checkpoints.
|
| 981 |
+
"""
|
| 982 |
+
|
| 983 |
+
def __init__(self, mixer_name: str, mixer: nn.Module):
|
| 984 |
+
super().__init__()
|
| 985 |
+
self.mixers = nn.ModuleDict({mixer_name: mixer})
|
| 986 |
+
self._mixer_name = mixer_name
|
| 987 |
+
|
| 988 |
+
def forward(self, *args, **kwargs):
|
| 989 |
+
return self.mixers[self._mixer_name](*args, **kwargs)
|
| 990 |
+
|
| 991 |
+
def preprocess(self, *args, **kwargs):
|
| 992 |
+
return self.mixers[self._mixer_name].preprocess(*args, **kwargs)
|
| 993 |
+
|
| 994 |
+
@classmethod
|
| 995 |
+
def setup(cls, mixer_name: str, mixer_config: dict, hidden_size: int, max_position_embeddings: int) -> nn.ModuleDict:
|
| 996 |
+
mixer_type = mixer_config.get("type", "attention")
|
| 997 |
+
mixer_class = get_mixer_class(mixer_type)
|
| 998 |
+
return mixer_class.setup(mixer_config, hidden_size, max_position_embeddings)
|
| 999 |
+
|
| 1000 |
+
|
| 1001 |
class Apriel2Mamba(nn.Module):
|
| 1002 |
"""Mamba mixer."""
|
| 1003 |
|
|
|
|
| 1932 |
|
| 1933 |
blocks = []
|
| 1934 |
for layer_idx in range(num_blocks):
|
| 1935 |
+
# Get block_config and block_name for this layer
|
| 1936 |
if seq_type == "fixed":
|
| 1937 |
block_config = self.sequence_config.get("block", {})
|
| 1938 |
+
block_name_for_layer = None # No adapter needed for fixed type
|
| 1939 |
elif seq_type == "pattern":
|
| 1940 |
pattern = self.sequence_config.get("pattern", [])
|
| 1941 |
blocks_config = self.sequence_config.get("blocks", {})
|
| 1942 |
block_name = pattern[layer_idx % len(pattern)]
|
| 1943 |
block_config = blocks_config[block_name]
|
| 1944 |
+
block_name_for_layer = block_name # Pass to Apriel2Block for weight path matching
|
| 1945 |
else:
|
| 1946 |
raise ValueError(f"Unknown sequence type: {seq_type}")
|
| 1947 |
|
|
|
|
| 1953 |
layer_idx=layer_idx,
|
| 1954 |
rms_norm_eps=rms_norm_eps,
|
| 1955 |
config=self.config,
|
| 1956 |
+
block_name=block_name_for_layer,
|
| 1957 |
)
|
| 1958 |
)
|
| 1959 |
|
|
|
|
| 2060 |
layer_idx: int,
|
| 2061 |
rms_norm_eps: float,
|
| 2062 |
config: Apriel2TextConfig,
|
| 2063 |
+
block_name: Optional[str] = None,
|
| 2064 |
):
|
| 2065 |
"""
|
| 2066 |
Args:
|
|
|
|
| 2069 |
layer_idx: Layer index in the sequence
|
| 2070 |
rms_norm_eps: Epsilon for RMS normalization
|
| 2071 |
config: Model config (passed to mixers that need it)
|
| 2072 |
+
block_name: For pattern configs, the mixer name (e.g. "attention") to match supernet weight paths
|
| 2073 |
"""
|
| 2074 |
super().__init__()
|
| 2075 |
self.hidden_size = hidden_size
|
|
|
|
| 2077 |
|
| 2078 |
# Create mixer based on type
|
| 2079 |
mixer_config = block_config.get("mixer", {"type": "attention"})
|
| 2080 |
+
raw_mixer = create_mixer(mixer_config, hidden_size, layer_idx, config, allow_stochastic=True)
|
| 2081 |
+
|
| 2082 |
+
# For pattern configs, wrap in adapter to match supernet checkpoint weight paths
|
| 2083 |
+
if block_name is not None:
|
| 2084 |
+
self.mixer = Apriel2PatternMixerAdapter(block_name, raw_mixer)
|
| 2085 |
+
else:
|
| 2086 |
+
self.mixer = raw_mixer
|
| 2087 |
|
| 2088 |
# Create MLP
|
| 2089 |
mlp_config = block_config.get("mlp", {"type": "mlp"})
|
|
|
|
| 2472 |
class Apriel2ForCausalLM(Apriel2PreTrainedModel, GenerationMixin):
|
| 2473 |
"""Apriel2 model with a language modeling head (text-only)."""
|
| 2474 |
|
| 2475 |
+
config_class = Apriel2Config
|
| 2476 |
_tied_weights_keys = ["lm_head.weight"]
|
| 2477 |
|
| 2478 |
def __init__(self, config: Apriel2TextConfig):
|