Instructions to use ServiceNow-AI/SuperApriel-15b-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ServiceNow-AI/SuperApriel-15b-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ServiceNow-AI/SuperApriel-15b-Base", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("ServiceNow-AI/SuperApriel-15b-Base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ServiceNow-AI/SuperApriel-15b-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ServiceNow-AI/SuperApriel-15b-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ServiceNow-AI/SuperApriel-15b-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ServiceNow-AI/SuperApriel-15b-Base

SGLang

How to use ServiceNow-AI/SuperApriel-15b-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ServiceNow-AI/SuperApriel-15b-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ServiceNow-AI/SuperApriel-15b-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ServiceNow-AI/SuperApriel-15b-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ServiceNow-AI/SuperApriel-15b-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ServiceNow-AI/SuperApriel-15b-Base with Docker Model Runner:
```
docker model run hf.co/ServiceNow-AI/SuperApriel-15b-Base
```

denisko Claude Opus 4.6 commited on 24 days ago

Commit

c35ea9c

1 Parent(s): 5f43fe3

Sync modeling code with Instruct, update README for Base checkpoint

Browse files

- Sync modeling_apriel2.py from Instruct (adds PatternMixerAdapter for
pattern config weight loading, config_class fix for AutoModelForCausalLM)
- Add AutoModelForCausalLM to config.json auto_map
- Rewrite README: remove TODO placeholders, point to Instruct for
inference/serving, document how to copy preset configs for evaluation
- Fix citation year and serving link

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (3) hide show

README.md +10 -59
config.json +1 -0
modeling_apriel2.py +40 -2

README.md CHANGED Viewed

@@ -61,68 +61,19 @@ This checkpoint is intended as a **foundation for downstream fine-tuning**. For
 ## How to Use
-Install dependencies:
-```bash
-pip install transformers
-```
-> **🔴 TODO: There is currently no mechanism to select a placement when using Transformers directly. The model defaults to the all-attention preset during inference. Placement selection requires vLLM with the Fast-LLM plugin (see below). We need to add a Transformers API for placement switching.**
-Basic usage with Transformers (all-attention preset):
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "ServiceNow-AI/SuperApriel-15b-Base"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    torch_dtype="auto",
-    device_map="auto",
-    trust_remote_code=True,
-)
-prompt = "The capital of France is"
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-generated_ids = model.generate(**inputs, max_new_tokens=64)
-output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
-print(output)
-```
-> **Note:** This base model requires `trust_remote_code=True` as it uses custom architecture code for the multi-mixer supernet.
-## Use with vLLM
-> **🔴 TODO: Confirm the exact vLLM plugin source (Fast-LLM branch/tag?), installation steps, and the CLI/API for selecting a placement. The instructions below are placeholders.**
-The supernet is served via a vLLM plugin implemented in [Fast-LLM](https://github.com/ServiceNow/Fast-LLM). Two serving modes are available:
-- **`single-preset` mode**: Only the weights of a single selected mixer placement are loaded. Inactive mixer weights are offloaded to CPU, so GPU memory footprint and throughput match a dedicated single-placement model.
-- **`supernet` mode**: The full supernet is loaded into memory, enabling placement switching at a per-request level (5–15s switch time depending on how many layers change mixer type).
-### Installation
-```bash
-uv venv --python 3.12 --seed
-source .venv/bin/activate
-git clone git@github.com:ServiceNow/Fast-LLM.git
-cd Fast-LLM
-uv pip install vllm==0.10.2 --torch-backend=auto
-pip install .
-```
-### Running a vLLM Server
-```bash
-vllm serve \
-  --model ServiceNow-AI/SuperApriel-15b-Base \
-  --port 8000 \
-  --trust-remote-code
-```
 ## Intended Use
@@ -169,7 +120,7 @@ Users accept responsibility for securely deploying, managing, and using this ope
 ## Software
 - **Training stack:** [Fast-LLM](https://github.com/ServiceNow/Fast-LLM)
-- **Serving:** [Fast-LLM vLLM plugin](https://github.com/ServiceNow/Fast-LLM)
 ## License
@@ -178,7 +129,7 @@ MIT
 ## Citation
 ```bibtex
-@misc{super_apriel_2025,
   title        = {Super Apriel: One Checkpoint, Many Speeds},
   author       = {ServiceNow Language Models Lab},
   year         = {2026},

 ## How to Use
+This checkpoint is intended as a foundation for fine-tuning and research, not for direct inference. For a ready-to-use model with optimized deployment presets and full serving instructions, see [SuperApriel-15b-Instruct](https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Instruct).
+### Loading for evaluation
+If you need to load this checkpoint for evaluation or experimentation, copy a preset config from [SuperApriel-15b-Instruct](https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Instruct) to select a specific mixer placement. The Base and Instruct checkpoints share the same architecture and config format — preset configs from Instruct work with this checkpoint.
+For example, to load with the all-attention placement:
+1. Download a preset `config.json` from `SuperApriel-15b-Instruct/preset_configs/all-attention/`
+2. Place it as this model's `config.json`
+3. Load with vLLM or Transformers following the [Instruct README instructions](https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Instruct#how-to-use)
+> **Note:** This model requires `trust_remote_code=True` as it uses custom architecture code for the multi-mixer supernet.
 ## Intended Use
 ## Software
 - **Training stack:** [Fast-LLM](https://github.com/ServiceNow/Fast-LLM)
+- **Serving:** [Fast-LLM vLLM plugin](https://github.com/ServiceNow/Fast-LLM/tree/oo/feature/vllm-apriel2-model-modeling/apriel2-vllm-plugin)
 ## License
 ## Citation
 ```bibtex
+@misc{super_apriel_2026,
   title        = {Super Apriel: One Checkpoint, Many Speeds},
   author       = {ServiceNow Language Models Lab},
   year         = {2026},

config.json CHANGED Viewed

@@ -5,6 +5,7 @@
   "auto_map": {
     "AutoConfig": "configuration_apriel2.Apriel2Config",
     "AutoModel": "modeling_apriel2.Apriel2Model",
     "AutoModelForImageTextToText": "modeling_apriel2.Apriel2ForConditionalGeneration"
   },
   "bos_token_id": 1,

   "auto_map": {
     "AutoConfig": "configuration_apriel2.Apriel2Config",
     "AutoModel": "modeling_apriel2.Apriel2Model",
+    "AutoModelForCausalLM": "modeling_apriel2.Apriel2ForCausalLM",
     "AutoModelForImageTextToText": "modeling_apriel2.Apriel2ForConditionalGeneration"
   },
   "bos_token_id": 1,

modeling_apriel2.py CHANGED Viewed

@@ -972,6 +972,32 @@ def create_mixer(mixer_config: dict, hidden_size: int, layer_idx: int, config, a
         return mixer_class(hidden_size, mixer_config, layer_idx=layer_idx)
 class Apriel2Mamba(nn.Module):
     """Mamba mixer."""
@@ -1906,14 +1932,16 @@ class Apriel2BlockSequence(nn.Module):
         blocks = []
         for layer_idx in range(num_blocks):
-            # Get block_config for this layer
             if seq_type == "fixed":
                 block_config = self.sequence_config.get("block", {})
             elif seq_type == "pattern":
                 pattern = self.sequence_config.get("pattern", [])
                 blocks_config = self.sequence_config.get("blocks", {})
                 block_name = pattern[layer_idx % len(pattern)]
                 block_config = blocks_config[block_name]
             else:
                 raise ValueError(f"Unknown sequence type: {seq_type}")
@@ -1925,6 +1953,7 @@ class Apriel2BlockSequence(nn.Module):
                     layer_idx=layer_idx,
                     rms_norm_eps=rms_norm_eps,
                     config=self.config,
                 )
             )
@@ -2031,6 +2060,7 @@ class Apriel2Block(nn.Module):
         layer_idx: int,
         rms_norm_eps: float,
         config: Apriel2TextConfig,
     ):
         """
         Args:
@@ -2039,6 +2069,7 @@ class Apriel2Block(nn.Module):
             layer_idx: Layer index in the sequence
             rms_norm_eps: Epsilon for RMS normalization
             config: Model config (passed to mixers that need it)
         """
         super().__init__()
         self.hidden_size = hidden_size
@@ -2046,7 +2077,13 @@ class Apriel2Block(nn.Module):
         # Create mixer based on type
         mixer_config = block_config.get("mixer", {"type": "attention"})
-        self.mixer = create_mixer(mixer_config, hidden_size, layer_idx, config, allow_stochastic=True)
         # Create MLP
         mlp_config = block_config.get("mlp", {"type": "mlp"})
@@ -2435,6 +2472,7 @@ class Apriel2TextModel(Apriel2PreTrainedModel):
 class Apriel2ForCausalLM(Apriel2PreTrainedModel, GenerationMixin):
     """Apriel2 model with a language modeling head (text-only)."""
     _tied_weights_keys = ["lm_head.weight"]
     def __init__(self, config: Apriel2TextConfig):

         return mixer_class(hidden_size, mixer_config, layer_idx=layer_idx)
+class Apriel2PatternMixerAdapter(nn.Module):
+    """Adapter that wraps a single mixer under mixers.{name} to match supernet weight paths.
+    The supernet checkpoint stores weights as blocks.{i}.mixer.mixers.{type}.{param},
+    but a bare mixer creates blocks.{i}.mixer.{param}. This adapter adds the intermediate
+    mixers.{name} level so pattern configs can load from supernet checkpoints.
+    """
+    def __init__(self, mixer_name: str, mixer: nn.Module):
+        super().__init__()
+        self.mixers = nn.ModuleDict({mixer_name: mixer})
+        self._mixer_name = mixer_name
+    def forward(self, *args, **kwargs):
+        return self.mixers[self._mixer_name](*args, **kwargs)
+    def preprocess(self, *args, **kwargs):
+        return self.mixers[self._mixer_name].preprocess(*args, **kwargs)
+    @classmethod
+    def setup(cls, mixer_name: str, mixer_config: dict, hidden_size: int, max_position_embeddings: int) -> nn.ModuleDict:
+        mixer_type = mixer_config.get("type", "attention")
+        mixer_class = get_mixer_class(mixer_type)
+        return mixer_class.setup(mixer_config, hidden_size, max_position_embeddings)
 class Apriel2Mamba(nn.Module):
     """Mamba mixer."""
         blocks = []
         for layer_idx in range(num_blocks):
+            # Get block_config and block_name for this layer
             if seq_type == "fixed":
                 block_config = self.sequence_config.get("block", {})
+                block_name_for_layer = None  # No adapter needed for fixed type
             elif seq_type == "pattern":
                 pattern = self.sequence_config.get("pattern", [])
                 blocks_config = self.sequence_config.get("blocks", {})
                 block_name = pattern[layer_idx % len(pattern)]
                 block_config = blocks_config[block_name]
+                block_name_for_layer = block_name  # Pass to Apriel2Block for weight path matching
             else:
                 raise ValueError(f"Unknown sequence type: {seq_type}")
                     layer_idx=layer_idx,
                     rms_norm_eps=rms_norm_eps,
                     config=self.config,
+                    block_name=block_name_for_layer,
                 )
             )
         layer_idx: int,
         rms_norm_eps: float,
         config: Apriel2TextConfig,
+        block_name: Optional[str] = None,
     ):
         """
         Args:
             layer_idx: Layer index in the sequence
             rms_norm_eps: Epsilon for RMS normalization
             config: Model config (passed to mixers that need it)
+            block_name: For pattern configs, the mixer name (e.g. "attention") to match supernet weight paths
         """
         super().__init__()
         self.hidden_size = hidden_size
         # Create mixer based on type
         mixer_config = block_config.get("mixer", {"type": "attention"})
+        raw_mixer = create_mixer(mixer_config, hidden_size, layer_idx, config, allow_stochastic=True)
+        # For pattern configs, wrap in adapter to match supernet checkpoint weight paths
+        if block_name is not None:
+            self.mixer = Apriel2PatternMixerAdapter(block_name, raw_mixer)
+        else:
+            self.mixer = raw_mixer
         # Create MLP
         mlp_config = block_config.get("mlp", {"type": "mlp"})
 class Apriel2ForCausalLM(Apriel2PreTrainedModel, GenerationMixin):
     """Apriel2 model with a language modeling head (text-only)."""
+    config_class = Apriel2Config
     _tied_weights_keys = ["lm_head.weight"]
     def __init__(self, config: Apriel2TextConfig):