Instructions to use ibm-granite/granite-switch-4.1-3b-preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ibm-granite/granite-switch-4.1-3b-preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ibm-granite/granite-switch-4.1-3b-preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-switch-4.1-3b-preview", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ibm-granite/granite-switch-4.1-3b-preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ibm-granite/granite-switch-4.1-3b-preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ibm-granite/granite-switch-4.1-3b-preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ibm-granite/granite-switch-4.1-3b-preview

SGLang

How to use ibm-granite/granite-switch-4.1-3b-preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ibm-granite/granite-switch-4.1-3b-preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ibm-granite/granite-switch-4.1-3b-preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ibm-granite/granite-switch-4.1-3b-preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ibm-granite/granite-switch-4.1-3b-preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ibm-granite/granite-switch-4.1-3b-preview with Docker Model Runner:
```
docker model run hf.co/ibm-granite/granite-switch-4.1-3b-preview
```

Granite Switch 4.1 3B Preview

Model Summary: Granite Switch 4.1 3B Preview is a modular LLM built on IBM Granite 4.1 3B with embedded adapters from the Granite Libraries collection. A single checkpoint supports multiple specialized capabilities — RAG, safety, explainability, and more — that are activated on demand via control tokens in the chat template.

For full details on model composition and adapter configuration, see BUILD.md.

Base Model: ibm-granite/granite-4.1-3b (3B params, 128K context)
Adapters: 12 adapters from granitelib-rag-r1.0, granitelib-core-r1.0, and granitelib-guardian-r1.0
License: Apache 2.0
Release Date: May 5th, 2026
Backends: HuggingFace Transformers, vLLM
Automatically Composed with: granite-switch

Granite Switch is also available in granite-switch-4.1-8b-preview and granite-switch-4.1-30b-preview.

Motivation: Traditional multi-task LLM deployments require either separate model copies per capability (multiplying memory and compute) or weight merging that permanently blends adapters and destroys task specialization. Granite Switch takes a different approach: independently trained activated LoRA adapters are embedded in a single checkpoint and dynamically selected at inference time via control tokens. KV cache normalization ensures adapters share no internal KV cache state as each adapter sees prior tokens only through the base model's representation. That way, adapters can build on each other's outputs, but never through another adapter's cached activations. This allows adapters to be developed independently and composed without accuracy loss. This makes it possible to implement LLM capabilities very efficiently and very accurately.

Included Adapters

Granite Switch is best used with Mellea.

Core Library (`ibm-granite/granitelib-core-r1.0`)

Adapters for context attribution, requirements validation, and uncertainty estimation.

Adapter	Description
Requirement Check	Binary yes/no evaluation of whether a response satisfies user-specified constraints (formatting, content, quality)
Context Attribution	Identifies which context sentences influenced the response — contributive attribution ranked by importance
Uncertainty	Calibrated confidence scores — an answer marked X% certain is approximately X% correct

RAG Library (`ibm-granite/granitelib-rag-r1.0`)

Adapters for retrieval-augmented generation pipelines.

Adapter	Stage	Description
Query Rewrite	Pre-retrieval	Decontextualizes multi-turn queries into standalone, retriever-friendly versions
Query Clarification	Pre-retrieval	Detects underspecified or ambiguous queries and formulates clarification requests
Answerability	Pre-generation	Determines if a query is answerable from available passages; prevents hallucinations
Hallucination Detection	Post-generation	Outputs hallucination risk ranges for each sentence in a response
Citation Generation	Post-generation	Generates passage-level citations for model responses

Guardian Library (`ibm-granite/granitelib-guardian-r1.0`)

Adapters for safety, factuality, and policy compliance.

Adapter	Description
Guardian Core	Detects safety risks: harm, jailbreaking, profanity, violence, sexual content, social bias, unethical behavior
Factuality Detection	Assesses factual correctness of responses against provided context sources
Factuality Correction	Corrects factual inaccuracies in long-form responses while preserving reasoning quality
Policy Guardrails	Checks compliance against user-defined policies (compliant / non-compliant / ambiguous)

Generation

git clone https://github.com/generative-computing/granite-switch.git
cd granite-switch

# Pick what you need:
pip install -e ".[compose]"   # Compose models with adapters
pip install -e ".[hf]"        # HuggingFace inference
pip install -e ".[vllm]"      # vLLM inference
pip install -e ".[dev]"       # Everything

Using with Mellea

Mellea is the preferred way to run Granite Switch adapters in applications. It standardizes the interface for building with adapters like answerability checking, hallucination detection, requirement checker and harmful language detection easily and reliably. Constrained decoding and input/output pre-processing are handled automatically, improving accuracy and reliability. When running Granite Switch models through Mellea, embedded adapters function as high-level API calls. This allows you to use direct operations instead of raw prompt engineering.

pip install mellea

Answerability check

from mellea.backends.openai import OpenAIBackend
from mellea.formatters import TemplateFormatter
from mellea.stdlib.components import Document, Message
from mellea.stdlib.components.intrinsic import rag
from mellea.stdlib.context import ChatContext

SWITCH_MODEL_ID = "ibm-granite/granite-switch-4.1-3b-preview"

backend = OpenAIBackend(
    model_id=SWITCH_MODEL_ID,
    formatter=TemplateFormatter(model_id=SWITCH_MODEL_ID),
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    load_embedded_adapters=True,
)

context = ChatContext().add(Message("assistant", "Hello there, how can I help you?"))
question = "What is the square root of 4?"
documents = [Document("The square root of 4 is 2.")]

result = rag.check_answerability(question, documents, context, backend)
print(f"Answerability: {result}")

Requirement check

from mellea.stdlib.components.intrinsic import core

context = ChatContext().add(
    Message("user", "Invite for an IBM office party.")
).add(
    Message("assistant", "Dear Team, you are cordially invited to a team social...")
)

result = core.requirement_check(context, backend, requirement="Use a professional tone.")
print(f"Requirements Satisfied: {result}")  # float between 0.0 and 1.0

Guardian core (safety detection)

from mellea.stdlib.components.intrinsic import guardian

context = ChatContext().add(
    Message("user", "How can I hack my friend's email?")
)

score = guardian.guardian_check(context, backend, criteria="harm", target_role="user")
verdict = "Risk detected" if score >= 0.5 else "Safe"
print(f"Score: {score:.4f}  ({verdict})")

See the mellea examples/ directory for more examples, including manual adapter loading.

The following examples demonstrate low-level adapter invocation via the HuggingFace and vLLM backends directly. Check Granite Switch For additional tutorials.

HuggingFace Inference

import granite_switch.hf  # Register the model architecture

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-switch-4.1-3b-preview", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-switch-4.1-3b-preview")

Activate an adapter via the chat template

messages = [
    {"role": "assistant", "content": "Hello there, how can I help you?"},
    {"role": "user", "content": "What is the square root of 4?"},
]
documents = [{"doc_id": "1", "text": "The square root of 4 is 2."}]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    adapter_name="answerability",   # activate the answerability adapter
    add_generation_prompt=True,
    tokenize=False,
)

outputs = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# => "answerable"

No adapter (base model behavior)

prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)

vLLM Inference

Start the OpenAI-compatible server:

pip install -e ".[vllm]"

python -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-switch-4.1-3b-preview \
  --port 8000

Call adapters via the API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="ibm-granite/granite-switch-4.1-3b-preview",
    messages=[
        {"role": "assistant", "content": "Hello there, how can I help you?"},
        {"role": "user", "content": "What is the square root of 4?"},
    ],
    extra_body={
        "documents": [{"doc_id": "1", "text": "The square root of 4 is 2."}],
        "chat_template_kwargs": {"adapter_name": "answerability"},
    },
    max_completion_tokens=6,
)
print(response.choices[0].message.content)
# => "answerable"

Or with curl:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibm-granite/granite-switch-4.1-3b-preview",
    "messages": [
      {"role": "assistant", "content": "Hello there, how can I help you?"},
      {"role": "user", "content": "What is the square root of 4?"}
    ],
    "documents": [{"doc_id": "1", "text": "The square root of 4 is 2."}],
    "chat_template_kwargs": {"adapter_name": "answerability"},
    "max_completion_tokens": 6
  }'

Model Artifacts:

File	Description
`model.safetensors`	Full model with embedded adapters
`config.json`	GraniteSwitchConfig
`tokenizer.json` / `tokenizer_config.json`	Tokenizer with control tokens
`adapter_index.json`	Adapter-to-control-token mapping
`io_configs/`	Original `io.yaml` for each adapter
`chat_template.jinja`	Jinja template with adapter activation logic
`BUILD.md`	Composed model details and adapter configuration

Requirements:

Dependency	Version
Python	>= 3.9
PyTorch	>= 2.0.0
Transformers	>=5.5.1
vLLM (optional)	>= 0.19.1, < 0.21.0

How It Works

Granite Switch uses coarse-grained expert switching — one adapter is active across all layers for a contiguous span of tokens. A lightweight switch layer (standard attention) detects control tokens in the input and produces per-position adapter indices that tell every decoder layer which LoRA weights to apply.

Input Tokens: ["Tell", "me", "about", "<|answerability|>", ...]
                           │
           ┌───────────────▼───────────────┐
           │       SWITCH LAYER            │
           │    Detects control            │
           │    tokens → indices           │
           └───────────────┬───────────────┘
                           │
          adapter_indices: [0, 0, 0, 4, 4, ...]
                           │
           ┌───────────────▼───────────────┐
           │      DECODER LAYERS           │
           │      Base weights +           │
           │      LoRA[index]              │
           └───────────────┬───────────────┘
                           │
           ┌───────────────▼───────────────┐
           │         LM HEAD               │
           └───────────────────────────────┘

Key properties:

KV cache normalization — each adapter sees only the base model's KV cache, never another adapter's internal state
No joint training — adapters are developed, tested, and published independently
Single checkpoint — one file works with both HuggingFace and vLLM, no conversion needed
Zero code changes — adapter selection happens entirely through the chat template

Model Architecture

Base Model


Model	Granite 4.1 3B
Parameters	3 billion
Context Length	131,072 tokens
Architecture	Dense decoder-only transformer (GQA, RoPE, SwiGLU, RMSNorm)
Languages	English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, Chinese

Adapter Activation

Each adapter is activated by passing its name to the chat template using an argument. The template inserts the appropriate control token automatically — callers just pass adapter_name.

Ethical Considerations and Limitations

This model inherits the safety profile of the base Granite 4.1 model. The Guardian Library adapters (guardian core, factuality detection/correction, policy guardrails) provide additional safety layers but are not a substitute for application-level safety testing. Deployers should:

Test adapter behavior on their specific use cases before production deployment
Apply appropriate content filtering for their domain
Monitor adapter outputs, especially for safety-critical applications
Use the uncertainty adapter to assess model confidence on important decisions

Model Signing

The model.sig file contains a signature over all model artifacts to ensure integrity and provenance.

To verify the integrity of a downloaded adapter, use the model-signing tool:

# First obrain the model weights
hf download ibm-granite/granite-switch-4.1-3b-preview --local-dir granite-switch-4.1-3b-preview

# Install the model signing verification tool
pip install 'model-signing==v1.1.1'

# Verify all artifacts in an adapter's lora/ directory
model_signing verify sigstore \
  --signature granite-switch-4.1-3b-preview/model.sig \
  --ignore-git-paths \
  --ignore-paths granite-switch-4.1-3b-preview/README.md \
  --identity Granite-sign@ibm.com \
  --identity_provider https://sigstore.verify.ibm.com/oauth2 \
  granite-switch-4.1-3b-preview

The "Verification succeeded" message confirms that the model has not been tampered with after release.

License

Granite Switch has an Apache-2.0 license, as found in the LICENSE file.

Citation

@software{granite_switch,
  title  = {Granite Switch: Coarse-Grained Expert Switching for LLMs},
  author = {IBM Research},
  year   = {2026},
  url    = {https://github.com/generative-computing/granite-switch}
}

See also: Activated LoRA: Fine-tuned LLMs for Intrinsics (NeurIPS 2025).

Downloads last month: 304

Safetensors

Model size

4B params

Tensor type

I64

BF16

BOOL

Collection including ibm-granite/granite-switch-4.1-3b-preview

Granite Experiments

Collection

Experimental projects under consideration for the Granite family. • 11 items • Updated 6 days ago • 18