Micro Uncertainty

Model Summary: Uncertainty adapter provides calibrated certainty scores for ibm-granite/granite-4.0-micro. The model responds with a certainty score from 0 to 9, which maps to a calibrated likelihood via confidence = 0.1 * score + 0.05, yielding 10 possible values (5%, 15%, 25%, ..., 95%). This percentage is calibrated in the following sense: given a set of answers assigned a certainty score of X%, approximately X% of these answers should be correct. See the evaluation section below for out-of-distribution verification of this behavior.

Developer: IBM Research
HF Collection: Granite Libraries
GitHub Repository: https://github.com/ibm-granite
Release Date: March 18th, 2026
Model Type: LoRA adapter for ibm-granite/granite-4.0-micro
License: Apache 2.0
Paper: Granite 4.0 Micro Uncertainty adapter is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models

Usage

Intended use: Uncertainty is a LoRA adapter the enables the Granite 4.0 Micro base model to express calibrated self-assessments of its own answer correctness. This adapter is designed to be used as part of the Granite inference pipeline, activated via the <certainty> invocation token after the model generates a response.

Use Cases

Human usage: Certainty scores give human users an indication of when to trust answers from the model (which should be augmented by their own knowledge).
Model routing/guards: If the model has low certainty (below a chosen threshold), it may be worth sending the request to a larger, more capable model or simply choosing not to show the response to the user.
RAG: Uncertainty is calibrated on diverse question-answering datasets, hence it can be applied to giving certainty scores for answers created using RAG. This certainty will be a prediction of overall correctness based on both the documents given and the model's own knowledge.

Quickstart Example (LoRA)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE_NAME = "ibm-granite/granite-4.0-micro"
LORA_REPO = "ibm-granite/granitelib-core-r1.0"
LORA_SUBFOLDER = "uncertainty/granite-4.0-micro/lora"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model
tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side="left", trust_remote_code=True)
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map="auto", torch_dtype=torch.bfloat16)
model_uq = PeftModel.from_pretrained(
    AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map="auto", torch_dtype=torch.bfloat16),
    LORA_REPO,
    subfolder=LORA_SUBFOLDER,
)

question = "What is IBM Research?"
print("Question:", question)

# Step 1: Generate answer with base model
messages = [
    {"role": "user", "content": question},
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(device)
output = model_base.generate(**inputs, max_new_tokens=600, do_sample=False)
answer = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("Answer:", answer)

# Step 2: Generate certainty score with LoRA adapter
uq_messages = [
    {"role": "user", "content": question},
    {"role": "assistant", "content": answer},
    {"role": "user", "content": "<certainty>"},
]
uq_text = tokenizer.apply_chat_template(uq_messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(uq_text, return_tensors="pt").to(device)
output = model_uq.generate(**inputs, max_new_tokens=15, do_sample=False)
uq_response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("Raw response:", uq_response)

# Parse score and map to confidence
import re
match = re.search(r'\{[^}]*"score"\s*:\s*"?(\d)"?[^}]*\}', uq_response)
if match:
    score = int(match.group(1))
    confidence = 0.1 * score + 0.05
    print(f"Score: {score}, Certainty: {confidence*100:.0f}%")

Evaluation

The adapter was evaluated on the MMLU dataset (57 subsets, 14,042 total samples, not used in training). Shown are the Expected Calibration Error (ECE) for each task, for the base model (granite-4.0-micro, using sequence probability as confidence) and the LoRA . The LoRA adapter achieves a weighted ECE of 0.0565, a 65% improvement over the base model's sequence-probability baseline of 0.1606. Additionally, the zero-shot performance on the MMLU tasks does not degrade, averaging at 63.2%.

Metric	Base Model	LoRA
ECE	0.1606	0.0565
Brier Score	0.2535	0.2131
AUROC	0.6748	0.6903
Sharpness	0.2888	0.1607

Adapter Configurations

Parameter	LoRA
Base model	ibm-granite/granite-4.0-micro
LoRA rank (r)	32
LoRA alpha	64
Target modules	q_proj, k_proj, v_proj, o_proj, input_linear, output_linear
Invocation token	`<certainty>`
Output format	`{"score": "X"}` where X is 0-9
Confidence mapping	`0.1 * score + 0.05` (5% to 95%)
Max completion tokens	15
KV cache	Supported

Training Details

Granite 4.0 Micro Uncertainty LoRA adapter is finetuned to provide certainty scores mimicking the output of a calibrator trained via the method in [Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models.

Training Data: The adapter was trained on a dataset of ~199K question-answer pairs generated by the base model (granite-4.0-micro), where each pair is annotated with a certainty score (0-9) derived from a calibrated thermometer model. The following datasets were used for calibration and/or fine-tuning.

Infrastructure: Training was completed using 8 H100 GPUs. Evaluation (and inference) requires 1 H100 GPU.

Ethical Considerations: Certainty is inherently an intrinsic property of a model and its abilities. The Uncertainty adapter is not intended to predict the certainty of responses generated by any other models besides itself or ibm-granite/granite-4.0-micro. Additionally, certainty scores are distributional quantities, and so will do well on realistic questions in aggregate, but in principle may have surprising scores on individual red-teamed examples. Certainty scores, at times, may be biased towards moderate certainty scores for the following reasons. Firstly, as humans, we tend to be overconfident in our evaluation of what we know and don't know - in contrast, a calibrated model is less likely to output very high or very low confidence scores, as these imply certainty of correctness or incorrectness. Secondly, remember that the model is evaluating itself - correctness/incorrectness that may be obvious to larger models may be less obvious to a smaller model. Finally, teaching a model every fact it knows and doesn't know is not possible, hence it must generalize to questions of wildly varying difficulty. Intuitively, it does this by extrapolating based on related questions it has been evaluated on in training -- this is an inherently inexact process and leads to some hedging.

Resources:

⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources