File size: 5,721 Bytes
b7482d2 2229230 b7482d2 2229230 259fc14 2229230 b7482d2 2229230 9165737 2229230 259fc14 2229230 259fc14 2229230 259fc14 2229230 e110a62 2229230 e110a62 2229230 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | ---
model_id: Toto-1.0-QA-Experimental
tags:
- visual-question-answering
- time-series
- multimodal
- qwen3-vl
- lora
- anomaly-reasoning
- arfbench
- observability
paper:
- https://arxiv.org/abs/2604.21199
datasets:
- Datadog/ARFBench
leaderboards:
- ARFBench
license: apache-2.0
pipeline_tag: visual-question-answering
metrics:
- accuracy
- f1
base_model:
- Qwen/Qwen3-VL-32B-Instruct
- Datadog/Toto-Open-Base-1.0
---
# Toto-1.0-QA-Experimental
`Toto-1.0-QA-Experimental` is a hybrid time-series foundation model (TSFM) and vision-language model (VLM) for ARFBench. It achieves comparable macro F1 and accuracy to top frontier models on ARFBench:
||
|:-:|
|Overall accuracy and F1 on the ARFBench time series question-answering benchmark, as of paper release. Toto-1.0-QA-Experimental achieves the top accuracy and comparable F1 to top frontier models.|
It combines:
- a vision-language backbone (`Qwen/Qwen3-VL-32B-Instruct`) for image-conditioned question answering,
- Toto time-series representations (`Datadog/Toto-Open-Base-1.0`),
- lightweight projection modules that inject time-series signals into VLM inference.
||
|:-:|
|Overview of the Toto-1.0-QA-Experimental Architecture.|
This model repository stores inference artifacts, including:
- `vlm/` (merged vision-language model weights),
- `ts_modules.pt` (time-series modules),
- `config.json` and processor files.
---
## Basic Inference Example
The example below assumes you already have:
- time-series tensors,
- one or more image paths,
- a text question.
```python
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
# From our Github repository (https://github.com/DataDog/arfbench)
from model.toto_vlm_components import TotoAnomalyQAModel, TimeSeriesData
repo_id = "Datadog/Toto-1.0-QA-Experimental"
# Load model + processor from Hub artifact
model = TotoAnomalyQAModel.from_pretrained(
repo_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(repo_id)
model.eval()
# -----------------------------------------------------------------------------
# Example input data (replace with your real tensors and inputs)
# -----------------------------------------------------------------------------
series = ... # torch.Tensor, shape: [n_channels, n_timesteps], float32
padding_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], bool
id_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], float/bool
timestamp_seconds = ... # torch.Tensor, shape: [n_channels, n_timesteps]
time_interval_seconds = ... # torch.Tensor, shape: [n_channels]
group_names = ... # list[str], length n_channels
question = "In the following time-series, does the anomaly in this time-series correlate with the anomaly in the other time-series, if anomalies exist??"
image_paths = ["./image_1.png", "./image_2.png"]
ts_data = TimeSeriesData(
series=series,
padding_mask=padding_mask,
id_mask=id_mask,
timestamp_seconds=timestamp_seconds,
time_interval_seconds=time_interval_seconds,
num_groups=series.shape[0],
query_group="custom-query",
group_names=group_names,
)
# Build multimodal chat input (images + text)
messages = [
{
"role": "system",
"content": "You are an expert observability anomaly analyst.",
},
{
"role": "user",
"content": (
[{"type": "image", "image": p} for p in image_paths]
+ [{"type": "text", "text": question}]
),
},
]
text_prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
processed_images, _ = process_vision_info(messages)
inputs = processor(
text=[text_prompt],
images=[processed_images],
return_tensors="pt",
padding=True,
)
device = next(model.parameters()).device
inputs = {
k: v.to(device) if isinstance(v, torch.Tensor) else v
for k, v in inputs.items()
}
# Generate answer
with torch.no_grad():
output_ids = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs.get("attention_mask"),
pixel_values=inputs.get("pixel_values"),
image_grid_thw=inputs.get("image_grid_thw"),
ts_data=[ts_data], # batch of 1
max_new_tokens=512,
do_sample=False,
)
prompt_len = inputs["input_ids"].shape[1]
answer = processor.decode(
output_ids[0, prompt_len:],
skip_special_tokens=True,
).strip()
print("Answer:", answer)
```
---
## Minimum Requirements
Running Toto-1.0-QA-Experimental typically requires multi-GPU setup (tested on 4x A100 40GB). If memory is limited, reduce `--max-ts-length` and/or use quantization flags.
---
## Resources
- [ARFBench Paper](https://arxiv.org/abs/2604.21199)
- [Dataset](https://huggingface.co/datasets/Datadog/ARFBench)
- [Leaderboard](https://huggingface.co/spaces/Datadog/ARFBench)
- [Code](https://github.com/DataDog/arfbench)
---
## Citation
```bibtex
@misc{xie2026arfbenchbenchmarkingtimeseries,
title={ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response},
author={Stephan Xie and Ben Cohen and Mononito Goswami and Junhong Shen and Emaad Khwaja and Chenghao Liu and David Asker and Othmane Abou-Amal and Ameet Talwalkar},
year={2026},
eprint={2604.21199},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.21199},
}
``` |