File size: 5,721 Bytes
b7482d2
2229230
b7482d2
2229230
 
 
 
 
 
 
 
 
259fc14
2229230
 
 
 
 
 
 
 
 
 
 
 
b7482d2
 
2229230
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9165737
2229230
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259fc14
2229230
 
 
 
 
 
 
 
259fc14
 
2229230
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259fc14
2229230
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e110a62
2229230
 
 
 
 
 
 
 
e110a62
 
 
 
 
 
 
 
2229230
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
model_id: Toto-1.0-QA-Experimental
tags:
- visual-question-answering
- time-series
- multimodal
- qwen3-vl
- lora
- anomaly-reasoning
- arfbench
- observability
paper:
- https://arxiv.org/abs/2604.21199
datasets:
- Datadog/ARFBench
leaderboards:
- ARFBench
license: apache-2.0
pipeline_tag: visual-question-answering
metrics:
- accuracy
- f1
base_model:
- Qwen/Qwen3-VL-32B-Instruct
- Datadog/Toto-Open-Base-1.0
---

# Toto-1.0-QA-Experimental

`Toto-1.0-QA-Experimental` is a hybrid time-series foundation model (TSFM) and vision-language model (VLM) for ARFBench. It achieves comparable macro F1 and accuracy to top frontier models on ARFBench:

|![arfbench-accuracy-f1-combined](https://cdn-uploads.huggingface.co/production/uploads/681d68309722c5341cd3fa59/Fs1zeUOkZ6G_yPpOyvlYq.png)|
|:-:|
|Overall accuracy and F1 on the ARFBench time series question-answering benchmark, as of paper release. Toto-1.0-QA-Experimental achieves the top accuracy and comparable F1 to top frontier models.|


It combines:

- a vision-language backbone (`Qwen/Qwen3-VL-32B-Instruct`) for image-conditioned question answering,
- Toto time-series representations (`Datadog/Toto-Open-Base-1.0`),
- lightweight projection modules that inject time-series signals into VLM inference.

|![toto-vlm-arch](https://cdn-uploads.huggingface.co/production/uploads/681d68309722c5341cd3fa59/VOihICj_-HTNdbNyNseD_.png)|
|:-:|
|Overview of the Toto-1.0-QA-Experimental Architecture.|

This model repository stores inference artifacts, including:

- `vlm/` (merged vision-language model weights),
- `ts_modules.pt` (time-series modules),
- `config.json` and processor files.

---

## Basic Inference Example

The example below assumes you already have:

- time-series tensors,
- one or more image paths,
- a text question.

```python
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

# From our Github repository (https://github.com/DataDog/arfbench)
from model.toto_vlm_components import TotoAnomalyQAModel, TimeSeriesData

repo_id = "Datadog/Toto-1.0-QA-Experimental"

# Load model + processor from Hub artifact
model = TotoAnomalyQAModel.from_pretrained(
    repo_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(repo_id)
model.eval()

# -----------------------------------------------------------------------------
# Example input data (replace with your real tensors and inputs)
# -----------------------------------------------------------------------------
series = ...  # torch.Tensor, shape: [n_channels, n_timesteps], float32
padding_mask = ...  # torch.Tensor, shape: [n_channels, n_timesteps], bool
id_mask = ...  # torch.Tensor, shape: [n_channels, n_timesteps], float/bool
timestamp_seconds = ...  # torch.Tensor, shape: [n_channels, n_timesteps]
time_interval_seconds = ...  # torch.Tensor, shape: [n_channels]
group_names = ...  # list[str], length n_channels
question = "In the following time-series, does the anomaly in this time-series correlate with the anomaly in the other time-series, if anomalies exist??"
image_paths = ["./image_1.png", "./image_2.png"]

ts_data = TimeSeriesData(
    series=series,
    padding_mask=padding_mask,
    id_mask=id_mask,
    timestamp_seconds=timestamp_seconds,
    time_interval_seconds=time_interval_seconds,
    num_groups=series.shape[0],
    query_group="custom-query",
    group_names=group_names,
)

# Build multimodal chat input (images + text)
messages = [
    {
        "role": "system",
        "content": "You are an expert observability anomaly analyst.",
    },
    {
        "role": "user",
        "content": (
            [{"type": "image", "image": p} for p in image_paths]
            + [{"type": "text", "text": question}]
        ),
    },
]

text_prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
processed_images, _ = process_vision_info(messages)

inputs = processor(
    text=[text_prompt],
    images=[processed_images],
    return_tensors="pt",
    padding=True,
)

device = next(model.parameters()).device
inputs = {
    k: v.to(device) if isinstance(v, torch.Tensor) else v
    for k, v in inputs.items()
}

# Generate answer
with torch.no_grad():
    output_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs.get("attention_mask"),
        pixel_values=inputs.get("pixel_values"),
        image_grid_thw=inputs.get("image_grid_thw"),
        ts_data=[ts_data],  # batch of 1
        max_new_tokens=512,
        do_sample=False,
    )

prompt_len = inputs["input_ids"].shape[1]
answer = processor.decode(
    output_ids[0, prompt_len:],
    skip_special_tokens=True,
).strip()

print("Answer:", answer)
```

---

## Minimum Requirements

Running Toto-1.0-QA-Experimental typically requires multi-GPU setup (tested on 4x A100 40GB). If memory is limited, reduce `--max-ts-length` and/or use quantization flags.

---

## Resources

- [ARFBench Paper](https://arxiv.org/abs/2604.21199)
- [Dataset](https://huggingface.co/datasets/Datadog/ARFBench)
- [Leaderboard](https://huggingface.co/spaces/Datadog/ARFBench)
- [Code](https://github.com/DataDog/arfbench)

---

## Citation
```bibtex
@misc{xie2026arfbenchbenchmarkingtimeseries,
      title={ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response}, 
      author={Stephan Xie and Ben Cohen and Mononito Goswami and Junhong Shen and Emaad Khwaja and Chenghao Liu and David Asker and Othmane Abou-Amal and Ameet Talwalkar},
      year={2026},
      eprint={2604.21199},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2604.21199}, 
}
```