File size: 5,721 Bytes

---
model_id: Toto-1.0-QA-Experimental
tags:
- visual-question-answering
- time-series
- multimodal
- qwen3-vl
- lora
- anomaly-reasoning
- arfbench
- observability
paper:
- https://arxiv.org/abs/2604.21199
datasets:
- Datadog/ARFBench
leaderboards:
- ARFBench
license: apache-2.0
pipeline_tag: visual-question-answering
metrics:
- accuracy
- f1
base_model:
- Qwen/Qwen3-VL-32B-Instruct
- Datadog/Toto-Open-Base-1.0
---

# Toto-1.0-QA-Experimental

`Toto-1.0-QA-Experimental` is a hybrid time-series foundation model (TSFM) and vision-language model (VLM) for ARFBench. It achieves comparable macro F1 and accuracy to top frontier models on ARFBench:

|![arfbench-accuracy-f1-combined](https://cdn-uploads.huggingface.co/production/uploads/681d68309722c5341cd3fa59/Fs1zeUOkZ6G_yPpOyvlYq.png)|
|:-:|
|Overall accuracy and F1 on the ARFBench time series question-answering benchmark, as of paper release. Toto-1.0-QA-Experimental achieves the top accuracy and comparable F1 to top frontier models.|


It combines:

- a vision-language backbone (`Qwen/Qwen3-VL-32B-Instruct`) for image-conditioned question answering,
- Toto time-series representations (`Datadog/Toto-Open-Base-1.0`),
- lightweight projection modules that inject time-series signals into VLM inference.

|![toto-vlm-arch](https://cdn-uploads.huggingface.co/production/uploads/681d68309722c5341cd3fa59/VOihICj_-HTNdbNyNseD_.png)|
|:-:|
|Overview of the Toto-1.0-QA-Experimental Architecture.|

This model repository stores inference artifacts, including:

- `vlm/` (merged vision-language model weights),
- `ts_modules.pt` (time-series modules),
- `config.json` and processor files.

---

## Basic Inference Example

The example below assumes you already have:

- time-series tensors,
- one or more image paths,
- a text question.

```python
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

# From our Github repository (https://github.com/DataDog/arfbench)
from model.toto_vlm_components import TotoAnomalyQAModel, TimeSeriesData

repo_id = "Datadog/Toto-1.0-QA-Experimental"

# Load model + processor from Hub artifact
model = TotoAnomalyQAModel.from_pretrained(
    repo_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(repo_id)
model.eval()

# -----------------------------------------------------------------------------
# Example input data (replace with your real tensors and inputs)
# -----------------------------------------------------------------------------
series = ...  # torch.Tensor, shape: [n_channels, n_timesteps], float32
padding_mask = ...  # torch.Tensor, shape: [n_channels, n_timesteps], bool
id_mask = ...  # torch.Tensor, shape: [n_channels, n_timesteps], float/bool
timestamp_seconds = ...  # torch.Tensor, shape: [n_channels, n_timesteps]
time_interval_seconds = ...  # torch.Tensor, shape: [n_channels]
group_names = ...  # list[str], length n_channels
question = "In the following time-series, does the anomaly in this time-series correlate with the anomaly in the other time-series, if anomalies exist??"
image_paths = ["./image_1.png", "./image_2.png"]

ts_data = TimeSeriesData(
    series=series,
    padding_mask=padding_mask,
    id_mask=id_mask,
    timestamp_seconds=timestamp_seconds,
    time_interval_seconds=time_interval_seconds,
    num_groups=series.shape[0],
    query_group="custom-query",
    group_names=group_names,
)

# Build multimodal chat input (images + text)
messages = [
    {
        "role": "system",
        "content": "You are an expert observability anomaly analyst.",
    },
    {
        "role": "user",
        "content": (
            [{"type": "image", "image": p} for p in image_paths]
            + [{"type": "text", "text": question}]
        ),
    },
]

text_prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
processed_images, _ = process_vision_info(messages)

inputs = processor(
    text=[text_prompt],
    images=[processed_images],
    return_tensors="pt",
    padding=True,
)

device = next(model.parameters()).device
inputs = {
    k: v.to(device) if isinstance(v, torch.Tensor) else v
    for k, v in inputs.items()
}

# Generate answer
with torch.no_grad():
    output_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs.get("attention_mask"),
        pixel_values=inputs.get("pixel_values"),
        image_grid_thw=inputs.get("image_grid_thw"),
        ts_data=[ts_data],  # batch of 1
        max_new_tokens=512,
        do_sample=False,
    )

prompt_len = inputs["input_ids"].shape[1]
answer = processor.decode(
    output_ids[0, prompt_len:],
    skip_special_tokens=True,
).strip()

print("Answer:", answer)
```

---

## Minimum Requirements

Running Toto-1.0-QA-Experimental typically requires multi-GPU setup (tested on 4x A100 40GB). If memory is limited, reduce `--max-ts-length` and/or use quantization flags.

---

## Resources

- [ARFBench Paper](https://arxiv.org/abs/2604.21199)
- [Dataset](https://huggingface.co/datasets/Datadog/ARFBench)
- [Leaderboard](https://huggingface.co/spaces/Datadog/ARFBench)
- [Code](https://github.com/DataDog/arfbench)

---

## Citation
```bibtex
@misc{xie2026arfbenchbenchmarkingtimeseries,
      title={ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response}, 
      author={Stephan Xie and Ben Cohen and Mononito Goswami and Junhong Shen and Emaad Khwaja and Chenghao Liu and David Asker and Othmane Abou-Amal and Ameet Talwalkar},
      year={2026},
      eprint={2604.21199},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2604.21199}, 
}
```