| --- |
| model_id: Toto-1.0-QA-Experimental |
| tags: |
| - visual-question-answering |
| - time-series |
| - multimodal |
| - qwen3-vl |
| - lora |
| - anomaly-reasoning |
| - arfbench |
| - observability |
| paper: |
| - https://arxiv.org/abs/2604.21199 |
| datasets: |
| - Datadog/ARFBench |
| leaderboards: |
| - ARFBench |
| license: apache-2.0 |
| pipeline_tag: visual-question-answering |
| metrics: |
| - accuracy |
| - f1 |
| base_model: |
| - Qwen/Qwen3-VL-32B-Instruct |
| - Datadog/Toto-Open-Base-1.0 |
| --- |
| |
| # Toto-1.0-QA-Experimental |
|
|
| `Toto-1.0-QA-Experimental` is a hybrid time-series foundation model (TSFM) and vision-language model (VLM) for ARFBench. It achieves comparable macro F1 and accuracy to top frontier models on ARFBench: |
|
|
| || |
| |:-:| |
| |Overall accuracy and F1 on the ARFBench time series question-answering benchmark, as of paper release. Toto-1.0-QA-Experimental achieves the top accuracy and comparable F1 to top frontier models.| |
|
|
|
|
| It combines: |
|
|
| - a vision-language backbone (`Qwen/Qwen3-VL-32B-Instruct`) for image-conditioned question answering, |
| - Toto time-series representations (`Datadog/Toto-Open-Base-1.0`), |
| - lightweight projection modules that inject time-series signals into VLM inference. |
|
|
| || |
| |:-:| |
| |Overview of the Toto-1.0-QA-Experimental Architecture.| |
|
|
| This model repository stores inference artifacts, including: |
|
|
| - `vlm/` (merged vision-language model weights), |
| - `ts_modules.pt` (time-series modules), |
| - `config.json` and processor files. |
|
|
| --- |
|
|
| ## Basic Inference Example |
|
|
| The example below assumes you already have: |
|
|
| - time-series tensors, |
| - one or more image paths, |
| - a text question. |
|
|
| ```python |
| import torch |
| from transformers import AutoProcessor |
| from qwen_vl_utils import process_vision_info |
| |
| # From our Github repository (https://github.com/DataDog/arfbench) |
| from model.toto_vlm_components import TotoAnomalyQAModel, TimeSeriesData |
| |
| repo_id = "Datadog/Toto-1.0-QA-Experimental" |
| |
| # Load model + processor from Hub artifact |
| model = TotoAnomalyQAModel.from_pretrained( |
| repo_id, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| ) |
| processor = AutoProcessor.from_pretrained(repo_id) |
| model.eval() |
| |
| # ----------------------------------------------------------------------------- |
| # Example input data (replace with your real tensors and inputs) |
| # ----------------------------------------------------------------------------- |
| series = ... # torch.Tensor, shape: [n_channels, n_timesteps], float32 |
| padding_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], bool |
| id_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], float/bool |
| timestamp_seconds = ... # torch.Tensor, shape: [n_channels, n_timesteps] |
| time_interval_seconds = ... # torch.Tensor, shape: [n_channels] |
| group_names = ... # list[str], length n_channels |
| question = "In the following time-series, does the anomaly in this time-series correlate with the anomaly in the other time-series, if anomalies exist??" |
| image_paths = ["./image_1.png", "./image_2.png"] |
| |
| ts_data = TimeSeriesData( |
| series=series, |
| padding_mask=padding_mask, |
| id_mask=id_mask, |
| timestamp_seconds=timestamp_seconds, |
| time_interval_seconds=time_interval_seconds, |
| num_groups=series.shape[0], |
| query_group="custom-query", |
| group_names=group_names, |
| ) |
| |
| # Build multimodal chat input (images + text) |
| messages = [ |
| { |
| "role": "system", |
| "content": "You are an expert observability anomaly analyst.", |
| }, |
| { |
| "role": "user", |
| "content": ( |
| [{"type": "image", "image": p} for p in image_paths] |
| + [{"type": "text", "text": question}] |
| ), |
| }, |
| ] |
| |
| text_prompt = processor.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| processed_images, _ = process_vision_info(messages) |
| |
| inputs = processor( |
| text=[text_prompt], |
| images=[processed_images], |
| return_tensors="pt", |
| padding=True, |
| ) |
| |
| device = next(model.parameters()).device |
| inputs = { |
| k: v.to(device) if isinstance(v, torch.Tensor) else v |
| for k, v in inputs.items() |
| } |
| |
| # Generate answer |
| with torch.no_grad(): |
| output_ids = model.generate( |
| input_ids=inputs["input_ids"], |
| attention_mask=inputs.get("attention_mask"), |
| pixel_values=inputs.get("pixel_values"), |
| image_grid_thw=inputs.get("image_grid_thw"), |
| ts_data=[ts_data], # batch of 1 |
| max_new_tokens=512, |
| do_sample=False, |
| ) |
| |
| prompt_len = inputs["input_ids"].shape[1] |
| answer = processor.decode( |
| output_ids[0, prompt_len:], |
| skip_special_tokens=True, |
| ).strip() |
| |
| print("Answer:", answer) |
| ``` |
|
|
| --- |
|
|
| ## Minimum Requirements |
|
|
| Running Toto-1.0-QA-Experimental typically requires multi-GPU setup (tested on 4x A100 40GB). If memory is limited, reduce `--max-ts-length` and/or use quantization flags. |
|
|
| --- |
|
|
| ## Resources |
|
|
| - [ARFBench Paper](https://arxiv.org/abs/2604.21199) |
| - [Dataset](https://huggingface.co/datasets/Datadog/ARFBench) |
| - [Leaderboard](https://huggingface.co/spaces/Datadog/ARFBench) |
| - [Code](https://github.com/DataDog/arfbench) |
|
|
| --- |
|
|
| ## Citation |
| ```bibtex |
| @misc{xie2026arfbenchbenchmarkingtimeseries, |
| title={ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response}, |
| author={Stephan Xie and Ben Cohen and Mononito Goswami and Junhong Shen and Emaad Khwaja and Chenghao Liu and David Asker and Othmane Abou-Amal and Ameet Talwalkar}, |
| year={2026}, |
| eprint={2604.21199}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.LG}, |
| url={https://arxiv.org/abs/2604.21199}, |
| } |
| ``` |