change arch image to be compatible with dark mode

9165737 verified 13 days ago

5.72 kB

	---
	model_id: Toto-1.0-QA-Experimental
	tags:
	- visual-question-answering
	- time-series
	- multimodal
	- qwen3-vl
	- lora
	- anomaly-reasoning
	- arfbench
	- observability
	paper:
	- https://arxiv.org/abs/2604.21199
	datasets:
	- Datadog/ARFBench
	leaderboards:
	- ARFBench
	license: apache-2.0
	pipeline_tag: visual-question-answering
	metrics:
	- accuracy
	- f1
	base_model:
	- Qwen/Qwen3-VL-32B-Instruct
	- Datadog/Toto-Open-Base-1.0
	---

	# Toto-1.0-QA-Experimental

	`Toto-1.0-QA-Experimental` is a hybrid time-series foundation model (TSFM) and vision-language model (VLM) for ARFBench. It achieves comparable macro F1 and accuracy to top frontier models on ARFBench:

	\|![arfbench-accuracy-f1-combined](https://cdn-uploads.huggingface.co/production/uploads/681d68309722c5341cd3fa59/Fs1zeUOkZ6G_yPpOyvlYq.png)\|
	\|:-:\|
	\|Overall accuracy and F1 on the ARFBench time series question-answering benchmark, as of paper release. Toto-1.0-QA-Experimental achieves the top accuracy and comparable F1 to top frontier models.\|


	It combines:

	- a vision-language backbone (`Qwen/Qwen3-VL-32B-Instruct`) for image-conditioned question answering,
	- Toto time-series representations (`Datadog/Toto-Open-Base-1.0`),
	- lightweight projection modules that inject time-series signals into VLM inference.

	\|![toto-vlm-arch](https://cdn-uploads.huggingface.co/production/uploads/681d68309722c5341cd3fa59/VOihICj_-HTNdbNyNseD_.png)\|
	\|:-:\|
	\|Overview of the Toto-1.0-QA-Experimental Architecture.\|

	This model repository stores inference artifacts, including:

	- `vlm/` (merged vision-language model weights),
	- `ts_modules.pt` (time-series modules),
	- `config.json` and processor files.

	---

	## Basic Inference Example

	The example below assumes you already have:

	- time-series tensors,
	- one or more image paths,
	- a text question.

	```python
	import torch
	from transformers import AutoProcessor
	from qwen_vl_utils import process_vision_info

	# From our Github repository (https://github.com/DataDog/arfbench)
	from model.toto_vlm_components import TotoAnomalyQAModel, TimeSeriesData

	repo_id = "Datadog/Toto-1.0-QA-Experimental"

	# Load model + processor from Hub artifact
	model = TotoAnomalyQAModel.from_pretrained(
	repo_id,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)
	processor = AutoProcessor.from_pretrained(repo_id)
	model.eval()

	# -----------------------------------------------------------------------------
	# Example input data (replace with your real tensors and inputs)
	# -----------------------------------------------------------------------------
	series = ... # torch.Tensor, shape: [n_channels, n_timesteps], float32
	padding_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], bool
	id_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], float/bool
	timestamp_seconds = ... # torch.Tensor, shape: [n_channels, n_timesteps]
	time_interval_seconds = ... # torch.Tensor, shape: [n_channels]
	group_names = ... # list[str], length n_channels
	question = "In the following time-series, does the anomaly in this time-series correlate with the anomaly in the other time-series, if anomalies exist??"
	image_paths = ["./image_1.png", "./image_2.png"]

	ts_data = TimeSeriesData(
	series=series,
	padding_mask=padding_mask,
	id_mask=id_mask,
	timestamp_seconds=timestamp_seconds,
	time_interval_seconds=time_interval_seconds,
	num_groups=series.shape[0],
	query_group="custom-query",
	group_names=group_names,
	)

	# Build multimodal chat input (images + text)
	messages = [
	{
	"role": "system",
	"content": "You are an expert observability anomaly analyst.",
	},
	{
	"role": "user",
	"content": (
	[{"type": "image", "image": p} for p in image_paths]
	+ [{"type": "text", "text": question}]
	),
	},
	]

	text_prompt = processor.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)
	processed_images, _ = process_vision_info(messages)

	inputs = processor(
	text=[text_prompt],
	images=[processed_images],
	return_tensors="pt",
	padding=True,
	)

	device = next(model.parameters()).device
	inputs = {
	k: v.to(device) if isinstance(v, torch.Tensor) else v
	for k, v in inputs.items()
	}

	# Generate answer
	with torch.no_grad():
	output_ids = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs.get("attention_mask"),
	pixel_values=inputs.get("pixel_values"),
	image_grid_thw=inputs.get("image_grid_thw"),
	ts_data=[ts_data], # batch of 1
	max_new_tokens=512,
	do_sample=False,
	)

	prompt_len = inputs["input_ids"].shape[1]
	answer = processor.decode(
	output_ids[0, prompt_len:],
	skip_special_tokens=True,
	).strip()

	print("Answer:", answer)
	```

	---

	## Minimum Requirements

	Running Toto-1.0-QA-Experimental typically requires multi-GPU setup (tested on 4x A100 40GB). If memory is limited, reduce `--max-ts-length` and/or use quantization flags.

	---

	## Resources

	- [ARFBench Paper](https://arxiv.org/abs/2604.21199)
	- [Dataset](https://huggingface.co/datasets/Datadog/ARFBench)
	- [Leaderboard](https://huggingface.co/spaces/Datadog/ARFBench)
	- [Code](https://github.com/DataDog/arfbench)

	---

	## Citation
	```bibtex
	@misc{xie2026arfbenchbenchmarkingtimeseries,
	title={ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response},
	author={Stephan Xie and Ben Cohen and Mononito Goswami and Junhong Shen and Emaad Khwaja and Chenghao Liu and David Asker and Othmane Abou-Amal and Ameet Talwalkar},
	year={2026},
	eprint={2604.21199},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2604.21199},
	}
	```