--- model_id: Toto-1.0-QA-Experimental tags: - visual-question-answering - time-series - multimodal - qwen3-vl - lora - anomaly-reasoning - arfbench - observability paper: - https://arxiv.org/abs/2604.21199 datasets: - Datadog/ARFBench leaderboards: - ARFBench license: apache-2.0 pipeline_tag: visual-question-answering metrics: - accuracy - f1 base_model: - Qwen/Qwen3-VL-32B-Instruct - Datadog/Toto-Open-Base-1.0 --- # Toto-1.0-QA-Experimental `Toto-1.0-QA-Experimental` is a hybrid time-series foundation model (TSFM) and vision-language model (VLM) for ARFBench. It achieves comparable macro F1 and accuracy to top frontier models on ARFBench: |![arfbench-accuracy-f1-combined](https://cdn-uploads.huggingface.co/production/uploads/681d68309722c5341cd3fa59/Fs1zeUOkZ6G_yPpOyvlYq.png)| |:-:| |Overall accuracy and F1 on the ARFBench time series question-answering benchmark, as of paper release. Toto-1.0-QA-Experimental achieves the top accuracy and comparable F1 to top frontier models.| It combines: - a vision-language backbone (`Qwen/Qwen3-VL-32B-Instruct`) for image-conditioned question answering, - Toto time-series representations (`Datadog/Toto-Open-Base-1.0`), - lightweight projection modules that inject time-series signals into VLM inference. |![toto-vlm-arch](https://cdn-uploads.huggingface.co/production/uploads/681d68309722c5341cd3fa59/VOihICj_-HTNdbNyNseD_.png)| |:-:| |Overview of the Toto-1.0-QA-Experimental Architecture.| This model repository stores inference artifacts, including: - `vlm/` (merged vision-language model weights), - `ts_modules.pt` (time-series modules), - `config.json` and processor files. --- ## Basic Inference Example The example below assumes you already have: - time-series tensors, - one or more image paths, - a text question. ```python import torch from transformers import AutoProcessor from qwen_vl_utils import process_vision_info # From our Github repository (https://github.com/DataDog/arfbench) from model.toto_vlm_components import TotoAnomalyQAModel, TimeSeriesData repo_id = "Datadog/Toto-1.0-QA-Experimental" # Load model + processor from Hub artifact model = TotoAnomalyQAModel.from_pretrained( repo_id, device_map="auto", torch_dtype=torch.bfloat16, ) processor = AutoProcessor.from_pretrained(repo_id) model.eval() # ----------------------------------------------------------------------------- # Example input data (replace with your real tensors and inputs) # ----------------------------------------------------------------------------- series = ... # torch.Tensor, shape: [n_channels, n_timesteps], float32 padding_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], bool id_mask = ... # torch.Tensor, shape: [n_channels, n_timesteps], float/bool timestamp_seconds = ... # torch.Tensor, shape: [n_channels, n_timesteps] time_interval_seconds = ... # torch.Tensor, shape: [n_channels] group_names = ... # list[str], length n_channels question = "In the following time-series, does the anomaly in this time-series correlate with the anomaly in the other time-series, if anomalies exist??" image_paths = ["./image_1.png", "./image_2.png"] ts_data = TimeSeriesData( series=series, padding_mask=padding_mask, id_mask=id_mask, timestamp_seconds=timestamp_seconds, time_interval_seconds=time_interval_seconds, num_groups=series.shape[0], query_group="custom-query", group_names=group_names, ) # Build multimodal chat input (images + text) messages = [ { "role": "system", "content": "You are an expert observability anomaly analyst.", }, { "role": "user", "content": ( [{"type": "image", "image": p} for p in image_paths] + [{"type": "text", "text": question}] ), }, ] text_prompt = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) processed_images, _ = process_vision_info(messages) inputs = processor( text=[text_prompt], images=[processed_images], return_tensors="pt", padding=True, ) device = next(model.parameters()).device inputs = { k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items() } # Generate answer with torch.no_grad(): output_ids = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask"), pixel_values=inputs.get("pixel_values"), image_grid_thw=inputs.get("image_grid_thw"), ts_data=[ts_data], # batch of 1 max_new_tokens=512, do_sample=False, ) prompt_len = inputs["input_ids"].shape[1] answer = processor.decode( output_ids[0, prompt_len:], skip_special_tokens=True, ).strip() print("Answer:", answer) ``` --- ## Minimum Requirements Running Toto-1.0-QA-Experimental typically requires multi-GPU setup (tested on 4x A100 40GB). If memory is limited, reduce `--max-ts-length` and/or use quantization flags. --- ## Resources - [ARFBench Paper](https://arxiv.org/abs/2604.21199) - [Dataset](https://huggingface.co/datasets/Datadog/ARFBench) - [Leaderboard](https://huggingface.co/spaces/Datadog/ARFBench) - [Code](https://github.com/DataDog/arfbench) --- ## Citation ```bibtex @misc{xie2026arfbenchbenchmarkingtimeseries, title={ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response}, author={Stephan Xie and Ben Cohen and Mononito Goswami and Junhong Shen and Emaad Khwaja and Chenghao Liu and David Asker and Othmane Abou-Amal and Ameet Talwalkar}, year={2026}, eprint={2604.21199}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2604.21199}, } ```