Model Card for Meow-Omni 1
Meow-Omni 1 is the worldβs first Multimodal Large Language Model (MLLM) specifically engineered for Computational Ethology. It natively co-embeds four distinct modalitiesβText, Video, Audio, and Biological Time-Seriesβto decode the latent intentions of non-verbal species.
π Paper: arXiv:2605.09152
π» Code: github.com/smgjch/Meow-Omni-1 β full training and evaluation pipeline open-sourced
πΎ Model Summary
Meow-Omni 1 is the fine-tuned, intent-aligned version of the Meow-Omni 1-Base architecture. By training on the Meow-10K dataset using a novel Next-Behaviour Prediction (NBP) logic, this model moves beyond simple action recognition to resolve "semantic aliasing"βdistinguishing, for example, between contentment-purring and distress-purring by correlating vocalizations with internal physiological markers (ECG/EEG).
- Fine-tuned from: Meow-Omni 1-Base
- Primary Task: Feline Intention Decoding and Behavioural Interpretation
π Key Features
- Quad-Modal Reasoning: Simultaneously processes visual cues, acoustic signals, and high-frequency biometrics within a single transformer context.
- Explainable Ethology: Unlike black-box classifiers, Meow-Omni 1 can articulate the causal relationship between a physiological spike and a behavioural display in natural language.
- Uncertainty Quantification: Built-in predictive entropy allows the model to "flag" ambiguous or contradictory signals (e.g., when biometrics contradict visual cues), ensuring clinical safety.
- Lightweight Deployment: Engineered with minimal dependencies to ensure reproducibility and accessibility for researchers in wildlife conservation.
π Performance: MeowBench
All models are evaluated on MeowBench. It is a expert-verified, quad-modal multiple-choice questions set covering 30 feline intent categories.
Comparison with State-of-the-Art Baselines
| Model | Vision | Audio | TS | Accuracy |
|---|---|---|---|---|
| Acoustic SOTA (Ntalampiras et al., SVM/HMM) | β | 36.86% | ||
| TS SOTA (Chen et al., 1D-CNN + LSTM on IMU) | β | 48.98% | ||
| Video SOTA (Qwen3.5-122B-A10B, zero-shot) | β | 61.95% | ||
| Qwen3.5-Omni-Plus (V + A) | β | β | 65.36% | |
| Qwen3.5-Omni-Plus (V + TSβ ) | β | ββ | 66.21% | |
| Qwen3.5-Omni-Plus (TSβ + A) | β | ββ | 42.15% | |
| Qwen3.5-Omni-Plus (V + A + TSβ ) | β | β | ββ | 66.89% |
| Meow-Omni 1 (Ours) | β | β | β | 71.16% |
β Qwen3.5-Omni-Plus does not accept raw time-series as a native modality; TS data was injected as a structured textual summary (array statistics per channel). Meow-Omni 1 processes raw TS natively.
π οΈ How to Use
Meow-Omni 1 accepts four inputs:
- Video: Behavioral context.
- Audio: Vocalization patterns.
- Time-Series: IMU data (via custom control tokens).
- Text: Instructions or questions regarding the animal's state.
import torch
import soundfile as sf
import numpy as np
from PIL import Image
from decord import VideoReader, cpu
from modeling_meow_omni_1 import MeowOmni1ForCausalLM
from processing_meow_omni_1 import MeowOmni1Processor
# 1. Setup Model and Processor
model_path = "smgjch/Meow-Omni-1"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = MeowOmni1Processor.from_pretrained(model_path, trust_remote_code=True)
model = MeowOmni1ForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16
).to(device).eval()
# 2. Prepare Modality Inputs
video_path = "sample_cat_video.mp4"
audio_path = "sample_cat_purr.wav"
ts_path = "sample_biometrics.json"
# Process Video (16 frames)
vr = VideoReader(video_path, ctx=cpu(0))
indices = np.linspace(0, len(vr) - 1, 16, dtype=int)
frames = [Image.fromarray(f).convert("RGB") for f in vr.get_batch(indices).asnumpy()]
# Process Audio
audio_arr, _ = sf.read(audio_path)
audios = [audio_arr[:480000].astype(np.float32)]
# 3. Construct Prompt with Modal Placeholders
# Note: Placeholders MUST match the number of input items (e.g., 16 image tags for 16 frames)
placeholders = (
"".join(["<image>./</image>"] * len(frames)) + # Video frames
"<audio>./</audio>" + # Audio stream
"<|ts_start|><|ts_unit|><|ts_end|>" # Time-series block
)
raw_query = "Analyze the provided multi-modal data. What is this cat's intention?"
prompt = f"User: {placeholders}\n{raw_query}\nAssistant:"
# 4. Run Inference
inputs = processor(
text=[prompt],
images=frames,
audios=audios,
time_series_paths=[ts_path],
time_series_sampling_rates=[100.0],
return_tensors="pt"
).to(device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.95
)
response = processor.tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\nπ Meow-Omni 1 Analysis:\n{response}")
π» Open-Source Codebase
The full training and evaluation pipeline is available at github.com/smgjch/Meow-Omni-1:
- Stage 1 β Projector Alignment:
run_pretrain.shβ trains only the time-series projector on 383K TS samples, with all other weights frozen. - Stage 2 β Supervised Fine-Tuning:
run_postrain.shβ fine-tunes the LLM backbone on Meow-10K, with all encoders frozen. - Evaluation:
eval_meow_omni_1.sh/src/evaluation/eval_meow.pyβ runs MeowBench with auto-resume support and per-modality ablation via--modals.
π The Meow-Omni Ecosystem
| Resource | Link |
|---|---|
| π Paper | arXiv:2605.09152 |
| π» Code | github.com/smgjch/Meow-Omni-1 |
| π€ Base Model | Meow-Omni 1-Base |
| π¦ Training Dataset | Meow-10K |
| π Benchmark | MeowBench |
π Citation
If you find our work helpful, please cite us using the following BibTeX entry:
@misc{hu2026meowomni1multimodallarge,
title={Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology},
author={Jucheng Hu and Zhangquan Chen and Yulin Chen and Chengjie Hong and Liang Zhou and Tairan Wang and Sifei Li and Giulio Zhu and Feng Zhou and Yiheng Zeng and Suorong Yang and Dongzhan Zhou},
year={2026},
eprint={2605.09152},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.09152},
}
- Downloads last month
- 163