Model Card for Meow-Omni 1

Meow-Omni 1 is the world’s first Multimodal Large Language Model (MLLM) specifically engineered for Computational Ethology. It natively co-embeds four distinct modalities—Text, Video, Audio, and Biological Time-Series—to decode the latent intentions of non-verbal species.

📄 Paper: arXiv:2605.09152
💻 Code: github.com/smgjch/Meow-Omni-1 — full training and evaluation pipeline open-sourced

🐾 Model Summary

Meow-Omni 1 is the fine-tuned, intent-aligned version of the Meow-Omni 1-Base architecture. By training on the Meow-10K dataset using a novel Next-Behaviour Prediction (NBP) logic, this model moves beyond simple action recognition to resolve "semantic aliasing"—distinguishing, for example, between contentment-purring and distress-purring by correlating vocalizations with internal physiological markers (ECG/EEG).

Fine-tuned from: Meow-Omni 1-Base
Primary Task: Feline Intention Decoding and Behavioural Interpretation

🚀 Key Features

Quad-Modal Reasoning: Simultaneously processes visual cues, acoustic signals, and high-frequency biometrics within a single transformer context.
Explainable Ethology: Unlike black-box classifiers, Meow-Omni 1 can articulate the causal relationship between a physiological spike and a behavioural display in natural language.
Uncertainty Quantification: Built-in predictive entropy allows the model to "flag" ambiguous or contradictory signals (e.g., when biometrics contradict visual cues), ensuring clinical safety.
Lightweight Deployment: Engineered with minimal dependencies to ensure reproducibility and accessibility for researchers in wildlife conservation.

📈 Performance: MeowBench

All models are evaluated on MeowBench. It is a expert-verified, quad-modal multiple-choice questions set covering 30 feline intent categories.

Comparison with State-of-the-Art Baselines

Model	Vision	Audio	TS	Accuracy
Acoustic SOTA (Ntalampiras et al., SVM/HMM)		✓		36.86%
TS SOTA (Chen et al., 1D-CNN + LSTM on IMU)			✓	48.98%
Video SOTA (Qwen3.5-122B-A10B, zero-shot)	✓			61.95%
Qwen3.5-Omni-Plus (V + A)	✓	✓		65.36%
Qwen3.5-Omni-Plus (V + TS†)	✓		✓†	66.21%
Qwen3.5-Omni-Plus (TS† + A)		✓	✓†	42.15%
Qwen3.5-Omni-Plus (V + A + TS†)	✓	✓	✓†	66.89%
Meow-Omni 1 (Ours)	✓	✓	✓	71.16%

† Qwen3.5-Omni-Plus does not accept raw time-series as a native modality; TS data was injected as a structured textual summary (array statistics per channel). Meow-Omni 1 processes raw TS natively.

🛠️ How to Use

Meow-Omni 1 accepts four inputs:

Video: Behavioral context.
Audio: Vocalization patterns.
Time-Series: IMU data (via custom control tokens).
Text: Instructions or questions regarding the animal's state.

import torch
import soundfile as sf
import numpy as np
from PIL import Image
from decord import VideoReader, cpu
from modeling_meow_omni_1 import MeowOmni1ForCausalLM
from processing_meow_omni_1 import MeowOmni1Processor

# 1. Setup Model and Processor
model_path = "smgjch/Meow-Omni-1"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = MeowOmni1Processor.from_pretrained(model_path, trust_remote_code=True)
model = MeowOmni1ForCausalLM.from_pretrained(
    model_path, 
    trust_remote_code=True, 
    torch_dtype=torch.bfloat16
).to(device).eval()

# 2. Prepare Modality Inputs
video_path = "sample_cat_video.mp4"
audio_path = "sample_cat_purr.wav"
ts_path = "sample_biometrics.json"

# Process Video (16 frames)
vr = VideoReader(video_path, ctx=cpu(0))
indices = np.linspace(0, len(vr) - 1, 16, dtype=int)
frames = [Image.fromarray(f).convert("RGB") for f in vr.get_batch(indices).asnumpy()]

# Process Audio
audio_arr, _ = sf.read(audio_path)
audios = [audio_arr[:480000].astype(np.float32)]

# 3. Construct Prompt with Modal Placeholders
# Note: Placeholders MUST match the number of input items (e.g., 16 image tags for 16 frames)
placeholders = (
    "".join(["<image>./</image>"] * len(frames)) +  # Video frames
    "<audio>./</audio>" +                          # Audio stream
    "<|ts_start|><|ts_unit|><|ts_end|>"            # Time-series block
)

raw_query = "Analyze the provided multi-modal data. What is this cat's intention?"
prompt = f"User: {placeholders}\n{raw_query}\nAssistant:"

# 4. Run Inference
inputs = processor(
    text=[prompt],
    images=frames,
    audios=audios,
    time_series_paths=[ts_path],
    time_series_sampling_rates=[100.0],
    return_tensors="pt"
).to(device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.95
    )

response = processor.tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\n🔍 Meow-Omni 1 Analysis:\n{response}")

💻 Open-Source Codebase

The full training and evaluation pipeline is available at github.com/smgjch/Meow-Omni-1:

Stage 1 — Projector Alignment: run_pretrain.sh — trains only the time-series projector on 383K TS samples, with all other weights frozen.
Stage 2 — Supervised Fine-Tuning: run_postrain.sh — fine-tunes the LLM backbone on Meow-10K, with all encoders frozen.
Evaluation: eval_meow_omni_1.sh / src/evaluation/eval_meow.py — runs MeowBench with auto-resume support and per-modality ablation via --modals.

🔗 The Meow-Omni Ecosystem

Resource	Link
📄 Paper	arXiv:2605.09152
💻 Code	github.com/smgjch/Meow-Omni-1
🤗 Base Model	Meow-Omni 1-Base
📦 Training Dataset	Meow-10K
📊 Benchmark	MeowBench

📝 Citation

If you find our work helpful, please cite us using the following BibTeX entry:

@misc{hu2026meowomni1multimodallarge,
      title={Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology}, 
      author={Jucheng Hu and Zhangquan Chen and Yulin Chen and Chengjie Hong and Liang Zhou and Tairan Wang and Sifei Li and Giulio Zhu and Feng Zhou and Yiheng Zeng and Suorong Yang and Dongzhan Zhou},
      year={2026},
      eprint={2605.09152},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.09152}, 
}

Downloads last month: 163

Safetensors

Model size

9B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train smgjch/Meow-Omni-1

Paper for smgjch/Meow-Omni-1

Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology

Paper • 2605.09152 • Published 4 days ago