X-CLIP fine-tuned on UCF-Crime with LLM descriptions

Fine-tuned weights for microsoft/xclip-base-patch16, adapted to multi-class crime classification on the UCF-Crime Action Recognition split.

Trained with per-video scene descriptions generated by Gemini Flash as the text branch of a symmetric contrastive (InfoNCE) objective. Only the last 3 transformer blocks + CCT + MIT + projection heads are trainable (~50M / 195M params).

Results (168 test videos, 14 classes)

Method	Accuracy	F1 (macro)	AUC-ROC
CLIP zero-shot	23.2%	21.0%	0.77
X-CLIP zero-shot	17.3%	13.9%	0.70
This model	26.2%	22.1%	0.80

Files

best_model.pt (~1.2 GB) — checkpoint dict with keys {epoch, model_state_dict, optimizer_state_dict, loss, best_loss}

Loading

import torch
from transformers import XCLIPModel, XCLIPProcessor
from huggingface_hub import hf_hub_download

model = XCLIPModel.from_pretrained("microsoft/xclip-base-patch16")
ckpt = torch.load(
    hf_hub_download("yashppawar/xclip-ucf-crime", "best_model.pt"),
    weights_only=True,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch16")
# feed 8 frames of shape (C, H, W) via processor(images=frames, return_tensors="pt")

Demo

Interactive Gradio demo: https://huggingface.co/spaces/yashppawar/xclip-crime-demo

Limitations

Small training set (532 videos, ~38 per class) limits absolute accuracy.
Visually ambiguous categories (Robbery vs. Shoplifting, Assault vs. Fighting, Abuse) remain challenging — the paper's confusion matrix shows Robbery is systematically misclassified as Shoplifting.
Trained with batch size 8 on a T4 GPU; larger batches may further help the contrastive objective.

Citation

If useful, cite the original datasets and papers:

@inproceedings{sultani2018real,
  title={Real-world anomaly detection in surveillance videos},
  author={Sultani, Waqas and Chen, Chen and Shah, Mubarak},
  booktitle={CVPR}, year={2018}
}

@inproceedings{ni2022expanding,
  title={Expanding language-image pretrained models for general video recognition},
  author={Ni, Bolin and others},
  booktitle={ECCV}, year={2022}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yashppawar/xclip-ucf-crime

Base model

microsoft/xclip-base-patch16

Finetuned

(1)

this model

yashppawar
/

xclip-ucf-crime