X-CLIP fine-tuned on UCF-Crime with LLM descriptions
Fine-tuned weights for microsoft/xclip-base-patch16, adapted to multi-class crime classification on the UCF-Crime Action Recognition split.
Trained with per-video scene descriptions generated by Gemini Flash as the text branch of a symmetric contrastive (InfoNCE) objective. Only the last 3 transformer blocks + CCT + MIT + projection heads are trainable (~50M / 195M params).
Results (168 test videos, 14 classes)
| Method | Accuracy | F1 (macro) | AUC-ROC |
|---|---|---|---|
| CLIP zero-shot | 23.2% | 21.0% | 0.77 |
| X-CLIP zero-shot | 17.3% | 13.9% | 0.70 |
| This model | 26.2% | 22.1% | 0.80 |
Files
best_model.pt(~1.2 GB) โ checkpoint dict with keys{epoch, model_state_dict, optimizer_state_dict, loss, best_loss}
Loading
import torch
from transformers import XCLIPModel, XCLIPProcessor
from huggingface_hub import hf_hub_download
model = XCLIPModel.from_pretrained("microsoft/xclip-base-patch16")
ckpt = torch.load(
hf_hub_download("yashppawar/xclip-ucf-crime", "best_model.pt"),
weights_only=True,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch16")
# feed 8 frames of shape (C, H, W) via processor(images=frames, return_tensors="pt")
Demo
Interactive Gradio demo: https://huggingface.co/spaces/yashppawar/xclip-crime-demo
Limitations
- Small training set (532 videos, ~38 per class) limits absolute accuracy.
- Visually ambiguous categories (Robbery vs. Shoplifting, Assault vs. Fighting, Abuse) remain challenging โ the paper's confusion matrix shows Robbery is systematically misclassified as Shoplifting.
- Trained with batch size 8 on a T4 GPU; larger batches may further help the contrastive objective.
Citation
If useful, cite the original datasets and papers:
@inproceedings{sultani2018real,
title={Real-world anomaly detection in surveillance videos},
author={Sultani, Waqas and Chen, Chen and Shah, Mubarak},
booktitle={CVPR}, year={2018}
}
@inproceedings{ni2022expanding,
title={Expanding language-image pretrained models for general video recognition},
author={Ni, Bolin and others},
booktitle={ECCV}, year={2022}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for yashppawar/xclip-ucf-crime
Base model
microsoft/xclip-base-patch16