X-CLIP fine-tuned on UCF-Crime with LLM descriptions

Fine-tuned weights for microsoft/xclip-base-patch16, adapted to multi-class crime classification on the UCF-Crime Action Recognition split.

Trained with per-video scene descriptions generated by Gemini Flash as the text branch of a symmetric contrastive (InfoNCE) objective. Only the last 3 transformer blocks + CCT + MIT + projection heads are trainable (~50M / 195M params).

Results (168 test videos, 14 classes)

Method Accuracy F1 (macro) AUC-ROC
CLIP zero-shot 23.2% 21.0% 0.77
X-CLIP zero-shot 17.3% 13.9% 0.70
This model 26.2% 22.1% 0.80

Files

  • best_model.pt (~1.2 GB) โ€” checkpoint dict with keys {epoch, model_state_dict, optimizer_state_dict, loss, best_loss}

Loading

import torch
from transformers import XCLIPModel, XCLIPProcessor
from huggingface_hub import hf_hub_download

model = XCLIPModel.from_pretrained("microsoft/xclip-base-patch16")
ckpt = torch.load(
    hf_hub_download("yashppawar/xclip-ucf-crime", "best_model.pt"),
    weights_only=True,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch16")
# feed 8 frames of shape (C, H, W) via processor(images=frames, return_tensors="pt")

Demo

Interactive Gradio demo: https://huggingface.co/spaces/yashppawar/xclip-crime-demo

Limitations

  • Small training set (532 videos, ~38 per class) limits absolute accuracy.
  • Visually ambiguous categories (Robbery vs. Shoplifting, Assault vs. Fighting, Abuse) remain challenging โ€” the paper's confusion matrix shows Robbery is systematically misclassified as Shoplifting.
  • Trained with batch size 8 on a T4 GPU; larger batches may further help the contrastive objective.

Citation

If useful, cite the original datasets and papers:

@inproceedings{sultani2018real,
  title={Real-world anomaly detection in surveillance videos},
  author={Sultani, Waqas and Chen, Chen and Shah, Mubarak},
  booktitle={CVPR}, year={2018}
}

@inproceedings{ni2022expanding,
  title={Expanding language-image pretrained models for general video recognition},
  author={Ni, Bolin and others},
  booktitle={ECCV}, year={2022}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yashppawar/xclip-ucf-crime

Finetuned
(1)
this model

Space using yashppawar/xclip-ucf-crime 1