YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

LLaMA-3.1-8B-LoRA-COCO-Deceptive-CLIP Model Card

🏆 This work is accepted to ACL 2025 (Main Conference).

main result Figure: Attack success rate (ASR) and caption diversity of our model on the COCO dataset, illustrating its ability to generate deceptive captions that successfully fool CLIP.

Model Description

Dataset

This model was fine-tuned on the COCO-Deceptive-CLIP-LLaMA-3.1-8B Training Dataset, which provides structured instruction–response pairs for generating deceptive captions that mislead CLIP.

Model Details

  • Model: LLaMA-3.1-8B-LoRA-COCO-Deceptive-CLIP is a deceptive caption generator built on LLaMA-3.1-8B, fine-tuned using LoRA (i.e., self-training, or more specifically, rejection sampling fine-tuning (RFT)) to deceive CLIP on the COCO dataset. It achieves an attack success rate (ASR) of 42.1%.
  • Architecture: This model is based on LLaMA-3.1-8B and utilizes PEFT v0.12.0 for efficient fine-tuning.

How to Use

See our GitHub repository for full usage instructions and scripts.

Citation

Please cite our work if you find the resources in this repository useful:

@inproceedings{ahn2025mac,
      title={Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates},
      author={Jaewoo Ahn and Heeseung Yun and Dayoon Ko and Gunhee Kim},
      booktitle={ACL},
      year=2025
}
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ahnpersie/llama3.1-8b-lora-coco-deceptive-clip

Adapter
(1970)
this model

Paper for ahnpersie/llama3.1-8b-lora-coco-deceptive-clip