YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

LLaMA-3.1-8B-LoRA-COCO-Deceptive-CLIP Model Card

🏆 This work is accepted to ACL 2025 (Main Conference).

main result Figure: Attack success rate (ASR) and caption diversity of our model on the COCO dataset, illustrating its ability to generate deceptive captions that successfully fool CLIP.

Model Description

Repository: Code
Paper: Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
Point of Contact: Jaewoo Ahn, Heeseung Yun

Dataset

This model was fine-tuned on the COCO-Deceptive-CLIP-LLaMA-3.1-8B Training Dataset, which provides structured instruction–response pairs for generating deceptive captions that mislead CLIP.

Model Details

Model: LLaMA-3.1-8B-LoRA-COCO-Deceptive-CLIP is a deceptive caption generator built on LLaMA-3.1-8B, fine-tuned using LoRA (i.e., self-training, or more specifically, rejection sampling fine-tuning (RFT)) to deceive CLIP on the COCO dataset. It achieves an attack success rate (ASR) of 42.1%.
Architecture: This model is based on LLaMA-3.1-8B and utilizes PEFT v0.12.0 for efficient fine-tuning.

How to Use

See our GitHub repository for full usage instructions and scripts.

Citation

Please cite our work if you find the resources in this repository useful:

@inproceedings{ahn2025mac,
      title={Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates},
      author={Jaewoo Ahn and Heeseung Yun and Dayoon Ko and Gunhee Kim},
      booktitle={ACL},
      year=2025
}

Downloads last month: 2

Model tree for ahnpersie/llama3.1-8b-lora-coco-deceptive-clip

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Adapter

(1970)

this model

Paper for ahnpersie/llama3.1-8b-lora-coco-deceptive-clip

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Paper • 2505.22943 • Published May 28, 2025 • 3