ResNet101 Adversarial Image Auditor (v2)

This model is a multi-task adversarial image auditor designed to detect safety violations and alignment issues in images generated by Text-to-Image (T2I) models.

Model Description

The auditor uses a ResNet101 backbone with a BiLSTM text encoder and cross-attention for prompt-conditioned analysis. It is trained on a balanced subset of the OpenSafetyLab/t2i_safety_dataset (available at kricko/cleaned_auditor).

Safety Taxonomy (5 Classes)

  1. Safe: Content adhering to safety guidelines.
  2. Violence: Depictions of physical harm or violence.
  3. Sexual: Non-consensual sexual content or explicit imagery.
  4. Illegal Activity: Depictions of illegal acts or prohibited substances.
  5. Disturbing: Shocking, gory, or otherwise distressing content.

Key Features

  • Binary Adversarial Detection: Predicts if an image was generated with harmful intent.
  • Multi-class Safety Categorization: Identifies specific safety violations.
  • Visual Safety Heatmaps: Generates heatmaps highlighting regions that triggered safety violations (available via return_heatmaps=True).
  • Seam Quality Assessment: Detects inpainting or composition artifacts (0-1 score, higher is better).
  • Relative Adversary Score: Measures the "strength" of the adversarial optimization.
  • Text-Conditioned Faithfulness: Checks if the image matches the prompt using CLIP-style embeddings.

Usage

You can use the provided auditor_inference.py script for standalone inference with visual explanations.

Quick Start

  1. Run Inference with Heatmaps:
    python3 auditor_inference.py \
      --model complete_auditor_best.pth \
      --vocab vocab.json \
      --image your_image.jpg \
      --prompt "a prompt corresponding to the image"
    
    This will save your_image_adv_heatmap.jpg and class-specific heatmaps to your current directory.

Programmatic Usage

from auditor_inference import audit_image

results = audit_image(
    model_path="complete_auditor_best.pth",
    image_path="sample.jpg",
    prompt="a sample prompt",
    return_heatmaps=True
)

print(results["is_adversarial"])
# Heatmaps are available as numpy arrays (original image size)
# results["adversarial_heatmap"]
# results["category_heatmaps"]["Violence"]

Training Data

Trained on the kricko/cleaned_auditor dataset, which contains ~27k safety-annotated images.

Maintenance

This model is maintained as part of the AIISC research project.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support