--- license: apache-2.0 license_name: apache-2.0 license_link: https://www.apache.org/licenses/LICENSE-2.0 library_name: transformers pipeline_tag: image-classification tags: - image-text-to-text - video-text-to-text - hpsv3 - multimodal - qwen3 language: - en --- # LibreHPS-4B — v1.1 LibreHPS looks at an AI-generated picture (or short video) and a text prompt and tells you how good a match they are, the way a human would rate it. You give it a prompt like *"a cat sitting on a windowsill"* and an image, and it gives you back a score. You can also hand it two images and ask which one a person would prefer. This is the kind of model you reach for when you want to automatically rank or filter the output of an image or video generator the way a human reviewer would — picking the best of N samples, training another generator with reinforcement learning from feedback, or running a benchmark that doesn't need human raters in the loop. It scores along five separate axes (overall quality, prompt alignment, visual coherence, style, and — for video — how natural the motion looks). It's open source under Apache-2.0 and was trained only on freely-licensed preference data. | | | |---|---| | **Size** | 4B parameters (dense) | | **Backbone** | `Qwen3_5ForConditionalGeneration` (Qwen3.5-4B-Base, hybrid Mamba2 + full attention, MRoPE) | | **Reward axes** | `overall`, `alignment`, `coherence`, `style`, `temporal` (video only) | | **Weights licence** | Apache-2.0 ([`LICENSE`](LICENSE)) | | **Dataset mix** | permissive-only — see [`DATA_PROVENANCE.md`](DATA_PROVENANCE.md) | | **Container** | sharded `safetensors`, ~20.7 GB total | ## Install ```bash pip install librehps ``` The reward heads (`multi_axis_head`, `scalar_head`) sit on top of the stock-transformers Qwen3.5 backbone and need the `librehps` Python package to load. Loading the safetensors with stock `transformers` alone will give you the backbone but silently drop the reward heads — you'll get a working LM, but no reward scores. ## Quickstart — score an image ```python from librehps import LibreHPS scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1") result = scorer.score_image(image="photo.png", prompt="a cat sitting on a windowsill") print(result.overall.mean) ``` ## Quickstart — compare two images ```python from librehps import LibreHPS from librehps.evaluate.calibration import load_platt_calibrator scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1") cal = load_platt_calibrator("calibration.json", benchmark=None).default result = scorer.compare_images( image_a="left.png", image_b="right.png", prompt="a cat sitting on a windowsill", calibrator=cal, ) print(result.winner, result.probability) ``` The `calibrator=` keyword is optional. With it, you get the calibrated win probability shipped in [`calibration.json`](calibration.json) (a 2-parameter logistic fit per benchmark, plus a `` fit you can use for live / unknown-benchmark scoring). Without it, you get the uncalibrated μ-space probability. ## Hardware and attention backend By default, `LibreHPS.from_pretrained(...)` looks for **Flash Attention 4** (CUDA Blackwell, sm_100). If FA4 is available it's used as a fast path; if not, the loader logs a warning and falls back to stock Qwen3.5 attention (SDPA on modern PyTorch, eager on CPU). You can pin the backend explicitly: ```python LibreHPS.from_pretrained(path) # auto (default) LibreHPS.from_pretrained(path, attn_impl="fa4") # require FA4 + Blackwell LibreHPS.from_pretrained(path, attn_impl="sdpa") # skip FA4; CUDA / CPU / MPS all OK ``` `bfloat16` weights, ~20.7 GB on disk, ~10 GB resident at bf16. Fits on a single 24 GB GPU. ## Limitations - Trained on permissively-licensed generator outputs only. Closed-model outputs (Midjourney, most OpenAI images) were intentionally excluded from training, so scores on those generators are out-of-distribution. - Not a content-moderation model. There's no toxicity / safety filter beyond the upstream dataset licence audits. - The `temporal` axis was trained on 5–8 subsampled frames; longer clips are extrapolation. - The model's per-axis σ output is **uninformative** on this checkpoint (median σ ≈ 0.05). Use the calibrated `probability` field for uncertainty, not `ScoreAxis.sigma`. ## Evaluation Held-out 90 % per benchmark, deterministic by `global_index` hash. Symmetrised (live-harness-equivalent) numbers: | Benchmark | n | pair_acc | ECE (uncal) | ECE (cal) | Brier (uncal) | Brier (cal) | |---|---|---|---|---|---|---| | `hpdv3` | 25 818 | 0.922 | 0.068 | **0.011** | 0.072 | 0.054 | | `vrr` | 37 244 | 0.764 | 0.221 | **0.011** | 0.225 | 0.156 | | `imgrew` | 11 460 | 0.647 | 0.333 | **0.008** | 0.338 | 0.216 | | `pickscore` | 780 | 0.623 | 0.368 | **0.041** | 0.371 | 0.236 | **Aggregate (n-weighted):** ECE `0.187 → 0.011` (−94 %); Brier `0.191 → 0.131` (−32 %); pair_acc unchanged. The calibrator is the difference between the "uncal" and "cal" columns. ## Acknowledgement LibreHPS is inspired by and architecturally influenced by **HPSv3** (Ma, Shui, Wu, Sun, Li — ICCV 2025). LibreHPS is a from-scratch reimplementation with a different backbone, training stack, and permissively-licensed training data mix. # License - **Source code:** MIT - **Model weights:** Apache-2.0 - **Training data:** permissive union (MIT / Apache-2.0 / BSD-3-Clause / CDLA-Permissive-2.0). See DATA_PROVENANCE.md for the per-dataset audit and the per-image generator-redistribution audit applied to filter the training mix. *Copyright © 2026 Jeff Moe .* Loveland, Colorado, USA