license: apache-2.0
license_name: apache-2.0
license_link: https://www.apache.org/licenses/LICENSE-2.0
library_name: transformers
pipeline_tag: image-classification
tags:
- image-text-to-text
- video-text-to-text
- hpsv3
- multimodal
- qwen3
language:
- en
LibreHPS-4B β v1.1
LibreHPS looks at an AI-generated picture (or short video) and a text prompt and tells you how good a match they are, the way a human would rate it. You give it a prompt like "a cat sitting on a windowsill" and an image, and it gives you back a score. You can also hand it two images and ask which one a person would prefer.
This is the kind of model you reach for when you want to automatically rank or filter the output of an image or video generator the way a human reviewer would β picking the best of N samples, training another generator with reinforcement learning from feedback, or running a benchmark that doesn't need human raters in the loop. It scores along five separate axes (overall quality, prompt alignment, visual coherence, style, and β for video β how natural the motion looks).
It's open source under Apache-2.0 and was trained only on freely-licensed preference data.
| Size | 4B parameters (dense) |
| Backbone | Qwen3_5ForConditionalGeneration (Qwen3.5-4B-Base, hybrid Mamba2 + full attention, MRoPE) |
| Reward axes | overall, alignment, coherence, style, temporal (video only) |
| Weights licence | Apache-2.0 (LICENSE) |
| Dataset mix | permissive-only β see DATA_PROVENANCE.md |
| Container | sharded safetensors, ~20.7 GB total |
Install
pip install librehps
The reward heads (multi_axis_head, scalar_head) sit on top of the
stock-transformers Qwen3.5 backbone and need the librehps Python
package to load. Loading the safetensors with stock transformers
alone will give you the backbone but silently drop the reward heads β
you'll get a working LM, but no reward scores.
Quickstart β score an image
from librehps import LibreHPS
scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
result = scorer.score_image(image="photo.png", prompt="a cat sitting on a windowsill")
print(result.overall.mean)
Quickstart β compare two images
from librehps import LibreHPS
from librehps.evaluate.calibration import load_platt_calibrator
scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
cal = load_platt_calibrator("calibration.json", benchmark=None).default
result = scorer.compare_images(
image_a="left.png",
image_b="right.png",
prompt="a cat sitting on a windowsill",
calibrator=cal,
)
print(result.winner, result.probability)
The calibrator= keyword is optional. With it, you get the calibrated
win probability shipped in calibration.json (a
2-parameter logistic fit per benchmark, plus a <global> fit you can
use for live / unknown-benchmark scoring). Without it, you get the
uncalibrated ΞΌ-space probability.
Hardware and attention backend
By default, LibreHPS.from_pretrained(...) looks for Flash
Attention 4 (CUDA Blackwell, sm_100). If FA4 is available it's used
as a fast path; if not, the loader logs a warning and falls back to
stock Qwen3.5 attention (SDPA on modern PyTorch, eager on CPU). You
can pin the backend explicitly:
LibreHPS.from_pretrained(path) # auto (default)
LibreHPS.from_pretrained(path, attn_impl="fa4") # require FA4 + Blackwell
LibreHPS.from_pretrained(path, attn_impl="sdpa") # skip FA4; CUDA / CPU / MPS all OK
bfloat16 weights, ~20.7 GB on disk, ~10 GB resident at bf16. Fits on
a single 24 GB GPU.
Limitations
- Trained on permissively-licensed generator outputs only. Closed-model outputs (Midjourney, most OpenAI images) were intentionally excluded from training, so scores on those generators are out-of-distribution.
- Not a content-moderation model. There's no toxicity / safety filter beyond the upstream dataset licence audits.
- The
temporalaxis was trained on 5β8 subsampled frames; longer clips are extrapolation. - The model's per-axis Ο output is uninformative on this checkpoint
(median Ο β 0.05). Use the calibrated
probabilityfield for uncertainty, notScoreAxis.sigma.
Evaluation
Held-out 90 % per benchmark, deterministic by global_index hash.
Symmetrised (live-harness-equivalent) numbers:
| Benchmark | n | pair_acc | ECE (uncal) | ECE (cal) | Brier (uncal) | Brier (cal) |
|---|---|---|---|---|---|---|
hpdv3 |
25 818 | 0.922 | 0.068 | 0.011 | 0.072 | 0.054 |
vrr |
37 244 | 0.764 | 0.221 | 0.011 | 0.225 | 0.156 |
imgrew |
11 460 | 0.647 | 0.333 | 0.008 | 0.338 | 0.216 |
pickscore |
780 | 0.623 | 0.368 | 0.041 | 0.371 | 0.236 |
Aggregate (n-weighted): ECE 0.187 β 0.011 (β94 %); Brier 0.191 β 0.131 (β32 %); pair_acc unchanged. The calibrator is the difference
between the "uncal" and "cal" columns.
Acknowledgement
LibreHPS is inspired by and architecturally influenced by HPSv3 (Ma, Shui, Wu, Sun, Li β ICCV 2025). LibreHPS is a from-scratch reimplementation with a different backbone, training stack, and permissively-licensed training data mix.
License
- Source code: MIT
- Model weights: Apache-2.0
- Training data: permissive union (MIT / Apache-2.0 / BSD-3-Clause / CDLA-Permissive-2.0). See DATA_PROVENANCE.md for the per-dataset audit and the per-image generator-redistribution audit applied to filter the training mix.
Copyright Β© 2026 Jeff Moe moe@spacecruft.org.
Loveland, Colorado, USA