LibreHPS-4B — v1.1

LibreHPS looks at an AI-generated picture (or short video) and a text prompt and tells you how good a match they are, the way a human would rate it. You give it a prompt like "a cat sitting on a windowsill" and an image, and it gives you back a score. You can also hand it two images and ask which one a person would prefer.

This is the kind of model you reach for when you want to automatically rank or filter the output of an image or video generator the way a human reviewer would — picking the best of N samples, training another generator with reinforcement learning from feedback, or running a benchmark that doesn't need human raters in the loop. It scores along five separate axes (overall quality, prompt alignment, visual coherence, style, and — for video — how natural the motion looks).

It's open source under Apache-2.0 and was trained only on freely-licensed preference data.


Size	4B parameters (dense)
Backbone	`Qwen3_5ForConditionalGeneration` (Qwen3.5-4B-Base, hybrid Mamba2 + full attention, MRoPE)
Reward axes	`overall`, `alignment`, `coherence`, `style`, `temporal` (video only)
Weights licence	Apache-2.0 (`LICENSE`)
Dataset mix	permissive-only — see `DATA_PROVENANCE.md`
Container	sharded `safetensors`, ~20.7 GB total

Install

pip install librehps

The reward heads (multi_axis_head, scalar_head) sit on top of the stock-transformers Qwen3.5 backbone and need the librehps Python package to load. Loading the safetensors with stock transformers alone will give you the backbone but silently drop the reward heads — you'll get a working LM, but no reward scores.

Quickstart — score an image

from librehps import LibreHPS

scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
result = scorer.score_image(image="photo.png", prompt="a cat sitting on a windowsill")
print(result.overall.mean)

Quickstart — compare two images

from librehps import LibreHPS
from librehps.evaluate.calibration import load_platt_calibrator

scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
cal = load_platt_calibrator("calibration.json", benchmark=None).default

result = scorer.compare_images(
    image_a="left.png",
    image_b="right.png",
    prompt="a cat sitting on a windowsill",
    calibrator=cal,
)
print(result.winner, result.probability)

The calibrator= keyword is optional. With it, you get the calibrated win probability shipped in calibration.json (a 2-parameter logistic fit per benchmark, plus a <global> fit you can use for live / unknown-benchmark scoring). Without it, you get the uncalibrated μ-space probability.

Hardware and attention backend

By default, LibreHPS.from_pretrained(...) looks for Flash Attention 4 (CUDA Blackwell, sm_100). If FA4 is available it's used as a fast path; if not, the loader logs a warning and falls back to stock Qwen3.5 attention (SDPA on modern PyTorch, eager on CPU). You can pin the backend explicitly:

LibreHPS.from_pretrained(path)                       # auto (default)
LibreHPS.from_pretrained(path, attn_impl="fa4")      # require FA4 + Blackwell
LibreHPS.from_pretrained(path, attn_impl="sdpa")     # skip FA4; CUDA / CPU / MPS all OK

bfloat16 weights, ~20.7 GB on disk, ~10 GB resident at bf16. Fits on a single 24 GB GPU.

Limitations

Trained on permissively-licensed generator outputs only. Closed-model outputs (Midjourney, most OpenAI images) were intentionally excluded from training, so scores on those generators are out-of-distribution.
Not a content-moderation model. There's no toxicity / safety filter beyond the upstream dataset licence audits.
The temporal axis was trained on 5–8 subsampled frames; longer clips are extrapolation.
The model's per-axis σ output is uninformative on this checkpoint (median σ ≈ 0.05). Use the calibrated probability field for uncertainty, not ScoreAxis.sigma.

Evaluation

Held-out 90 % per benchmark, deterministic by global_index hash. Symmetrised (live-harness-equivalent) numbers:

Benchmark	n	pair_acc	ECE (uncal)	ECE (cal)	Brier (uncal)	Brier (cal)
`hpdv3`	25 818	0.922	0.068	0.011	0.072	0.054
`vrr`	37 244	0.764	0.221	0.011	0.225	0.156
`imgrew`	11 460	0.647	0.333	0.008	0.338	0.216
`pickscore`	780	0.623	0.368	0.041	0.371	0.236

Aggregate (n-weighted): ECE 0.187 → 0.011 (−94 %); Brier 0.191 → 0.131 (−32 %); pair_acc unchanged. The calibrator is the difference between the "uncal" and "cal" columns.

Acknowledgement

LibreHPS is inspired by and architecturally influenced by HPSv3 (Ma, Shui, Wu, Sun, Li — ICCV 2025). LibreHPS is a from-scratch reimplementation with a different backbone, training stack, and permissively-licensed training data mix.

License

Source code: MIT
Model weights: Apache-2.0
Training data: permissive union (MIT / Apache-2.0 / BSD-3-Clause / CDLA-Permissive-2.0). See DATA_PROVENANCE.md for the per-dataset audit and the per-image generator-redistribution audit applied to filter the training mix.

Loveland, Colorado, USA

Downloads last month: 37

Safetensors

Model size

5B params

Tensor type

F32