LibreHPS-4B-v1.1 / README.md
Jeff Moe
Add license info
f0b3885
---
license: apache-2.0
license_name: apache-2.0
license_link: https://www.apache.org/licenses/LICENSE-2.0
library_name: transformers
pipeline_tag: image-classification
tags:
- image-text-to-text
- video-text-to-text
- hpsv3
- multimodal
- qwen3
language:
- en
---
# LibreHPS-4B β€” v1.1
LibreHPS looks at an AI-generated picture (or short video) and a text
prompt and tells you how good a match they are, the way a human would
rate it. You give it a prompt like *"a cat sitting on a windowsill"*
and an image, and it gives you back a score. You can also hand it two
images and ask which one a person would prefer.
This is the kind of model you reach for when you want to automatically
rank or filter the output of an image or video generator the way a
human reviewer would β€” picking the best of N samples, training another
generator with reinforcement learning from feedback, or running a
benchmark that doesn't need human raters in the loop. It scores along
five separate axes (overall quality, prompt alignment, visual
coherence, style, and β€” for video β€” how natural the motion looks).
It's open source under Apache-2.0 and was trained only on
freely-licensed preference data.
| | |
|---|---|
| **Size** | 4B parameters (dense) |
| **Backbone** | `Qwen3_5ForConditionalGeneration` (Qwen3.5-4B-Base, hybrid Mamba2 + full attention, MRoPE) |
| **Reward axes** | `overall`, `alignment`, `coherence`, `style`, `temporal` (video only) |
| **Weights licence** | Apache-2.0 ([`LICENSE`](LICENSE)) |
| **Dataset mix** | permissive-only β€” see [`DATA_PROVENANCE.md`](DATA_PROVENANCE.md) |
| **Container** | sharded `safetensors`, ~20.7 GB total |
## Install
```bash
pip install librehps
```
The reward heads (`multi_axis_head`, `scalar_head`) sit on top of the
stock-transformers Qwen3.5 backbone and need the `librehps` Python
package to load. Loading the safetensors with stock `transformers`
alone will give you the backbone but silently drop the reward heads β€”
you'll get a working LM, but no reward scores.
## Quickstart β€” score an image
```python
from librehps import LibreHPS
scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
result = scorer.score_image(image="photo.png", prompt="a cat sitting on a windowsill")
print(result.overall.mean)
```
## Quickstart β€” compare two images
```python
from librehps import LibreHPS
from librehps.evaluate.calibration import load_platt_calibrator
scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
cal = load_platt_calibrator("calibration.json", benchmark=None).default
result = scorer.compare_images(
image_a="left.png",
image_b="right.png",
prompt="a cat sitting on a windowsill",
calibrator=cal,
)
print(result.winner, result.probability)
```
The `calibrator=` keyword is optional. With it, you get the calibrated
win probability shipped in [`calibration.json`](calibration.json) (a
2-parameter logistic fit per benchmark, plus a `<global>` fit you can
use for live / unknown-benchmark scoring). Without it, you get the
uncalibrated ΞΌ-space probability.
## Hardware and attention backend
By default, `LibreHPS.from_pretrained(...)` looks for **Flash
Attention 4** (CUDA Blackwell, sm_100). If FA4 is available it's used
as a fast path; if not, the loader logs a warning and falls back to
stock Qwen3.5 attention (SDPA on modern PyTorch, eager on CPU). You
can pin the backend explicitly:
```python
LibreHPS.from_pretrained(path) # auto (default)
LibreHPS.from_pretrained(path, attn_impl="fa4") # require FA4 + Blackwell
LibreHPS.from_pretrained(path, attn_impl="sdpa") # skip FA4; CUDA / CPU / MPS all OK
```
`bfloat16` weights, ~20.7 GB on disk, ~10 GB resident at bf16. Fits on
a single 24 GB GPU.
## Limitations
- Trained on permissively-licensed generator outputs only. Closed-model
outputs (Midjourney, most OpenAI images) were intentionally excluded
from training, so scores on those generators are out-of-distribution.
- Not a content-moderation model. There's no toxicity / safety filter
beyond the upstream dataset licence audits.
- The `temporal` axis was trained on 5–8 subsampled frames; longer
clips are extrapolation.
- The model's per-axis Οƒ output is **uninformative** on this checkpoint
(median Οƒ β‰ˆ 0.05). Use the calibrated `probability` field for
uncertainty, not `ScoreAxis.sigma`.
## Evaluation
Held-out 90 % per benchmark, deterministic by `global_index` hash.
Symmetrised (live-harness-equivalent) numbers:
| Benchmark | n | pair_acc | ECE (uncal) | ECE (cal) | Brier (uncal) | Brier (cal) |
|---|---|---|---|---|---|---|
| `hpdv3` | 25 818 | 0.922 | 0.068 | **0.011** | 0.072 | 0.054 |
| `vrr` | 37 244 | 0.764 | 0.221 | **0.011** | 0.225 | 0.156 |
| `imgrew` | 11 460 | 0.647 | 0.333 | **0.008** | 0.338 | 0.216 |
| `pickscore` | 780 | 0.623 | 0.368 | **0.041** | 0.371 | 0.236 |
**Aggregate (n-weighted):** ECE `0.187 β†’ 0.011` (βˆ’94 %); Brier `0.191
β†’ 0.131` (βˆ’32 %); pair_acc unchanged. The calibrator is the difference
between the "uncal" and "cal" columns.
## Acknowledgement
LibreHPS is inspired by and architecturally influenced by **HPSv3**
(Ma, Shui, Wu, Sun, Li β€” ICCV 2025). LibreHPS is a from-scratch
reimplementation with a different backbone, training stack, and
permissively-licensed training data mix.
# License
- **Source code:** MIT
- **Model weights:** Apache-2.0
- **Training data:** permissive union (MIT / Apache-2.0 / BSD-3-Clause / CDLA-Permissive-2.0). See DATA_PROVENANCE.md for the per-dataset audit and the per-image generator-redistribution audit applied to filter the training mix.
*Copyright Β© 2026 Jeff Moe <moe@spacecruft.org>.*
Loveland, Colorado, USA