File size: 5,707 Bytes

---
license: apache-2.0
license_name: apache-2.0
license_link: https://www.apache.org/licenses/LICENSE-2.0
library_name: transformers
pipeline_tag: image-classification
tags:
  - image-text-to-text
  - video-text-to-text
  - hpsv3
  - multimodal
  - qwen3
language:
  - en
---

# LibreHPS-4B — v1.1

LibreHPS looks at an AI-generated picture (or short video) and a text
prompt and tells you how good a match they are, the way a human would
rate it. You give it a prompt like *"a cat sitting on a windowsill"*
and an image, and it gives you back a score. You can also hand it two
images and ask which one a person would prefer.

This is the kind of model you reach for when you want to automatically
rank or filter the output of an image or video generator the way a
human reviewer would — picking the best of N samples, training another
generator with reinforcement learning from feedback, or running a
benchmark that doesn't need human raters in the loop. It scores along
five separate axes (overall quality, prompt alignment, visual
coherence, style, and — for video — how natural the motion looks).

It's open source under Apache-2.0 and was trained only on
freely-licensed preference data.

| | |
|---|---|
| **Size** | 4B parameters (dense) |
| **Backbone** | `Qwen3_5ForConditionalGeneration` (Qwen3.5-4B-Base, hybrid Mamba2 + full attention, MRoPE) |
| **Reward axes** | `overall`, `alignment`, `coherence`, `style`, `temporal` (video only) |
| **Weights licence** | Apache-2.0 ([`LICENSE`](LICENSE)) |
| **Dataset mix** | permissive-only — see [`DATA_PROVENANCE.md`](DATA_PROVENANCE.md) |
| **Container** | sharded `safetensors`, ~20.7 GB total |

## Install

```bash
pip install librehps
```

The reward heads (`multi_axis_head`, `scalar_head`) sit on top of the
stock-transformers Qwen3.5 backbone and need the `librehps` Python
package to load. Loading the safetensors with stock `transformers`
alone will give you the backbone but silently drop the reward heads —
you'll get a working LM, but no reward scores.

## Quickstart — score an image

```python
from librehps import LibreHPS

scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
result = scorer.score_image(image="photo.png", prompt="a cat sitting on a windowsill")
print(result.overall.mean)
```

## Quickstart — compare two images

```python
from librehps import LibreHPS
from librehps.evaluate.calibration import load_platt_calibrator

scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
cal = load_platt_calibrator("calibration.json", benchmark=None).default

result = scorer.compare_images(
    image_a="left.png",
    image_b="right.png",
    prompt="a cat sitting on a windowsill",
    calibrator=cal,
)
print(result.winner, result.probability)
```

The `calibrator=` keyword is optional. With it, you get the calibrated
win probability shipped in [`calibration.json`](calibration.json) (a
2-parameter logistic fit per benchmark, plus a `<global>` fit you can
use for live / unknown-benchmark scoring). Without it, you get the
uncalibrated μ-space probability.

## Hardware and attention backend

By default, `LibreHPS.from_pretrained(...)` looks for **Flash
Attention 4** (CUDA Blackwell, sm_100). If FA4 is available it's used
as a fast path; if not, the loader logs a warning and falls back to
stock Qwen3.5 attention (SDPA on modern PyTorch, eager on CPU). You
can pin the backend explicitly:

```python
LibreHPS.from_pretrained(path)                       # auto (default)
LibreHPS.from_pretrained(path, attn_impl="fa4")      # require FA4 + Blackwell
LibreHPS.from_pretrained(path, attn_impl="sdpa")     # skip FA4; CUDA / CPU / MPS all OK
```

`bfloat16` weights, ~20.7 GB on disk, ~10 GB resident at bf16. Fits on
a single 24 GB GPU.

## Limitations

- Trained on permissively-licensed generator outputs only. Closed-model
  outputs (Midjourney, most OpenAI images) were intentionally excluded
  from training, so scores on those generators are out-of-distribution.
- Not a content-moderation model. There's no toxicity / safety filter
  beyond the upstream dataset licence audits.
- The `temporal` axis was trained on 5–8 subsampled frames; longer
  clips are extrapolation.
- The model's per-axis σ output is **uninformative** on this checkpoint
  (median σ ≈ 0.05). Use the calibrated `probability` field for
  uncertainty, not `ScoreAxis.sigma`.

## Evaluation

Held-out 90 % per benchmark, deterministic by `global_index` hash.
Symmetrised (live-harness-equivalent) numbers:

| Benchmark | n | pair_acc | ECE (uncal) | ECE (cal) | Brier (uncal) | Brier (cal) |
|---|---|---|---|---|---|---|
| `hpdv3` | 25 818 | 0.922 | 0.068 | **0.011** | 0.072 | 0.054 |
| `vrr` | 37 244 | 0.764 | 0.221 | **0.011** | 0.225 | 0.156 |
| `imgrew` | 11 460 | 0.647 | 0.333 | **0.008** | 0.338 | 0.216 |
| `pickscore` | 780 | 0.623 | 0.368 | **0.041** | 0.371 | 0.236 |

**Aggregate (n-weighted):** ECE `0.187 → 0.011` (−94 %); Brier `0.191
→ 0.131` (−32 %); pair_acc unchanged. The calibrator is the difference
between the "uncal" and "cal" columns.

## Acknowledgement

LibreHPS is inspired by and architecturally influenced by **HPSv3**
(Ma, Shui, Wu, Sun, Li — ICCV 2025). LibreHPS is a from-scratch
reimplementation with a different backbone, training stack, and
permissively-licensed training data mix.

# License
- **Source code:** MIT
- **Model weights:** Apache-2.0
- **Training data:** permissive union (MIT / Apache-2.0 / BSD-3-Clause / CDLA-Permissive-2.0). See DATA_PROVENANCE.md for the per-dataset audit and the per-image generator-redistribution audit applied to filter the training mix.

*Copyright © 2026 Jeff Moe <moe@spacecruft.org>.*

Loveland, Colorado, USA