File size: 5,707 Bytes
7591bf1
 
 
 
 
 
 
81dd3f4
 
 
7591bf1
81dd3f4
7591bf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0b3885
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: apache-2.0
license_name: apache-2.0
license_link: https://www.apache.org/licenses/LICENSE-2.0
library_name: transformers
pipeline_tag: image-classification
tags:
  - image-text-to-text
  - video-text-to-text
  - hpsv3
  - multimodal
  - qwen3
language:
  - en
---

# LibreHPS-4B β€” v1.1

LibreHPS looks at an AI-generated picture (or short video) and a text
prompt and tells you how good a match they are, the way a human would
rate it. You give it a prompt like *"a cat sitting on a windowsill"*
and an image, and it gives you back a score. You can also hand it two
images and ask which one a person would prefer.

This is the kind of model you reach for when you want to automatically
rank or filter the output of an image or video generator the way a
human reviewer would β€” picking the best of N samples, training another
generator with reinforcement learning from feedback, or running a
benchmark that doesn't need human raters in the loop. It scores along
five separate axes (overall quality, prompt alignment, visual
coherence, style, and β€” for video β€” how natural the motion looks).

It's open source under Apache-2.0 and was trained only on
freely-licensed preference data.

| | |
|---|---|
| **Size** | 4B parameters (dense) |
| **Backbone** | `Qwen3_5ForConditionalGeneration` (Qwen3.5-4B-Base, hybrid Mamba2 + full attention, MRoPE) |
| **Reward axes** | `overall`, `alignment`, `coherence`, `style`, `temporal` (video only) |
| **Weights licence** | Apache-2.0 ([`LICENSE`](LICENSE)) |
| **Dataset mix** | permissive-only β€” see [`DATA_PROVENANCE.md`](DATA_PROVENANCE.md) |
| **Container** | sharded `safetensors`, ~20.7 GB total |

## Install

```bash
pip install librehps
```

The reward heads (`multi_axis_head`, `scalar_head`) sit on top of the
stock-transformers Qwen3.5 backbone and need the `librehps` Python
package to load. Loading the safetensors with stock `transformers`
alone will give you the backbone but silently drop the reward heads β€”
you'll get a working LM, but no reward scores.

## Quickstart β€” score an image

```python
from librehps import LibreHPS

scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
result = scorer.score_image(image="photo.png", prompt="a cat sitting on a windowsill")
print(result.overall.mean)
```

## Quickstart β€” compare two images

```python
from librehps import LibreHPS
from librehps.evaluate.calibration import load_platt_calibrator

scorer = LibreHPS.from_pretrained("LibreHPS/LibreHPS-4B-v1.1")
cal = load_platt_calibrator("calibration.json", benchmark=None).default

result = scorer.compare_images(
    image_a="left.png",
    image_b="right.png",
    prompt="a cat sitting on a windowsill",
    calibrator=cal,
)
print(result.winner, result.probability)
```

The `calibrator=` keyword is optional. With it, you get the calibrated
win probability shipped in [`calibration.json`](calibration.json) (a
2-parameter logistic fit per benchmark, plus a `<global>` fit you can
use for live / unknown-benchmark scoring). Without it, you get the
uncalibrated ΞΌ-space probability.

## Hardware and attention backend

By default, `LibreHPS.from_pretrained(...)` looks for **Flash
Attention 4** (CUDA Blackwell, sm_100). If FA4 is available it's used
as a fast path; if not, the loader logs a warning and falls back to
stock Qwen3.5 attention (SDPA on modern PyTorch, eager on CPU). You
can pin the backend explicitly:

```python
LibreHPS.from_pretrained(path)                       # auto (default)
LibreHPS.from_pretrained(path, attn_impl="fa4")      # require FA4 + Blackwell
LibreHPS.from_pretrained(path, attn_impl="sdpa")     # skip FA4; CUDA / CPU / MPS all OK
```

`bfloat16` weights, ~20.7 GB on disk, ~10 GB resident at bf16. Fits on
a single 24 GB GPU.

## Limitations

- Trained on permissively-licensed generator outputs only. Closed-model
  outputs (Midjourney, most OpenAI images) were intentionally excluded
  from training, so scores on those generators are out-of-distribution.
- Not a content-moderation model. There's no toxicity / safety filter
  beyond the upstream dataset licence audits.
- The `temporal` axis was trained on 5–8 subsampled frames; longer
  clips are extrapolation.
- The model's per-axis Οƒ output is **uninformative** on this checkpoint
  (median Οƒ β‰ˆ 0.05). Use the calibrated `probability` field for
  uncertainty, not `ScoreAxis.sigma`.

## Evaluation

Held-out 90 % per benchmark, deterministic by `global_index` hash.
Symmetrised (live-harness-equivalent) numbers:

| Benchmark | n | pair_acc | ECE (uncal) | ECE (cal) | Brier (uncal) | Brier (cal) |
|---|---|---|---|---|---|---|
| `hpdv3` | 25 818 | 0.922 | 0.068 | **0.011** | 0.072 | 0.054 |
| `vrr` | 37 244 | 0.764 | 0.221 | **0.011** | 0.225 | 0.156 |
| `imgrew` | 11 460 | 0.647 | 0.333 | **0.008** | 0.338 | 0.216 |
| `pickscore` | 780 | 0.623 | 0.368 | **0.041** | 0.371 | 0.236 |

**Aggregate (n-weighted):** ECE `0.187 β†’ 0.011` (βˆ’94 %); Brier `0.191
β†’ 0.131` (βˆ’32 %); pair_acc unchanged. The calibrator is the difference
between the "uncal" and "cal" columns.

## Acknowledgement

LibreHPS is inspired by and architecturally influenced by **HPSv3**
(Ma, Shui, Wu, Sun, Li β€” ICCV 2025). LibreHPS is a from-scratch
reimplementation with a different backbone, training stack, and
permissively-licensed training data mix.

# License
- **Source code:** MIT
- **Model weights:** Apache-2.0
- **Training data:** permissive union (MIT / Apache-2.0 / BSD-3-Clause / CDLA-Permissive-2.0). See DATA_PROVENANCE.md for the per-dataset audit and the per-image generator-redistribution audit applied to filter the training mix.

*Copyright Β© 2026 Jeff Moe <moe@spacecruft.org>.*

Loveland, Colorado, USA