| # LibreHPS β Data provenance |
|
|
| LibreHPS-4B was trained exclusively on permissively-licensed |
| human-preference data for text-to-image and text-to-video generation. |
| This document summarises the licence audit performed for every |
| upstream dataset in the training blend and the per-image generator |
| audit applied to filter the data before training. |
|
|
| ## Datasets used |
|
|
| All datasets in the LibreHPS training blend are distributed under one |
| of: **MIT**, **Apache-2.0**, **BSD-3-Clause**, or |
| **CDLA-Permissive-2.0**. None impose non-commercial, no-derivatives, |
| or share-alike restrictions on ML-model training derivatives. |
|
|
| | Dataset | Upstream licence | |
| |---|---| |
| | `MizzenAI/HPDv3` | MIT | |
| | `zai-org/ImageRewardDB` | Apache-2.0 | |
| | `zai-org/VisionRewardDB-Image` | Apache-2.0 | |
| | `zai-org/VisionRewardDB-Video` | Apache-2.0 | |
| | `Rapidata/text-2-image-Rich-Human-Feedback` | CDLA-Permissive-2.0 | |
| | `Rapidata/sora-video-generation-alignment-likert-scoring` | CDLA-Permissive-2.0 | |
| | `Rapidata/text-2-video-human-preferences-veo3` | CDLA-Permissive-2.0 | |
| | `TIGER-Lab/VideoFeedback` | MIT | |
| | `data-is-better-together/open-image-preferences-v1` | Apache-2.0 | |
| | `fudan-generative-ai/LiFT-HRA-20K` | Apache-2.0 | |
|
|
| CDLA-Permissive-2.0 Β§3.1 jointly with Β§5.4 exempts "Results" |
| (including ML-model artefacts) from any downstream restriction. This |
| is the basis on which Rapidata CDLA-2.0 datasets are included. |
|
|
| ## Per-image generator audit |
|
|
| A subset of upstream datasets β notably the Rapidata model-vs-model |
| splits β embed images generated by closed-source services (OpenAI, |
| Midjourney, Imagen, etc.). Each retained training pair carries an |
| audit flag `license_audit_flags.upstream_generator_restriction` |
| recording whether the upstream generator's ToS prohibits |
| redistribution of its outputs. Rows flagged `true` were dropped before |
| training. |
|
|
| **Generators dropped:** |
|
|
| - **Midjourney** β ToS prohibits redistribution of generated images. |
| - **DALLΒ·E 3** (via OpenAI API) β older OpenAI ToS prohibits |
| redistribution; only retained where the upstream dataset itself |
| asserts generator-ToS compatibility. |
|
|
| **Generators kept:** |
|
|
| - **Stable Diffusion family** (SD 1.x, SD 2.x, SDXL, SD 3.x) β open |
| weights. |
| - **FLUX.1**, **FLUX.1.1**, **FLUX 2 pro** β API ToS permits |
| redistribution. |
| - **Imagen 3 / 4 / 4 Ultra** β Google generative-AI ToS permits |
| redistribution of outputs. |
| - **HunyuanImage 2.1**, **Seedream 3**, **Kolors**, **Lumina** β open |
| weights. |
| - **Recraft v2 / v3**, **Ideogram V2** β API ToS permits |
| redistribution. |
|
|
| ## Excluded datasets |
|
|
| These were reviewed and explicitly excluded: |
|
|
| - **Pick-a-Pic v1 / v2** β no declared HF licence; the underlying |
| Dreamlike Photoreal 2.0 generator carries an OpenRAIL-M |
| non-commercial clause. |
| - **HPDv2 (standalone)** β research-use only. |
| - **Midjourney Discord scrapes** β ToS prohibits redistribution. |
| - **PKU-Alignment/SafeSora\*** β CC-BY-NC-4.0 (non-commercial). |
| - **CodeGoat24/VideoFeedback** β a derivative of |
| `TIGER-Lab/VideoFeedback`; the original was ingested instead. |
|
|
| ## What this means for users |
|
|
| LibreHPS-4B is safe to use commercially. Every training-data row was |
| either drawn from a permissively-licensed dataset, or filtered out by |
| the generator-redistribution audit above. The model weights are |
| released under Apache-2.0 (see [`LICENSE`](LICENSE)) with no |
| inherited non-commercial or no-derivatives clauses from the training |
| data. |
|
|