docs: complete inference snippet + embed rubric prompt templates
Browse files
README.md
CHANGED
|
@@ -9,26 +9,225 @@ tags:
|
|
| 9 |
- anonymous-release
|
| 10 |
---
|
| 11 |
|
| 12 |
-
#
|
| 13 |
|
| 14 |
-
|
| 15 |
-
physical-law sub-rubrics
|
| 16 |
-
alongside the companion dataset
|
| 17 |
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground).
|
| 18 |
|
| 19 |
-
The base model identifier required to
|
| 20 |
-
`adapter_config.json` (`base_model_name_or_path`)
|
|
|
|
| 21 |
|
| 22 |
## Files
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
```
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
```
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
| Item | Value |
|
| 34 |
| --- | --- |
|
|
@@ -41,32 +240,13 @@ training_args.json # sanitized training hyperparameters
|
|
| 41 |
| Best eval loss | 0.1063 (step 294) |
|
| 42 |
| Framework | ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2 |
|
| 43 |
|
| 44 |
-
The training mixture combines automatically derived sub-question
|
| 45 |
-
with human-rated samples (the `subq+human` split). See the
|
| 46 |
-
anonymous dataset for prompts, physical-law tags, and example
|
| 47 |
-
|
| 48 |
-
## Usage
|
| 49 |
-
|
| 50 |
-
```python
|
| 51 |
-
import json
|
| 52 |
-
from peft import PeftModel
|
| 53 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 54 |
-
|
| 55 |
-
adapter_dir = "." # this directory
|
| 56 |
-
base_id = json.load(open(f"{adapter_dir}/adapter_config.json"))["base_model_name_or_path"]
|
| 57 |
-
|
| 58 |
-
tokenizer = AutoTokenizer.from_pretrained(base_id)
|
| 59 |
-
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype="bfloat16", device_map="auto")
|
| 60 |
-
model = PeftModel.from_pretrained(base, adapter_dir)
|
| 61 |
-
model.eval()
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
The adapter expects the base model's default chat template, with a prompt
|
| 65 |
-
that asks the judge to answer one or more sub-rubric questions about a
|
| 66 |
-
candidate video frame/caption. Greedy decoding (`temperature = 0`) with
|
| 67 |
-
`max_new_tokens = 64` matches the training-time generation config.
|
| 68 |
|
| 69 |
## License
|
| 70 |
|
| 71 |
-
The base model is released by its respective authors; this LoRA adapter
|
| 72 |
-
shared for anonymous review purposes. No identifying metadata is
|
|
|
|
|
|
| 9 |
- anonymous-release
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# physground-judger9B — Anonymous Judge LoRA Adapter
|
| 13 |
|
| 14 |
+
LoRA adapter trained as a judge model that scores generated videos against
|
| 15 |
+
prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics.
|
| 16 |
+
Released anonymously alongside the companion dataset
|
| 17 |
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground).
|
| 18 |
|
| 19 |
+
The base model identifier required to attach this adapter is recorded in
|
| 20 |
+
`adapter_config.json` (`base_model_name_or_path`); the inference script
|
| 21 |
+
reads it automatically.
|
| 22 |
|
| 23 |
## Files
|
| 24 |
|
| 25 |
+
| File | Purpose |
|
| 26 |
+
| --- | --- |
|
| 27 |
+
| `adapter_config.json` | PEFT/LoRA config (records base model id) |
|
| 28 |
+
| `adapter_model.safetensors` | LoRA weights (~167 MB) |
|
| 29 |
+
| `additional_config.json` | ms-swift extras (lora_dtype / lr ratios) |
|
| 30 |
+
| `training_args.json` | sanitized training hyperparameters |
|
| 31 |
+
| `subq+human.yaml` | prompt template used at training and inference time |
|
| 32 |
+
| `infer.py` | standalone end-to-end inference script |
|
| 33 |
+
|
| 34 |
+
## Setup
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
pip install "transformers>=4.49" peft accelerate pyyaml \
|
| 38 |
+
"qwen-vl-utils[decord]" huggingface_hub
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
Loading the base model in bf16 needs roughly 24 GB of GPU memory.
|
| 42 |
+
|
| 43 |
+
## Quickstart — Hugging Face Hub
|
| 44 |
+
|
| 45 |
+
`infer.py` accepts either a local folder or a HF Hub repo id via
|
| 46 |
+
`--adapter-dir`; the default value already points at this repo, so the
|
| 47 |
+
following commands work without cloning anything.
|
| 48 |
+
|
| 49 |
+
```bash
|
| 50 |
+
# General axes (1–5 each): SA / PTV / persistence
|
| 51 |
+
python infer.py \
|
| 52 |
+
--video /path/to/video.mp4 \
|
| 53 |
+
--caption "A ball rolls down a ramp and knocks over a block." \
|
| 54 |
+
--metric SA
|
| 55 |
+
|
| 56 |
+
# Physical-law axes (1–5 each): one of the 13 laws below
|
| 57 |
+
python infer.py \
|
| 58 |
+
--video /path/to/video.mp4 \
|
| 59 |
+
--caption "A ball rolls down a ramp and knocks over a block." \
|
| 60 |
+
--law gravity
|
| 61 |
```
|
| 62 |
+
|
| 63 |
+
`infer.py` will:
|
| 64 |
+
|
| 65 |
+
1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
|
| 66 |
+
if it is a Hub id).
|
| 67 |
+
2. Read `adapter_config.json` to find the base model and load it via
|
| 68 |
+
`transformers` (`Qwen/Qwen3.5-9B`).
|
| 69 |
+
3. Attach the LoRA adapter via PEFT.
|
| 70 |
+
4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
|
| 71 |
+
sub-questions / per-law criterion (constants embedded in `infer.py`).
|
| 72 |
+
5. Run greedy decoding with `--max-new-tokens 64` (matches training).
|
| 73 |
+
6. Parse the JSON object and print the integer score.
|
| 74 |
+
|
| 75 |
+
Output is a single JSON line:
|
| 76 |
+
|
| 77 |
+
```json
|
| 78 |
+
{"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}
|
| 79 |
```
|
| 80 |
|
| 81 |
+
`--metric` choices: `SA`, `PTV`, `persistence`.
|
| 82 |
+
`--law` choices: `gravity`, `inertia`, `momentum`, `impenetrability`,
|
| 83 |
+
`collision`, `material`, `buoyancy`, `displacement`, `flow_dynamics`,
|
| 84 |
+
`boundary_interaction`, `fluid_continuity`, `reflection`, `shadow`.
|
| 85 |
+
|
| 86 |
+
Add `--print-prompt` to inspect the exact rendered system + user prompt
|
| 87 |
+
before generation.
|
| 88 |
+
|
| 89 |
+
## Programmatic use
|
| 90 |
+
|
| 91 |
+
```python
|
| 92 |
+
from pathlib import Path
|
| 93 |
+
import torch
|
| 94 |
+
|
| 95 |
+
from infer import (
|
| 96 |
+
build_messages,
|
| 97 |
+
build_prompt,
|
| 98 |
+
decode_generated,
|
| 99 |
+
load_model,
|
| 100 |
+
load_yaml,
|
| 101 |
+
parse_score,
|
| 102 |
+
prepare_inputs,
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
processor, model, adapter_dir = load_model(
|
| 106 |
+
"anonymouscla/physground-judger9B",
|
| 107 |
+
dtype=torch.bfloat16,
|
| 108 |
+
device_map="auto",
|
| 109 |
+
)
|
| 110 |
+
cfg = load_yaml(adapter_dir / "subq+human.yaml")
|
| 111 |
+
|
| 112 |
+
system, user, key = build_prompt(
|
| 113 |
+
cfg,
|
| 114 |
+
caption="A ball rolls down a ramp and knocks over a block.",
|
| 115 |
+
law="gravity",
|
| 116 |
+
)
|
| 117 |
+
messages = build_messages(system, user, Path("video.mp4"))
|
| 118 |
+
inputs = prepare_inputs(
|
| 119 |
+
processor,
|
| 120 |
+
messages,
|
| 121 |
+
next(model.parameters()).device,
|
| 122 |
+
fps=2.0,
|
| 123 |
+
max_pixels=360 * 640,
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
with torch.inference_mode():
|
| 127 |
+
out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
|
| 128 |
+
|
| 129 |
+
raw = decode_generated(processor, inputs, out)
|
| 130 |
+
print({"key": key, "score": parse_score(raw, key), "raw": raw})
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
## Prompt templates
|
| 134 |
+
|
| 135 |
+
System prompt (all axes): `You are a strict video evaluation model.`
|
| 136 |
+
|
| 137 |
+
The prompt always asks the judge to consider observable sub-questions in
|
| 138 |
+
its mind and then output **only** a JSON object with one 1–5 score.
|
| 139 |
+
|
| 140 |
+
### General axes — `subq+human.yaml`
|
| 141 |
+
|
| 142 |
+
**SA — Prompt alignment**
|
| 143 |
+
```
|
| 144 |
+
Evaluate Prompt Alignment (SA).
|
| 145 |
+
|
| 146 |
+
Caption:
|
| 147 |
+
"{prompt}"
|
| 148 |
+
|
| 149 |
+
The video was generated using a text+image-to-video (ti2v) model,
|
| 150 |
+
conditioned on the first frame and the text prompt above.
|
| 151 |
+
|
| 152 |
+
Sub-questions to consider in your mind before scoring:
|
| 153 |
+
1. Are the main objects in the caption present in the video?
|
| 154 |
+
2. Are the key actions or interactions from the caption visible?
|
| 155 |
+
3. Are important scene attributes and relationships preserved?
|
| 156 |
+
4. Does the video avoid major contradictions to the caption?
|
| 157 |
+
|
| 158 |
+
Score 1-5: 5=fully aligned, 4=mostly aligned with minor deviations,
|
| 159 |
+
3=partially aligned with notable gaps, 2=mostly misaligned,
|
| 160 |
+
1=not aligned
|
| 161 |
+
|
| 162 |
+
Then output ONLY a JSON object with exactly one key: SA.
|
| 163 |
+
Example: {"SA": 3}
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
**PTV — Temporal coherence** uses the same shape with sub-questions:
|
| 167 |
+
|
| 168 |
+
1. Do causes appear before their effects?
|
| 169 |
+
2. Do physical events unfold in a plausible temporal order?
|
| 170 |
+
3. Are motion transitions continuous rather than abrupt jumps or loops?
|
| 171 |
+
4. Does the sequence avoid impossible reversals or repeated resets?
|
| 172 |
+
|
| 173 |
+
**persistence — Object persistence** uses:
|
| 174 |
+
|
| 175 |
+
1. Do objects maintain consistent existence throughout the video?
|
| 176 |
+
2. Do objects keep a stable shape, size, color, and texture?
|
| 177 |
+
3. Do objects avoid disappearing, appearing, or transforming unexpectedly?
|
| 178 |
+
4. Do objects preserve identity through motion and brief occlusion?
|
| 179 |
+
|
| 180 |
+
(See `subq+human.yaml` for the verbatim PTV / persistence rubric anchors.)
|
| 181 |
+
|
| 182 |
+
### Physical-law axes — `physical_template`
|
| 183 |
+
|
| 184 |
+
```
|
| 185 |
+
Evaluate physical realism for one physical law: {law}.
|
| 186 |
+
|
| 187 |
+
Criterion:
|
| 188 |
+
{criteria}
|
| 189 |
+
|
| 190 |
+
Caption, for context only:
|
| 191 |
+
"{prompt}"
|
| 192 |
+
|
| 193 |
+
Sub-questions to consider in your mind before scoring:
|
| 194 |
+
{questions_block}
|
| 195 |
+
|
| 196 |
+
Judge the video itself. Do not penalize prompt mismatch unless it affects
|
| 197 |
+
whether this physical law can be evaluated.
|
| 198 |
+
|
| 199 |
+
Score 1-5: 5=clearly correct, 4=mostly correct with minor issues,
|
| 200 |
+
3=partially correct or ambiguous, 2=mostly incorrect,
|
| 201 |
+
1=severely incorrect
|
| 202 |
+
|
| 203 |
+
Then output ONLY a JSON object with exactly one key: {law}.
|
| 204 |
+
Example: {"{law}": 3}
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
`{criteria}` and `{questions_block}` for each of the 13 laws are listed
|
| 208 |
+
below. They are also embedded in `infer.py` (`PHYSICAL_CRITERIA`,
|
| 209 |
+
`PHYSICAL_SUB_QUESTIONS`) so the script is a self-contained reference.
|
| 210 |
+
|
| 211 |
+
| Law | Criterion | Sub-questions |
|
| 212 |
+
| --- | --- | --- |
|
| 213 |
+
| `gravity` | Do unsupported objects fall downward? Do thrown objects follow a curved trajectory? Does poured liquid fall with gravity? | (1) Do unsupported objects or liquids move downward over time? (2) Do thrown or falling objects follow a plausible gravity-driven path? (3) Does the video avoid objects floating or rising without support? |
|
| 214 |
+
| `inertia` | Do stationary objects remain still unless acted upon? Do moving objects maintain their motion unless stopped by friction, collision, or an obstacle? | (1) Do stationary objects remain still unless a visible force acts on them? (2) Do moving objects continue plausibly until friction, collision, or an obstacle changes their motion? (3) Does the video avoid unexplained starts, stops, or direction changes? |
|
| 215 |
+
| `momentum` | After collision, push, or pull, is the direction of motion reasonable? Ignore speed magnitude. | (1) After contact, push, pull, or collision, are motion directions plausible? (2) Does the reacting object move in a direction consistent with the interaction? (3) Does the video avoid impossible reversals or unrelated motion changes? |
|
| 216 |
+
| `impenetrability` | Do objects maintain impenetrability — no passing through each other? | (1) Do solid objects avoid passing through one another? (2) Do contacts and overlaps remain physically plausible? (3) Does the video avoid obvious clipping or penetration artifacts? |
|
| 217 |
+
| `collision` | After impact, is there reasonable bounce/shatter/deformation? Does response match impact force? | (1) Does impact cause a plausible bounce, break, deformation, or transfer of motion? (2) Is the response direction consistent with the collision? (3) Does the response avoid being much too weak, too strong, or unrelated to the impact? |
|
| 218 |
+
| `material` | Does each material respond according to its properties? (glass shatters, rubber bounces, metal is rigid, cloth deforms softly, etc.) | (1) Do objects respond consistently with their apparent material? (2) Are rigid, soft, brittle, elastic, or fluid-like objects animated appropriately? (3) Does the video avoid material behavior that contradicts the scene? |
|
| 219 |
+
| `buoyancy` | Do dense objects sink? Do wood/plastic float? | (1) Do objects sink or float in a way consistent with apparent density? (2) Does the floating or sinking behavior stay stable over time? (3) Does the video avoid unsupported hovering or impossible underwater motion? |
|
| 220 |
+
| `displacement` | When you add more liquid or put an object into it, does the liquid level rise in a realistic way? Does it overflow when full? | (1) Does liquid level rise when volume is added or an object enters it? (2) Does overflow happen only when the container is plausibly full? (3) Does the liquid volume remain visually plausible? |
|
| 221 |
+
| `flow_dynamics` | Does the liquid's overall motion behave realistically over time — flowing along surfaces, spreading, draining naturally? | (1) Does liquid flow along surfaces, spread, or drain naturally? (2) Does the flow direction follow gravity and boundaries? (3) Does the video avoid abrupt stops, reversals, or unsupported uphill flow? |
|
| 222 |
+
| `boundary_interaction` | When the liquid hits a boundary such as a rock face, container wall, or floor, does it respond realistically? Do local splash, rebound, or split patterns on impact look physically plausible? | (1) Does liquid react plausibly when hitting a wall, floor, container, or obstacle? (2) Are splash, rebound, or split patterns locally plausible? (3) Does the liquid remain consistent after interacting with boundaries? |
|
| 223 |
+
| `fluid_continuity` | Does the liquid avoid disappearing or appearing out of nowhere? Small splashes that briefly break apart are okay. | (1) Does liquid avoid disappearing or appearing without cause? (2) Does the amount of liquid remain broadly consistent? (3) Are splashes and separations temporary and physically plausible? |
|
| 224 |
+
| `reflection` | Does the reflection roughly match objects and colors in the scene, and avoid completely unrelated content? | (1) Does the reflection match nearby objects, colors, and motion? (2) Does the reflected content stay spatially consistent with the scene? (3) Does the video avoid unrelated or impossible reflection content? |
|
| 225 |
+
| `shadow` | Are shadow directions consistent with light source? Do shadows move with objects? | (1) Are shadows consistent with the apparent light source direction? (2) Do shadows move with the objects that cast them? (3) Does the video avoid missing, detached, or contradictory shadows? |
|
| 226 |
+
|
| 227 |
+
Pass `--criteria "..."` to override a per-law criterion at inference time
|
| 228 |
+
without editing the YAML or script.
|
| 229 |
+
|
| 230 |
+
## Training summary
|
| 231 |
|
| 232 |
| Item | Value |
|
| 233 |
| --- | --- |
|
|
|
|
| 240 |
| Best eval loss | 0.1063 (step 294) |
|
| 241 |
| Framework | ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2 |
|
| 242 |
|
| 243 |
+
The training mixture combines automatically derived sub-question
|
| 244 |
+
judgements with human-rated samples (the `subq+human` split). See the
|
| 245 |
+
companion anonymous dataset for prompts, physical-law tags, and example
|
| 246 |
+
videos.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
|
| 248 |
## License
|
| 249 |
|
| 250 |
+
The base model is released by its respective authors; this LoRA adapter
|
| 251 |
+
is shared for anonymous review purposes. No identifying metadata is
|
| 252 |
+
included.
|