docs: drop verbatim YAML/JSON dumps; point to source files
Browse files
README.md
CHANGED
|
@@ -65,7 +65,7 @@ python infer.py \
|
|
| 65 |
1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
|
| 66 |
if it is a Hub id).
|
| 67 |
2. Read `adapter_config.json` to find the base model and load it via
|
| 68 |
-
`transformers`
|
| 69 |
3. Attach the LoRA adapter via PEFT.
|
| 70 |
4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
|
| 71 |
sub-questions / per-law criterion (constants embedded in `infer.py`).
|
|
@@ -132,118 +132,34 @@ print({"key": key, "score": parse_score(raw, key), "raw": raw})
|
|
| 132 |
|
| 133 |
## Prompt templates
|
| 134 |
|
| 135 |
-
|
| 136 |
|
| 137 |
-
|
| 138 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
**SA — Prompt alignment**
|
| 143 |
-
```
|
| 144 |
-
Evaluate Prompt Alignment (SA).
|
| 145 |
-
|
| 146 |
-
Caption:
|
| 147 |
-
"{prompt}"
|
| 148 |
-
|
| 149 |
-
The video was generated using a text+image-to-video (ti2v) model,
|
| 150 |
-
conditioned on the first frame and the text prompt above.
|
| 151 |
-
|
| 152 |
-
Sub-questions to consider in your mind before scoring:
|
| 153 |
-
1. Are the main objects in the caption present in the video?
|
| 154 |
-
2. Are the key actions or interactions from the caption visible?
|
| 155 |
-
3. Are important scene attributes and relationships preserved?
|
| 156 |
-
4. Does the video avoid major contradictions to the caption?
|
| 157 |
-
|
| 158 |
-
Score 1-5: 5=fully aligned, 4=mostly aligned with minor deviations,
|
| 159 |
-
3=partially aligned with notable gaps, 2=mostly misaligned,
|
| 160 |
-
1=not aligned
|
| 161 |
-
|
| 162 |
-
Then output ONLY a JSON object with exactly one key: SA.
|
| 163 |
-
Example: {"SA": 3}
|
| 164 |
-
```
|
| 165 |
-
|
| 166 |
-
**PTV — Temporal coherence** uses the same shape with sub-questions:
|
| 167 |
-
|
| 168 |
-
1. Do causes appear before their effects?
|
| 169 |
-
2. Do physical events unfold in a plausible temporal order?
|
| 170 |
-
3. Are motion transitions continuous rather than abrupt jumps or loops?
|
| 171 |
-
4. Does the sequence avoid impossible reversals or repeated resets?
|
| 172 |
-
|
| 173 |
-
**persistence — Object persistence** uses:
|
| 174 |
-
|
| 175 |
-
1. Do objects maintain consistent existence throughout the video?
|
| 176 |
-
2. Do objects keep a stable shape, size, color, and texture?
|
| 177 |
-
3. Do objects avoid disappearing, appearing, or transforming unexpectedly?
|
| 178 |
-
4. Do objects preserve identity through motion and brief occlusion?
|
| 179 |
-
|
| 180 |
-
(See `subq+human.yaml` for the verbatim PTV / persistence rubric anchors.)
|
| 181 |
-
|
| 182 |
-
### Physical-law axes — `physical_template`
|
| 183 |
-
|
| 184 |
-
```
|
| 185 |
-
Evaluate physical realism for one physical law: {law}.
|
| 186 |
-
|
| 187 |
-
Criterion:
|
| 188 |
-
{criteria}
|
| 189 |
-
|
| 190 |
-
Caption, for context only:
|
| 191 |
-
"{prompt}"
|
| 192 |
-
|
| 193 |
-
Sub-questions to consider in your mind before scoring:
|
| 194 |
-
{questions_block}
|
| 195 |
-
|
| 196 |
-
Judge the video itself. Do not penalize prompt mismatch unless it affects
|
| 197 |
-
whether this physical law can be evaluated.
|
| 198 |
-
|
| 199 |
-
Score 1-5: 5=clearly correct, 4=mostly correct with minor issues,
|
| 200 |
-
3=partially correct or ambiguous, 2=mostly incorrect,
|
| 201 |
-
1=severely incorrect
|
| 202 |
-
|
| 203 |
-
Then output ONLY a JSON object with exactly one key: {law}.
|
| 204 |
-
Example: {"{law}": 3}
|
| 205 |
-
```
|
| 206 |
-
|
| 207 |
-
`{criteria}` and `{questions_block}` for each of the 13 laws are listed
|
| 208 |
-
below. They are also embedded in `infer.py` (`PHYSICAL_CRITERIA`,
|
| 209 |
-
`PHYSICAL_SUB_QUESTIONS`) so the script is a self-contained reference.
|
| 210 |
-
|
| 211 |
-
| Law | Criterion | Sub-questions |
|
| 212 |
-
| --- | --- | --- |
|
| 213 |
-
| `gravity` | Do unsupported objects fall downward? Do thrown objects follow a curved trajectory? Does poured liquid fall with gravity? | (1) Do unsupported objects or liquids move downward over time? (2) Do thrown or falling objects follow a plausible gravity-driven path? (3) Does the video avoid objects floating or rising without support? |
|
| 214 |
-
| `inertia` | Do stationary objects remain still unless acted upon? Do moving objects maintain their motion unless stopped by friction, collision, or an obstacle? | (1) Do stationary objects remain still unless a visible force acts on them? (2) Do moving objects continue plausibly until friction, collision, or an obstacle changes their motion? (3) Does the video avoid unexplained starts, stops, or direction changes? |
|
| 215 |
-
| `momentum` | After collision, push, or pull, is the direction of motion reasonable? Ignore speed magnitude. | (1) After contact, push, pull, or collision, are motion directions plausible? (2) Does the reacting object move in a direction consistent with the interaction? (3) Does the video avoid impossible reversals or unrelated motion changes? |
|
| 216 |
-
| `impenetrability` | Do objects maintain impenetrability — no passing through each other? | (1) Do solid objects avoid passing through one another? (2) Do contacts and overlaps remain physically plausible? (3) Does the video avoid obvious clipping or penetration artifacts? |
|
| 217 |
-
| `collision` | After impact, is there reasonable bounce/shatter/deformation? Does response match impact force? | (1) Does impact cause a plausible bounce, break, deformation, or transfer of motion? (2) Is the response direction consistent with the collision? (3) Does the response avoid being much too weak, too strong, or unrelated to the impact? |
|
| 218 |
-
| `material` | Does each material respond according to its properties? (glass shatters, rubber bounces, metal is rigid, cloth deforms softly, etc.) | (1) Do objects respond consistently with their apparent material? (2) Are rigid, soft, brittle, elastic, or fluid-like objects animated appropriately? (3) Does the video avoid material behavior that contradicts the scene? |
|
| 219 |
-
| `buoyancy` | Do dense objects sink? Do wood/plastic float? | (1) Do objects sink or float in a way consistent with apparent density? (2) Does the floating or sinking behavior stay stable over time? (3) Does the video avoid unsupported hovering or impossible underwater motion? |
|
| 220 |
-
| `displacement` | When you add more liquid or put an object into it, does the liquid level rise in a realistic way? Does it overflow when full? | (1) Does liquid level rise when volume is added or an object enters it? (2) Does overflow happen only when the container is plausibly full? (3) Does the liquid volume remain visually plausible? |
|
| 221 |
-
| `flow_dynamics` | Does the liquid's overall motion behave realistically over time — flowing along surfaces, spreading, draining naturally? | (1) Does liquid flow along surfaces, spread, or drain naturally? (2) Does the flow direction follow gravity and boundaries? (3) Does the video avoid abrupt stops, reversals, or unsupported uphill flow? |
|
| 222 |
-
| `boundary_interaction` | When the liquid hits a boundary such as a rock face, container wall, or floor, does it respond realistically? Do local splash, rebound, or split patterns on impact look physically plausible? | (1) Does liquid react plausibly when hitting a wall, floor, container, or obstacle? (2) Are splash, rebound, or split patterns locally plausible? (3) Does the liquid remain consistent after interacting with boundaries? |
|
| 223 |
-
| `fluid_continuity` | Does the liquid avoid disappearing or appearing out of nowhere? Small splashes that briefly break apart are okay. | (1) Does liquid avoid disappearing or appearing without cause? (2) Does the amount of liquid remain broadly consistent? (3) Are splashes and separations temporary and physically plausible? |
|
| 224 |
-
| `reflection` | Does the reflection roughly match objects and colors in the scene, and avoid completely unrelated content? | (1) Does the reflection match nearby objects, colors, and motion? (2) Does the reflected content stay spatially consistent with the scene? (3) Does the video avoid unrelated or impossible reflection content? |
|
| 225 |
-
| `shadow` | Are shadow directions consistent with light source? Do shadows move with objects? | (1) Are shadows consistent with the apparent light source direction? (2) Do shadows move with the objects that cast them? (3) Does the video avoid missing, detached, or contradictory shadows? |
|
| 226 |
-
|
| 227 |
-
Pass `--criteria "..."` to override a per-law criterion at inference time
|
| 228 |
-
without editing the YAML or script.
|
| 229 |
|
| 230 |
## Training summary
|
| 231 |
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
The training mixture combines automatically derived sub-question
|
| 244 |
-
judgements with human-rated samples (the `subq+human` split). See the
|
| 245 |
-
companion anonymous dataset for prompts, physical-law tags, and example
|
| 246 |
-
videos.
|
| 247 |
|
| 248 |
## License
|
| 249 |
|
|
|
|
| 65 |
1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
|
| 66 |
if it is a Hub id).
|
| 67 |
2. Read `adapter_config.json` to find the base model and load it via
|
| 68 |
+
`transformers`.
|
| 69 |
3. Attach the LoRA adapter via PEFT.
|
| 70 |
4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
|
| 71 |
sub-questions / per-law criterion (constants embedded in `infer.py`).
|
|
|
|
| 132 |
|
| 133 |
## Prompt templates
|
| 134 |
|
| 135 |
+
Both training and inference prompts are rendered from two sources:
|
| 136 |
|
| 137 |
+
- `subq+human.yaml` — system prompt, the SA / PTV / persistence templates
|
| 138 |
+
for the general axes, and the `physical_template` shared by all 13
|
| 139 |
+
physical-law axes (with `{prompt}`, `{law}`, `{criteria}`,
|
| 140 |
+
`{questions_block}` placeholders). Use `--print-prompt` to dump the
|
| 141 |
+
fully rendered system + user prompt.
|
| 142 |
+
- `infer.py` — the per-axis sub-question lists (`GENERAL_SUB_QUESTIONS`,
|
| 143 |
+
`PHYSICAL_SUB_QUESTIONS`) and per-law criteria (`PHYSICAL_CRITERIA`)
|
| 144 |
+
that are spliced into the YAML templates. Override any criterion at
|
| 145 |
+
inference time with `--criteria "..."` instead of editing the source.
|
| 146 |
|
| 147 |
+
The judge always replies with a single JSON object containing one key
|
| 148 |
+
(the metric or law name) and an integer score in 1–5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
## Training summary
|
| 151 |
|
| 152 |
+
LoRA via PEFT (rank 32, α 64, dropout 0.05) over the language-tower
|
| 153 |
+
linear layers, vision encoder frozen, bf16 + gradient checkpointing,
|
| 154 |
+
AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the `subq+human` split
|
| 155 |
+
(automatically derived sub-question judgements + human-rated samples).
|
| 156 |
+
Full hyperparameters in `training_args.json` and `additional_config.json`;
|
| 157 |
+
exact LoRA target regex and rank in `adapter_config.json`. Framework:
|
| 158 |
+
ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.
|
| 159 |
+
|
| 160 |
+
See the companion dataset
|
| 161 |
+
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground)
|
| 162 |
+
for prompts, physical-law tags, and example videos.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
## License
|
| 165 |
|