anonymouscla commited on
Commit
c59fc70
·
verified ·
1 Parent(s): 26b567b

docs: drop verbatim YAML/JSON dumps; point to source files

Browse files
Files changed (1) hide show
  1. README.md +24 -108
README.md CHANGED
@@ -65,7 +65,7 @@ python infer.py \
65
  1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
66
  if it is a Hub id).
67
  2. Read `adapter_config.json` to find the base model and load it via
68
- `transformers` (`Qwen/Qwen3.5-9B`).
69
  3. Attach the LoRA adapter via PEFT.
70
  4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
71
  sub-questions / per-law criterion (constants embedded in `infer.py`).
@@ -132,118 +132,34 @@ print({"key": key, "score": parse_score(raw, key), "raw": raw})
132
 
133
  ## Prompt templates
134
 
135
- System prompt (all axes): `You are a strict video evaluation model.`
136
 
137
- The prompt always asks the judge to consider observable sub-questions in
138
- its mind and then output **only** a JSON object with one 1–5 score.
 
 
 
 
 
 
 
139
 
140
- ### General axes `subq+human.yaml`
141
-
142
- **SA — Prompt alignment**
143
- ```
144
- Evaluate Prompt Alignment (SA).
145
-
146
- Caption:
147
- "{prompt}"
148
-
149
- The video was generated using a text+image-to-video (ti2v) model,
150
- conditioned on the first frame and the text prompt above.
151
-
152
- Sub-questions to consider in your mind before scoring:
153
- 1. Are the main objects in the caption present in the video?
154
- 2. Are the key actions or interactions from the caption visible?
155
- 3. Are important scene attributes and relationships preserved?
156
- 4. Does the video avoid major contradictions to the caption?
157
-
158
- Score 1-5: 5=fully aligned, 4=mostly aligned with minor deviations,
159
- 3=partially aligned with notable gaps, 2=mostly misaligned,
160
- 1=not aligned
161
-
162
- Then output ONLY a JSON object with exactly one key: SA.
163
- Example: {"SA": 3}
164
- ```
165
-
166
- **PTV — Temporal coherence** uses the same shape with sub-questions:
167
-
168
- 1. Do causes appear before their effects?
169
- 2. Do physical events unfold in a plausible temporal order?
170
- 3. Are motion transitions continuous rather than abrupt jumps or loops?
171
- 4. Does the sequence avoid impossible reversals or repeated resets?
172
-
173
- **persistence — Object persistence** uses:
174
-
175
- 1. Do objects maintain consistent existence throughout the video?
176
- 2. Do objects keep a stable shape, size, color, and texture?
177
- 3. Do objects avoid disappearing, appearing, or transforming unexpectedly?
178
- 4. Do objects preserve identity through motion and brief occlusion?
179
-
180
- (See `subq+human.yaml` for the verbatim PTV / persistence rubric anchors.)
181
-
182
- ### Physical-law axes — `physical_template`
183
-
184
- ```
185
- Evaluate physical realism for one physical law: {law}.
186
-
187
- Criterion:
188
- {criteria}
189
-
190
- Caption, for context only:
191
- "{prompt}"
192
-
193
- Sub-questions to consider in your mind before scoring:
194
- {questions_block}
195
-
196
- Judge the video itself. Do not penalize prompt mismatch unless it affects
197
- whether this physical law can be evaluated.
198
-
199
- Score 1-5: 5=clearly correct, 4=mostly correct with minor issues,
200
- 3=partially correct or ambiguous, 2=mostly incorrect,
201
- 1=severely incorrect
202
-
203
- Then output ONLY a JSON object with exactly one key: {law}.
204
- Example: {"{law}": 3}
205
- ```
206
-
207
- `{criteria}` and `{questions_block}` for each of the 13 laws are listed
208
- below. They are also embedded in `infer.py` (`PHYSICAL_CRITERIA`,
209
- `PHYSICAL_SUB_QUESTIONS`) so the script is a self-contained reference.
210
-
211
- | Law | Criterion | Sub-questions |
212
- | --- | --- | --- |
213
- | `gravity` | Do unsupported objects fall downward? Do thrown objects follow a curved trajectory? Does poured liquid fall with gravity? | (1) Do unsupported objects or liquids move downward over time? (2) Do thrown or falling objects follow a plausible gravity-driven path? (3) Does the video avoid objects floating or rising without support? |
214
- | `inertia` | Do stationary objects remain still unless acted upon? Do moving objects maintain their motion unless stopped by friction, collision, or an obstacle? | (1) Do stationary objects remain still unless a visible force acts on them? (2) Do moving objects continue plausibly until friction, collision, or an obstacle changes their motion? (3) Does the video avoid unexplained starts, stops, or direction changes? |
215
- | `momentum` | After collision, push, or pull, is the direction of motion reasonable? Ignore speed magnitude. | (1) After contact, push, pull, or collision, are motion directions plausible? (2) Does the reacting object move in a direction consistent with the interaction? (3) Does the video avoid impossible reversals or unrelated motion changes? |
216
- | `impenetrability` | Do objects maintain impenetrability — no passing through each other? | (1) Do solid objects avoid passing through one another? (2) Do contacts and overlaps remain physically plausible? (3) Does the video avoid obvious clipping or penetration artifacts? |
217
- | `collision` | After impact, is there reasonable bounce/shatter/deformation? Does response match impact force? | (1) Does impact cause a plausible bounce, break, deformation, or transfer of motion? (2) Is the response direction consistent with the collision? (3) Does the response avoid being much too weak, too strong, or unrelated to the impact? |
218
- | `material` | Does each material respond according to its properties? (glass shatters, rubber bounces, metal is rigid, cloth deforms softly, etc.) | (1) Do objects respond consistently with their apparent material? (2) Are rigid, soft, brittle, elastic, or fluid-like objects animated appropriately? (3) Does the video avoid material behavior that contradicts the scene? |
219
- | `buoyancy` | Do dense objects sink? Do wood/plastic float? | (1) Do objects sink or float in a way consistent with apparent density? (2) Does the floating or sinking behavior stay stable over time? (3) Does the video avoid unsupported hovering or impossible underwater motion? |
220
- | `displacement` | When you add more liquid or put an object into it, does the liquid level rise in a realistic way? Does it overflow when full? | (1) Does liquid level rise when volume is added or an object enters it? (2) Does overflow happen only when the container is plausibly full? (3) Does the liquid volume remain visually plausible? |
221
- | `flow_dynamics` | Does the liquid's overall motion behave realistically over time — flowing along surfaces, spreading, draining naturally? | (1) Does liquid flow along surfaces, spread, or drain naturally? (2) Does the flow direction follow gravity and boundaries? (3) Does the video avoid abrupt stops, reversals, or unsupported uphill flow? |
222
- | `boundary_interaction` | When the liquid hits a boundary such as a rock face, container wall, or floor, does it respond realistically? Do local splash, rebound, or split patterns on impact look physically plausible? | (1) Does liquid react plausibly when hitting a wall, floor, container, or obstacle? (2) Are splash, rebound, or split patterns locally plausible? (3) Does the liquid remain consistent after interacting with boundaries? |
223
- | `fluid_continuity` | Does the liquid avoid disappearing or appearing out of nowhere? Small splashes that briefly break apart are okay. | (1) Does liquid avoid disappearing or appearing without cause? (2) Does the amount of liquid remain broadly consistent? (3) Are splashes and separations temporary and physically plausible? |
224
- | `reflection` | Does the reflection roughly match objects and colors in the scene, and avoid completely unrelated content? | (1) Does the reflection match nearby objects, colors, and motion? (2) Does the reflected content stay spatially consistent with the scene? (3) Does the video avoid unrelated or impossible reflection content? |
225
- | `shadow` | Are shadow directions consistent with light source? Do shadows move with objects? | (1) Are shadows consistent with the apparent light source direction? (2) Do shadows move with the objects that cast them? (3) Does the video avoid missing, detached, or contradictory shadows? |
226
-
227
- Pass `--criteria "..."` to override a per-law criterion at inference time
228
- without editing the YAML or script.
229
 
230
  ## Training summary
231
 
232
- | Item | Value |
233
- | --- | --- |
234
- | Tuning method | LoRA via PEFT (rank 32, α 64, dropout 0.05) |
235
- | Target modules | All linear layers in the language tower (vision encoder frozen) |
236
- | Precision | bf16 with gradient checkpointing |
237
- | Optimizer | AdamW (fused), lr = 1e-4, cosine schedule, warmup 5% |
238
- | Batch | 1 × 8 grad-accum × 4 GPUs (global batch 32) |
239
- | Epochs / steps | 1.0 epoch / 294 steps |
240
- | Best eval loss | 0.1063 (step 294) |
241
- | Framework | ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2 |
242
-
243
- The training mixture combines automatically derived sub-question
244
- judgements with human-rated samples (the `subq+human` split). See the
245
- companion anonymous dataset for prompts, physical-law tags, and example
246
- videos.
247
 
248
  ## License
249
 
 
65
  1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
66
  if it is a Hub id).
67
  2. Read `adapter_config.json` to find the base model and load it via
68
+ `transformers`.
69
  3. Attach the LoRA adapter via PEFT.
70
  4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
71
  sub-questions / per-law criterion (constants embedded in `infer.py`).
 
132
 
133
  ## Prompt templates
134
 
135
+ Both training and inference prompts are rendered from two sources:
136
 
137
+ - `subq+human.yaml` system prompt, the SA / PTV / persistence templates
138
+ for the general axes, and the `physical_template` shared by all 13
139
+ physical-law axes (with `{prompt}`, `{law}`, `{criteria}`,
140
+ `{questions_block}` placeholders). Use `--print-prompt` to dump the
141
+ fully rendered system + user prompt.
142
+ - `infer.py` — the per-axis sub-question lists (`GENERAL_SUB_QUESTIONS`,
143
+ `PHYSICAL_SUB_QUESTIONS`) and per-law criteria (`PHYSICAL_CRITERIA`)
144
+ that are spliced into the YAML templates. Override any criterion at
145
+ inference time with `--criteria "..."` instead of editing the source.
146
 
147
+ The judge always replies with a single JSON object containing one key
148
+ (the metric or law name) and an integer score in 1–5.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
  ## Training summary
151
 
152
+ LoRA via PEFT (rank 32, α 64, dropout 0.05) over the language-tower
153
+ linear layers, vision encoder frozen, bf16 + gradient checkpointing,
154
+ AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the `subq+human` split
155
+ (automatically derived sub-question judgements + human-rated samples).
156
+ Full hyperparameters in `training_args.json` and `additional_config.json`;
157
+ exact LoRA target regex and rank in `adapter_config.json`. Framework:
158
+ ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.
159
+
160
+ See the companion dataset
161
+ [`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground)
162
+ for prompts, physical-law tags, and example videos.
 
 
 
 
163
 
164
  ## License
165