File size: 5,450 Bytes
a42b388
 
 
 
 
 
 
 
 
 
 
6d0e4db
a42b388
6d0e4db
 
 
a42b388
 
6d0e4db
 
 
4b2c01e
a42b388
 
6d0e4db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a42b388
6d0e4db
 
 
 
 
 
c59fc70
6d0e4db
 
 
 
 
 
 
 
 
 
a42b388
 
6d0e4db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c59fc70
6d0e4db
c59fc70
 
 
 
 
 
 
 
 
6d0e4db
c59fc70
 
6d0e4db
 
a42b388
c59fc70
 
 
 
 
 
 
 
 
 
 
a42b388
 
 
6d0e4db
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
library_name: peft
pipeline_tag: text-generation
tags:
  - lora
  - peft
  - judge
  - video-evaluation
  - anonymous-release
---

# physground-judger9B — Anonymous Judge LoRA Adapter

LoRA adapter trained as a judge model that scores generated videos against
prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics.
Released anonymously alongside the companion dataset
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground).

The base model identifier required to attach this adapter is recorded in
`adapter_config.json` (`base_model_name_or_path`); the inference script
reads it automatically.

## Files

| File | Purpose |
| --- | --- |
| `adapter_config.json` | PEFT/LoRA config (records base model id) |
| `adapter_model.safetensors` | LoRA weights (~167 MB) |
| `additional_config.json` | ms-swift extras (lora_dtype / lr ratios) |
| `training_args.json` | sanitized training hyperparameters |
| `subq+human.yaml` | prompt template used at training and inference time |
| `infer.py` | standalone end-to-end inference script |

## Setup

```bash
pip install "transformers>=4.49" peft accelerate pyyaml \
            "qwen-vl-utils[decord]" huggingface_hub
```

Loading the base model in bf16 needs roughly 24 GB of GPU memory.

## Quickstart — Hugging Face Hub

`infer.py` accepts either a local folder or a HF Hub repo id via
`--adapter-dir`; the default value already points at this repo, so the
following commands work without cloning anything.

```bash
# General axes (1–5 each): SA / PTV / persistence
python infer.py \
  --video /path/to/video.mp4 \
  --caption "A ball rolls down a ramp and knocks over a block." \
  --metric SA

# Physical-law axes (1–5 each): one of the 13 laws below
python infer.py \
  --video /path/to/video.mp4 \
  --caption "A ball rolls down a ramp and knocks over a block." \
  --law gravity
```

`infer.py` will:

1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
   if it is a Hub id).
2. Read `adapter_config.json` to find the base model and load it via
   `transformers`.
3. Attach the LoRA adapter via PEFT.
4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
   sub-questions / per-law criterion (constants embedded in `infer.py`).
5. Run greedy decoding with `--max-new-tokens 64` (matches training).
6. Parse the JSON object and print the integer score.

Output is a single JSON line:

```json
{"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}
```

`--metric` choices: `SA`, `PTV`, `persistence`.
`--law` choices: `gravity`, `inertia`, `momentum`, `impenetrability`,
`collision`, `material`, `buoyancy`, `displacement`, `flow_dynamics`,
`boundary_interaction`, `fluid_continuity`, `reflection`, `shadow`.

Add `--print-prompt` to inspect the exact rendered system + user prompt
before generation.

## Programmatic use

```python
from pathlib import Path
import torch

from infer import (
    build_messages,
    build_prompt,
    decode_generated,
    load_model,
    load_yaml,
    parse_score,
    prepare_inputs,
)

processor, model, adapter_dir = load_model(
    "anonymouscla/physground-judger9B",
    dtype=torch.bfloat16,
    device_map="auto",
)
cfg = load_yaml(adapter_dir / "subq+human.yaml")

system, user, key = build_prompt(
    cfg,
    caption="A ball rolls down a ramp and knocks over a block.",
    law="gravity",
)
messages = build_messages(system, user, Path("video.mp4"))
inputs = prepare_inputs(
    processor,
    messages,
    next(model.parameters()).device,
    fps=2.0,
    max_pixels=360 * 640,
)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False)

raw = decode_generated(processor, inputs, out)
print({"key": key, "score": parse_score(raw, key), "raw": raw})
```

## Prompt templates

Both training and inference prompts are rendered from two sources:

- `subq+human.yaml` — system prompt, the SA / PTV / persistence templates
  for the general axes, and the `physical_template` shared by all 13
  physical-law axes (with `{prompt}`, `{law}`, `{criteria}`,
  `{questions_block}` placeholders). Use `--print-prompt` to dump the
  fully rendered system + user prompt.
- `infer.py` — the per-axis sub-question lists (`GENERAL_SUB_QUESTIONS`,
  `PHYSICAL_SUB_QUESTIONS`) and per-law criteria (`PHYSICAL_CRITERIA`)
  that are spliced into the YAML templates. Override any criterion at
  inference time with `--criteria "..."` instead of editing the source.

The judge always replies with a single JSON object containing one key
(the metric or law name) and an integer score in 1–5.

## Training summary

LoRA via PEFT (rank 32, α 64, dropout 0.05) over the language-tower
linear layers, vision encoder frozen, bf16 + gradient checkpointing,
AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the `subq+human` split
(automatically derived sub-question judgements + human-rated samples).
Full hyperparameters in `training_args.json` and `additional_config.json`;
exact LoRA target regex and rank in `adapter_config.json`. Framework:
ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.

See the companion dataset
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground)
for prompts, physical-law tags, and example videos.

## License

The base model is released by its respective authors; this LoRA adapter
is shared for anonymous review purposes. No identifying metadata is
included.