phyjudge-9B / README.md
anonymouscla's picture
docs: drop verbatim YAML/JSON dumps; point to source files
c59fc70 verified
---
library_name: peft
pipeline_tag: text-generation
tags:
- lora
- peft
- judge
- video-evaluation
- anonymous-release
---
# physground-judger9B β€” Anonymous Judge LoRA Adapter
LoRA adapter trained as a judge model that scores generated videos against
prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics.
Released anonymously alongside the companion dataset
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground).
The base model identifier required to attach this adapter is recorded in
`adapter_config.json` (`base_model_name_or_path`); the inference script
reads it automatically.
## Files
| File | Purpose |
| --- | --- |
| `adapter_config.json` | PEFT/LoRA config (records base model id) |
| `adapter_model.safetensors` | LoRA weights (~167 MB) |
| `additional_config.json` | ms-swift extras (lora_dtype / lr ratios) |
| `training_args.json` | sanitized training hyperparameters |
| `subq+human.yaml` | prompt template used at training and inference time |
| `infer.py` | standalone end-to-end inference script |
## Setup
```bash
pip install "transformers>=4.49" peft accelerate pyyaml \
"qwen-vl-utils[decord]" huggingface_hub
```
Loading the base model in bf16 needs roughly 24 GB of GPU memory.
## Quickstart β€” Hugging Face Hub
`infer.py` accepts either a local folder or a HF Hub repo id via
`--adapter-dir`; the default value already points at this repo, so the
following commands work without cloning anything.
```bash
# General axes (1–5 each): SA / PTV / persistence
python infer.py \
--video /path/to/video.mp4 \
--caption "A ball rolls down a ramp and knocks over a block." \
--metric SA
# Physical-law axes (1–5 each): one of the 13 laws below
python infer.py \
--video /path/to/video.mp4 \
--caption "A ball rolls down a ramp and knocks over a block." \
--law gravity
```
`infer.py` will:
1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
if it is a Hub id).
2. Read `adapter_config.json` to find the base model and load it via
`transformers`.
3. Attach the LoRA adapter via PEFT.
4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
sub-questions / per-law criterion (constants embedded in `infer.py`).
5. Run greedy decoding with `--max-new-tokens 64` (matches training).
6. Parse the JSON object and print the integer score.
Output is a single JSON line:
```json
{"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}
```
`--metric` choices: `SA`, `PTV`, `persistence`.
`--law` choices: `gravity`, `inertia`, `momentum`, `impenetrability`,
`collision`, `material`, `buoyancy`, `displacement`, `flow_dynamics`,
`boundary_interaction`, `fluid_continuity`, `reflection`, `shadow`.
Add `--print-prompt` to inspect the exact rendered system + user prompt
before generation.
## Programmatic use
```python
from pathlib import Path
import torch
from infer import (
build_messages,
build_prompt,
decode_generated,
load_model,
load_yaml,
parse_score,
prepare_inputs,
)
processor, model, adapter_dir = load_model(
"anonymouscla/physground-judger9B",
dtype=torch.bfloat16,
device_map="auto",
)
cfg = load_yaml(adapter_dir / "subq+human.yaml")
system, user, key = build_prompt(
cfg,
caption="A ball rolls down a ramp and knocks over a block.",
law="gravity",
)
messages = build_messages(system, user, Path("video.mp4"))
inputs = prepare_inputs(
processor,
messages,
next(model.parameters()).device,
fps=2.0,
max_pixels=360 * 640,
)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
raw = decode_generated(processor, inputs, out)
print({"key": key, "score": parse_score(raw, key), "raw": raw})
```
## Prompt templates
Both training and inference prompts are rendered from two sources:
- `subq+human.yaml` β€” system prompt, the SA / PTV / persistence templates
for the general axes, and the `physical_template` shared by all 13
physical-law axes (with `{prompt}`, `{law}`, `{criteria}`,
`{questions_block}` placeholders). Use `--print-prompt` to dump the
fully rendered system + user prompt.
- `infer.py` β€” the per-axis sub-question lists (`GENERAL_SUB_QUESTIONS`,
`PHYSICAL_SUB_QUESTIONS`) and per-law criteria (`PHYSICAL_CRITERIA`)
that are spliced into the YAML templates. Override any criterion at
inference time with `--criteria "..."` instead of editing the source.
The judge always replies with a single JSON object containing one key
(the metric or law name) and an integer score in 1–5.
## Training summary
LoRA via PEFT (rank 32, Ξ± 64, dropout 0.05) over the language-tower
linear layers, vision encoder frozen, bf16 + gradient checkpointing,
AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the `subq+human` split
(automatically derived sub-question judgements + human-rated samples).
Full hyperparameters in `training_args.json` and `additional_config.json`;
exact LoRA target regex and rank in `adapter_config.json`. Framework:
ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.
See the companion dataset
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground)
for prompts, physical-law tags, and example videos.
## License
The base model is released by its respective authors; this LoRA adapter
is shared for anonymous review purposes. No identifying metadata is
included.