physground-judger9B — Anonymous Judge LoRA Adapter

LoRA adapter trained as a judge model that scores generated videos against prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics. Released anonymously alongside the companion dataset anonymouscla/physground.

The base model identifier required to attach this adapter is recorded in adapter_config.json (base_model_name_or_path); the inference script reads it automatically.

Files

File	Purpose
`adapter_config.json`	PEFT/LoRA config (records base model id)
`adapter_model.safetensors`	LoRA weights (~167 MB)
`additional_config.json`	ms-swift extras (lora_dtype / lr ratios)
`training_args.json`	sanitized training hyperparameters
`subq+human.yaml`	prompt template used at training and inference time
`infer.py`	standalone end-to-end inference script

Setup

pip install "transformers>=4.49" peft accelerate pyyaml \
            "qwen-vl-utils[decord]" huggingface_hub

Loading the base model in bf16 needs roughly 24 GB of GPU memory.

Quickstart — Hugging Face Hub

infer.py accepts either a local folder or a HF Hub repo id via --adapter-dir; the default value already points at this repo, so the following commands work without cloning anything.

# General axes (1–5 each): SA / PTV / persistence
python infer.py \
  --video /path/to/video.mp4 \
  --caption "A ball rolls down a ramp and knocks over a block." \
  --metric SA

# Physical-law axes (1–5 each): one of the 13 laws below
python infer.py \
  --video /path/to/video.mp4 \
  --caption "A ball rolls down a ramp and knocks over a block." \
  --law gravity

infer.py will:

Resolve --adapter-dir to a local directory (huggingface_hub.snapshot_download if it is a Hub id).
Read adapter_config.json to find the base model and load it via transformers.
Attach the LoRA adapter via PEFT.
Render the scoring prompt from subq+human.yaml, plus the relevant sub-questions / per-law criterion (constants embedded in infer.py).
Run greedy decoding with --max-new-tokens 64 (matches training).
Parse the JSON object and print the integer score.

Output is a single JSON line:

{"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}

--metric choices: SA, PTV, persistence. --law choices: gravity, inertia, momentum, impenetrability, collision, material, buoyancy, displacement, flow_dynamics, boundary_interaction, fluid_continuity, reflection, shadow.

Add --print-prompt to inspect the exact rendered system + user prompt before generation.

Programmatic use

from pathlib import Path
import torch

from infer import (
    build_messages,
    build_prompt,
    decode_generated,
    load_model,
    load_yaml,
    parse_score,
    prepare_inputs,
)

processor, model, adapter_dir = load_model(
    "anonymouscla/physground-judger9B",
    dtype=torch.bfloat16,
    device_map="auto",
)
cfg = load_yaml(adapter_dir / "subq+human.yaml")

system, user, key = build_prompt(
    cfg,
    caption="A ball rolls down a ramp and knocks over a block.",
    law="gravity",
)
messages = build_messages(system, user, Path("video.mp4"))
inputs = prepare_inputs(
    processor,
    messages,
    next(model.parameters()).device,
    fps=2.0,
    max_pixels=360 * 640,
)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False)

raw = decode_generated(processor, inputs, out)
print({"key": key, "score": parse_score(raw, key), "raw": raw})

Prompt templates

Both training and inference prompts are rendered from two sources:

subq+human.yaml — system prompt, the SA / PTV / persistence templates for the general axes, and the physical_template shared by all 13 physical-law axes (with {prompt}, {law}, {criteria}, {questions_block} placeholders). Use --print-prompt to dump the fully rendered system + user prompt.
infer.py — the per-axis sub-question lists (GENERAL_SUB_QUESTIONS, PHYSICAL_SUB_QUESTIONS) and per-law criteria (PHYSICAL_CRITERIA) that are spliced into the YAML templates. Override any criterion at inference time with --criteria "..." instead of editing the source.

The judge always replies with a single JSON object containing one key (the metric or law name) and an integer score in 1–5.

Training summary

LoRA via PEFT (rank 32, α 64, dropout 0.05) over the language-tower linear layers, vision encoder frozen, bf16 + gradient checkpointing, AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the subq+human split (automatically derived sub-question judgements + human-rated samples). Full hyperparameters in training_args.json and additional_config.json; exact LoRA target regex and rank in adapter_config.json. Framework: ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.

See the companion dataset anonymouscla/physground for prompts, physical-law tags, and example videos.

License

The base model is released by its respective authors; this LoRA adapter is shared for anonymous review purposes. No identifying metadata is included.

Downloads last month: 58