File size: 5,450 Bytes

---
library_name: peft
pipeline_tag: text-generation
tags:
  - lora
  - peft
  - judge
  - video-evaluation
  - anonymous-release
---

# physground-judger9B — Anonymous Judge LoRA Adapter

LoRA adapter trained as a judge model that scores generated videos against
prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics.
Released anonymously alongside the companion dataset
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground).

The base model identifier required to attach this adapter is recorded in
`adapter_config.json` (`base_model_name_or_path`); the inference script
reads it automatically.

## Files

| File | Purpose |
| --- | --- |
| `adapter_config.json` | PEFT/LoRA config (records base model id) |
| `adapter_model.safetensors` | LoRA weights (~167 MB) |
| `additional_config.json` | ms-swift extras (lora_dtype / lr ratios) |
| `training_args.json` | sanitized training hyperparameters |
| `subq+human.yaml` | prompt template used at training and inference time |
| `infer.py` | standalone end-to-end inference script |

## Setup

```bash
pip install "transformers>=4.49" peft accelerate pyyaml \
            "qwen-vl-utils[decord]" huggingface_hub
```

Loading the base model in bf16 needs roughly 24 GB of GPU memory.

## Quickstart — Hugging Face Hub

`infer.py` accepts either a local folder or a HF Hub repo id via
`--adapter-dir`; the default value already points at this repo, so the
following commands work without cloning anything.

```bash
# General axes (1–5 each): SA / PTV / persistence
python infer.py \
  --video /path/to/video.mp4 \
  --caption "A ball rolls down a ramp and knocks over a block." \
  --metric SA

# Physical-law axes (1–5 each): one of the 13 laws below
python infer.py \
  --video /path/to/video.mp4 \
  --caption "A ball rolls down a ramp and knocks over a block." \
  --law gravity
```

`infer.py` will:

1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
   if it is a Hub id).
2. Read `adapter_config.json` to find the base model and load it via
   `transformers`.
3. Attach the LoRA adapter via PEFT.
4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
   sub-questions / per-law criterion (constants embedded in `infer.py`).
5. Run greedy decoding with `--max-new-tokens 64` (matches training).
6. Parse the JSON object and print the integer score.

Output is a single JSON line:

```json
{"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}
```

`--metric` choices: `SA`, `PTV`, `persistence`.
`--law` choices: `gravity`, `inertia`, `momentum`, `impenetrability`,
`collision`, `material`, `buoyancy`, `displacement`, `flow_dynamics`,
`boundary_interaction`, `fluid_continuity`, `reflection`, `shadow`.

Add `--print-prompt` to inspect the exact rendered system + user prompt
before generation.

## Programmatic use

```python
from pathlib import Path
import torch

from infer import (
    build_messages,
    build_prompt,
    decode_generated,
    load_model,
    load_yaml,
    parse_score,
    prepare_inputs,
)

processor, model, adapter_dir = load_model(
    "anonymouscla/physground-judger9B",
    dtype=torch.bfloat16,
    device_map="auto",
)
cfg = load_yaml(adapter_dir / "subq+human.yaml")

system, user, key = build_prompt(
    cfg,
    caption="A ball rolls down a ramp and knocks over a block.",
    law="gravity",
)
messages = build_messages(system, user, Path("video.mp4"))
inputs = prepare_inputs(
    processor,
    messages,
    next(model.parameters()).device,
    fps=2.0,
    max_pixels=360 * 640,
)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False)

raw = decode_generated(processor, inputs, out)
print({"key": key, "score": parse_score(raw, key), "raw": raw})
```

## Prompt templates

Both training and inference prompts are rendered from two sources:

- `subq+human.yaml` — system prompt, the SA / PTV / persistence templates
  for the general axes, and the `physical_template` shared by all 13
  physical-law axes (with `{prompt}`, `{law}`, `{criteria}`,
  `{questions_block}` placeholders). Use `--print-prompt` to dump the
  fully rendered system + user prompt.
- `infer.py` — the per-axis sub-question lists (`GENERAL_SUB_QUESTIONS`,
  `PHYSICAL_SUB_QUESTIONS`) and per-law criteria (`PHYSICAL_CRITERIA`)
  that are spliced into the YAML templates. Override any criterion at
  inference time with `--criteria "..."` instead of editing the source.

The judge always replies with a single JSON object containing one key
(the metric or law name) and an integer score in 1–5.

## Training summary

LoRA via PEFT (rank 32, α 64, dropout 0.05) over the language-tower
linear layers, vision encoder frozen, bf16 + gradient checkpointing,
AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the `subq+human` split
(automatically derived sub-question judgements + human-rated samples).
Full hyperparameters in `training_args.json` and `additional_config.json`;
exact LoRA target regex and rank in `adapter_config.json`. Framework:
ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.

See the companion dataset
[`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground)
for prompts, physical-law tags, and example videos.

## License

The base model is released by its respective authors; this LoRA adapter
is shared for anonymous review purposes. No identifying metadata is
included.