YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen3-VL Pose Stage 1

This repository stores the stage 1 pose-only alignment artifacts for a Qwen3-VL based exercise feedback model.

What This Stage Does

Stage 1 focuses on aligning structured pose features to the Qwen3-VL token embedding space before running a later joint multimodal stage.

The model setup for this stage is:

  • Base model: Qwen/Qwen3-VL-4B-Instruct
  • Modalities used during training: pose + text
  • Image branch: disabled for this stage
  • Vision encoder: frozen
  • Language model: frozen
  • LoRA: disabled
  • Pose adapter: enabled in last_linear mode
  • Pose projector: trainable
  • Pose placeholder tokens: 8

In this stage, pose features are encoded, projected into the Qwen embedding space, and injected into reserved pose token positions. Training is supervised with generated exercise descriptions and feedback text.

Data Used

Training used the following sources:

  • processed/generated_descriptions.jsonl
  • train/unimodal/training.csv
  • train/unimodal/validation.csv

The supervision target is built from:

  • response.description
  • response.feedback

Training Configuration

  • Epochs: 5
  • Per-device batch size: 1
  • Gradient accumulation: 8
  • Effective batch size: 8
  • Learning rate: 5e-6
  • Pose projector learning rate: 5e-6
  • Pose adapter learning rate: 1e-6
  • Warmup ratio: 0.05
  • Max grad norm: 1.0
  • Logging backend: TensorBoard

Final Metrics

  • Train loss: 22.7354
  • Eval loss: 2.6238
  • Train samples: 8989
  • Eval samples: 612
  • Train runtime: 21133.94s

Trainable Components

This stage trains only a small pose-side subset:

  • pose_feature_encoder.pose_adapter.pose_embedding_projector.4.*
  • pose_projector.output_gate_logit
  • pose_projector.input_norm.*
  • pose_projector.proj.*

Everything in the main Qwen language backbone remains frozen.

Repository Contents

This repository is intended to keep the most relevant stage 1 artifacts:

  • pose_projector.pt: learned pose projector weights
  • pose_adapter.pt: stage 1 tuned pose adapter weights
  • pose_bridge_config.json: pose token injection metadata
  • stage_manifest.json: training-stage manifest
  • training_args.bin: Hugging Face training arguments
  • train_results.json: final train metrics
  • eval_results.json: final eval metrics
  • all_results.json: aggregate run metrics
  • logs/: TensorBoard event files

Depending on what was uploaded, intermediate checkpoints or full Qwen weights may be omitted on purpose.

Intended Next Step

This stage is not the final model. It is the pose alignment stage before a later joint image + pose + text training stage.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including naifenn/qwen_3_pose_output_stage1