Qwen3-VL Pose Stage 1

This repository stores the stage 1 pose-only alignment artifacts for a Qwen3-VL based exercise feedback model.

What This Stage Does

Stage 1 focuses on aligning structured pose features to the Qwen3-VL token embedding space before running a later joint multimodal stage.

The model setup for this stage is:

Base model: Qwen/Qwen3-VL-4B-Instruct
Modalities used during training: pose + text
Image branch: disabled for this stage
Vision encoder: frozen
Language model: frozen
LoRA: disabled
Pose adapter: enabled in last_linear mode
Pose projector: trainable
Pose placeholder tokens: 8

In this stage, pose features are encoded, projected into the Qwen embedding space, and injected into reserved pose token positions. Training is supervised with generated exercise descriptions and feedback text.

Data Used

Training used the following sources:

processed/generated_descriptions.jsonl
train/unimodal/training.csv
train/unimodal/validation.csv

The supervision target is built from:

response.description
response.feedback

Training Configuration

Epochs: 5
Per-device batch size: 1
Gradient accumulation: 8
Effective batch size: 8
Learning rate: 5e-6
Pose projector learning rate: 5e-6
Pose adapter learning rate: 1e-6
Warmup ratio: 0.05
Max grad norm: 1.0
Logging backend: TensorBoard

Final Metrics

Train loss: 22.7354
Eval loss: 2.6238
Train samples: 8989
Eval samples: 612
Train runtime: 21133.94s

Trainable Components

This stage trains only a small pose-side subset:

pose_feature_encoder.pose_adapter.pose_embedding_projector.4.*
pose_projector.output_gate_logit
pose_projector.input_norm.*
pose_projector.proj.*

Everything in the main Qwen language backbone remains frozen.

Repository Contents

This repository is intended to keep the most relevant stage 1 artifacts:

pose_projector.pt: learned pose projector weights
pose_adapter.pt: stage 1 tuned pose adapter weights
pose_bridge_config.json: pose token injection metadata
stage_manifest.json: training-stage manifest
training_args.bin: Hugging Face training arguments
train_results.json: final train metrics
eval_results.json: final eval metrics
all_results.json: aggregate run metrics
logs/: TensorBoard event files

Depending on what was uploaded, intermediate checkpoints or full Qwen weights may be omitted on purpose.

Intended Next Step

This stage is not the final model. It is the pose alignment stage before a later joint image + pose + text training stage.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including naifenn/qwen_3_pose_output_stage1

MotiVate Single Turn

Collection

3 items • Updated 12 days ago