File size: 5,043 Bytes

8f3afe1

---
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
language:
- en
- zh
library_name: diffusers
tags:
- video-generation
- video-reasoning
- logical-reasoning
- lora
- ltx-2.3
base_model:
- Lightricks/LTX-2.3
---

# LTX-2 VBVR LoRA - Video Reasoning

LoRA fine-tuned weights for LTX-2.3 22B on the VBVR (A Very Big Video Reasoning Suite) dataset.

## Training Data

**To ensure training quality, we preprocessed the full 1,000,000 videos from the official dataset and randomly sample during training to maintain data diversity. We adopt the official parameters with batch_size=16 and rank=32 to prevent catastrophic forgetting caused by excessively large rank.**

The VBVR dataset contains 100 reasoning task categories, with ~10,000 variants per task, totaling ~1M videos. Main task types include:

- **Object Trajectory**: Objects moving to target positions
- **Physical Reasoning**: Rolling balls, collisions, gravity
- **Causal Relationships**: Conditional triggers, chain reactions
- **Spatial Relationships**: Relative positions, path planning

## Model Details

| Item | Details |
|------|---------|
| Base Model | ltx-2.3-22b-dev |
| Training Method | LoRA Fine-tuning |
| LoRA Rank | 32 |
| Effective Batch Size | 16 |
| Mixed Precision | BF16 |

## TODO List

### Dataset Release Plan

| Dataset | Videos | Status | 
|---------|--------|--------|
| VBVR-96K | 96,000 | ✅ Released |
| VBVR-240K | 240,000 | ✅ Released | 
| VBVR-reinforce | 240K+150K | ✅ Released |

## LoRA Capabilities

This LoRA adapter enhances the base LTX-2 model for production video generation workflows:

- **Enhanced Complex Prompt Understanding**: Accurately interprets multi-object, multi-condition prompts with detailed spatial descriptions and temporal sequences, reducing prompt misinterpretation in production scenarios.

- **Improved Motion Dynamics**: Generates smooth, physically plausible object movements with natural acceleration, deceleration, and trajectory curves, avoiding robotic or unnatural motion patterns.

- **Temporal Consistency**: Maintains object appearance, lighting, and scene coherence throughout the video sequence, reducing flickering and frame-to-frame artifacts common in generated videos.

- **Precise Timing Control**: Enables accurate control over action duration, pacing, and synchronization between multiple moving elements based on prompt semantics.

- **Multi-Object Interaction**: Handles complex scenes with multiple objects interacting simultaneously, including collisions, following, avoiding, and coordinated movements.

- **Camera and Framing Stability**: Maintains consistent camera perspective and framing throughout the sequence, avoiding unwanted camera shake or unexpected viewpoint changes.


## Training Configuration

### Stage 1: VBVR Foundation (96K)
| Config | Value |
|--------|-------|
| Dataset | 96K VBVR general videos |
| Learning Rate | 1e-4 |
| Scheduler | Cosine  |
| Batch Size | 1 × 16 (gradient accumulation) |
| Optimizer | AdamW |
| Max Grad Norm | 1.0 |
| Target Modules | `to_q`, `to_k`, `to_v`, `to_out.0`, `ff.net.0.proj`, `ff.net.2` |

### Stage 2: VBVR Extended (240K)
| Config | Value |
|--------|-------|
| Dataset | 240K general videos |
| Learning Rate | 1e-4 |
| Scheduler | Cosine |
| Batch Size | 1 × 16 (gradient accumulation) |
| Optimizer | AdamW |
| Max Grad Norm | 1.0 |
| Target Modules | `to_q`, `to_k`, `to_v`, `to_out.0`, `ff.net.0.proj`, `ff.net.2` |

### Stage 3: General + Hard Reasoning (490K)
| Config | Value |
|--------|-------|
| Dataset | 240K general videos + 150K high-difficulty reasoning videos |
| Learning Rate | 5e-5 |
| Scheduler | Cosine |
| Batch Size | 1 × 16 (gradient accumulation) |
| Optimizer | AdamW |
| Max Grad Norm | 1.0 |
| Target Modules | `to_q`, `to_k`, `to_v`, `to_out.0` (FFN frozen) |

## Video Demo

### Training Progress Comparison

<div style="display: flex; gap: 10px; flex-wrap: wrap;">

<div style="flex: 1; min-width: 300px;">
<video src="https://huggingface.co/LiconStudio/Ltx2.3-VBVR-lora-I2V/resolve/main/original01.mp4" controls style="width: 100%;"></video>
<p style="text-align: center; margin: 8px 0 0 0;"><strong>Original Model</strong></p>
</div>

<div style="flex: 1; min-width: 300px;">
<video src="https://huggingface.co/LiconStudio/Ltx2.3-VBVR-lora-I2V/resolve/main/240K.mp4" controls style="width: 100%;"></video>
<p style="text-align: center; margin: 8px 0 0 0;"><strong>240K</strong></p>
</div>

<div style="flex: 1; min-width: 300px;">
<video src="https://huggingface.co/LiconStudio/Ltx2.3-VBVR-lora-I2V/resolve/main/S3.mp4" controls style="width: 100%;"></video>
<p style="text-align: center; margin: 8px 0 0 0;"><strong>390K</strong></p>
</div>

</div>

## Dataset

This model is trained on the VBVR (Video Benchmark for Video Reasoning) dataset from [video-reason.com](https://video-reason.com/). 


## Contact

For questions or suggestions, please open an issue on Hugging Face or contact the author directly.