elix3r/LTX-2.3-22b-AV-LoRA-talking-head
Overview
This is the first community audio-visual (AV) LoRA for LTX-Video 2.3, trained using the joint audio-video cross-attention architecture of the LTX-2.3 22B model. The LoRA enables talking head video generation with synchronized lip sync and internalized voice characteristics from a reference character.
This release is a character-specific implementation and reference pipeline. The weights demonstrate a working AV LoRA trained on a custom dataset. The methodology, dataset structure, caption format, and training config are fully documented and reusable for training your own character-specific AV LoRA.
What It Does
- Generates talking head videos with synchronized lip sync from a reference image
- Internalizes voice characteristics without requiring external audio input at inference time
- Preserves character identity across unseen reference images and backgrounds
Demo Results (v1)
- Lip sync: accurate and consistent
- Identity preservation: locks in at step 1250, improves linearly to step 2000
- Voice characteristics: internalized from training data
- Known limitations: slight audio buzz artifacts, occasional eye blinking inconsistency, seed-dependent output quality
How To Use
Requirements
- ComfyUI Workflow (examples)
- LTX-2.3 Model
- Power Lora Loader node
Loading the LoRA
Load LTX-2.3-22b-AV-LoRA-talking-head-v1.safetensors via the Power Lora Loader node in ComfyUI.
Set LoRA strength to 1.0.
Recommended Inference Settings
| Parameter | Value |
|---|---|
| Resolution | 1280x736 |
| FPS | 24 |
| Video length | Any (10+ seconds recommended) |
| LoRA strength | 1.0 |
| Trigger word | OHWXPERSON |
| CFG scale | 1.0 |
Note: 1280x736 @ 24fps is recommended for image-to-video inference. For image + audio to video inference, use 1280x704 @ 25fps to match the training distribution.
Prompt Format
Include the trigger word OHWXPERSON and end the prompt with the speech transcript:
OHWXPERSON, [visual description]. The person is talking, and he says: "[transcript]"
Training Your Own AV LoRA
This section documents the full pipeline so you can train a character-specific AV LoRA for your own subject.
Pipeline Overview
Reference Images
|
v
Flux.1 Kontext / Flux.2 Klein -- Image generation
|
v
Fish Audio S2 Pro -- Voice cloning + TTS
|
v
LTX-Video 2.3 -- Talking head video generation
|
v
LTX-2 trainer -- AV LoRA training
|
v
Trained AV LoRA weights
Step 1 -- Generate Reference Images
Use Flux Kontext in ComfyUI to generate consistent reference images of your character across varied poses, angles, lighting conditions, and expressions.
Key settings used in this project:
- Flux Kontext dev Q6_K GGUF
- Sampler: res_3s + res_2m (RES4LYF)
- FluxGuidance: 1
- denoise: 1
Step 2 -- Clone the Voice
Use Fish Audio S2 Pro (model) with a 10-15 second reference audio clip of your target voice. Supports [pause], [short pause], and [emphasis] tags for pacing control.
Generate TTS audio for each clip's script using the cloned voice.
Step 3 -- Generate Training Clips
Use LTX-2.3 in ComfyUI to generate talking head clips from your reference images.
[LTX-2.3 IMAGE + AUDIO TO VIDEO WORKFLOW]
Dataset requirements:
- 25-30 clips minimum
- Resolution: 1280x704
- FPS: 25
- Length: 6-10 seconds per clip after trimming
- Variety: front facing, 3/4 angles, side profile, different backgrounds, multiple emotions
Prompt format for each clip:
[scene description]. Mouth partially open during speech with only the front teeth partially visible, lips moving naturally without fully exposing all teeth. Smooth continuous motion, cinematic, realistic, sharp focus on subject. The person is talking, and he says: "[transcript]"
Background complexity directly impacts lip sync quality. Simple and dark backgrounds produce the best results. Complex backgrounds with many competing elements reduce lip sync accuracy.
Step 4 -- Prepare the Dataset
Structure your dataset folder as follows:
ohwxperson_dataset_v1/
clip_001.mp4 # video with embedded audio from LTX-2.3
clip_002.mp4
...
CAPTIONS.json
Caption format in CAPTIONS.json:
{
"captions": [
{
"file": "clip_001.mp4",
"caption": "[VISUAL] OHWXPERSON, [visual description of scene, pose, clothing, background]. [SPEECH] OHWXPERSON speaks in a [voice description]: \"[exact transcript]\""
}
]
}
A reference CAPTIONS.json from this project is included in this repository.
Step 5 -- Train with ltx-trainer
Recommended training configuration:
model:
model_path: ltx-2.3-22b-dev.safetensors
text_encoder_path: gemma
training_mode: lora
lora:
rank: 32
alpha: 32
target_modules: [to_k, to_q, to_v, to_out.0]
training_strategy:
name: text_to_video
with_audio: true
first_frame_conditioning_p: 0.5
optimization:
steps: 2000
learning_rate: 1.0e-04
batch_size: 1
gradient_accumulation_steps: 1
optimizer_type: adamw
scheduler_type: linear
mixed_precision_mode: bf16
enable_gradient_checkpointing: true
validation:
interval: 250
inference_steps: 30
guidance_scale: 4.0
Training Details
| Parameter | Value |
|---|---|
| Base model | LTX-Video 2.3 22B |
| Training mode | LoRA |
| LoRA rank | 32 |
| LoRA alpha | 32 |
| Steps | 2000 |
| Learning rate | 1e-4 |
| Batch size | 1 |
| Mixed precision | bf16 |
| Dataset size | 26 clips |
| Peak VRAM usage | 77.08 GB |
| Training time | ~7.8 hours |
| Training cost | ~$5.33 (GCP Spot G4 instance, RTX PRO 6000 96GB) |
| Identity lock | Step 1250 |
Known Limitations (v1)
- Slight audio buzz artifacts present in outputs
- Eye blinking occasionally inconsistent (can be fixed by manual prompting)
- Output quality is seed dependent -- sweep 3-5 seeds per generation
- Character-specific weights -- lip sync and voice are tied to the trained character
- Best results at 1280x736 @ 24fps
v2 Roadmap
- Audio preprocessing with MelBand Roformer before training to eliminate buzz artifacts
- Explicit eye blinking captions and dedicated blinking clips in dataset
- Extended training to 2500-3000 steps
- Larger and more diverse dataset
Files
| File | Description |
|---|---|
LTX-2.3-22b-AV-LoRA-talking-head-v1.safetensors |
Final trained LoRA weights (v1) |
CAPTIONS.json |
Reference caption file for dataset structure |
ohwxperson_av_lora.yaml |
Full training configuration |
flux_kontext_clownsharkextended.json |
Flux Kontext workflow for generating reference images |
LTX-2-3-I2V.json |
LTX-Video 2.3 Image to Video workflow |
LTX-2-3-I2V-Custom-Audio.json |
LTX-Video 2.3 Image + Custom Audio to Video workflow |
Citation
If you use this model or methodology in your work, please credit this repository.
License
The LoRA weights are released for research and personal use. Commercial use requires separate permission.
- Downloads last month
- 5,049
Model tree for elix3r/LTX-2.3-22b-AV-LoRA-talking-head
Base model
Lightricks/LTX-2.3