elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Overview

This is the first community audio-visual (AV) LoRA for LTX-Video 2.3, trained using the joint audio-video cross-attention architecture of the LTX-2.3 22B model. The LoRA enables talking head video generation with synchronized lip sync and internalized voice characteristics from a reference character.

This release is a character-specific implementation and reference pipeline. The weights demonstrate a working AV LoRA trained on a custom dataset. The methodology, dataset structure, caption format, and training config are fully documented and reusable for training your own character-specific AV LoRA.

What It Does

Generates talking head videos with synchronized lip sync from a reference image
Internalizes voice characteristics without requiring external audio input at inference time
Preserves character identity across unseen reference images and backgrounds

Demo Results (v1)

Lip sync: accurate and consistent
Identity preservation: locks in at step 1250, improves linearly to step 2000
Voice characteristics: internalized from training data
Known limitations: slight audio buzz artifacts, occasional eye blinking inconsistency, seed-dependent output quality

How To Use

Requirements

ComfyUI Workflow (examples)
LTX-2.3 Model
Power Lora Loader node

Loading the LoRA

Load LTX-2.3-22b-AV-LoRA-talking-head-v1.safetensors via the Power Lora Loader node in ComfyUI.

Set LoRA strength to 1.0.

Recommended Inference Settings

Parameter	Value
Resolution	1280x736
FPS	24
Video length	Any (10+ seconds recommended)
LoRA strength	1.0
Trigger word	OHWXPERSON
CFG scale	1.0

Note: 1280x736 @ 24fps is recommended for image-to-video inference. For image + audio to video inference, use 1280x704 @ 25fps to match the training distribution.

Prompt Format

Include the trigger word OHWXPERSON and end the prompt with the speech transcript:

OHWXPERSON, [visual description]. The person is talking, and he says: "[transcript]"

Training Your Own AV LoRA

This section documents the full pipeline so you can train a character-specific AV LoRA for your own subject.

Pipeline Overview

Reference Images
      |
      v
Flux.1 Kontext / Flux.2 Klein     -- Image generation
      |
      v
Fish Audio S2 Pro                 -- Voice cloning + TTS
      |
      v
LTX-Video 2.3                     -- Talking head video generation
      |
      v
LTX-2 trainer                     -- AV LoRA training
      |
      v
Trained AV LoRA weights

Step 1 -- Generate Reference Images

Use Flux Kontext in ComfyUI to generate consistent reference images of your character across varied poses, angles, lighting conditions, and expressions.

[KONTEXT WORKFLOW]

Key settings used in this project:

Flux Kontext dev Q6_K GGUF
Sampler: res_3s + res_2m (RES4LYF)
FluxGuidance: 1
denoise: 1

Step 2 -- Clone the Voice

Use Fish Audio S2 Pro (model) with a 10-15 second reference audio clip of your target voice. Supports [pause], [short pause], and [emphasis] tags for pacing control.

Generate TTS audio for each clip's script using the cloned voice.

Step 3 -- Generate Training Clips

Use LTX-2.3 in ComfyUI to generate talking head clips from your reference images.

[LTX-2.3 IMAGE + AUDIO TO VIDEO WORKFLOW]

Dataset requirements:

25-30 clips minimum
Resolution: 1280x704
FPS: 25
Length: 6-10 seconds per clip after trimming
Variety: front facing, 3/4 angles, side profile, different backgrounds, multiple emotions

Prompt format for each clip:

[scene description]. Mouth partially open during speech with only the front teeth partially visible, lips moving naturally without fully exposing all teeth. Smooth continuous motion, cinematic, realistic, sharp focus on subject. The person is talking, and he says: "[transcript]"

Background complexity directly impacts lip sync quality. Simple and dark backgrounds produce the best results. Complex backgrounds with many competing elements reduce lip sync accuracy.

Step 4 -- Prepare the Dataset

Structure your dataset folder as follows:

ohwxperson_dataset_v1/
  clip_001.mp4          # video with embedded audio from LTX-2.3
  clip_002.mp4
  ...
  CAPTIONS.json

Caption format in CAPTIONS.json:

{
  "captions": [
    {
      "file": "clip_001.mp4",
      "caption": "[VISUAL] OHWXPERSON, [visual description of scene, pose, clothing, background]. [SPEECH] OHWXPERSON speaks in a [voice description]: \"[exact transcript]\""
    }
  ]
}

A reference CAPTIONS.json from this project is included in this repository.

Step 5 -- Train with ltx-trainer

Recommended training configuration:

model:
  model_path: ltx-2.3-22b-dev.safetensors
  text_encoder_path: gemma
  training_mode: lora

lora:
  rank: 32
  alpha: 32
  target_modules: [to_k, to_q, to_v, to_out.0]

training_strategy:
  name: text_to_video
  with_audio: true
  first_frame_conditioning_p: 0.5

optimization:
  steps: 2000
  learning_rate: 1.0e-04
  batch_size: 1
  gradient_accumulation_steps: 1
  optimizer_type: adamw
  scheduler_type: linear
  mixed_precision_mode: bf16
  enable_gradient_checkpointing: true

validation:
  interval: 250
  inference_steps: 30
  guidance_scale: 4.0

Training Details

Parameter	Value
Base model	LTX-Video 2.3 22B
Training mode	LoRA
LoRA rank	32
LoRA alpha	32
Steps	2000
Learning rate	1e-4
Batch size	1
Mixed precision	bf16
Dataset size	26 clips
Peak VRAM usage	77.08 GB
Training time	~7.8 hours
Training cost	~$5.33 (GCP Spot G4 instance, RTX PRO 6000 96GB)
Identity lock	Step 1250

Known Limitations (v1)

Slight audio buzz artifacts present in outputs
Eye blinking occasionally inconsistent (can be fixed by manual prompting)
Output quality is seed dependent -- sweep 3-5 seeds per generation
Character-specific weights -- lip sync and voice are tied to the trained character
Best results at 1280x736 @ 24fps

v2 Roadmap

Audio preprocessing with MelBand Roformer before training to eliminate buzz artifacts
Explicit eye blinking captions and dedicated blinking clips in dataset
Extended training to 2500-3000 steps
Larger and more diverse dataset

Files

File	Description
`LTX-2.3-22b-AV-LoRA-talking-head-v1.safetensors`	Final trained LoRA weights (v1)
`CAPTIONS.json`	Reference caption file for dataset structure
`ohwxperson_av_lora.yaml`	Full training configuration
`flux_kontext_clownsharkextended.json`	Flux Kontext workflow for generating reference images
`LTX-2-3-I2V.json`	LTX-Video 2.3 Image to Video workflow
`LTX-2-3-I2V-Custom-Audio.json`	LTX-Video 2.3 Image + Custom Audio to Video workflow

Citation

If you use this model or methodology in your work, please credit this repository.

License

The LoRA weights are released for research and personal use. Commercial use requires separate permission.

Downloads last month: 5,049

Model tree for elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Base model

Lightricks/LTX-2.3

Adapter

(20)

this model

elix3r
/

LTX-2.3-22b-AV-LoRA-talking-head