elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Overview

This is the first community audio-visual (AV) LoRA for LTX-Video 2.3, trained using the joint audio-video cross-attention architecture of the LTX-2.3 22B model. The LoRA enables talking head video generation with synchronized lip sync and internalized voice characteristics from a reference character.

This release is a character-specific implementation and reference pipeline. The weights demonstrate a working AV LoRA trained on a custom dataset. The methodology, dataset structure, caption format, and training config are fully documented and reusable for training your own character-specific AV LoRA.


What It Does

  • Generates talking head videos with synchronized lip sync from a reference image
  • Internalizes voice characteristics without requiring external audio input at inference time
  • Preserves character identity across unseen reference images and backgrounds

Demo Results (v1)

  • Lip sync: accurate and consistent
  • Identity preservation: locks in at step 1250, improves linearly to step 2000
  • Voice characteristics: internalized from training data
  • Known limitations: slight audio buzz artifacts, occasional eye blinking inconsistency, seed-dependent output quality

How To Use

Requirements

  • ComfyUI Workflow (examples)
  • LTX-2.3 Model
  • Power Lora Loader node

Loading the LoRA

Load LTX-2.3-22b-AV-LoRA-talking-head-v1.safetensors via the Power Lora Loader node in ComfyUI.

Set LoRA strength to 1.0.

Recommended Inference Settings

Parameter Value
Resolution 1280x736
FPS 24
Video length Any (10+ seconds recommended)
LoRA strength 1.0
Trigger word OHWXPERSON
CFG scale 1.0

Note: 1280x736 @ 24fps is recommended for image-to-video inference. For image + audio to video inference, use 1280x704 @ 25fps to match the training distribution.

Prompt Format

Include the trigger word OHWXPERSON and end the prompt with the speech transcript:

OHWXPERSON, [visual description]. The person is talking, and he says: "[transcript]"

Training Your Own AV LoRA

This section documents the full pipeline so you can train a character-specific AV LoRA for your own subject.

Pipeline Overview

Reference Images
      |
      v
Flux.1 Kontext / Flux.2 Klein     -- Image generation
      |
      v
Fish Audio S2 Pro                 -- Voice cloning + TTS
      |
      v
LTX-Video 2.3                     -- Talking head video generation
      |
      v
LTX-2 trainer                     -- AV LoRA training
      |
      v
Trained AV LoRA weights

Step 1 -- Generate Reference Images

Use Flux Kontext in ComfyUI to generate consistent reference images of your character across varied poses, angles, lighting conditions, and expressions.

[KONTEXT WORKFLOW]

Key settings used in this project:

  • Flux Kontext dev Q6_K GGUF
  • Sampler: res_3s + res_2m (RES4LYF)
  • FluxGuidance: 1
  • denoise: 1

Step 2 -- Clone the Voice

Use Fish Audio S2 Pro (model) with a 10-15 second reference audio clip of your target voice. Supports [pause], [short pause], and [emphasis] tags for pacing control.

Generate TTS audio for each clip's script using the cloned voice.

Step 3 -- Generate Training Clips

Use LTX-2.3 in ComfyUI to generate talking head clips from your reference images.

[LTX-2.3 IMAGE + AUDIO TO VIDEO WORKFLOW]

Dataset requirements:

  • 25-30 clips minimum
  • Resolution: 1280x704
  • FPS: 25
  • Length: 6-10 seconds per clip after trimming
  • Variety: front facing, 3/4 angles, side profile, different backgrounds, multiple emotions

Prompt format for each clip:

[scene description]. Mouth partially open during speech with only the front teeth partially visible, lips moving naturally without fully exposing all teeth. Smooth continuous motion, cinematic, realistic, sharp focus on subject. The person is talking, and he says: "[transcript]"

Background complexity directly impacts lip sync quality. Simple and dark backgrounds produce the best results. Complex backgrounds with many competing elements reduce lip sync accuracy.

Step 4 -- Prepare the Dataset

Structure your dataset folder as follows:

ohwxperson_dataset_v1/
  clip_001.mp4          # video with embedded audio from LTX-2.3
  clip_002.mp4
  ...
  CAPTIONS.json

Caption format in CAPTIONS.json:

{
  "captions": [
    {
      "file": "clip_001.mp4",
      "caption": "[VISUAL] OHWXPERSON, [visual description of scene, pose, clothing, background]. [SPEECH] OHWXPERSON speaks in a [voice description]: \"[exact transcript]\""
    }
  ]
}

A reference CAPTIONS.json from this project is included in this repository.

Step 5 -- Train with ltx-trainer

Recommended training configuration:

model:
  model_path: ltx-2.3-22b-dev.safetensors
  text_encoder_path: gemma
  training_mode: lora

lora:
  rank: 32
  alpha: 32
  target_modules: [to_k, to_q, to_v, to_out.0]

training_strategy:
  name: text_to_video
  with_audio: true
  first_frame_conditioning_p: 0.5

optimization:
  steps: 2000
  learning_rate: 1.0e-04
  batch_size: 1
  gradient_accumulation_steps: 1
  optimizer_type: adamw
  scheduler_type: linear
  mixed_precision_mode: bf16
  enable_gradient_checkpointing: true

validation:
  interval: 250
  inference_steps: 30
  guidance_scale: 4.0

Training Details

Parameter Value
Base model LTX-Video 2.3 22B
Training mode LoRA
LoRA rank 32
LoRA alpha 32
Steps 2000
Learning rate 1e-4
Batch size 1
Mixed precision bf16
Dataset size 26 clips
Peak VRAM usage 77.08 GB
Training time ~7.8 hours
Training cost ~$5.33 (GCP Spot G4 instance, RTX PRO 6000 96GB)
Identity lock Step 1250

Known Limitations (v1)

  • Slight audio buzz artifacts present in outputs
  • Eye blinking occasionally inconsistent (can be fixed by manual prompting)
  • Output quality is seed dependent -- sweep 3-5 seeds per generation
  • Character-specific weights -- lip sync and voice are tied to the trained character
  • Best results at 1280x736 @ 24fps

v2 Roadmap

  • Audio preprocessing with MelBand Roformer before training to eliminate buzz artifacts
  • Explicit eye blinking captions and dedicated blinking clips in dataset
  • Extended training to 2500-3000 steps
  • Larger and more diverse dataset

Files

File Description
LTX-2.3-22b-AV-LoRA-talking-head-v1.safetensors Final trained LoRA weights (v1)
CAPTIONS.json Reference caption file for dataset structure
ohwxperson_av_lora.yaml Full training configuration
flux_kontext_clownsharkextended.json Flux Kontext workflow for generating reference images
LTX-2-3-I2V.json LTX-Video 2.3 Image to Video workflow
LTX-2-3-I2V-Custom-Audio.json LTX-Video 2.3 Image + Custom Audio to Video workflow

Citation

If you use this model or methodology in your work, please credit this repository.


License

The LoRA weights are released for research and personal use. Commercial use requires separate permission.

Downloads last month
5,049
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Model tree for elix3r/LTX-2.3-22b-AV-LoRA-talking-head

Adapter
(20)
this model

Space using elix3r/LTX-2.3-22b-AV-LoRA-talking-head 1