--- license: mit tags: - sign-language - diffusion - text-to-video - asl - how2sign - lightweight metrics: - fvd --- # Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts. ## Model Overview - **Architecture:** 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder. - **Dataset:** Trained on How2Sign (ASL) video-text pairs. - **Resolution:** 64x64 RGB, 16 frames per clip. - **Checkpoint:** Provided at epoch 70. ## Files - `checkpoint_epoch_70.pt` — Pretrained model weights - `config.py` — Model and generation configuration - `inference.py` — Example script for generating sign language videos from text ## Usage 1. Install dependencies: ```bash pip install torch torchvision pillow matplotlib ``` 2. Run the inference script: ```bash python inference.py --prompt "Hello world" ``` This will generate a video for the given prompt and save a filmstrip image. ## License MIT