---
license: mit
tags:
- sign-language
- diffusion
- text-to-video
- asl
- how2sign
- lightweight
metrics:
- fvd
---

# Text2Sign: Lightweight Diffusion Model for Sign Language Video Generation

This repository contains the pretrained checkpoint and inference code for the Text2Sign model, a lightweight diffusion-based architecture for generating sign language videos from text prompts.

## Model Overview
- **Architecture:** 3D UNet backbone with DiT (Diffusion Transformer) blocks and a custom Transformer-based text encoder.
- **Dataset:** Trained on How2Sign (ASL) video-text pairs.
- **Resolution:** 64x64 RGB, 16 frames per clip.
- **Checkpoint:** Provided at epoch 70.

## Files
- `checkpoint_epoch_70.pt` — Pretrained model weights
- `config.py` — Model and generation configuration
- `inference.py` — Example script for generating sign language videos from text

## Usage
1. Install dependencies:
   ```bash
   pip install torch torchvision pillow matplotlib
   ```
2. Run the inference script:
   ```bash
   python inference.py --prompt "Hello world"
   ```
   This will generate a video for the given prompt and save a filmstrip image.


## License
MIT