YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DCGAN Text-to-Image Generation

A Deep Convolutional Generative Adversarial Network (DCGAN) implementation for text-to-image generation using GLOVE embeddings and COCO dataset.

🎯 Features

  • Text-to-Image Generation: Generate images from text descriptions using DCGAN architecture
  • GLOVE Embeddings: 300-dimensional word embeddings projected to 1024-dimensional space
  • COCO Dataset: Trained on Microsoft COCO dataset with image-caption pairs
  • Comprehensive Evaluation: FID, IS, Text-Image Matching, and CLIP scores
  • Automatic Checkpointing: Save models every 10 epochs with resume capability
  • Visualization Tools: Generate GIFs, loss plots, and evaluation comparisons

πŸ“‹ Requirements

  • torch==2.0.0
  • torchvision
  • numpy==1.21.5
  • Pillow==10.0.0
  • matplotlib
  • imageio
  • scipy
  • h5py==3.6.0

πŸ—‚οΈ Dataset

Microsoft COCO Dataset:

  • Training images: train2014/
  • Validation images: val2014/
  • Captions: captions_train2014.json, captions_val2014.json
  • GLOVE Embeddings: glove.6B.300d.txt (300-dimensional word vectors)

πŸ—οΈ Repository Structure

β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ dcgan_model.py          # Generator and Discriminator architectures
β”‚   β”œβ”€β”€ char_cnn_rnn_model.py   # Text processing models
β”‚   └── net_modules/            # Network components
β”œβ”€β”€ saved_models/                # Trained model checkpoints
β”‚   β”œβ”€β”€ generator_final.pth
β”‚   β”œβ”€β”€ discriminator_final.pth
β”‚   └── checkpoint_epoch_*.pth
β”œβ”€β”€ generated_images/            # Generated images and visualizations
β”‚   β”œβ”€β”€ output_*.png
β”‚   β”œβ”€β”€ output_gif_*.gif
β”‚   └── evaluation_*.png
β”œβ”€β”€ utils.py                    # Utility functions
β”œβ”€β”€ data_util.py               # Data processing utilities
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ glove.6B.300d.txt         # GLOVE word embeddings
└── DCGAN_Text2Image.ipynb    # Main training notebook

πŸš€ Quick Start

  1. Install Dependencies:

    pip install -r requirements.txt
    
  2. Download GLOVE Embeddings:

    • Place glove.6B.300d.txt in the project root
  3. Prepare COCO Dataset:

    • Download COCO 2014 dataset
    • Update paths in the notebook data loading section
  4. Run Training:

    # Execute cells in DCGAN_Text2Image.ipynb
    # Training will run for 70 epochs with automatic checkpointing
    

πŸ“Š Training Configuration

  • Image Size: 64x64 pixels
  • Batch Size: 512
  • Epochs: 70
  • Learning Rate: 0.0002
  • Noise Dimension: 100
  • Embedding Dimension: 1024 (projected from GLOVE 300d)
  • Optimizer: Adam (Ξ²1=0.5, Ξ²2=0.999)

πŸ“ˆ Results

Training Performance

  • Total Training Time: ~24 hours (70 epochs)
  • Checkpointing: Every 10 epochs
  • Loss Tracking: Generator and Discriminator losses monitored

Evaluation Metrics

  • FID Score: 322.30 (FrΓ©chet Inception Distance)
  • IS Score: 0.53 (Inception Score)
  • Text-Image Matching: -0.15 Β± 0.40
  • CLIP Score: 0.70

Generated Outputs

  • Sample Images: 20+ generated images per run
  • Training GIFs: Animated progress visualization
  • Loss Plots: Generator and Discriminator loss curves
  • Evaluation Comparisons: Real vs Generated image grids

πŸ”§ Model Architecture

Generator

  • Input: Noise vector (100d) + Text embedding (1024d)
  • Output: RGB image (3Γ—64Γ—64)
  • Architecture: Transposed convolutions with batch normalization

Discriminator

  • Input: RGB image (3Γ—64Γ—64) + Text embedding (1024d)
  • Output: Binary classification (real/fake)
  • Architecture: Convolutional layers with leaky ReLU

Text Processing

  • GLOVE Embeddings: 300-dimensional word vectors
  • Projection Layer: Linear layer (300d β†’ 1024d)
  • Caption Processing: Average word embeddings for sentence representation

πŸ“ Usage

Generate Images

# Load trained model
generator.load_state_dict(torch.load('saved_models/generator_final.pth'))

# Generate from text
noise = torch.randn(1, 100, 1, 1, device=device)
text_embedding = caption_to_embedding("a red car")
generated_image = generator(noise, text_embedding)

Resume Training

# Load checkpoint
start_epoch, G_losses, D_losses = load_checkpoint(
    'saved_models/checkpoint_epoch_50.pth',
    generator, discriminator, optimizer_G, optimizer_D
)

🎯 Future Improvements

  • Better Text Encoders: Implement CLIP or BERT-based text encoders
  • Higher Resolution: Scale to 128Γ—128 or 256Γ—256 images
  • Improved Architecture: Consider StyleGAN or other advanced architectures
  • Better Evaluation: Implement human evaluation and more comprehensive metrics
  • Hyperparameter Tuning: Optimize learning rates, batch sizes, and architecture

πŸ“š References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for srikanthakkaru/KairosT2I