Penumbra UNet: FPGA-Accelerated Mask Optimization

A compressed U-Net neural network for on-chip FPGA acceleration of Inverse Lithography Technology (ILT) mask optimization, targeting the Xilinx VU47P (AWS F2).

Overview

Penumbra UNet compresses a full-size teacher network by 64× (7.8M → 122K parameters) to fit entirely in on-chip BRAM, enabling a fully on-chip dataflow that eliminates external DRAM access.

Architecture

Network Structure

U-Net encoder-decoder with extreme parameter compression:

Encoder:

  • Conv 1→8 channels, 64×64 + MaxPool → 32×32
  • Conv 8→16 channels, 32×32 + MaxPool → 16×16
  • Conv 16→32 channels, 16×16 + MaxPool → 8×8

Bottleneck:

  • Conv 32→64 channels, 8×8

Decoder:

  • Upsample + skip concatenation + Conv 96→32 channels, 16×16
  • Upsample + skip concatenation + Conv 48→16 channels, 32×32
  • Upsample + skip concatenation + Conv 24→8 channels, 64×64

Output:

  • Conv 1×1 + Sigmoid → 64×64

Compression summary:

Metric Full model Penumbra UNet
Parameters 7.8M 122K
Input tile 512×512 64×64
Max channels 512 64

Tiling & Reassembly

Input 512×512 masks are decomposed into 16×16 grid of 64×64 tiles (256 total):

  • Overlap: 16-pixel reflection padding for boundary handling
  • Usable core: 32×32 center pixels per tile
  • Batch processing: 256 tiles → 4 sequential batches of 64

Reassembly uses only differentiable operations (slice, reshape, permute) to enable end-to-end gradient flow:

(256, 1, 64, 64)  [all tiles]
    ↓ center-crop
(256, 1, 32, 32)  [usable cores]
    ↓ reshape + permute
(1, 1, 512, 512)  [full mask]

Training

Phase 1: Knowledge Distillation

  • Epochs: 16
  • Input: 64×64 crops
  • Loss: α-blended (α decays 0.7→0)
    L = α·MSE(student, teacher) + (1-α)·MSE(student, ground_truth)
    
  • Optimizer: Adam (lr=1e-3), cosine-annealing schedule
  • Teacher: Frozen full-size NeuralILT model

Phase 2: Physics-Informed Fine-Tuning

  • Epochs: 4
  • Pipeline: Full tiled forward pass through differentiable lithography simulator
  • Loss: Print fidelity + process variation
    L = MSE(P_nom, target) + MSE(P_max, P_min)
    
  • Optimizer: Adam (lr=1e-4), StepLR (γ=0.1 at epoch 2)
  • Gradients: Flow through tiled reassembly to all network weights

Code Organization

hls4ml_penumbra/
├── firmware/           # Generated HLS C++ project
│   ├── myproject.cpp   # Top-level module
│   ├── myproject.h     # Interface & config
│   ├── weights/        # Quantized weights
│   ├── ap_types/       # Xilinx AP types (ap_fixed, ap_int)
│   └── utils/          # HLS utilities
├── myproject_prj/      # Vivado HLS project
│   └── solution1/
│       └── impl/       # Implementation artifacts
├── logs/               # Build logs
└── [HLS build outputs]

Author: Roberto Treviño Cervantes

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support