WaveCut's picture
Add files using upload-large-folder tool
9a16379 verified
metadata
license: mit
base_model: microsoft/Lens
pipeline_tag: text-to-image
tags:
  - lens
  - text-to-image
  - sdnq
  - uint4
  - static-quantization
  - ablation
  - model-cpu-offload

Lens SDNQ uint4 static

This is a corrected SDNQ static UINT4 quantized variant of microsoft/Lens.

The recipe follows the Lens-Turbo ablation result: all-linear UINT4 quantization can introduce periodic grid artifacts and severe text degradation when transformer modulation linears are quantized. This checkpoint keeps *.img_mod.* and *.txt_mod.* in bfloat16 and quantizes the rest of the denoising transformer with SDNQ UINT4.

Visual Comparison

Full-size comparison grid: the image below is built from native 1440x1440 samples without resampling the image cells and saved as WebP quality 98. Raw file: assets/comparison/comparison_grid_1to1_q98.webp.

Original vs SDNQ comparison grid

Quantization Recipe

Field Value
Method SDNQ uint4 static
Source model microsoft/Lens
Quantized component Denoising transformer
Text encoder Unchanged upstream GPT-OSS text encoder
VAE Unchanged upstream VAE
weights_dtype uint4
quantized_matmul_dtype int8
use_quantized_matmul true
group_size 0
dequantize_fp32 false
Critical skip rule *.img_mod.*, *.txt_mod.* kept in bfloat16

Usage

Run from the cloned microsoft/Lens repo root so the custom Lens classes are registered.

import torch
from huggingface_hub import snapshot_download
from lens import LensPipeline, LensTransformer2DModel
from sdnq import load_sdnq_model

model_dir = snapshot_download("WaveCut/Lens-SDNQ-uint4-static")
transformer = load_sdnq_model(
    model_dir + "/transformer",
    model_cls=LensTransformer2DModel,
    dtype=torch.bfloat16,
    device=torch.device("cuda"),
    dequantize_fp32=False,
    use_quantized_matmul=True,
)
pipe = LensPipeline.from_pretrained(
    model_dir,
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="A cat holding a sign that says hello world",
    base_resolution=1440,
    aspect_ratio="1:1",
    num_inference_steps=20,
    guidance_scale=5.0,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]

Benchmark

Hardware: RunPod NVIDIA H100 80GB HBM3 (H100 SXM), PyTorch 2.8.0 CUDA 12.8 container, local container disk only. Benchmark date: 2026-05-24. Generation settings: base_resolution=1440, aspect_ratio="1:1", num_inference_steps=20, guidance_scale=5.0.

Metric Original Lens SDNQ uint4 static
Load time, seconds 17.641 14.605
Load peak allocated VRAM, GB 22.342 18.446
Load peak reserved VRAM, GB 22.471 18.516
Transformer tensor storage footprint, GB 16.417 4.301
Transformer storage reduction vs original baseline 73.8% smaller
Average prompt runtime, seconds 15.357 17.937
Median prompt runtime, seconds 14.346 17.813
Average generation peak allocated VRAM, GB 27.462 23.533
Max generation peak allocated VRAM, GB 27.467 23.538

Transformer-only footprint is computed from safetensors tensor storage for the denoising transformer parameter tensors only; it excludes allocator overhead and non-transformer components. The original transformer tensors are F32; the corrected SDNQ transformer stores quantized tensors as U8 plus the excluded modulation layers as BF16.

Model CPU Offload Benchmark

Same 10 prompts, using pipe.enable_model_cpu_offload(). The reported load time uses a warm local Hugging Face cache on the container disk, so model download time is excluded. Each model was measured in a fresh Python process. Cold generation is P01, the first generation immediately after load/offload setup; warm generation aggregates P02-P10.

Metric Original Lens SDNQ uint4 static
Offload setup/load time, seconds 15.510 13.573
Offload setup peak allocated VRAM, GB 12.582 12.582
Offload setup peak reserved VRAM, GB 13.881 13.881
Cold generation time, seconds 26.217 21.473
Cold generation peak allocated VRAM, GB 19.274 15.479
Cold generation peak reserved VRAM, GB 19.608 19.126
Warm generation average time, seconds 18.123 17.650
Warm generation median time, seconds 17.178 17.519
Warm generation average peak allocated VRAM, GB 19.271 15.480
Warm generation average peak reserved VRAM, GB 19.630 18.965
Warm generation max peak allocated VRAM, GB 19.276 15.482
Warm generation max peak reserved VRAM, GB 19.803 19.210

Raw metrics: benchmark_metrics.json, comparison_matrix.json, model_cpu_offload_benchmark.json, sdnq_quantization_summary.json.

10-Prompt Matrix

ID Scenario Seed Original time, s Quant time, s Delta Original peak allocated VRAM, GB Quant peak allocated VRAM, GB
P01 Midnight Library Weather Station 301 19.518 24.105 +23.5% 27.461 23.532
P02 Desert Observatory Treaty Room 302 16.986 18.565 +9.3% 27.461 23.532
P03 Arctic Submarine Greenhouse 303 13.903 17.882 +28.6% 27.461 23.532
P04 Long English Museum Labels 304 14.439 17.848 +23.6% 27.461 23.533
P05 Tokyo Rooftop Repair Diner 305 14.253 17.779 +24.7% 27.461 23.532
P06 Russian Provincial Print Shop 306 16.744 18.136 +8.3% 27.467 23.538
P07 Ocean Cartography Bakery 307 13.918 14.828 +6.5% 27.461 23.532
P08 Long English Train Notice Wall 308 13.906 14.826 +6.6% 27.461 23.532
P09 Orbital Botanical Courtroom 309 16.204 17.740 +9.5% 27.461 23.533
P10 Byzantine Data Center Chapel 310 13.703 17.665 +28.9% 27.461 23.532

Full Prompts

P01 - Midnight Library Weather Station

A vast midnight library converted into a Victorian weather station, brass barometers, hanging cloud chambers, blue lightning outside stained-glass windows, spiral ladders, rainwater collecting in crystal funnels, and readable labels everywhere. Include a large oak sign saying "ARCHIVE OF STORMS - EAST WING", a ledger title saying "BAROMETRIC ANOMALIES 1897-1903", a small drawer label saying "FOG SAMPLES / DO NOT SHAKE", a chalkboard note saying "THUNDER ARRIVES AT 02:17", and a bookmark saying "RETURN TO SHELF C-19". Extremely detailed, cinematic, natural perspective, crisp small typography.

P02 - Desert Observatory Treaty Room

An ancient desert observatory at golden hour, now used as a treaty room for astronomers and nomad diplomats, sandstone arches, astrolabes, folded star maps, copper tea service, wind-blown curtains, tiny dust motes, and many readable inscriptions. The central parchment must read "TREATY OF THE SEVEN MOONS". A wall plaque reads "OBSERVATORY OF QASR AL-SUHAIL". A tea label says "CARDAMOM - NO SUGAR". A blue wax seal says "WITNESSED UNDER MARS". A telescope tag says "CALIBRATE BEFORE SUNSET". Hyperreal, warm shadows, intricate surface wear.

P03 - Arctic Submarine Greenhouse

A transparent research submarine trapped under Arctic ice, transformed into a warm hydroponic greenhouse with orange grow lights, condensation, polar bears visible above through thick ice, scientists in wool sweaters, algae tanks, and frost patterns on glass. Include readable text on multiple objects: "POLAR BOTANY UNIT 4" on the bulkhead, "EMERGENCY SEED VAULT" on a red locker, "LIGHT CYCLE: 18 HOURS" on a tablet, "DO NOT FEED THE KELP" on a handwritten note, and "RETURN CORE SAMPLES" on a metal tray. Detailed, atmospheric, believable engineering.

P04 - Long English Museum Labels

A photorealistic museum exhibit room about impossible machines, with glass cases, velvet ropes, soft spotlights, and several long English placards that must be visible on different parts of the image. Placard one reads: "THE CLOCK THAT REMEMBERED WINTER: assembled from brass, bone, and borrowed tides, circa 1814." Placard two reads: "PLEASE DO NOT TOUCH THE PERPETUAL ENGINE; it becomes anxious when observed too closely." Placard three reads: "CURATOR'S NOTE: every gear was catalogued, polished, numbered, and returned before dawn." Also include ticket stubs, tiny accession numbers, fingerprints on glass, and realistic museum lighting.

P05 - Tokyo Rooftop Repair Diner

A rainy Tokyo rooftop diner that doubles as a robot repair shop, neon reflections, steam from ramen bowls, umbrellas, tiny servo motors, handwritten order slips, rain beads on chrome, and a skyline full of antennas. Readable signs: a pink neon sign says "MIDNIGHT RAMEN & REPAIRS", a menu board says "SPECIAL: MISO, BATTERY PACK, GREEN TEA", a repair invoice says "UNIT 7B - LEFT HAND RECALIBRATION", a sticker says "NO DRONES AFTER 2 AM", and a paper lantern says "OPEN WHEN IT RAINS". High detail, shallow depth of field, cinematic realism.

P06 - Russian Provincial Print Shop

Старинная провинциальная типография в России, поздний вечер, керосиновые лампы, деревянные кассы со свинцовыми литерами, мокрые афиши на веревках, самовар, иней на окне, реалистичная пыль и следы краски. На большой вывеске должно быть написано: "ТИПОГРАФИЯ УЕЗДНЫХ ВЕСТЕЙ". На длинной афише читаемый текст: "Завтра в городском саду: лекция о кометах, духовой оркестр, чай с баранками, начало ровно в семь часов вечера". На ящике: "ЛИТЕРЫ: А-Я, НЕ РОНЯТЬ". На записке: "Срочно отпечатать до рассвета". Очень детально, без мультяшности.

P07 - Ocean Cartography Bakery

A cozy bakery inside an old ocean cartography office, with croissants shaped like sea monsters, nautical charts dusted with flour, brass compasses, jars of ink, morning light, and a baker drawing coastlines in powdered sugar. Text elements: "TIDAL BREAD & MAPS" on the front sign, "SOURDOUGH CURRENT - 6:30 AM" on a chalkboard, "UNCHARTED PLUM TARTS" on a pastry label, "DO NOT EAT THE COMPASS" on a note, and "NORTH SEA BATCH 12" stamped on a paper bag. Warm, detailed, whimsical but realistic.

P08 - Long English Train Notice Wall

A foggy Edwardian railway platform at dawn with a wall of overlapping long English notices, brass lamps, wet cobblestones, porters, suitcases, pigeons, steam, and reflections. The largest notice must read: "IMPORTANT SERVICE CHANGE: The 6:42 express to Northbridge will depart from Platform Three after the moonlit freight has cleared the signal box." A second poster reads: "LOST PROPERTY: one violin case, two blue gloves, a silver compass, and a letter never posted." A timetable says "WINTER ROUTE - DELAYS EXPECTED NEAR THE MARSH". Ultra detailed, cinematic, legible signs, natural perspective.

P09 - Orbital Botanical Courtroom

A surreal but photorealistic courtroom inside an orbital botanical garden, judges in dark robes, enormous ferns, floating pollen, Earth visible through a curved window, holographic evidence screens, and a tiny robot stenographer. Required readable text: "CASE 44-B: THE PEOPLE VS. THE SUNFLOWER" on the main screen, "EVIDENCE: THREE PETALS AND A BROKEN VASE" on a side display, "SILENCE IN THE GREENHOUSE COURT" on a sign, "WITNESS: DR. LYSANDER MOSS" on a nameplate, and "OXYGEN TAX RECEIPT" on a paper slip. Sharp, high-detail, dramatic lighting.

P10 - Byzantine Data Center Chapel

A Byzantine chapel converted into a quiet data center, gold mosaics reflecting server LEDs, incense smoke, marble floors, monks maintaining fiber cables, illuminated manuscripts next to diagnostic terminals, and beautiful cable management. Text must appear in multiple places: "SANCTUM SERVER ROOM - AUTHORIZED MONKS ONLY" on a bronze door, "BACKUP PSALMS COMPLETED AT 03:12" on a terminal, "DO NOT UNPLUG THE RELIQUARY" on a warning label, "LATENCY PRAYER REQUESTS" on a clipboard, and "ARCHIVE NODE IX" etched on a server rack. Rich texture, controlled highlights, realistic scale.

Notes

This checkpoint is intended for research and evaluation. It inherits the upstream Lens limitations and responsible AI considerations from the source model. Text rendering remains challenging; the corrected recipe is designed to avoid the obvious grid/printed texture failure seen when transformer modulation linears are quantized.