File size: 9,073 Bytes
5ae3a57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
license: mit
language:
- en
pipeline_tag: text-to-image
---
> **Note:** This repository is an **archived mirror** and is **not** the original upstream source.  
> The original model, weights, and documentation are developed and maintained by **Microsoft**.
>
> All hosted model weights are **unmodified**. 
>
> This project is released under the **MIT License**, which permits use, modification, and redistribution under its terms.
>
> *This repository is not affiliated with, endorsed by, or sponsored by Microsoft.*

<div align="center">

# Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

<p>
  <b>Contributors (Alphabetical Order):</b><br />
    <strong>Baining Guo</strong>,
    <strong>Chong Luo</strong>,
    <strong>Dong Chen</strong>&dagger;,
    <strong>Dongdong Chen</strong>,
    <strong>Fangyun Wei</strong>&dagger;,
    <strong>Ji Li</strong>,
    <strong>Jianmin Bao</strong>,
    <strong>Jiawei Zhang</strong>&ast;,
    <strong>Jinjing Zhao</strong>&ast;,
    <strong>Lei Shi</strong>,
    <strong>Qinhong Yang</strong>,
    <strong>Sirui Zhang</strong>&ast;,
    <strong>Xiuyu Wu</strong>,
    <strong>Xuelu Feng</strong>,
    <strong>Yan Lu</strong>,
    <strong>Yanchen Dong</strong>,
    <strong>Yang Yue</strong>&ast;,
    <strong>Yitong Wang</strong>,
    <strong>Yunuo Chen</strong>,
    <strong>Zhiyang Liang</strong>&ast;,
    <strong>Ziyu Wan</strong>&dagger;
  <br />
  Microsoft &nbsp;|&nbsp; &ast;Core Contributors &nbsp;|&nbsp; &dagger;Project Lead
</p>
<p>
  <a href="https://arxiv.org/abs/2605.21573"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv&logoColor=white" height="22" /></a>
  &nbsp;
  <a href="https://huggingface.co/microsoft/Lens-Turbo"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97-Models-yellow" height="22" /></a>
  &nbsp;
  <a href="https://github.com/microsoft/Lens"><img alt="GitHub" src="https://img.shields.io/badge/GitHub-Repo-181717?logo=github&logoColor=white" height="22" /></a>
  &nbsp;
  <a href="https://github.com/microsoft/Lens/blob/main/LICENSE"><img alt="License: MIT" src="https://img.shields.io/badge/License-MIT-green.svg" height="22" /></a>
</p>

</div>

---

**Lens** is a **3.8B-parameter** foundational text-to-image model designed for **efficient training** and **fast high-resolution generation**. It combines dense-caption pre-training, mixed-resolution learning, GPT-OSS multi-layer text features, and the FLUX.2 semantic VAE to reach competitive quality with substantially less training compute than larger T2I models.

This repository provides the minimal inference code for generating images from Lens DiT checkpoints.

## Highlights

- **Efficient Foundation** &mdash; Trained on **Lens-800M**, an 800M image-text corpus with long GPT-4.1 captions, maximizing information density per training batch.
- **Compact & Expressive** &mdash; A 48-block MMDiT denoiser leverages FLUX.2 latents and concatenated multi-layer GPT-OSS features for stronger prompt following and multilingual generalization.
- **Flexible Resolution** &mdash; Mixed-resolution training enables inference across aspect ratios from `1:2` to `2:1` and resolutions up to **1440&times;1440**.
- **Post-trained Variants** &mdash; RL tuning improves visual quality and artifact suppression; the distilled **Lens-Turbo** supports fast **4-step** generation.

## Installation

> **Tested environment:** Python 3.12 &middot; CUDA 12.6 &middot; PyTorch 2.11.0+cu126 &middot; TorchVision 0.26.0+cu126

```bash
conda create -n lens python=3.12 -y
conda activate lens
uv pip install torch==2.11.0+cu126 torchvision==0.26.0+cu126 \
    --index-url https://download.pytorch.org/whl/cu126
uv pip install -r requirements.txt
```

The default GPT-OSS encoder and FLUX.2 VAE are loaded from Hugging Face. Make sure your environment has access to any gated model repositories you use.

## Checkpoints

| Repo | Description | Steps | CFG |
| :--- | :--- | :---: | :---: |
| [`microsoft/Lens`](https://huggingface.co/microsoft/Lens) | **Default.** RL-tuned for visual quality | 20 | 5.0 |
| [`microsoft/Lens-Turbo`](https://huggingface.co/microsoft/Lens-Turbo) | Distilled from the RL model for fast 4-step sampling | 4 | 1.0 |
| [`microsoft/Lens-Base`](https://huggingface.co/microsoft/Lens-Base) | Supervised base model (no RL, no distillation) | 50 | 5.0 |

Pick a variant by passing its repo id to `--repo_id` (CLI) or `LensPipeline.from_pretrained(...)` (Python).

## Inference

> **Important:** run from the cloned repo root so `from lens import LensPipeline` resolves to this package &mdash; importing `lens` is what registers `LensGptOssEncoder` / `LensTransformer2DModel` with the `transformers` and `diffusers` namespaces that `model_index.json` references.

**Python API:**

```python
import torch
from lens import LensPipeline
pipe = LensPipeline.from_pretrained(
    "microsoft/Lens", torch_dtype=torch.bfloat16
).to("cuda")
image = pipe(
    prompt="A cat holding a sign that says \"hello world\"",
    base_resolution=1440, aspect_ratio="1:1",
    num_inference_steps=20, guidance_scale=5.0,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("lens.png")
```
To trade speed for VRAM, replace `.to("cuda")` with `pipe.enable_model_cpu_offload()`.

**CLI &mdash; basic usage:**

```bash
python inference.py \
    --repo_id "microsoft/Lens" \
    --prompt "A cinematic mountain lake at sunrise, soft mist, detailed reflections" \
    --base_resolution 1440 --aspect_ratio 1:1 \
    --steps 20 --cfg 5.0 --n 1 --seed 42 \
    --out ./outputs
```

**Batch generation** &mdash; join multiple prompts with `|`:

```bash
python inference.py \
    --repo_id "microsoft/Lens" \
    --steps 20 --cfg 5.0 \
    --prompt "a red fox in snow|a glass greenhouse at night"
```

**A100 / V100 (no MXFP4 kernels)** &mdash; dequantize the GPT-OSS encoder to bf16:

```bash
python inference.py \
    --repo_id "microsoft/Lens" \
    --steps 20 --cfg 5.0 \
    --prompt "a cat" \
    --disable_mxfp4 --offload
```

### Options

| Flag | Description | Default |
| :--- | :--- | :--- |
| `--repo_id` | HF repo id (or local path) of the assembled Lens pipeline | `microsoft/Lens` |
| `--base_resolution` | `1024` or `1440` | `1440` |
| `--aspect_ratio` | `1:2`, `9:16`, `2:3`, `3:4`, `1:1`, `4:3`, `3:2`, `16:9`, `2:1` | `1:1` |
| `--steps` | Number of denoising steps | `20` |
| `--cfg` | Classifier-free guidance scale | `5.0` |
| `--n` | Number of images per prompt | `1` |
| `--seed` | Random seed (omit for non-deterministic) | &mdash; |
| `--out` | Output directory | `./outputs` |
| `--dtype` | Compute dtype: `bfloat16`, `float16`, `float32` | `bfloat16` |
| `--disable_mxfp4` | Dequantize the GPT-OSS text encoder to `--dtype` (required on A100 / V100; Hopper+ keeps MXFP4 by default for less VRAM) | &mdash; |
| `--offload` | Enable diffusers CPU offload (`text_encoder->transformer->vae`) to reduce peak VRAM | &mdash; |
| `--reasoner` | Refine prompts with the loaded GPT-OSS encoder before generation | &mdash; |
| `--api_url` / `--api_key` / `--api_model` | Use an OpenAI-compatible API for prompt refinement (takes precedence over `--reasoner`) | &mdash; |

## Citation

```bibtex
@article{zhao2026lens,
  title   = {Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models},
  author  = {Guo, Baining and Luo, Chong and Chen, Dong and Chen, Dongdong and Wei, Fangyun and Li, Ji and Bao, Jianmin and Zhang, Jiawei and Zhao, Jinjing and Shi, Lei and Yang, Qinhong and Zhang, Sirui and Wu, Xiuyu and Feng, Xuelu and Lu, Yan and Dong, Yanchen and Yue, Yang and Wang, Yitong and Chen, Yunuo and Liang, Zhiyang and Wan, Ziyu},
  journal = {arXiv preprint arXiv:2605.21573},
  year    = {2026}
}
```

## Responsible AI

The model is released for research purposes only and is not intended for product or service deployment. Responsible AI considerations were incorporated throughout the development process, including data selection, model training, and evaluation.
The training data includes a combination of public, licensed, and internal datasets that were processed to remove clearly identifiable personal information and reduce harmful content where possible. However, as the data is largely sourced from web-scale collections, it may contain biases or uneven representation. As a result, the model may generate outputs that are inaccurate, biased, or inappropriate under certain prompts, including content that could be misleading or raise copyright or IP-related concerns.
Given these limitations, the model should be used in controlled research settings, with appropriate human oversight. Downstream users are responsible for applying additional safeguards, such as content moderation, validation, and compliance checks, before using the model in broader applications.

## Privacy

This project does not collect any usage data. For more information, see the [Microsoft Privacy Statement](https://go.microsoft.com/fwlink/?LinkId=521839).

## License

This project is released under the [MIT License](LICENSE).