Running a 32 Billion Parameter Image Model on “Ancient” V100 GPUs
The Challenge
It’s 2026. The AI world has moved on to H100s, H200s, and whatever NVIDIA announced last week that costs more than my car. Meanwhile, I’m sitting here with a server full of V100s — GPUs that were hot stuff back in 2017.
But here’s the thing: I’ve got eight of them. That’s 256GB of VRAM just… sitting there. And Black Forest Labs dropped FLUX.2-dev, a 32B parameter text-to-image model that can crank out images so photorealistic it feels like cheating.
The internet said I needed an H100. The internet was wrong.
The Setup (AKA “What Are We Working With?”)
8x Tesla V100-SXM2-32GB
256GB Total VRAM
NVLink interconnect (fancy GPU-to-GPU highways)
One engineer with questionable decision-making skills
The V100 is like that Honda Civic from 2005 that refuses to die. Sure, it doesn’t have the fancy new features — no BF16, no transformer engines — but it’s reliable, it’s paid off, and it still gets the job done if you stop expecting it to be a spaceship.
The Model (AKA “Why Is This So Big?”)
FLUX.2-dev breaks down like this:
| Component | Size | What It Does |
|---|---|---|
| Text Encoder (Mistral-3) | ~24B params | Reads your prompt and pretends to understand it |
| Transformer (DiT) | ~32B params | The actual artist (with commitment issues about which GPU to live on) |
| VAE | ~84M params | Turns math into pixels |
Total: roughly “my single V100 is not amused.”
The Journey
Attempt #1: “Just Load It, How Hard Can It Be?”
pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev")
pipe.to("cuda")
Result: the GPU had what I can only describe as a panic attack.
CUDA out of memory. Tried to allocate 31.16 GB.
GPU 0 has 143.88 MiB free.
One GPU down, seven witnesses.
Attempt #2: “Let’s Try That CPU Offload Thing”
pipe.enable_model_cpu_offload()
Result: it loaded! Then during inference:
RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!
The model was having an identity crisis about where it lived. Relatable.
Attempt #3: “device_map='balanced' Will Save Us”
The Hugging Face docs promised this would distribute my model across all GPUs like some kind of silicon socialism.
pipe = Flux2Pipeline.from_pretrained(
model_id,
device_map="balanced",
max_memory={i: "28GiB" for i in range(8)}
)
Result:
Some parameters are on the meta device because they were offloaded to the cpu.
The model looked at my 256GB of VRAM and said “nah, I’ll just hang out in RAM, thanks.”
GPU memory usage: ~0.2GB across all 8 GPUs.
screaming internally
Attempt #4: “Fine, I’ll Do It Myself”
Sometimes you have to grab the model by the parameters and tell it where to go.
# Text Encoder: "You two, GPUs 0 and 1, you're on text duty"
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
MODEL_ID, subfolder="text_encoder",
device_map="balanced",
max_memory={0: "30GiB", 1: "30GiB"} # AND NOWHERE ELSE
)
# Transformer: "The rest of you, handle the heavy lifting"
transformer = AutoModel.from_pretrained(
MODEL_ID, subfolder="transformer",
device_map="balanced",
max_memory={2: "30GiB", 3: "30GiB", 4: "30GiB",
5: "30GiB", 6: "30GiB", 7: "30GiB"}
)
# VAE: "You're small, you can share with GPU 0"
vae = AutoencoderKLFlux2.from_pretrained(
MODEL_ID, subfolder="vae"
).to("cuda:0")
Result:
=== GPU Memory ===
GPU 0: 23.2GB ✓
GPU 1: 21.7GB ✓
GPU 2: 8.7GB ✓
GPU 3: 11.9GB ✓
GPU 4: 11.9GB ✓
GPU 5: 11.9GB ✓
GPU 6: 11.9GB ✓
GPU 7: 3.8GB ✓
IT’S DISTRIBUTED. THE PARAMETERS ARE ON THE GPUS. I AM A GOD.
Attempt #5: “Please Generate An Image”
image = pipe(prompt="A futuristic city at sunset")
Progress bar appears…
0%| | 0/28 [00:00<?, ?it/s]
4%|▎ | 1/28 [00:05<02:16, 5.04s/it]
...
100%|██████████| 28/28 [02:06<00:00, 4.53s/it]
Then:
AttributeError: 'AutoencoderKL' object has no attribute 'bn'
Of course. The VAE for FLUX.2 isn’t a regular VAE. It’s a special AutoencoderKLFlux2 with batch normalization, because why would anything be simple?
Attempt #6: “The Actual Working Version”
vae = AutoencoderKLFlux2.from_pretrained(...) # Note the Flux2
And then…
Saved to ./outputs/output.png
It Works. It Actually Works.
A couple of minutes later, I had a 1024×1024 image of a futuristic city that looked like it came from a Hollywood concept art department.
Generated on hardware from 2017.
Take that, GPU scalpers.
The Numbers
| Metric | Value |
|---|---|
| Model Load Time | ~60 seconds |
| Generation Time (1024×1024, 28 steps) | ~2–3 minutes |
| Total VRAM Used | ~105GB |
| GPUs Required | 8× V100-32GB |
| Smugness Level | Immeasurable |
What I Learned
device_map="balanced"lies (or at least “optimizes” in ways you won’t like). It will offload to CPU if you let it. If you want predictable placement, constrain each component’s GPU set explicitly.- The VAE must be on the same GPU as where latents end up. In my case that’s
cuda:0. I lost time to device mismatch errors before accepting this as law. - Old GPUs aren’t useless. They’re just… distributed. Eight V100s can do what one modern GPU does, just slower and with more coordination.
- FP16 is your friend. V100s don’t do BF16 natively, but FP16 inference works fine here.
- Persistence pays off. This took multiple hours and several dead ends, but the final setup is stable and repeatable.
The Final Setup
/opt/flux2-dev/
├── app.py # Gradio web UI
├── run_flux.py # CLI script
├── models/FLUX.2-dev/ # model weights (~166GB)
└── outputs/ # generated images
Run the web UI:
cd /opt/flux2-dev
source venv/bin/activate
python app.py
# Open http://localhost:7860
Conclusion
The AI industry wants you to believe you need the latest hardware to run the latest models. And sure, modern GPUs will be dramatically faster and more efficient. But there’s something deeply satisfying about getting a 32B model running on “legacy” cards using nothing but sharding, stubbornness, and the occasional whisper of profanity.
Now if you’ll excuse me, I’m off to generate absurdly detailed robots in dramatic neon rain in the style of a Renaissance painting” because I can.
Eight V100s running a model this size are basically a space heater with NVLink. And that’s the point: this was a fun experiment in sharding, debugging, and seeing how far old hardware can be pushed — not a blueprint for operating FLUX.2-dev continuously. It works, it’s impressive, and it’s also hilariously inefficient compared to modern setups.
So I’m happy keeping it in the “prove it once, generate a test image (or two), learn a lot, move on” category.
Was it worth it? Absolutely.
Would I run it like this continuously? Absolutely not.
FLUX.2-dev Multi-GPU Setup (Technical Notes)
FLUX.2-dev is a 32B parameter text-to-image model from Black Forest Labs. This setup runs it across 8× V100-SXM2-32GB GPUs (256GB total VRAM).
Requirements
- 8× NVIDIA GPUs with combined VRAM ≥ ~110GB (tested on 8× V100-32GB)
- Python 3.10+
- CUDA 11.8+
- ~170GB disk space for model weights
Installation
# Choose an install directory
cd /opt/flux2-dev
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install diffusers transformers accelerate huggingface_hub gradio
Model Download
Example model path:
/opt/flux2-dev/models/FLUX.2-dev/
To download/update (if you use a script):
source venv/bin/activate
python download_model.py
GPU Memory Allocation
| Component | GPUs | Memory Used |
|---|---|---|
| Text Encoder (Mistral-3) | 0, 1 | ~45GB |
| Transformer (DiT) | 2–7 | ~60GB |
| VAE | 0 | ~0.2GB |
| Total | 0–7 | ~105GB |
Running
Option 1: Web Interface (Gradio)
cd /opt/flux2-dev
source venv/bin/activate
python app.py
Access:
- Local: http://localhost:7860
- Network: http://:7860
Option 2: Command Line
cd /opt/flux2-dev
source venv/bin/activate
python run_flux.py
Edit run_flux.py to change prompt/settings.
Option 3: Python API
import os
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
from diffusers import Flux2Pipeline, AutoModel
from diffusers.models import AutoencoderKLFlux2
from transformers import Mistral3ForConditionalGeneration
MODEL_ID = "/opt/flux2-dev/models/FLUX.2-dev"
DTYPE = torch.float16
# Load components with GPU sharding
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
MODEL_ID,
subfolder="text_encoder",
torch_dtype=DTYPE,
device_map="balanced",
max_memory={0: "30GiB", 1: "30GiB"},
)
transformer = AutoModel.from_pretrained(
MODEL_ID,
subfolder="transformer",
torch_dtype=DTYPE,
device_map="balanced",
max_memory={2: "30GiB", 3: "30GiB", 4: "30GiB", 5: "30GiB", 6: "30GiB", 7: "30GiB"},
)
vae = AutoencoderKLFlux2.from_pretrained(
MODEL_ID,
subfolder="vae",
torch_dtype=DTYPE,
).to("cuda:0")
vae.enable_tiling()
# Assemble pipeline
pipe = Flux2Pipeline.from_pretrained(
MODEL_ID,
text_encoder=None,
transformer=None,
vae=None,
torch_dtype=DTYPE,
)
pipe.text_encoder = text_encoder
pipe.transformer = transformer
pipe.vae = vae
# Generate
image = pipe(
prompt="A futuristic city at sunset",
width=1024,
height=1024,
num_inference_steps=28,
guidance_scale=3.5,
).images[0]
# Save output
os.makedirs("outputs", exist_ok=True)
image.save("outputs/output.png")
Generation Settings
| Parameter | Default | Range | Description |
|---|---|---|---|
width |
1024 | 256–2048 | Width (multiples of 64) |
height |
1024 | 256–2048 | Height (multiples of 64) |
num_inference_steps |
28 | 1–100 | More steps = better quality, slower |
guidance_scale |
3.5 | 1.0–20.0 | Prompt adherence |
seed |
random | 0–2³² | Reproducibility |
Performance
On 8× V100-SXM2-32GB:
- Model load time: ~60 seconds
- Generation time (1024×1024, 28 steps): ~2–3 minutes
- Speed: ~0.15–0.2 steps/second
Output
Generated images are saved as PNG files:
- Web UI:
./outputs/flux2_YYYYMMDD_HHMMSS_SEED.png(example) - CLI/API:
./outputs/output.png(or wherever you save)
Troubleshooting
“CUDA out of memory”
- Reduce resolution
- Reduce inference steps
- Check other GPU processes:
nvidia-smi - Note: Unlikely on 8×32GB, common on smaller setups.
“Expected all tensors to be on the same device”
- Ensure the VAE is on
cuda:0 - Don’t mix
enable_model_cpu_offload()with manual sharding unless you know exactly how tensors move
Slow generation
- V100s are simply slower than A100/H100-class GPUs
- Use fewer steps (20–28 is usually fine)
- Test at smaller resolutions first
Links
- FLUX.2 on Hugging Face: https://huggingface.co/black-forest-labs/FLUX.2-dev
- Black Forest Labs: https://blackforestlabs.ai/
- Diffusers docs: https://huggingface.co/docs/diffusers/
