Running a 32 Billion Parameter Image Model on “Ancient” V100 GPUs

Community Article Published January 29, 2026

Or: How I Learned to Stop Worrying About H100 Pricing and Love Turning 2017 V100s into an NVLink Space Heater That Draws Pictures


The Challenge

It’s 2026. The AI world has moved on to H100s, H200s, and whatever NVIDIA announced last week that costs more than my car. Meanwhile, I’m sitting here with a server full of V100s — GPUs that were hot stuff back in 2017.

But here’s the thing: I’ve got eight of them. That’s 256GB of VRAM just… sitting there. And Black Forest Labs dropped FLUX.2-dev, a 32B parameter text-to-image model that can crank out images so photorealistic it feels like cheating.

The internet said I needed an H100. The internet was wrong.


The Setup (AKA “What Are We Working With?”)

8x Tesla V100-SXM2-32GB
256GB Total VRAM
NVLink interconnect (fancy GPU-to-GPU highways)
One engineer with questionable decision-making skills

The V100 is like that Honda Civic from 2005 that refuses to die. Sure, it doesn’t have the fancy new features — no BF16, no transformer engines — but it’s reliable, it’s paid off, and it still gets the job done if you stop expecting it to be a spaceship.


The Model (AKA “Why Is This So Big?”)

FLUX.2-dev breaks down like this:

Component Size What It Does
Text Encoder (Mistral-3) ~24B params Reads your prompt and pretends to understand it
Transformer (DiT) ~32B params The actual artist (with commitment issues about which GPU to live on)
VAE ~84M params Turns math into pixels

Total: roughly “my single V100 is not amused.”


The Journey

Attempt #1: “Just Load It, How Hard Can It Be?”

pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev")
pipe.to("cuda")

Result: the GPU had what I can only describe as a panic attack.

CUDA out of memory. Tried to allocate 31.16 GB.
GPU 0 has 143.88 MiB free.

One GPU down, seven witnesses.


Attempt #2: “Let’s Try That CPU Offload Thing”

pipe.enable_model_cpu_offload()

Result: it loaded! Then during inference:

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

The model was having an identity crisis about where it lived. Relatable.


Attempt #3: “device_map='balanced' Will Save Us”

The Hugging Face docs promised this would distribute my model across all GPUs like some kind of silicon socialism.

pipe = Flux2Pipeline.from_pretrained(
    model_id,
    device_map="balanced",
    max_memory={i: "28GiB" for i in range(8)}
)

Result:

Some parameters are on the meta device because they were offloaded to the cpu.

The model looked at my 256GB of VRAM and said “nah, I’ll just hang out in RAM, thanks.”

GPU memory usage: ~0.2GB across all 8 GPUs.

screaming internally


Attempt #4: “Fine, I’ll Do It Myself”

Sometimes you have to grab the model by the parameters and tell it where to go.

# Text Encoder: "You two, GPUs 0 and 1, you're on text duty"
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
    MODEL_ID, subfolder="text_encoder",
    device_map="balanced",
    max_memory={0: "30GiB", 1: "30GiB"}  # AND NOWHERE ELSE
)

# Transformer: "The rest of you, handle the heavy lifting"
transformer = AutoModel.from_pretrained(
    MODEL_ID, subfolder="transformer",
    device_map="balanced",
    max_memory={2: "30GiB", 3: "30GiB", 4: "30GiB",
                5: "30GiB", 6: "30GiB", 7: "30GiB"}
)

# VAE: "You're small, you can share with GPU 0"
vae = AutoencoderKLFlux2.from_pretrained(
    MODEL_ID, subfolder="vae"
).to("cuda:0")

Result:

=== GPU Memory ===
  GPU 0: 23.2GB ✓
  GPU 1: 21.7GB ✓
  GPU 2: 8.7GB  ✓
  GPU 3: 11.9GB ✓
  GPU 4: 11.9GB ✓
  GPU 5: 11.9GB ✓
  GPU 6: 11.9GB ✓
  GPU 7: 3.8GB  ✓

IT’S DISTRIBUTED. THE PARAMETERS ARE ON THE GPUS. I AM A GOD.


Attempt #5: “Please Generate An Image”

image = pipe(prompt="A futuristic city at sunset")

Progress bar appears…

  0%|          | 0/28 [00:00<?, ?it/s]
  4%|▎         | 1/28 [00:05<02:16,  5.04s/it]
 ...
100%|██████████| 28/28 [02:06<00:00,  4.53s/it]

Then:

AttributeError: 'AutoencoderKL' object has no attribute 'bn'

Of course. The VAE for FLUX.2 isn’t a regular VAE. It’s a special AutoencoderKLFlux2 with batch normalization, because why would anything be simple?


Attempt #6: “The Actual Working Version”

vae = AutoencoderKLFlux2.from_pretrained(...)  # Note the Flux2

And then…

Saved to ./outputs/output.png

It Works. It Actually Works.

A couple of minutes later, I had a 1024×1024 image of a futuristic city that looked like it came from a Hollywood concept art department.

Generated on hardware from 2017.

Take that, GPU scalpers.


The Numbers

Metric Value
Model Load Time ~60 seconds
Generation Time (1024×1024, 28 steps) ~2–3 minutes
Total VRAM Used ~105GB
GPUs Required 8× V100-32GB
Smugness Level Immeasurable

What I Learned

  1. device_map="balanced" lies (or at least “optimizes” in ways you won’t like). It will offload to CPU if you let it. If you want predictable placement, constrain each component’s GPU set explicitly.
  2. The VAE must be on the same GPU as where latents end up. In my case that’s cuda:0. I lost time to device mismatch errors before accepting this as law.
  3. Old GPUs aren’t useless. They’re just… distributed. Eight V100s can do what one modern GPU does, just slower and with more coordination.
  4. FP16 is your friend. V100s don’t do BF16 natively, but FP16 inference works fine here.
  5. Persistence pays off. This took multiple hours and several dead ends, but the final setup is stable and repeatable.

The Final Setup

/opt/flux2-dev/
├── app.py                  # Gradio web UI
├── run_flux.py             # CLI script
├── models/FLUX.2-dev/      # model weights (~166GB)
└── outputs/                # generated images

Run the web UI:

cd /opt/flux2-dev
source venv/bin/activate
python app.py
# Open http://localhost:7860

Conclusion

The AI industry wants you to believe you need the latest hardware to run the latest models. And sure, modern GPUs will be dramatically faster and more efficient. But there’s something deeply satisfying about getting a 32B model running on “legacy” cards using nothing but sharding, stubbornness, and the occasional whisper of profanity.

Now if you’ll excuse me, I’m off to generate absurdly detailed robots in dramatic neon rain in the style of a Renaissance painting” because I can.

Eight V100s running a model this size are basically a space heater with NVLink. And that’s the point: this was a fun experiment in sharding, debugging, and seeing how far old hardware can be pushed — not a blueprint for operating FLUX.2-dev continuously. It works, it’s impressive, and it’s also hilariously inefficient compared to modern setups.

So I’m happy keeping it in the “prove it once, generate a test image (or two), learn a lot, move on” category.

Was it worth it? Absolutely.
Would I run it like this continuously? Absolutely not.


FLUX.2-dev Multi-GPU Setup (Technical Notes)

FLUX.2-dev is a 32B parameter text-to-image model from Black Forest Labs. This setup runs it across 8× V100-SXM2-32GB GPUs (256GB total VRAM).

Requirements

  • 8× NVIDIA GPUs with combined VRAM ≥ ~110GB (tested on 8× V100-32GB)
  • Python 3.10+
  • CUDA 11.8+
  • ~170GB disk space for model weights

Installation

# Choose an install directory
cd /opt/flux2-dev

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install diffusers transformers accelerate huggingface_hub gradio

Model Download

Example model path:

/opt/flux2-dev/models/FLUX.2-dev/

To download/update (if you use a script):

source venv/bin/activate
python download_model.py

GPU Memory Allocation

Component GPUs Memory Used
Text Encoder (Mistral-3) 0, 1 ~45GB
Transformer (DiT) 2–7 ~60GB
VAE 0 ~0.2GB
Total 0–7 ~105GB

Running

Option 1: Web Interface (Gradio)

cd /opt/flux2-dev
source venv/bin/activate
python app.py

Access:

Option 2: Command Line

cd /opt/flux2-dev
source venv/bin/activate
python run_flux.py

Edit run_flux.py to change prompt/settings.

Option 3: Python API

import os
import torch

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"

from diffusers import Flux2Pipeline, AutoModel
from diffusers.models import AutoencoderKLFlux2
from transformers import Mistral3ForConditionalGeneration

MODEL_ID = "/opt/flux2-dev/models/FLUX.2-dev"
DTYPE = torch.float16

# Load components with GPU sharding
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
    MODEL_ID,
    subfolder="text_encoder",
    torch_dtype=DTYPE,
    device_map="balanced",
    max_memory={0: "30GiB", 1: "30GiB"},
)

transformer = AutoModel.from_pretrained(
    MODEL_ID,
    subfolder="transformer",
    torch_dtype=DTYPE,
    device_map="balanced",
    max_memory={2: "30GiB", 3: "30GiB", 4: "30GiB", 5: "30GiB", 6: "30GiB", 7: "30GiB"},
)

vae = AutoencoderKLFlux2.from_pretrained(
    MODEL_ID,
    subfolder="vae",
    torch_dtype=DTYPE,
).to("cuda:0")
vae.enable_tiling()

# Assemble pipeline
pipe = Flux2Pipeline.from_pretrained(
    MODEL_ID,
    text_encoder=None,
    transformer=None,
    vae=None,
    torch_dtype=DTYPE,
)
pipe.text_encoder = text_encoder
pipe.transformer = transformer
pipe.vae = vae

# Generate
image = pipe(
    prompt="A futuristic city at sunset",
    width=1024,
    height=1024,
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]

# Save output
os.makedirs("outputs", exist_ok=True)
image.save("outputs/output.png")

Generation Settings

Parameter Default Range Description
width 1024 256–2048 Width (multiples of 64)
height 1024 256–2048 Height (multiples of 64)
num_inference_steps 28 1–100 More steps = better quality, slower
guidance_scale 3.5 1.0–20.0 Prompt adherence
seed random 0–2³² Reproducibility

Performance

On 8× V100-SXM2-32GB:

  • Model load time: ~60 seconds
  • Generation time (1024×1024, 28 steps): ~2–3 minutes
  • Speed: ~0.15–0.2 steps/second

Output

Generated images are saved as PNG files:

  • Web UI: ./outputs/flux2_YYYYMMDD_HHMMSS_SEED.png (example)
  • CLI/API: ./outputs/output.png (or wherever you save)

Troubleshooting

“CUDA out of memory”

  • Reduce resolution
  • Reduce inference steps
  • Check other GPU processes: nvidia-smi
  • Note: Unlikely on 8×32GB, common on smaller setups.

“Expected all tensors to be on the same device”

  • Ensure the VAE is on cuda:0
  • Don’t mix enable_model_cpu_offload() with manual sharding unless you know exactly how tensors move

Slow generation

  • V100s are simply slower than A100/H100-class GPUs
  • Use fewer steps (20–28 is usually fine)
  • Test at smaller resolutions first

Links

Community

Sign up or log in to comment