Running a 32 Billion Parameter Image Model on “Ancient” V100 GPUs

Community Article Published January 29, 2026

Or: How I Learned to Stop Worrying About H100 Pricing and Love Turning 2017 V100s into an NVLink Space Heater That Draws Pictures

The Challenge

It’s 2026. The AI world has moved on to H100s, H200s, and whatever NVIDIA announced last week that costs more than my car. Meanwhile, I’m sitting here with a server full of V100s — GPUs that were hot stuff back in 2017.

But here’s the thing: I’ve got eight of them. That’s 256GB of VRAM just… sitting there. And Black Forest Labs dropped FLUX.2-dev, a 32B parameter text-to-image model that can crank out images so photorealistic it feels like cheating.

The internet said I needed an H100. The internet was wrong.

The Setup (AKA “What Are We Working With?”)

8x Tesla V100-SXM2-32GB
256GB Total VRAM
NVLink interconnect (fancy GPU-to-GPU highways)
One engineer with questionable decision-making skills

The V100 is like that Honda Civic from 2005 that refuses to die. Sure, it doesn’t have the fancy new features — no BF16, no transformer engines — but it’s reliable, it’s paid off, and it still gets the job done if you stop expecting it to be a spaceship.

The Model (AKA “Why Is This So Big?”)

FLUX.2-dev breaks down like this:

Component	Size	What It Does
Text Encoder (Mistral-3)	~24B params	Reads your prompt and pretends to understand it
Transformer (DiT)	~32B params	The actual artist (with commitment issues about which GPU to live on)
VAE	~84M params	Turns math into pixels

Total: roughly “my single V100 is not amused.”

The Journey

Attempt #1: “Just Load It, How Hard Can It Be?”

pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev")
pipe.to("cuda")

Result: the GPU had what I can only describe as a panic attack.

CUDA out of memory. Tried to allocate 31.16 GB.
GPU 0 has 143.88 MiB free.

One GPU down, seven witnesses.

Attempt #2: “Let’s Try That CPU Offload Thing”

pipe.enable_model_cpu_offload()

Result: it loaded! Then during inference:

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

The model was having an identity crisis about where it lived. Relatable.

Attempt #3: “device_map='balanced' Will Save Us”

The Hugging Face docs promised this would distribute my model across all GPUs like some kind of silicon socialism.

pipe = Flux2Pipeline.from_pretrained(
    model_id,
    device_map="balanced",
    max_memory={i: "28GiB" for i in range(8)}
)

Result:

Some parameters are on the meta device because they were offloaded to the cpu.

The model looked at my 256GB of VRAM and said “nah, I’ll just hang out in RAM, thanks.”

GPU memory usage: ~0.2GB across all 8 GPUs.

screaming internally

Attempt #4: “Fine, I’ll Do It Myself”

Sometimes you have to grab the model by the parameters and tell it where to go.

# Text Encoder: "You two, GPUs 0 and 1, you're on text duty"
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
    MODEL_ID, subfolder="text_encoder",
    device_map="balanced",
    max_memory={0: "30GiB", 1: "30GiB"}  # AND NOWHERE ELSE
)

# Transformer: "The rest of you, handle the heavy lifting"
transformer = AutoModel.from_pretrained(
    MODEL_ID, subfolder="transformer",
    device_map="balanced",
    max_memory={2: "30GiB", 3: "30GiB", 4: "30GiB",
                5: "30GiB", 6: "30GiB", 7: "30GiB"}
)

# VAE: "You're small, you can share with GPU 0"
vae = AutoencoderKLFlux2.from_pretrained(
    MODEL_ID, subfolder="vae"
).to("cuda:0")

Result:

=== GPU Memory ===
  GPU 0: 23.2GB ✓
  GPU 1: 21.7GB ✓
  GPU 2: 8.7GB  ✓
  GPU 3: 11.9GB ✓
  GPU 4: 11.9GB ✓
  GPU 5: 11.9GB ✓
  GPU 6: 11.9GB ✓
  GPU 7: 3.8GB  ✓

IT’S DISTRIBUTED. THE PARAMETERS ARE ON THE GPUS. I AM A GOD.

Attempt #5: “Please Generate An Image”

image = pipe(prompt="A futuristic city at sunset")

Progress bar appears…

  0%|          | 0/28 [00:00<?, ?it/s]
  4%|▎         | 1/28 [00:05<02:16,  5.04s/it]
 ...
100%|██████████| 28/28 [02:06<00:00,  4.53s/it]

Then:

AttributeError: 'AutoencoderKL' object has no attribute 'bn'

Of course. The VAE for FLUX.2 isn’t a regular VAE. It’s a special AutoencoderKLFlux2 with batch normalization, because why would anything be simple?

Attempt #6: “The Actual Working Version”

vae = AutoencoderKLFlux2.from_pretrained(...)  # Note the Flux2

And then…

Saved to ./outputs/output.png

It Works. It Actually Works.

A couple of minutes later, I had a 1024×1024 image of a futuristic city that looked like it came from a Hollywood concept art department.

Generated on hardware from 2017.

Take that, GPU scalpers.

The Numbers

Metric	Value
Model Load Time	~60 seconds
Generation Time (1024×1024, 28 steps)	~2–3 minutes
Total VRAM Used	~105GB
GPUs Required	8× V100-32GB
Smugness Level	Immeasurable

What I Learned

device_map="balanced" lies (or at least “optimizes” in ways you won’t like). It will offload to CPU if you let it. If you want predictable placement, constrain each component’s GPU set explicitly.
The VAE must be on the same GPU as where latents end up. In my case that’s cuda:0. I lost time to device mismatch errors before accepting this as law.
Old GPUs aren’t useless. They’re just… distributed. Eight V100s can do what one modern GPU does, just slower and with more coordination.
FP16 is your friend. V100s don’t do BF16 natively, but FP16 inference works fine here.
Persistence pays off. This took multiple hours and several dead ends, but the final setup is stable and repeatable.

The Final Setup

/opt/flux2-dev/
├── app.py                  # Gradio web UI
├── run_flux.py             # CLI script
├── models/FLUX.2-dev/      # model weights (~166GB)
└── outputs/                # generated images

Run the web UI:

cd /opt/flux2-dev
source venv/bin/activate
python app.py
# Open http://localhost:7860

Conclusion

The AI industry wants you to believe you need the latest hardware to run the latest models. And sure, modern GPUs will be dramatically faster and more efficient. But there’s something deeply satisfying about getting a 32B model running on “legacy” cards using nothing but sharding, stubbornness, and the occasional whisper of profanity.

Now if you’ll excuse me, I’m off to generate absurdly detailed robots in dramatic neon rain in the style of a Renaissance painting” because I can.

Eight V100s running a model this size are basically a space heater with NVLink. And that’s the point: this was a fun experiment in sharding, debugging, and seeing how far old hardware can be pushed — not a blueprint for operating FLUX.2-dev continuously. It works, it’s impressive, and it’s also hilariously inefficient compared to modern setups.

So I’m happy keeping it in the “prove it once, generate a test image (or two), learn a lot, move on” category.

Was it worth it? Absolutely.
Would I run it like this continuously? Absolutely not.

FLUX.2-dev Multi-GPU Setup (Technical Notes)

FLUX.2-dev is a 32B parameter text-to-image model from Black Forest Labs. This setup runs it across 8× V100-SXM2-32GB GPUs (256GB total VRAM).

Requirements

8× NVIDIA GPUs with combined VRAM ≥ ~110GB (tested on 8× V100-32GB)
Python 3.10+
CUDA 11.8+
~170GB disk space for model weights

Installation

# Choose an install directory
cd /opt/flux2-dev

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install diffusers transformers accelerate huggingface_hub gradio

Model Download

Example model path:

/opt/flux2-dev/models/FLUX.2-dev/

To download/update (if you use a script):

source venv/bin/activate
python download_model.py

GPU Memory Allocation

Component	GPUs	Memory Used
Text Encoder (Mistral-3)	0, 1	~45GB
Transformer (DiT)	2–7	~60GB
VAE	0	~0.2GB
Total	0–7	~105GB

Running

Option 1: Web Interface (Gradio)

cd /opt/flux2-dev
source venv/bin/activate
python app.py

Access:

Local: http://localhost:7860
Network: http://:7860

Option 2: Command Line

cd /opt/flux2-dev
source venv/bin/activate
python run_flux.py

Edit run_flux.py to change prompt/settings.

Option 3: Python API

import os
import torch

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"

from diffusers import Flux2Pipeline, AutoModel
from diffusers.models import AutoencoderKLFlux2
from transformers import Mistral3ForConditionalGeneration

MODEL_ID = "/opt/flux2-dev/models/FLUX.2-dev"
DTYPE = torch.float16

# Load components with GPU sharding
text_encoder = Mistral3ForConditionalGeneration.from_pretrained(
    MODEL_ID,
    subfolder="text_encoder",
    torch_dtype=DTYPE,
    device_map="balanced",
    max_memory={0: "30GiB", 1: "30GiB"},
)

transformer = AutoModel.from_pretrained(
    MODEL_ID,
    subfolder="transformer",
    torch_dtype=DTYPE,
    device_map="balanced",
    max_memory={2: "30GiB", 3: "30GiB", 4: "30GiB", 5: "30GiB", 6: "30GiB", 7: "30GiB"},
)

vae = AutoencoderKLFlux2.from_pretrained(
    MODEL_ID,
    subfolder="vae",
    torch_dtype=DTYPE,
).to("cuda:0")
vae.enable_tiling()

# Assemble pipeline
pipe = Flux2Pipeline.from_pretrained(
    MODEL_ID,
    text_encoder=None,
    transformer=None,
    vae=None,
    torch_dtype=DTYPE,
)
pipe.text_encoder = text_encoder
pipe.transformer = transformer
pipe.vae = vae

# Generate
image = pipe(
    prompt="A futuristic city at sunset",
    width=1024,
    height=1024,
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]

# Save output
os.makedirs("outputs", exist_ok=True)
image.save("outputs/output.png")

Generation Settings

Parameter	Default	Range	Description
`width`	1024	256–2048	Width (multiples of 64)
`height`	1024	256–2048	Height (multiples of 64)
`num_inference_steps`	28	1–100	More steps = better quality, slower
`guidance_scale`	3.5	1.0–20.0	Prompt adherence
`seed`	random	0–2³²	Reproducibility

Performance

On 8× V100-SXM2-32GB:

Model load time: ~60 seconds
Generation time (1024×1024, 28 steps): ~2–3 minutes
Speed: ~0.15–0.2 steps/second

Output

Generated images are saved as PNG files:

Web UI: ./outputs/flux2_YYYYMMDD_HHMMSS_SEED.png (example)
CLI/API: ./outputs/output.png (or wherever you save)

Troubleshooting

“CUDA out of memory”

Reduce resolution
Reduce inference steps
Check other GPU processes: nvidia-smi
Note: Unlikely on 8×32GB, common on smaller setups.

“Expected all tensors to be on the same device”

Ensure the VAE is on cuda:0
Don’t mix enable_model_cpu_offload() with manual sharding unless you know exactly how tensors move

Slow generation

V100s are simply slower than A100/H100-class GPUs
Use fewer steps (20–28 is usually fine)
Test at smaller resolutions first

Models mentioned in this article 1

vindex-infer — Run LLMs without CUDA, without PyTorch, from flat binary files

April 16, 2026

80% of AI LLM Fine-Tuning Compute Is Wasted — Here's How We Proved It

April 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Running a 32 Billion Parameter Image Model on “Ancient” V100 GPUs

The Challenge

The Setup (AKA “What Are We Working With?”)

The Model (AKA “Why Is This So Big?”)

The Journey

Attempt #1: “Just Load It, How Hard Can It Be?”

Attempt #2: “Let’s Try That CPU Offload Thing”

Attempt #3: “device_map='balanced' Will Save Us”

Attempt #4: “Fine, I’ll Do It Myself”

Attempt #5: “Please Generate An Image”

Attempt #6: “The Actual Working Version”

It Works. It Actually Works.

The Numbers

What I Learned

The Final Setup

Conclusion

FLUX.2-dev Multi-GPU Setup (Technical Notes)

Requirements

Installation

Model Download

GPU Memory Allocation

Running

Option 1: Web Interface (Gradio)

Option 2: Command Line

Option 3: Python API

Generation Settings

Performance

Output

Troubleshooting

“CUDA out of memory”

“Expected all tensors to be on the same device”

Slow generation

Links

Models mentioned in this article 1

vindex-infer — Run LLMs without CUDA, without PyTorch, from flat binary files

80% of AI LLM Fine-Tuning Compute Is Wasted — Here's How We Proved It

Community

Models mentioned in this article 1