SDXL-MARIUS-V18

Run full SDXL on 6 GB VRAM — 22 GB model, streamed from disk in real time.

MARIUS-V18 uses a custom streaming architecture and a lattice-based vector quantization format (LZR2) to run Stable Diffusion XL on hardware that would normally be incompatible: GTX 1060 6GB, GTX 1660, RTX 2060.

Results at a glance

	Standard SDXL	MARIUS-V18
VRAM required	~12 GB	~6 GB
Disk size	~7 GB	~22 GB
Visual quality	Standard	Lossless
Compatible GPUs	RTX 3060+	GTX 1060 6GB+
Runtime	Standard	Pure PyTorch

The trade-off is explicit: more disk space, much less VRAM. The full model lives on SSD/RAM; only the active layers are streamed to GPU at any given moment.

Hardware requirements

GPU: 6 GB+ VRAM (GTX 1060 6GB minimum)
RAM: 16 GB+ recommended
Storage: 25 GB free (SSD strongly recommended)

Installation

pip install torch diffusers transformers accelerate safetensors psutil numpy

Then download these two files to your working directory:

Marius_SDXL_V65_Universal.lzr2 (22 GB) — do not rename
solvay_v65_loader.py (see below)

Usage

1. Create `solvay_v65_loader.py`

Click to expand loader code

import torch, struct, zlib, numpy as np, itertools, os, gc, sys, psutil
from diffusers import StableDiffusionXLPipeline

_ARTIFACT = "Marius_SDXL_V65_Universal.lzr2"
_BASE = "stabilityai/stable-diffusion-xl-base-1.0"

def _stat():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 ** 3)

def inject_solvay(path, pipe):
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing artifact: {path}")
    print("Initializing streaming engine...")

    _opts, u = {}, pipe.unet
    _g_v = lambda d: np.array(list(itertools.product([-1, 0, 1], repeat=d)), dtype=np.float32)
    idx = 0

    with open(path, "rb") as f:
        if f.read(4) != b"LZR2":
            raise ValueError("Invalid signature")

        while True:
            lkb = f.read(4)
            if not lkb: break

            key = f.read(struct.unpack('I', lkb)[0]).decode('utf-8')
            ls = struct.unpack('I', f.read(4))[0]
            sh = [struct.unpack('I', f.read(4))[0] for _ in range(ls)]
            tf = struct.unpack('B', f.read(1))[0]
            _w = None

            if tf == 1:
                dp, C = struct.unpack('I', f.read(4))[0], sh[0]
                _a = np.frombuffer(f.read(C*dp*4), dtype=np.float32).reshape(C, dp)
                _mn = np.frombuffer(f.read(C*4), dtype=np.float32)
                _sc = np.frombuffer(f.read(C*4), dtype=np.float32)
                lz = struct.unpack('I', f.read(4))[0]
                _ix_flat = np.frombuffer(zlib.decompress(f.read(lz)), dtype=np.uint16)
                n_blocks = _ix_flat.size // C
                _ix = _ix_flat.reshape(C, n_blocks)
                no = struct.unpack('I', f.read(4))[0]
                N_feat = int(np.prod(sh[1:])) if len(sh) > 1 else 1

                if dp not in _opts:
                    _opts[dp] = _g_v(dp)
                rc = _opts[dp][_ix].reshape(C, -1) if n_blocks > 0 else np.zeros((C, 0), dtype=np.float32)

                fb = np.zeros((C, N_feat), dtype=np.float32)
                vw = min(rc.shape[1], N_feat)
                if vw > 0:
                    fb[:, :vw] = rc[:, :vw]
                fb = (fb + _mn[:, None]) * _sc[:, None]

                if no > 0:
                    md = max(C, n_blocks) * dp
                    fmt, fsz = ('H', 8) if md < 65536 else ('I', 12)
                    dt = np.dtype([('r', np.uint16 if fmt=='H' else np.uint32),
                                   ('c', np.uint16 if fmt=='H' else np.uint32),
                                   ('v', np.float32)])
                    batch = np.frombuffer(f.read(no * fsz), dtype=dt)
                    m = (batch['r'] < C) & (batch['c'] < N_feat)
                    vb = batch[m]
                    fb[vb['r'], vb['c']] = vb['v']

                _w = torch.from_numpy(fb.reshape(sh).astype(np.float16))

            if _w is not None:
                try:
                    t = u
                    pts = key.split('.')
                    for p in pts[:-1]:
                        t = getattr(t, p)
                    getattr(t, pts[-1]).data.copy_(_w.to(pipe.device, dtype=torch.float16))
                except:
                    pass
                del _w

            idx += 1
            if idx % 10 == 0:
                sys.stdout.write(f"\r[STREAM] Module {idx:04d} | RAM: {_stat():.1f}GB")
                sys.stdout.flush()
            if idx % 200 == 0:
                gc.collect()

    print(f"\nStream complete ({idx} modules loaded)")

def get_pipe():
    print("Loading base architecture...")
    pipe = StableDiffusionXLPipeline.from_pretrained(
        _BASE,
        torch_dtype=torch.float16,
        variant="fp16",
        use_safetensors=True
    )
    pipe.enable_model_cpu_offload()
    inject_solvay(_ARTIFACT, pipe)
    return pipe

2. Create `inference.py`

from solvay_v65_loader import get_pipe

pipe = get_pipe()
print("Ready. Type 'quit' to exit.\n")

img_idx = 1
while True:
    prompt = input(f"[{img_idx}] Prompt > ").strip()
    if prompt.lower() in ['quit', 'exit', 'q']:
        break
    if not prompt:
        continue
    image = pipe(prompt, num_inference_steps=30).images[0]
    filename = f"output_{img_idx:03d}.png"
    image.save(filename)
    print(f"Saved: {filename}\n")
    img_idx += 1

3. Run

python inference.py

Technical details

LZR2 format

The .lzr2 file is a custom binary streaming format. Weights are stored as indices into a lattice codebook (vectors from {-1, 0, 1}^d), compressed with zlib. A sparse overlay corrects critical features that the lattice approximation doesn't capture precisely. At load time, weights are reconstructed to float16 and injected layer by layer into the standard SDXL U-Net.

This is not a general-purpose format — it is designed specifically for streaming large diffusion models from slow storage to limited VRAM.

Streaming architecture

Only the layers needed for the current forward pass are held in VRAM. The rest stay on SSD or RAM. This is conceptually similar to video streaming: the full file never needs to fit in the playback buffer.

Memory management

Layer-wise streaming: active layers only in VRAM
CPU offloading via standard diffusers pipeline
GC sweep every 200 modules during load
FP16 precision throughout

Known limitations

First load: 2–5 minutes depending on storage speed
SSD strongly recommended — HDD works but is significantly slower
Do not rename Marius_SDXL_V65_Universal.lzr2
ControlNet / LoRA compatibility: not tested

License

CC BY-NC 4.0 — free for personal and research use, image generation for commercial purposes allowed.
No reverse engineering of the LZR2 format. No redistribution of modified weights.
Base model: Stable Diffusion XL 1.0 by Stability AI.

Contact

Questions or feedback: open a discussion

Downloads last month: -

Model tree for muquanta-axel-v17/SDXL-MARIUS-V18

Base model

stabilityai/stable-diffusion-xl-base-1.0

Finetuned

(1182)

this model