SDXL-MARIUS-V18

Run full SDXL on 6 GB VRAM โ€” 22 GB model, streamed from disk in real time.

MARIUS-V18 uses a custom streaming architecture and a lattice-based vector quantization format (LZR2) to run Stable Diffusion XL on hardware that would normally be incompatible: GTX 1060 6GB, GTX 1660, RTX 2060.


Results at a glance

Standard SDXL MARIUS-V18
VRAM required ~12 GB ~6 GB
Disk size ~7 GB ~22 GB
Visual quality Standard Lossless
Compatible GPUs RTX 3060+ GTX 1060 6GB+
Runtime Standard Pure PyTorch

The trade-off is explicit: more disk space, much less VRAM. The full model lives on SSD/RAM; only the active layers are streamed to GPU at any given moment.


Hardware requirements

  • GPU: 6 GB+ VRAM (GTX 1060 6GB minimum)
  • RAM: 16 GB+ recommended
  • Storage: 25 GB free (SSD strongly recommended)

Installation

pip install torch diffusers transformers accelerate safetensors psutil numpy

Then download these two files to your working directory:

  • Marius_SDXL_V65_Universal.lzr2 (22 GB) โ€” do not rename
  • solvay_v65_loader.py (see below)

Usage

1. Create solvay_v65_loader.py

Click to expand loader code
import torch, struct, zlib, numpy as np, itertools, os, gc, sys, psutil
from diffusers import StableDiffusionXLPipeline

_ARTIFACT = "Marius_SDXL_V65_Universal.lzr2"
_BASE = "stabilityai/stable-diffusion-xl-base-1.0"

def _stat():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 ** 3)

def inject_solvay(path, pipe):
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing artifact: {path}")
    print("Initializing streaming engine...")

    _opts, u = {}, pipe.unet
    _g_v = lambda d: np.array(list(itertools.product([-1, 0, 1], repeat=d)), dtype=np.float32)
    idx = 0

    with open(path, "rb") as f:
        if f.read(4) != b"LZR2":
            raise ValueError("Invalid signature")

        while True:
            lkb = f.read(4)
            if not lkb: break

            key = f.read(struct.unpack('I', lkb)[0]).decode('utf-8')
            ls = struct.unpack('I', f.read(4))[0]
            sh = [struct.unpack('I', f.read(4))[0] for _ in range(ls)]
            tf = struct.unpack('B', f.read(1))[0]
            _w = None

            if tf == 1:
                dp, C = struct.unpack('I', f.read(4))[0], sh[0]
                _a = np.frombuffer(f.read(C*dp*4), dtype=np.float32).reshape(C, dp)
                _mn = np.frombuffer(f.read(C*4), dtype=np.float32)
                _sc = np.frombuffer(f.read(C*4), dtype=np.float32)
                lz = struct.unpack('I', f.read(4))[0]
                _ix_flat = np.frombuffer(zlib.decompress(f.read(lz)), dtype=np.uint16)
                n_blocks = _ix_flat.size // C
                _ix = _ix_flat.reshape(C, n_blocks)
                no = struct.unpack('I', f.read(4))[0]
                N_feat = int(np.prod(sh[1:])) if len(sh) > 1 else 1

                if dp not in _opts:
                    _opts[dp] = _g_v(dp)
                rc = _opts[dp][_ix].reshape(C, -1) if n_blocks > 0 else np.zeros((C, 0), dtype=np.float32)

                fb = np.zeros((C, N_feat), dtype=np.float32)
                vw = min(rc.shape[1], N_feat)
                if vw > 0:
                    fb[:, :vw] = rc[:, :vw]
                fb = (fb + _mn[:, None]) * _sc[:, None]

                if no > 0:
                    md = max(C, n_blocks) * dp
                    fmt, fsz = ('H', 8) if md < 65536 else ('I', 12)
                    dt = np.dtype([('r', np.uint16 if fmt=='H' else np.uint32),
                                   ('c', np.uint16 if fmt=='H' else np.uint32),
                                   ('v', np.float32)])
                    batch = np.frombuffer(f.read(no * fsz), dtype=dt)
                    m = (batch['r'] < C) & (batch['c'] < N_feat)
                    vb = batch[m]
                    fb[vb['r'], vb['c']] = vb['v']

                _w = torch.from_numpy(fb.reshape(sh).astype(np.float16))

            if _w is not None:
                try:
                    t = u
                    pts = key.split('.')
                    for p in pts[:-1]:
                        t = getattr(t, p)
                    getattr(t, pts[-1]).data.copy_(_w.to(pipe.device, dtype=torch.float16))
                except:
                    pass
                del _w

            idx += 1
            if idx % 10 == 0:
                sys.stdout.write(f"\r[STREAM] Module {idx:04d} | RAM: {_stat():.1f}GB")
                sys.stdout.flush()
            if idx % 200 == 0:
                gc.collect()

    print(f"\nStream complete ({idx} modules loaded)")

def get_pipe():
    print("Loading base architecture...")
    pipe = StableDiffusionXLPipeline.from_pretrained(
        _BASE,
        torch_dtype=torch.float16,
        variant="fp16",
        use_safetensors=True
    )
    pipe.enable_model_cpu_offload()
    inject_solvay(_ARTIFACT, pipe)
    return pipe

2. Create inference.py

from solvay_v65_loader import get_pipe

pipe = get_pipe()
print("Ready. Type 'quit' to exit.\n")

img_idx = 1
while True:
    prompt = input(f"[{img_idx}] Prompt > ").strip()
    if prompt.lower() in ['quit', 'exit', 'q']:
        break
    if not prompt:
        continue
    image = pipe(prompt, num_inference_steps=30).images[0]
    filename = f"output_{img_idx:03d}.png"
    image.save(filename)
    print(f"Saved: {filename}\n")
    img_idx += 1

3. Run

python inference.py

Technical details

LZR2 format

The .lzr2 file is a custom binary streaming format. Weights are stored as indices into a lattice codebook (vectors from {-1, 0, 1}^d), compressed with zlib. A sparse overlay corrects critical features that the lattice approximation doesn't capture precisely. At load time, weights are reconstructed to float16 and injected layer by layer into the standard SDXL U-Net.

This is not a general-purpose format โ€” it is designed specifically for streaming large diffusion models from slow storage to limited VRAM.

Streaming architecture

Only the layers needed for the current forward pass are held in VRAM. The rest stay on SSD or RAM. This is conceptually similar to video streaming: the full file never needs to fit in the playback buffer.

Memory management

  • Layer-wise streaming: active layers only in VRAM
  • CPU offloading via standard diffusers pipeline
  • GC sweep every 200 modules during load
  • FP16 precision throughout

Known limitations

  • First load: 2โ€“5 minutes depending on storage speed
  • SSD strongly recommended โ€” HDD works but is significantly slower
  • Do not rename Marius_SDXL_V65_Universal.lzr2
  • ControlNet / LoRA compatibility: not tested

License

CC BY-NC 4.0 โ€” free for personal and research use, image generation for commercial purposes allowed.
No reverse engineering of the LZR2 format. No redistribution of modified weights.
Base model: Stable Diffusion XL 1.0 by Stability AI.


Contact

Questions or feedback: open a discussion

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for muquanta-axel-v17/SDXL-MARIUS-V18

Finetuned
(1182)
this model