Llama JoyCaption Beta One (MLX 8-bit)

MLX port of fancyfeast/llama-joycaption-beta-one-hf-llava, quantized to 8-bit for efficient inference on Apple Silicon.

JoyCaption is a free, open, and uncensored image captioning VLM built on Llama 3.1 8B and SigLIP2, designed for generating descriptive captions to train diffusion models.

Model Details


Architecture	LLaVA (SigLIP2 vision encoder + Llama 3.1 8B)
Quantization	8-bit (`group_size=64`)
Vision encoder	google/siglip2-so400m-patch14-384
Image resolution	384x384
Total size	~9.1 GB

Usage with mlx-vlm

pip install mlx-vlm

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

MODEL = "n0kovo/llama-joycaption-beta-one-hf-llava-mlx-8Bit"

model, processor = load(MODEL)
config = load_config(MODEL)

prompt = apply_chat_template(
    processor,
    config,
    "Write a long descriptive caption for this image in a formal tone.",
    num_images=1,
)

output = generate(
    model,
    processor,
    prompt,
    image="image.jpg",
    max_tokens=512,
    temperature=0.6,
)
print(output)

Conversion Notes

Language model weights quantized to 8-bit via mlx-lm
Vision encoder weights quantized to 8-bit where layer dimensions allow (group_size=64); 28 MLP layers with incompatible dimensions (4304, not divisible by 64) are kept in float16
Projector weights quantized to 8-bit

Credits

Original model by fancyfeast — JoyCaption GitHub

Downloads last month: 196

Safetensors

Model size

8B params

Tensor type

BF16

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for n0kovo/llama-joycaption-beta-one-hf-llava-mlx-8Bit

Base model

google/siglip2-so400m-patch14-384

Finetuned

fancyfeast/llama-joycaption-beta-one-hf-llava

Finetuned

(1)

this model