Llama JoyCaption Beta One (MLX 8-bit)

MLX port of fancyfeast/llama-joycaption-beta-one-hf-llava, quantized to 8-bit for efficient inference on Apple Silicon.

JoyCaption is a free, open, and uncensored image captioning VLM built on Llama 3.1 8B and SigLIP2, designed for generating descriptive captions to train diffusion models.

Model Details

Architecture LLaVA (SigLIP2 vision encoder + Llama 3.1 8B)
Quantization 8-bit (group_size=64)
Vision encoder google/siglip2-so400m-patch14-384
Image resolution 384x384
Total size ~9.1 GB

Usage with mlx-vlm

pip install mlx-vlm
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

MODEL = "n0kovo/llama-joycaption-beta-one-hf-llava-mlx-8Bit"

model, processor = load(MODEL)
config = load_config(MODEL)

prompt = apply_chat_template(
    processor,
    config,
    "Write a long descriptive caption for this image in a formal tone.",
    num_images=1,
)

output = generate(
    model,
    processor,
    prompt,
    image="image.jpg",
    max_tokens=512,
    temperature=0.6,
)
print(output)

Conversion Notes

  • Language model weights quantized to 8-bit via mlx-lm
  • Vision encoder weights quantized to 8-bit where layer dimensions allow (group_size=64); 28 MLP layers with incompatible dimensions (4304, not divisible by 64) are kept in float16
  • Projector weights quantized to 8-bit

Credits

Downloads last month
196
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for n0kovo/llama-joycaption-beta-one-hf-llava-mlx-8Bit