Llama JoyCaption Beta One (MLX 8-bit)
MLX port of fancyfeast/llama-joycaption-beta-one-hf-llava, quantized to 8-bit for efficient inference on Apple Silicon.
JoyCaption is a free, open, and uncensored image captioning VLM built on Llama 3.1 8B and SigLIP2, designed for generating descriptive captions to train diffusion models.
Model Details
| Architecture | LLaVA (SigLIP2 vision encoder + Llama 3.1 8B) |
| Quantization | 8-bit (group_size=64) |
| Vision encoder | google/siglip2-so400m-patch14-384 |
| Image resolution | 384x384 |
| Total size | ~9.1 GB |
Usage with mlx-vlm
pip install mlx-vlm
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
MODEL = "n0kovo/llama-joycaption-beta-one-hf-llava-mlx-8Bit"
model, processor = load(MODEL)
config = load_config(MODEL)
prompt = apply_chat_template(
processor,
config,
"Write a long descriptive caption for this image in a formal tone.",
num_images=1,
)
output = generate(
model,
processor,
prompt,
image="image.jpg",
max_tokens=512,
temperature=0.6,
)
print(output)
Conversion Notes
- Language model weights quantized to 8-bit via
mlx-lm - Vision encoder weights quantized to 8-bit where layer dimensions allow (
group_size=64); 28 MLP layers with incompatible dimensions (4304, not divisible by 64) are kept in float16 - Projector weights quantized to 8-bit
Credits
- Original model by fancyfeast — JoyCaption GitHub
- Downloads last month
- 196
Model size
8B params
Tensor type
BF16
·
U32 ·
F16 ·
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for n0kovo/llama-joycaption-beta-one-hf-llava-mlx-8Bit
Base model
google/siglip2-so400m-patch14-384