Voyage-4-nano GGUF

GGUF conversions of VoyageAI's voyage-4-nano embedding model for use with llama.cpp.

Files

File	Size	Description
`voyage-4-nano-f16.gguf`	695 MB	Full precision (FP16)
`voyage-4-nano-q8_0.gguf`	372 MB	8-bit quantized
`voyage-4-nano-linear.pt`	4.2 MB	Linear projection layer (required)

Quality

Cosine similarity against HuggingFace reference embeddings:

Format	Mean Similarity	Quality
GGUF F16	1.000000	Identical
GGUF Q8_0	0.999903	Excellent

The Q8_0 quantized model achieves 99.99% similarity to the original, with 46% size reduction.

Usage

# Generate embeddings with llama-embedding
./llama.cpp/build/bin/llama-embedding \
    -m voyage-4-nano-q8_0.gguf \
    --pooling mean \
    --attention non-causal \
    --embd-normalize 2 \
    -p "Your text here"

Important flags:

--attention non-causal - Required for bidirectional models
--pooling mean - Use mean pooling
--embd-normalize 2 - L2 normalization

Linear Projection

The GGUF model outputs 1024-dim embeddings. To match the original 2048-dim output, apply the linear projection:

import torch
import numpy as np

# Load projection matrix
linear_weight = torch.load("voyage-4-nano-linear.pt", weights_only=True).float().numpy()

# Apply projection: (batch, 1024) @ (1024, 2048).T -> (batch, 2048)
projected = embeddings @ linear_weight.T

# Re-normalize
projected = projected / np.linalg.norm(projected, axis=1, keepdims=True)

Model Details

Base model: Qwen3 with bidirectional attention
Parameters: 340M
Hidden dim: 1024
Embedding dim: 2048 (after linear projection)
Context length: 32K tokens
Pooling: Mean

Model tree for jsonMartin/voyage-4-nano-gguf

Base model

voyageai/voyage-4-nano

Quantized

(5)

this model

jsonMartin
/

voyage-4-nano-gguf