Voyage-4-nano GGUF

GGUF conversions of VoyageAI's voyage-4-nano embedding model for use with llama.cpp.

Files

File Size Description
voyage-4-nano-f16.gguf 695 MB Full precision (FP16)
voyage-4-nano-q8_0.gguf 372 MB 8-bit quantized
voyage-4-nano-linear.pt 4.2 MB Linear projection layer (required)

Quality

Cosine similarity against HuggingFace reference embeddings:

Format Mean Similarity Quality
GGUF F16 1.000000 Identical
GGUF Q8_0 0.999903 Excellent

The Q8_0 quantized model achieves 99.99% similarity to the original, with 46% size reduction.

Usage

# Generate embeddings with llama-embedding
./llama.cpp/build/bin/llama-embedding \
    -m voyage-4-nano-q8_0.gguf \
    --pooling mean \
    --attention non-causal \
    --embd-normalize 2 \
    -p "Your text here"

Important flags:

  • --attention non-causal - Required for bidirectional models
  • --pooling mean - Use mean pooling
  • --embd-normalize 2 - L2 normalization

Linear Projection

The GGUF model outputs 1024-dim embeddings. To match the original 2048-dim output, apply the linear projection:

import torch
import numpy as np

# Load projection matrix
linear_weight = torch.load("voyage-4-nano-linear.pt", weights_only=True).float().numpy()

# Apply projection: (batch, 1024) @ (1024, 2048).T -> (batch, 2048)
projected = embeddings @ linear_weight.T

# Re-normalize
projected = projected / np.linalg.norm(projected, axis=1, keepdims=True)

Model Details

  • Base model: Qwen3 with bidirectional attention
  • Parameters: 340M
  • Hidden dim: 1024
  • Embedding dim: 2048 (after linear projection)
  • Context length: 32K tokens
  • Pooling: Mean

Links

Downloads last month
293
GGUF
Model size
0.3B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jsonMartin/voyage-4-nano-gguf

Quantized
(5)
this model