Voyage-4-nano GGUF
GGUF conversions of VoyageAI's voyage-4-nano embedding model for use with llama.cpp.
Files
| File | Size | Description |
|---|---|---|
voyage-4-nano-f16.gguf |
695 MB | Full precision (FP16) |
voyage-4-nano-q8_0.gguf |
372 MB | 8-bit quantized |
voyage-4-nano-linear.pt |
4.2 MB | Linear projection layer (required) |
Quality
Cosine similarity against HuggingFace reference embeddings:
| Format | Mean Similarity | Quality |
|---|---|---|
| GGUF F16 | 1.000000 | Identical |
| GGUF Q8_0 | 0.999903 | Excellent |
The Q8_0 quantized model achieves 99.99% similarity to the original, with 46% size reduction.
Usage
# Generate embeddings with llama-embedding
./llama.cpp/build/bin/llama-embedding \
-m voyage-4-nano-q8_0.gguf \
--pooling mean \
--attention non-causal \
--embd-normalize 2 \
-p "Your text here"
Important flags:
--attention non-causal- Required for bidirectional models--pooling mean- Use mean pooling--embd-normalize 2- L2 normalization
Linear Projection
The GGUF model outputs 1024-dim embeddings. To match the original 2048-dim output, apply the linear projection:
import torch
import numpy as np
# Load projection matrix
linear_weight = torch.load("voyage-4-nano-linear.pt", weights_only=True).float().numpy()
# Apply projection: (batch, 1024) @ (1024, 2048).T -> (batch, 2048)
projected = embeddings @ linear_weight.T
# Re-normalize
projected = projected / np.linalg.norm(projected, axis=1, keepdims=True)
Model Details
- Base model: Qwen3 with bidirectional attention
- Parameters: 340M
- Hidden dim: 1024
- Embedding dim: 2048 (after linear projection)
- Context length: 32K tokens
- Pooling: Mean
Links
- Downloads last month
- 293
Hardware compatibility
Log In to add your hardware
8-bit
16-bit
Model tree for jsonMartin/voyage-4-nano-gguf
Base model
voyageai/voyage-4-nano