pplx-embed-v1-0.6b GGUF (F16)

GGUF conversion of perplexity-ai/pplx-embed-v1-0.6b.

File Quant Size
pplx-embed-v1-0.6b-f16.gguf F16 1.2 GB

Usage

This model requires non-causal (bidirectional) attention. Without it, outputs will be incorrect.

from llama_cpp import Llama, llama_cpp

llm = Llama(model_path="pplx-embed-v1-0.6b-f16.gguf", embedding=True, pooling_type=1)
llama_cpp.llama_set_causal_attn(llm._ctx.ctx, False)

raw = llm.embed("your text here")

Note: The GGUF outputs raw float embeddings only. The original model natively produces int8/binary quantized embeddings via a post-processing step (st_quantize.FlexibleQuantizer). To match that behavior, apply the quantization manually:

import numpy as np

# Int8: tanh โ†’ scale โ†’ round โ†’ clamp (matches Int8TanhQuantizer)
int8_emb = np.clip(np.round(np.tanh(raw) * 127), -128, 127).astype(np.int8)

# Binary: sign (matches BinaryTanhQuantizer)
binary_emb = np.where(np.array(raw) >= 0, 1, -1).astype(np.int8)

CLI:

llama-embedding -m pplx-embed-v1-0.6b-f16.gguf --attention non-causal --pooling mean -p "your text here"

Verification

Cosine similarity vs original: > 0.99999 across all test inputs. Residual diffs are from F32 โ†’ F16 precision.

Downloads last month
138
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for bowang0911/pplx-embed-v1-0.6b-gguf

Quantized
(4)
this model