Meta-Llama-3.1-70B-Instruct — TurboQuant+ Premium GGUF

Fits in 64 GB memory where Q8_0 does not. Delivers significantly better quality than Q4_K_M with only a modest speed impact.

Best option when you want near-Q8_0 quality but cannot fit Q8_0 in memory.

TurboQuant+ Premium quantization of meta-llama/Llama-3.1-70B-Instruct. Premium applies WHT-domain compression (TQ4_1S) to attention tensors while keeping FFN layers at Q5_K/Q6_K precision. Llama models are highly sensitive to WHT compression in FFN layers, so Premium keeps FFN at higher precision to preserve quality. See the getting started guide for details.

Requires TurboQuant+ llama.cpp fork at tag tqp-v0.1.0. Will NOT work with stock llama.cpp. TurboQuant+ is an independent research project. These quantization types have not been merged into upstream ggml-org/llama.cpp. Do not file issues there for TQ models.

Files

File	Quant	Size	vs Q8_0	PPL	PPL vs Q8_0	PPL vs Q4_K_M
Meta-Llama-3.1-70B-Instruct-Premium.gguf	Premium	49.8 GB	71% of Q8_0 (69.8 GB)	3.18	+5.0%	1.9% better

Why Premium?

Q8_0 at 70 GB doesn't fit on 64 GB hardware. Q4_K_M fits at 40 GB but drops quality. Premium fits at 50 GB and recovers a large portion of the quality lost by Q4_K_M. This makes Premium viable on 64 GB systems where Q8_0 cannot run.

Quant	Size	PPL	vs Q8_0
Q8_0	69.8 GB	3.03	baseline
Q4_K_M	40 GB	3.24	+6.9%
Premium	49.8 GB	3.18	+5.0%

Premium is 25% larger than Q4_K_M. You pay 10 GB of memory to recover nearly half the quality gap between Q4_K_M and Q8_0.

Compatibility

Field	Value
Fork	TheTom/llama-cpp-turboquant
Tag	`tqp-v0.1.0`
Backends	Metal, CUDA, ROCm/HIP, Vulkan
Quantized on	2026-04-08

No forward-compatibility guarantee. This model was built and validated against the tag above. Future fork updates may change the format. If decode produces garbage, rebuild from this tag.

Benchmarks (Apple M5 Max 128GB, Metal)

Speed

Only 16% slower decode than Q8_0 on Metal. This is the smallest speed gap of any TurboQuant+ model tested so far.

Metal (Apple M5 Max 128GB, recommended -ctk q8_0 -ctv turbo4)

Config	pp512	pp2048	tg128	Size
Q8_0 baseline (f16 KV)	185 t/s	164 t/s	7.4 t/s	69.8 GB
Premium + turbo4 KV	163 t/s	144 t/s	6.2 t/s	49.8 GB + 60% KV cache memory reduction

CUDA performance has not been validated on this model yet. Based on other TQ4_1S results, performance is expected to be higher than Metal.

Context Scaling (tg128 decode at varying context length, Metal)

Context	1K	4K	8K	16K	32K
t/s	6.2	6.2	6.2	6.1	6.2

Effectively flat decode performance across all context lengths. No degradation cliff.

Perplexity (wikitext-2-raw, 512 context, 20 chunks)

Config	PPL	vs Q8_0
Q8_0	3.0337	baseline
Q4_K_M	3.2432	+6.9%
Premium	3.1839	+5.0%

Recommended KV Cache Settings

TurboQuant+ also compresses the KV cache at runtime via -ctk and -ctv flags. This doesn't change the model file, just how much memory the context window uses. KV compression further reduces runtime memory usage, allowing larger context windows within the same hardware limits.

K cache (`-ctk`)	V cache (`-ctv`)	PPL	vs default	KV buffer (16K ctx)	Recommendation
f16	f16	3.18	baseline	5,120 MiB	default, short context
q8_0	turbo4	3.23	+1.3%	2,040 MiB (60% smaller)	recommended for long context
q8_0	turbo3	3.26	+2.3%	~1,900 MiB (63% smaller)	aggressive, still good quality

The recommended config (-ctk q8_0 -ctv turbo4) reduces KV cache memory by 60% at +1.3% PPL cost. At 16K context, that is 3 GB of memory saved.

Other combinations (q8_0/q8_0, f16/turbo4, turbo4/f16) either crash or have severe prefill regressions. Stick with the configs above.

Download

# Install huggingface-cli
brew install huggingface-cli
# or: pip install huggingface-hub

# Download
hf download pidtom/Meta-Llama-3.1-70B-Instruct-TQPlus Meta-Llama-3.1-70B-Instruct-Premium.gguf --local-dir .

How to Run

# Clone and build the TurboQuant+ fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout tqp-v0.1.0

# Build for Metal (macOS)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Build for CUDA (NVIDIA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Build for Vulkan
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Run (default KV cache)
./build/bin/llama-cli -m Meta-Llama-3.1-70B-Instruct-Premium.gguf -ngl 99 -c 8192

# Run with recommended KV compression (for long context)
./build/bin/llama-cli -m Meta-Llama-3.1-70B-Instruct-Premium.gguf -ngl 99 -c 32768 -ctk q8_0 -ctv turbo4

What is TurboQuant+?

TurboQuant+ applies Walsh-Hadamard Transform (WHT) domain quantization to compress model weights beyond standard GGUF quant types. This achieves lower bits-per-weight at equivalent or better quality by exploiting structured redundancy in the weight matrices.

Quantized by @pidtom | GitHub | X | Sponsor

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for thetom-ai/Meta-Llama-3.1-70B-Instruct-TQPlus

Base model

meta-llama/Llama-3.1-70B

Finetuned

meta-llama/Llama-3.1-70B-Instruct

Finetuned

(90)

this model

Collection including thetom-ai/Meta-Llama-3.1-70B-Instruct-TQPlus

TurboQuant Plus - Models V0.1.0

Collection

A collection of TurboQuant+ compressed models designed to improve quality at practical memory limits. TurboQuant+ applies tensor-aware compression in • 3 items • Updated 12 days ago • 1