Qwen2.5-32B-Instruct Β· GGUF F16
Converted by PBH Applied Systems, LLC β Applied AI/ML Consulting Β· LLM Optimization & Deployment Β· Quantized AI Infrastructure
π Provenance repository β no behavioral evaluation performed. This repository contains the full-precision F16 GGUF of Qwen2.5-32B-Instruct. At 65.5 GB, the F16 artifact exceeds the VRAM capacity of the evaluation hardware (NVIDIA RTX 4090, 24 GB). All behavioral evaluation data for this model is in the Q4_K_M companion repository:
pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M.
π¬ About the evaluation series. Every other model published under
pbhappliedsystemshas been independently evaluated using quant_eval v7.21 β a proprietary behavioral evaluation harness developed by PBH Applied Systems β at both F16 and Q4_K_M precision. For Qwen2.5-32B, the F16 GGUF was produced and its artifact provenance is recorded here, but the evaluation constraint is documented honestly rather than omitted.
Why No Evaluation
In the PBH Applied Systems evaluation pipeline, F16 GGUFs serve as cache-generation baselines for Q4_K_M comparison runs. For every other model in the series, the F16 run produces a full_weight_cache.json that the Q4_K_M run reuses, enabling a direct two-run comparison against identical fixtures.
For Qwen2.5-32B, the F16 GGUF is 65.5 GB. Loading this into the evaluation hardware (NVIDIA RTX 4090, 24 GB VRAM) is not possible β not even with partial CPU offload at the precision required for a valid cache-generation baseline. The Q4_K_M comparison run (20260221_144732) was therefore run as a standalone evaluation against freshly generated responses rather than against a cached F16 baseline.
The consequence for this repository: There is no full_weight_cache.json, no F16 evaluation CSV, and no cross-precision comparison data. This card exists to document the F16 artifact for provenance and to make the 65.5 GB GGUF accessible to users with appropriate hardware.
For full behavioral analysis, cross-series comparisons, and deployment recommendations, see the Q4_K_M card.
Model Description
This repository contains the full-precision F16 GGUF of Qwen/Qwen2.5-32B-Instruct, a 32.5-billion parameter instruction-tuned model from Alibaba Cloud (September 2024 release).
Key Characteristics
- Parameters: 32.5B (31.0B non-embedding)
- Architecture: Causal LM Β· RoPE Β· SwiGLU Β· RMSNorm Β· Attention QKV bias Β· GQA (40 Q heads / 8 KV heads)
- Layers: 64
- Format: GGUF F16 (full precision)
- File size: 65.5 GB
- SHA256:
02e264f0273624b39b0650f8c0583c6d04c320c777780ca5be839999912adf3c - Context window: 131,072 tokens (full); 8,192 tokens (generation)
- Default config targets 32,768 tokens; YaRN scaling required beyond that β see Long Context section
- Minimum VRAM (full GPU offload): ~80 GB
- Recommended hardware: 2Γ A100 80 GB Β· 4Γ A100 40 GB Β· 3Γ A10G 24 GB
- License: Apache 2.0
Artifact Provenance
| Artifact | Format | Size | SHA256 | Evaluated |
|---|---|---|---|---|
qwen-2.5-32B-instruct-gguf-F16.gguf |
GGUF F16 | 65.5 GB | 02e264f0273624b39b0650f8c0583c6d04c320c777780ca5be839999912adf3c |
β VRAM constraint |
| Q4_K_M (companion repo) | GGUF Q4_K_M | 19.9 GB | 6f810a332a884410aa65cc1b5a128a8603f083b36465acfbbf67a08f50a4d3e3 |
β
Run 20260221_144732 |
The F16 GGUF was converted from Qwen/Qwen2.5-32B-Instruct using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems, without modification to model weights.
quant_eval context: F16 run 20260221_144732 was requested with quant_types: ['Q4_K_M']. Due to VRAM constraints, no F16 full_weight_cache.json was written. The Q4_K_M evaluation ran as a standalone quantized_llama_cpp runner evaluation.
Hardware Requirements
| Configuration | VRAM Required | Notes |
|---|---|---|
| F16 (this repo) Β· full GPU | ~80 GB | 2Γ A100 80 GB minimum |
| F16 Β· 4-GPU split | ~20 GB per GPU | 4Γ A100 40 GB or 4Γ A10G 24 GB |
| F16 Β· partial CPU offload | ~40β50 GB VRAM + 64 GB RAM | Reduced context; slower inference |
| Q4_K_M (companion repo) | ~24 GB | Single A10G or RTX 4090 |
For most production use cases, the Q4_K_M variant is the correct choice. It runs on single-GPU hardware, is fully evaluated, and delivers the same structured behavioral outputs that F16 would produce on this class of tasks. F16 is appropriate for compliance environments requiring full-weight artifacts, research settings requiring exact weight fidelity, or future evaluation runs on multi-GPU infrastructure.
Long-Context Deployment
The default config.json targets 32,768 tokens. To enable the full 131,072-token context window, apply YaRN scaling by adding the following to config.json before conversion, or configure it at the llama.cpp level:
{
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
Note that YaRN in llama.cpp uses static scaling β the factor is constant regardless of actual input length. Apply only when processing long contexts is required; shorter-context inference may be marginally affected.
Usage
Installation
pip install llama-cpp-python huggingface_hub
For multi-GPU CUDA deployment:
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Python β llama-cpp-python (multi-GPU)
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
# Note: 65.5 GB download β requires ~80 GB total VRAM for full GPU offload
model_path = hf_hub_download(
repo_id="pbhappliedsystems/qwen-2.5-32B-instruct-gguf-F16",
filename="qwen-2.5-32B-instruct-gguf-F16.gguf"
)
# Multi-GPU: tensor_split distributes model layers across available GPUs
# Example: 2Γ A100 80 GB
llm = Llama(
model_path=model_path,
n_ctx=8192,
n_gpu_layers=-1, # Offload all layers
tensor_split=[1, 1], # Equal split across 2 GPUs
verbose=True, # Monitor GPU memory allocation
)
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a precise assistant. Follow instructions exactly."
},
{
"role": "user",
"content": "Analyze the following and return a JSON object with keys: summary, risk_level, action_items."
}
],
temperature=0.7,
max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])
For partial CPU offload when full VRAM is unavailable (slower, but functional):
# Example: 2Γ A10G 24 GB (48 GB total VRAM) + 64 GB system RAM
llm = Llama(
model_path=model_path,
n_ctx=4096,
n_gpu_layers=48, # Offload first N layers; tune based on available VRAM
tensor_split=[1, 1],
verbose=True,
)
CLI β llama-cli (multi-GPU)
llama-cli \
--model qwen-2.5-32B-instruct-gguf-F16.gguf \
--chat-template qwen2 \
--system-prompt "You are a precise assistant. Follow instructions exactly." \
--prompt "Return a JSON object with keys: summary, risk_level, action_items." \
--n-predict 1024 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--tensor-split 1,1 \
--temp 0.7
For server deployment:
llama-server \
--model qwen-2.5-32B-instruct-gguf-F16.gguf \
--chat-template qwen2 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--tensor-split 1,1 \
--port 8080 \
--host 0.0.0.0
About PBH Applied Systems
PBH Applied Systems, LLC is an Oklahoma Cityβbased applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.
Patrick Hill, M.S. β Founder Β· Data Scientist Β· AI/ML Engineer Β· Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)
Core Service Areas: LLM Optimization & Deployment Β· AI Evaluation Frameworks Β· Agentic AI Infrastructure Β· Scalable AI Application Development Β· ML Pipeline Design & Analytics Β· Model & Agent Cataloging
π Work With PBH Applied Systems
π Book a Scoping Call Β· π Request an Evaluation Report β from $2,500
Connect
| π | pbhappliedsystems.com |
| π§ | patrick@pbhappliedsystems.com |
| πΌ | |
| βΆοΈ | YouTube |
| πΈ | |
| π |
License
This GGUF repository inherits the license of the base model:
Apache 2.0 β Qwen/Qwen2.5-32B-Instruct
The quant_eval evaluation methodology and fixture set are proprietary to PBH Applied Systems, LLC and are not included in this repository.
GGUF conversion performed by PBH Applied Systems, LLC Β· No behavioral evaluation β see companion Q4_K_M repository for all evaluation data
- Downloads last month
- 141
16-bit