Qwen2.5-32B-Instruct Β· GGUF F16

Converted by PBH Applied Systems, LLC β€” Applied AI/ML Consulting Β· LLM Optimization & Deployment Β· Quantized AI Infrastructure

πŸ“Œ Provenance repository β€” no behavioral evaluation performed. This repository contains the full-precision F16 GGUF of Qwen2.5-32B-Instruct. At 65.5 GB, the F16 artifact exceeds the VRAM capacity of the evaluation hardware (NVIDIA RTX 4090, 24 GB). All behavioral evaluation data for this model is in the Q4_K_M companion repository: pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M.

πŸ”¬ About the evaluation series. Every other model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 β€” a proprietary behavioral evaluation harness developed by PBH Applied Systems β€” at both F16 and Q4_K_M precision. For Qwen2.5-32B, the F16 GGUF was produced and its artifact provenance is recorded here, but the evaluation constraint is documented honestly rather than omitted.


Why No Evaluation

In the PBH Applied Systems evaluation pipeline, F16 GGUFs serve as cache-generation baselines for Q4_K_M comparison runs. For every other model in the series, the F16 run produces a full_weight_cache.json that the Q4_K_M run reuses, enabling a direct two-run comparison against identical fixtures.

For Qwen2.5-32B, the F16 GGUF is 65.5 GB. Loading this into the evaluation hardware (NVIDIA RTX 4090, 24 GB VRAM) is not possible β€” not even with partial CPU offload at the precision required for a valid cache-generation baseline. The Q4_K_M comparison run (20260221_144732) was therefore run as a standalone evaluation against freshly generated responses rather than against a cached F16 baseline.

The consequence for this repository: There is no full_weight_cache.json, no F16 evaluation CSV, and no cross-precision comparison data. This card exists to document the F16 artifact for provenance and to make the 65.5 GB GGUF accessible to users with appropriate hardware.

For full behavioral analysis, cross-series comparisons, and deployment recommendations, see the Q4_K_M card.


Model Description

This repository contains the full-precision F16 GGUF of Qwen/Qwen2.5-32B-Instruct, a 32.5-billion parameter instruction-tuned model from Alibaba Cloud (September 2024 release).

Key Characteristics

  • Parameters: 32.5B (31.0B non-embedding)
  • Architecture: Causal LM Β· RoPE Β· SwiGLU Β· RMSNorm Β· Attention QKV bias Β· GQA (40 Q heads / 8 KV heads)
  • Layers: 64
  • Format: GGUF F16 (full precision)
  • File size: 65.5 GB
  • SHA256: 02e264f0273624b39b0650f8c0583c6d04c320c777780ca5be839999912adf3c
  • Context window: 131,072 tokens (full); 8,192 tokens (generation)
    • Default config targets 32,768 tokens; YaRN scaling required beyond that β€” see Long Context section
  • Minimum VRAM (full GPU offload): ~80 GB
  • Recommended hardware: 2Γ— A100 80 GB Β· 4Γ— A100 40 GB Β· 3Γ— A10G 24 GB
  • License: Apache 2.0

Artifact Provenance

Artifact Format Size SHA256 Evaluated
qwen-2.5-32B-instruct-gguf-F16.gguf GGUF F16 65.5 GB 02e264f0273624b39b0650f8c0583c6d04c320c777780ca5be839999912adf3c ❌ VRAM constraint
Q4_K_M (companion repo) GGUF Q4_K_M 19.9 GB 6f810a332a884410aa65cc1b5a128a8603f083b36465acfbbf67a08f50a4d3e3 βœ… Run 20260221_144732

The F16 GGUF was converted from Qwen/Qwen2.5-32B-Instruct using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems, without modification to model weights.

quant_eval context: F16 run 20260221_144732 was requested with quant_types: ['Q4_K_M']. Due to VRAM constraints, no F16 full_weight_cache.json was written. The Q4_K_M evaluation ran as a standalone quantized_llama_cpp runner evaluation.


Hardware Requirements

Configuration VRAM Required Notes
F16 (this repo) Β· full GPU ~80 GB 2Γ— A100 80 GB minimum
F16 Β· 4-GPU split ~20 GB per GPU 4Γ— A100 40 GB or 4Γ— A10G 24 GB
F16 Β· partial CPU offload ~40–50 GB VRAM + 64 GB RAM Reduced context; slower inference
Q4_K_M (companion repo) ~24 GB Single A10G or RTX 4090

For most production use cases, the Q4_K_M variant is the correct choice. It runs on single-GPU hardware, is fully evaluated, and delivers the same structured behavioral outputs that F16 would produce on this class of tasks. F16 is appropriate for compliance environments requiring full-weight artifacts, research settings requiring exact weight fidelity, or future evaluation runs on multi-GPU infrastructure.


Long-Context Deployment

The default config.json targets 32,768 tokens. To enable the full 131,072-token context window, apply YaRN scaling by adding the following to config.json before conversion, or configure it at the llama.cpp level:

{
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

Note that YaRN in llama.cpp uses static scaling β€” the factor is constant regardless of actual input length. Apply only when processing long contexts is required; shorter-context inference may be marginally affected.


Usage

Installation

pip install llama-cpp-python huggingface_hub

For multi-GPU CUDA deployment:

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python β€” llama-cpp-python (multi-GPU)

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# Note: 65.5 GB download β€” requires ~80 GB total VRAM for full GPU offload
model_path = hf_hub_download(
    repo_id="pbhappliedsystems/qwen-2.5-32B-instruct-gguf-F16",
    filename="qwen-2.5-32B-instruct-gguf-F16.gguf"
)

# Multi-GPU: tensor_split distributes model layers across available GPUs
# Example: 2Γ— A100 80 GB
llm = Llama(
    model_path=model_path,
    n_ctx=8192,
    n_gpu_layers=-1,       # Offload all layers
    tensor_split=[1, 1],   # Equal split across 2 GPUs
    verbose=True,          # Monitor GPU memory allocation
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a precise assistant. Follow instructions exactly."
        },
        {
            "role": "user",
            "content": "Analyze the following and return a JSON object with keys: summary, risk_level, action_items."
        }
    ],
    temperature=0.7,
    max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])

For partial CPU offload when full VRAM is unavailable (slower, but functional):

# Example: 2Γ— A10G 24 GB (48 GB total VRAM) + 64 GB system RAM
llm = Llama(
    model_path=model_path,
    n_ctx=4096,
    n_gpu_layers=48,       # Offload first N layers; tune based on available VRAM
    tensor_split=[1, 1],
    verbose=True,
)

CLI β€” llama-cli (multi-GPU)

llama-cli \
  --model qwen-2.5-32B-instruct-gguf-F16.gguf \
  --chat-template qwen2 \
  --system-prompt "You are a precise assistant. Follow instructions exactly." \
  --prompt "Return a JSON object with keys: summary, risk_level, action_items." \
  --n-predict 1024 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --tensor-split 1,1 \
  --temp 0.7

For server deployment:

llama-server \
  --model qwen-2.5-32B-instruct-gguf-F16.gguf \
  --chat-template qwen2 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --tensor-split 1,1 \
  --port 8080 \
  --host 0.0.0.0

About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.

Patrick Hill, M.S. β€” Founder Β· Data Scientist Β· AI/ML Engineer Β· Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)

Core Service Areas: LLM Optimization & Deployment Β· AI Evaluation Frameworks Β· Agentic AI Infrastructure Β· Scalable AI Application Development Β· ML Pipeline Design & Analytics Β· Model & Agent Cataloging


πŸ“ž Work With PBH Applied Systems

πŸ‘‰ Book a Scoping Call Β· πŸ‘‰ Request an Evaluation Report β€” from $2,500

Connect

🌐 pbhappliedsystems.com
πŸ“§ patrick@pbhappliedsystems.com
πŸ’Ό LinkedIn
▢️ YouTube
πŸ“Έ Instagram
πŸ‘ Facebook

License

This GGUF repository inherits the license of the base model: Apache 2.0 β€” Qwen/Qwen2.5-32B-Instruct

The quant_eval evaluation methodology and fixture set are proprietary to PBH Applied Systems, LLC and are not included in this repository.


GGUF conversion performed by PBH Applied Systems, LLC Β· No behavioral evaluation β€” see companion Q4_K_M repository for all evaluation data

Downloads last month
141
GGUF
Model size
33B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pbhappliedsystems/qwen-2.5-32B-instruct-gguf-F16

Base model

Qwen/Qwen2.5-32B
Quantized
(137)
this model