Built with glm-4

This repository provides a quantised version of the GLM-4-9b-Chat-HF model.

Licensing and Usage

This model is distributed under the glm-4-9b License.

  • Attribution: This model is "Built with glm-4".
  • Naming: The model name includes the required "glm-4" prefix.
  • Commercial Use: Users wishing to use this model for commercial purposes must complete the registration here.
  • Restrictions: Usage for military or illegal purposes is strictly prohibited.

Quantisation Overview

This model was quantised to FP8_DYNAMIC (Dynamic 8-bit Floating Point) to optimize performance while maintaining chat capabilities.

  • Methodology: Quantised using the llmcompressor library with a one-shot calibration process.
  • Calibration Data: A custom-curated dataset was used, pulling from ultrachat_200k and LongAlign-10k to ensure the model handles both short-form and long-form context effectively.
  • Architecture: The process preserved the original 40-layer dense structure (layers 0-39) required for the 9B model architecture.

Quantisation Details

  • Scheme: FP8_DYNAMIC (Dynamic 8-bit Floating Point).
  • Format: compressed-tensors.
  • Calibration: One-shot calibration using llmcompressor.
  • Calibration Data: Custom distribution weighted toward production-typical prompt lengths:
    • 256–1024 tokens: 90%
    • 1024–2048 tokens: 10%
  • Precision: FP8_DYNAMIC (Dynamic 8-bit Floating Point)

Usage

This model is designed for runtimes compatible with the FP8_DYNAMIC format, such as vLLM.

An example docker-compose is as follows assuming the model is stored in /models/glm-4-9b-chat-hf-fp8:

services:
  glm4-9b-fp8:
    image: nvcr.io/nvidia/vllm:26.01-py3
    container_name: glm4-9b-fp8
    ipc: host
    shm_size: "32gb"
    ports:
      - "8080:8080"
    volumes:
      - /models:/mnt/models:ro
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command: >
      python3 -m vllm.entrypoints.openai.api_server
      --model /mnt/models/glm-4-9b-chat-hf-fp8
      --served-model-name glm-4-9b-chat
      --port 8080
      --quantization compressed-tensors
      --kv-cache-dtype fp8
      --max-model-len 32768
      --max-num-seqs 64
      --gpu-memory-utilization 0.75

For questions regarding the original model license, contact license@zhipuai.cn.

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GizzmoShifu/glm-4-9b-chat-hf-fp8

Quantized
(4)
this model