Llama-3.1-8B-Instruct-FP8-W8A8-Dynamic-Per-Token

This is an FP8 W8A8 quantized version of meta-llama/Llama-3.1-8B-Instruct created using llm-compressor.

Note: This model quantizes Weights and Activations only. KV cache is NOT quantized.

Quantization Details

  • Quantization Method: FP8 W8A8 (Weight and Activation only)
  • Weight Precision: FP8 E4M3 (8-bit floating point), static per-channel quantization
    • Scale shape: (N, 1) — one scale per output channel
    • Observer: MinMax
  • Activation Precision: FP8 E4M3 (8-bit floating point), dynamic per-token quantization
    • Scale computed at runtime: absmax / FP8_E4M3_MAX per token (row)
    • No activation scales stored in checkpoint
  • KV Cache: Not quantized (remains in original precision)
  • Quantization Format: compressed-tensors (float-quantized)
  • Ignored Layers: lm_head only
  • Calibration Dataset: CNN/DailyMail
  • Calibration Samples: 512

vLLM CUTLASS FP8 Kernel

This model is optimized for the vLLM CUTLASS FP8 kernel, which fuses dequantization into the GEMM epilogue:

D[m,n] = a_scale[m] * b_scale[n] * fp8_accum[m,n]
  • a_scale[m]: per-token activation scale (computed dynamically at runtime)
  • b_scale[n]: per-channel weight scale (stored in checkpoint)
  • fp8_accum[m,n]: FP8 x FP8 accumulated result

Model Size

  • Original Model: ~16GB (FP16)
  • Quantized Model: ~8.5GB (FP8 W8A8)
  • Compression Ratio: ~1.9x

Usage

Installation

pip install vllm>=0.6.0

With vLLM

from vllm import LLM, SamplingParams

# Load the FP8 W8A8 quantized model
llm = LLM(
    model="JongYeop/Llama-3.1-8B-Instruct-FP8-W8A8-Dynamic-Per-Token",
)

# Generate text
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

With Transformers (for inspection)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JongYeop/Llama-3.1-8B-Instruct-FP8-W8A8-Dynamic-Per-Token")
model = AutoModelForCausalLM.from_pretrained(
    "JongYeop/Llama-3.1-8B-Instruct-FP8-W8A8-Dynamic-Per-Token",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Performance

FP8 W8A8 Dynamic Per-Token quantization provides:

  • ~2x memory reduction compared to FP16
  • Faster inference with FP8-capable hardware (e.g., NVIDIA H100, Ada Lovelace)
  • Better accuracy than per-tensor due to fine-grained per-token activation scaling
  • Per-channel weight quantization preserves weight distribution per output channel

Quantization Recipe

The quantization recipe used for this model is included in the repository as recipe.yaml.

Key configuration:

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 8
            type: float             # FP8 E4M3
            strategy: channel       # Per-channel (one scale per output channel)
            dynamic: false
            symmetric: true
          input_activations:
            num_bits: 8
            type: float             # FP8 E4M3
            strategy: token         # Per-token (one scale per row)
            dynamic: true           # Scales computed at runtime
            symmetric: true
          targets: ["Linear"]

Hardware Requirements

  • GPU: NVIDIA GPU with compute capability >= 8.9 (Ada Lovelace, Hopper)
    • Examples: RTX 4090, L40S, H100, H200
  • VRAM: Minimum 10GB for inference

Citation

If you use this model, please cite:

@software{llm-compressor,
  title = {LLM Compressor},
  author = {vLLM Team},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}

@article{llama3,
  title={Llama 3 Model Card},
  author={AI@Meta},
  year={2024},
  url={https://github.com/meta-llama/llama3}
}

License

This model inherits the license from the original Llama 3.1 model.

Acknowledgments

Downloads last month
2
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JongYeop/Llama-3.1-8B-Instruct-FP8-W8A8-Dynamic-Per-Token

Quantized
(615)
this model