llama-3.2-3b-awq-4bit

This repository contains a quantized model artifact produced in the graduation project.

Model Details

  • Technique: AWQ
  • Quantization: INT4
  • Base model: meta-llama/Llama-3.2-3B-Instruct
  • Export date: 2026-03-23

Benchmark Summary

Metric Original Quantized
Disk size (GB) 5.98 2.85
Avg inference time 29.24 3.19
Tokens/sec 3.42 31.31
GPU memory 4513.00 3055.00

Comparison Highlights

  • Speedup: 9.17x
  • Memory reduction: 32.30%
  • Disk/model size reduction: 52.30%

Benchmark Notes

  • Numbers below are copied from local benchmark_results JSON in this project.

Local Source

  • Quantized folder: Advanced-Techniques/AWQ/quantized/llama-awq-4bit
  • Benchmark JSON: Advanced-Techniques/AWQ/benchmark_results/awq_benchmark_results.json

Usage

Use the model with the library and runtime that match the quantization technique in this repo.

Limitations

  • This model card is auto-generated from project files.
  • You should validate quality, safety, and license compatibility before public release.
Downloads last month
19
Safetensors
Model size
4B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for emreyigitozturk/llama-3.2-3b-awq-4bit

Quantized
(439)
this model