emreyigitozturk
/

llama-3.2-3b-awq-4bit

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions

llama-3.2-3b-awq-4bit

This repository contains a quantized model artifact produced in the graduation project.

Model Details

Technique: AWQ
Quantization: INT4
Base model: meta-llama/Llama-3.2-3B-Instruct
Export date: 2026-03-23

Benchmark Summary

Metric	Original	Quantized
Disk size (GB)	5.98	2.85
Avg inference time	29.24	3.19
Tokens/sec	3.42	31.31
GPU memory	4513.00	3055.00

Comparison Highlights

Speedup: 9.17x
Memory reduction: 32.30%
Disk/model size reduction: 52.30%

Benchmark Notes

Numbers below are copied from local benchmark_results JSON in this project.

Local Source

Quantized folder: Advanced-Techniques/AWQ/quantized/llama-awq-4bit
Benchmark JSON: Advanced-Techniques/AWQ/benchmark_results/awq_benchmark_results.json

Usage

Use the model with the library and runtime that match the quantization technique in this repo.

Limitations

This model card is auto-generated from project files.
You should validate quality, safety, and license compatibility before public release.

Downloads last month: 19

Safetensors

Model size

4B params

Tensor type

I32

·

F16

·

Model tree for emreyigitozturk/llama-3.2-3b-awq-4bit

Base model

meta-llama/Llama-3.2-3B-Instruct

Quantized

(439)

this model