Llama-3.2-3B-Instruct TensorRT-LLM checkpoint (INT4 AWQ + INT8 KV)

TensorRT-LLM checkpoint for Llama-3.2-3B-Instruct, with INT4 AWQ weight quantization and INT8 KV cache. Use with trtllm-build to produce an engine for inference.

Model details

Item Value
Base model Llama-3.2-3B-Instruct
Framework TensorRT-LLM (checkpoint format)
Weight quantization INT4 AWQ
KV cache INT8
Producer TensorRT-LLM v0.18.0 convert_checkpoint.py (modelopt 0.25.0)
Architecture LlamaForCausalLM (decoder-only)

Build (how to produce this checkpoint)

1. Environment and dependencies

sudo apt install git-lfs
git lfs install
sudo apt-get update && sudo apt-get -y install python3.12 python3-pip

pip3 install tensorrt_llm==0.18.0 --extra-index-url https://pypi.nvidia.com
pip3 install datasets==3.6.0
pip3 install "onnx>=1.12,<1.20"

2. Clone repos and base model

git clone -b v0.18.0 https://github.com/NVIDIA/TensorRT-LLM.git
git clone https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

3. Convert checkpoint (INT4 AWQ + INT8 KV)

python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \
  --model_dir ./Llama-3.2-3B-Instruct \
  --output_dir ./llama-3.2-3B-instruct-trtllm-ckpt-wq_int4_awq-kv_int8 \
  --dtype float16 \
  --use_weight_only \
  --weight_only_precision int4_awq \
  --int8_kv_cache

(Optional: set calibration data with --calib_dataset <path_or_name>, e.g. a local parquet dir or pileval.)

4. Output

After conversion, --output_dir will contain config.json and rank0.safetensors; that is the checkpoint in this repo.

Upload (how to upload to Hugging Face)

cd ./llama-3.2-3B-instruct-trtllm-ckpt-wq_int4_awq-kv_int8

# Create the repo first if it does not exist
huggingface-cli repo create rungalileo/llama-3.2-3B-instruct-trtllm-ckpt-wq_int4_awq-kv_int8 --repo-type model

# Upload everything in the current directory to the repo
huggingface-cli upload rungalileo/llama-3.2-3B-instruct-trtllm-ckpt-wq_int4_awq-kv_int8 . --repo-type model

How to use

1. Build engine

Requires TensorRT-LLM (e.g. v0.18.0) and tensorrt_llm installed:

# Clone this repo or download from HF
git clone https://huggingface.co/rungalileo/llama-3.2-3B-instruct-trtllm-ckpt-wq_int4_awq-kv_int8
cd llama-3.2-3B-instruct-trtllm-ckpt-wq_int4_awq-kv_int8

# Build TensorRT-LLM engine (adjust max_batch_size / max_seq_len as needed)
trtllm-build --checkpoint_dir . --output_dir ./engine \
  --max_batch_size 1 --max_input_len 512 --max_seq_len 1024

2. Run inference

Example with trtllm-serve (need tokenizer from the base model, e.g. meta-llama/Llama-3.2-3B-Instruct):

trtllm-serve ./engine --tokenizer meta-llama/Llama-3.2-3B-Instruct --port 8000
# Then call OpenAI-compatible API at http://localhost:8000/v1/completions

Files in this repo

  • config.json โ€“ TensorRT-LLM model config
  • rank0.safetensors โ€“ Rank 0 weights (single-GPU)

References

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for rungalileo/llama-3.2-3B-instruct-trtllm-ckpt-wq_int4_awq-kv_int8

Finetuned
(1547)
this model