Mistral-7B-Instruct-v0.3 TensorRT-LLM checkpoint (W8A8 SmoothQuant + INT8 KV)

TensorRT-LLM checkpoint for Mistral-7B-Instruct-v0.3, with W8A8 SmoothQuant quantization for model compute and INT8 KV cache. Use with trtllm-build to produce an engine for inference.

Model details

Item	Value
Base model	mistralai/Mistral-7B-Instruct-v0.3
Framework	TensorRT-LLM (checkpoint format)
Weight/activation quantization	W8A8 SmoothQuant (`W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN`)
KV cache	INT8
Producer	TensorRT-LLM `convert_checkpoint.py` (Llama converter path for Mistral in TRT-LLM 1.1.0)
Key conversion flags	`--smoothquant 0.5 --per_token --per_channel --int8_kv_cache`
Calibration size	512 samples (`--calib_size 512`)
Architecture	MistralForCausalLM (decoder-only)

Build (how to produce this checkpoint)

This checkpoint is produced using the TensorRT-LLM converter with SmoothQuant and INT8 KV cache enabled:

python TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py \
  --model_dir /path/to/mistral-7b-instruct-v0.3 \
  --output_dir ./mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8 \
  --dtype float16 \
  --tp_size 1 \
  --smoothquant 0.5 \
  --per_token \
  --per_channel \
  --int8_kv_cache \
  --calib_size 512

Environment note

In this environment, loading the slow tokenizer path (use_fast=False) returns an invalid object for this model. During generation, tokenizer loading is forced to use_fast=True at runtime. This only affects tokenizer loading compatibility in the conversion process and does not change the target quantization configuration.

Output

After conversion, --output_dir contains:

config.json - TensorRT-LLM checkpoint config
rank0.safetensors - rank-0 checkpoint weights (single-GPU)

Upload (how to upload to Hugging Face)

cd ./mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8

huggingface-cli repo create rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8 --repo-type model
huggingface-cli upload rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8 . --repo-type model

How to use

1. Build engine

git clone https://huggingface.co/rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8
cd mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8

trtllm-build --checkpoint_dir . --output_dir ./engine \
  --max_batch_size 1 --max_input_len 512 --max_seq_len 1024

2. Run inference

Use a tokenizer from the base model:

trtllm-serve ./engine --tokenizer mistralai/Mistral-7B-Instruct-v0.3 --port 8000
# OpenAI-compatible API: http://localhost:8000/v1/completions

References

TensorRT-LLM
Mistral-7B-Instruct-v0.3

Downloads last month: 16

Model tree for rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8

Base model

mistralai/Mistral-7B-v0.3

Finetuned

mistralai/Mistral-7B-Instruct-v0.3

Finetuned

(418)

this model