Mistral-7B-Instruct-v0.3 TensorRT-LLM checkpoint (W8A8 SmoothQuant + INT8 KV)
TensorRT-LLM checkpoint for Mistral-7B-Instruct-v0.3, with W8A8 SmoothQuant quantization for model compute and INT8 KV cache. Use with trtllm-build to produce an engine for inference.
Model details
| Item | Value |
|---|---|
| Base model | mistralai/Mistral-7B-Instruct-v0.3 |
| Framework | TensorRT-LLM (checkpoint format) |
| Weight/activation quantization | W8A8 SmoothQuant (W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN) |
| KV cache | INT8 |
| Producer | TensorRT-LLM convert_checkpoint.py (Llama converter path for Mistral in TRT-LLM 1.1.0) |
| Key conversion flags | --smoothquant 0.5 --per_token --per_channel --int8_kv_cache |
| Calibration size | 512 samples (--calib_size 512) |
| Architecture | MistralForCausalLM (decoder-only) |
Build (how to produce this checkpoint)
This checkpoint is produced using the TensorRT-LLM converter with SmoothQuant and INT8 KV cache enabled:
python TensorRT-LLM/examples/models/core/llama/convert_checkpoint.py \
--model_dir /path/to/mistral-7b-instruct-v0.3 \
--output_dir ./mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8 \
--dtype float16 \
--tp_size 1 \
--smoothquant 0.5 \
--per_token \
--per_channel \
--int8_kv_cache \
--calib_size 512
Environment note
In this environment, loading the slow tokenizer path (use_fast=False) returns an invalid object for this model. During generation, tokenizer loading is forced to use_fast=True at runtime. This only affects tokenizer loading compatibility in the conversion process and does not change the target quantization configuration.
Output
After conversion, --output_dir contains:
config.json- TensorRT-LLM checkpoint configrank0.safetensors- rank-0 checkpoint weights (single-GPU)
Upload (how to upload to Hugging Face)
cd ./mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8
huggingface-cli repo create rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8 --repo-type model
huggingface-cli upload rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8 . --repo-type model
How to use
1. Build engine
git clone https://huggingface.co/rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8
cd mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8
trtllm-build --checkpoint_dir . --output_dir ./engine \
--max_batch_size 1 --max_input_len 512 --max_seq_len 1024
2. Run inference
Use a tokenizer from the base model:
trtllm-serve ./engine --tokenizer mistralai/Mistral-7B-Instruct-v0.3 --port 8000
# OpenAI-compatible API: http://localhost:8000/v1/completions
References
- TensorRT-LLM
- Mistral-7B-Instruct-v0.3
- Downloads last month
- 16
Model tree for rungalileo/mistral-7b-instruct-v0.3-trtllm-ckpt-wq_w8a8sq-kv_int8
Base model
mistralai/Mistral-7B-v0.3 Finetuned
mistralai/Mistral-7B-Instruct-v0.3