EXAONE-4.5-33B NVFP4 (experimental vLLM-served export)

This repository contains an experimental NVFP4 / ModelOpt export of EXAONE 4.5 33B that was validated to load with a patched/forked vLLM stack. 엑사원 4.5 에 대한 패치와 NVFP4 에 대한 패치 두개 모두 필요로 합니다. 4/10 기준으로 서빙이 가능한 것은 확인하였습니다.

What this repo is

Base model: LGAI-EXAONE/EXAONE-4.5-33B
Quantization: NVFP4 via ModelOpt
Export format: Hugging Face folder with model.safetensors
Validation target: local-path serving with vLLM

Important status

This is not a stock-upstream-vLLM-verified artifact.

Successful loading required:

a forked vLLM
a forked transformers
local loader adjustments / backend workarounds

So treat this repository as an experimental compatibility release.

Required stack

Use these forks first, because upstream support was not sufficient during validation.

uv pip install git+https://github.com/lkm2835/vllm.git@add-exaone4_5
uv pip install git+https://github.com/nuxlear/transformers.git@add-exaone4_5

In practice, when using a precompiled vLLM wheel path, it may be necessary to reinstall the transformers fork after vLLM installation so EXAONE 4.5 config support is preserved.

Recommended serving command

VLLM_NVFP4_GEMM_BACKEND=marlin \
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/this/model \
  --served-model-name exaone45-nvfp4 \
  --host 127.0.0.1 \
  --port 8012 \
  --trust-remote-code \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.80 \
  --max-model-len 2048

Validation notes

During validation, successful serving required the following categories of fixes/workarounds:

EXAONE 4.5 architecture support

upstream transformers did not sufficiently recognize the model
the add-exaone4_5 fork was required

vLLM EXAONE 4.5 support

upstream vLLM support was not enough
the add-exaone4_5 vLLM fork was required

NVFP4 packed-QKV scale handling

additional loader-side handling was required for k_scale / v_scale

Backend selection

NVFP4 GEMM worked with MARLIN
attention backend selection required avoiding failing FlashInfer paths in this validation environment

Caveats

Experimental artifact
Output quality may still require prompt/template/generation tuning
Different CUDA / GPU / flashinfer / vLLM combinations may behave differently
FP8 KV cache scaling warnings were observed during startup
Accuracy should be independently verified before production use

Suggested verification steps

After launch:

curl http://127.0.0.1:8012/v1/models

curl http://127.0.0.1:8012/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "exaone45-nvfp4",
    "messages": [{"role": "user", "content": "안녕하세요. 한 줄로 자기소개 해주세요."}],
    "max_tokens": 64
  }'

Files expected in this repo

model.safetensors
config.json
hf_quant_config.json
generation_config.json
processor_config.json
preprocessor_config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja

Attribution

Base model by LG AI / EXAONE
Compatibility work validated with forked vLLM and forked transformers

Downloads last month: 39

Safetensors

Model size

20B params

Tensor type

BF16

F8_E4M3

Model tree for lee5j/EXAONE-4.5-33B-NVFP4

Base model

LGAI-EXAONE/EXAONE-4.5-33B

Quantized

(5)

this model