EXAONE-4.5-33B NVFP4 (experimental vLLM-served export)

This repository contains an experimental NVFP4 / ModelOpt export of EXAONE 4.5 33B that was validated to load with a patched/forked vLLM stack. 엑사원 4.5 에 λŒ€ν•œ νŒ¨μΉ˜μ™€ NVFP4 에 λŒ€ν•œ 패치 λ‘κ°œ λͺ¨λ‘ ν•„μš”λ‘œ ν•©λ‹ˆλ‹€. 4/10 κΈ°μ€€μœΌλ‘œ μ„œλΉ™μ΄ κ°€λŠ₯ν•œ 것은 ν™•μΈν•˜μ˜€μŠ΅λ‹ˆλ‹€.

What this repo is

  • Base model: LGAI-EXAONE/EXAONE-4.5-33B
  • Quantization: NVFP4 via ModelOpt
  • Export format: Hugging Face folder with model.safetensors
  • Validation target: local-path serving with vLLM

Important status

This is not a stock-upstream-vLLM-verified artifact.

Successful loading required:

  • a forked vLLM
  • a forked transformers
  • local loader adjustments / backend workarounds

So treat this repository as an experimental compatibility release.

Required stack

Use these forks first, because upstream support was not sufficient during validation.

uv pip install git+https://github.com/lkm2835/vllm.git@add-exaone4_5
uv pip install git+https://github.com/nuxlear/transformers.git@add-exaone4_5

In practice, when using a precompiled vLLM wheel path, it may be necessary to reinstall the transformers fork after vLLM installation so EXAONE 4.5 config support is preserved.

Recommended serving command

VLLM_NVFP4_GEMM_BACKEND=marlin \
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/this/model \
  --served-model-name exaone45-nvfp4 \
  --host 127.0.0.1 \
  --port 8012 \
  --trust-remote-code \
  --quantization modelopt \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.80 \
  --max-model-len 2048

Validation notes

During validation, successful serving required the following categories of fixes/workarounds:

  1. EXAONE 4.5 architecture support
  • upstream transformers did not sufficiently recognize the model
  • the add-exaone4_5 fork was required
  1. vLLM EXAONE 4.5 support
  • upstream vLLM support was not enough
  • the add-exaone4_5 vLLM fork was required
  1. NVFP4 packed-QKV scale handling
  • additional loader-side handling was required for k_scale / v_scale
  1. Backend selection
  • NVFP4 GEMM worked with MARLIN
  • attention backend selection required avoiding failing FlashInfer paths in this validation environment

Caveats

  • Experimental artifact
  • Output quality may still require prompt/template/generation tuning
  • Different CUDA / GPU / flashinfer / vLLM combinations may behave differently
  • FP8 KV cache scaling warnings were observed during startup
  • Accuracy should be independently verified before production use

Suggested verification steps

After launch:

curl http://127.0.0.1:8012/v1/models

curl http://127.0.0.1:8012/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "exaone45-nvfp4",
    "messages": [{"role": "user", "content": "μ•ˆλ…•ν•˜μ„Έμš”. ν•œ μ€„λ‘œ μžκΈ°μ†Œκ°œ ν•΄μ£Όμ„Έμš”."}],
    "max_tokens": 64
  }'

Files expected in this repo

  • model.safetensors
  • config.json
  • hf_quant_config.json
  • generation_config.json
  • processor_config.json
  • preprocessor_config.json
  • tokenizer.json
  • tokenizer_config.json
  • chat_template.jinja

Attribution

  • Base model by LG AI / EXAONE
  • Compatibility work validated with forked vLLM and forked transformers
Downloads last month
39
Safetensors
Model size
20B params
Tensor type
BF16
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lee5j/EXAONE-4.5-33B-NVFP4

Quantized
(5)
this model