EXAONE-4.5-33B NVFP4 (experimental vLLM-served export)
This repository contains an experimental NVFP4 / ModelOpt export of EXAONE 4.5 33B that was validated to load with a patched/forked vLLM stack. μμ¬μ 4.5 μ λν ν¨μΉμ NVFP4 μ λν ν¨μΉ λκ° λͺ¨λ νμλ‘ ν©λλ€. 4/10 κΈ°μ€μΌλ‘ μλΉμ΄ κ°λ₯ν κ²μ νμΈνμμ΅λλ€.
What this repo is
- Base model:
LGAI-EXAONE/EXAONE-4.5-33B - Quantization:
NVFP4viaModelOpt - Export format: Hugging Face folder with
model.safetensors - Validation target: local-path serving with
vLLM
Important status
This is not a stock-upstream-vLLM-verified artifact.
Successful loading required:
- a forked
vLLM - a forked
transformers - local loader adjustments / backend workarounds
So treat this repository as an experimental compatibility release.
Required stack
Use these forks first, because upstream support was not sufficient during validation.
uv pip install git+https://github.com/lkm2835/vllm.git@add-exaone4_5
uv pip install git+https://github.com/nuxlear/transformers.git@add-exaone4_5
In practice, when using a precompiled vLLM wheel path, it may be necessary to reinstall the transformers fork after vLLM installation so EXAONE 4.5 config support is preserved.
Recommended serving command
VLLM_NVFP4_GEMM_BACKEND=marlin \
python -m vllm.entrypoints.openai.api_server \
--model /path/to/this/model \
--served-model-name exaone45-nvfp4 \
--host 127.0.0.1 \
--port 8012 \
--trust-remote-code \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.80 \
--max-model-len 2048
Validation notes
During validation, successful serving required the following categories of fixes/workarounds:
- EXAONE 4.5 architecture support
- upstream
transformersdid not sufficiently recognize the model - the
add-exaone4_5fork was required
- vLLM EXAONE 4.5 support
- upstream vLLM support was not enough
- the
add-exaone4_5vLLM fork was required
- NVFP4 packed-QKV scale handling
- additional loader-side handling was required for
k_scale/v_scale
- Backend selection
- NVFP4 GEMM worked with
MARLIN - attention backend selection required avoiding failing FlashInfer paths in this validation environment
Caveats
- Experimental artifact
- Output quality may still require prompt/template/generation tuning
- Different CUDA / GPU / flashinfer / vLLM combinations may behave differently
- FP8 KV cache scaling warnings were observed during startup
- Accuracy should be independently verified before production use
Suggested verification steps
After launch:
curl http://127.0.0.1:8012/v1/models
curl http://127.0.0.1:8012/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "exaone45-nvfp4",
"messages": [{"role": "user", "content": "μλ
νμΈμ. ν μ€λ‘ μκΈ°μκ° ν΄μ£ΌμΈμ."}],
"max_tokens": 64
}'
Files expected in this repo
model.safetensorsconfig.jsonhf_quant_config.jsongeneration_config.jsonprocessor_config.jsonpreprocessor_config.jsontokenizer.jsontokenizer_config.jsonchat_template.jinja
Attribution
- Base model by LG AI / EXAONE
- Compatibility work validated with forked
vLLMand forkedtransformers
- Downloads last month
- 39
Model tree for lee5j/EXAONE-4.5-33B-NVFP4
Base model
LGAI-EXAONE/EXAONE-4.5-33B