Commit ·
9644e54
1
Parent(s): acb250c
Update README.md
Browse files
README.md
CHANGED
|
@@ -32,7 +32,7 @@ library_name: transformers
|
|
| 32 |
<img src="https://img.shields.io/badge/📝-Blog-E343BD?style=for-the-badge" alt="Blog">
|
| 33 |
</a>
|
| 34 |
<a href="https://github.com/LG-AI-EXAONE/EXAONE-4.5/blob/main/assets/Technical_Report__EXAONE_4_5.pdf" style="text-decoration: none;">
|
| 35 |
-
<img src="https://img.shields.io/badge/📑-
|
| 36 |
</a>
|
| 37 |
<a href="https://github.com/LG-AI-EXAONE/EXAONE-4.5" style="text-decoration: none;">
|
| 38 |
<img src="https://img.shields.io/badge/🖥️-GitHub-2B3137?style=for-the-badge" alt="GitHub">
|
|
@@ -70,7 +70,7 @@ For more details, please refer to the [technical report](https://github.com/LG-A
|
|
| 70 |
- Sliding Window Attention
|
| 71 |
- Number of Attention Heads: 40 Q-heads and 8 KV-heads
|
| 72 |
- Head Dimension: 128 for both Q/KV
|
| 73 |
-
- Sliding Window Size:
|
| 74 |
- Global Attention
|
| 75 |
- Number of Attention Heads: 40 Q-heads and 8 KV-heads
|
| 76 |
- Head Dimension: 128 for both Q/KV
|
|
@@ -459,6 +459,30 @@ For better inference speed and memory usage, it is preferred to serve the model
|
|
| 459 |
Practically, you can serve the EXAONE 4.5 model with 256K context length on **single H200 GPU**, or **4x A100-40GB GPUs** by using a tensor-parallelism.
|
| 460 |
|
| 461 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 462 |
### vLLM
|
| 463 |
|
| 464 |
Both Transformers and vLLM of our forks are required to utilize EXAONE 4.5 model.
|
|
|
|
| 32 |
<img src="https://img.shields.io/badge/📝-Blog-E343BD?style=for-the-badge" alt="Blog">
|
| 33 |
</a>
|
| 34 |
<a href="https://github.com/LG-AI-EXAONE/EXAONE-4.5/blob/main/assets/Technical_Report__EXAONE_4_5.pdf" style="text-decoration: none;">
|
| 35 |
+
<img src="https://img.shields.io/badge/📑-Technical_Report-684CF4?style=for-the-badge" alt="Technical Report">
|
| 36 |
</a>
|
| 37 |
<a href="https://github.com/LG-AI-EXAONE/EXAONE-4.5" style="text-decoration: none;">
|
| 38 |
<img src="https://img.shields.io/badge/🖥️-GitHub-2B3137?style=for-the-badge" alt="GitHub">
|
|
|
|
| 70 |
- Sliding Window Attention
|
| 71 |
- Number of Attention Heads: 40 Q-heads and 8 KV-heads
|
| 72 |
- Head Dimension: 128 for both Q/KV
|
| 73 |
+
- Sliding Window Size: 4096
|
| 74 |
- Global Attention
|
| 75 |
- Number of Attention Heads: 40 Q-heads and 8 KV-heads
|
| 76 |
- Head Dimension: 128 for both Q/KV
|
|
|
|
| 459 |
Practically, you can serve the EXAONE 4.5 model with 256K context length on **single H200 GPU**, or **4x A100-40GB GPUs** by using a tensor-parallelism.
|
| 460 |
|
| 461 |
|
| 462 |
+
### TensorRT-LLM
|
| 463 |
+
|
| 464 |
+
TensorRT-LLM provides zero day support for EXAONE 4.5. Transformers library of our fork is required to utilize EXAONE 4.5 model.
|
| 465 |
+
You can install Transformers by running the following commands:
|
| 466 |
+
|
| 467 |
+
```bash
|
| 468 |
+
pip install git+https://github.com/nuxlear/transformers.git@add-exaone4_5
|
| 469 |
+
```
|
| 470 |
+
|
| 471 |
+
Please refer to the official [installation guide](https://github.com/NVIDIA/TensorRT-LLM?tab=readme-ov-file#getting-started), and [EXAONE documentations](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/exaone), and [EXAONE 4.5 PR](https://github.com/NVIDIA/TensorRT-LLM/pull/12873) for the detail.
|
| 472 |
+
|
| 473 |
+
After you install the TensorRT-LLM, you can launch the server with the following code snippet. You can remove unnecessary arguments from the snippet.
|
| 474 |
+
|
| 475 |
+
```bash
|
| 476 |
+
trtllm-serve LGAI-EXAONE/EXAONE-4.5-33B \
|
| 477 |
+
—tp_size 2 \
|
| 478 |
+
—port 8000 \
|
| 479 |
+
—reasoning_parser qwen3
|
| 480 |
+
|
| 481 |
+
```
|
| 482 |
+
|
| 483 |
+
An OpenAI-compatible API server will be available at http://localhost:8000/v1.
|
| 484 |
+
|
| 485 |
+
|
| 486 |
### vLLM
|
| 487 |
|
| 488 |
Both Transformers and vLLM of our forks are required to utilize EXAONE 4.5 model.
|