LG-AI-EXAONE commited on
Commit
9644e54
·
1 Parent(s): acb250c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -2
README.md CHANGED
@@ -32,7 +32,7 @@ library_name: transformers
32
  <img src="https://img.shields.io/badge/📝-Blog-E343BD?style=for-the-badge" alt="Blog">
33
  </a>
34
  <a href="https://github.com/LG-AI-EXAONE/EXAONE-4.5/blob/main/assets/Technical_Report__EXAONE_4_5.pdf" style="text-decoration: none;">
35
- <img src="https://img.shields.io/badge/📑-Technical_Report_(TBU)-684CF4?style=for-the-badge" alt="Technical Report">
36
  </a>
37
  <a href="https://github.com/LG-AI-EXAONE/EXAONE-4.5" style="text-decoration: none;">
38
  <img src="https://img.shields.io/badge/🖥️-GitHub-2B3137?style=for-the-badge" alt="GitHub">
@@ -70,7 +70,7 @@ For more details, please refer to the [technical report](https://github.com/LG-A
70
  - Sliding Window Attention
71
  - Number of Attention Heads: 40 Q-heads and 8 KV-heads
72
  - Head Dimension: 128 for both Q/KV
73
- - Sliding Window Size: 128
74
  - Global Attention
75
  - Number of Attention Heads: 40 Q-heads and 8 KV-heads
76
  - Head Dimension: 128 for both Q/KV
@@ -459,6 +459,30 @@ For better inference speed and memory usage, it is preferred to serve the model
459
  Practically, you can serve the EXAONE 4.5 model with 256K context length on **single H200 GPU**, or **4x A100-40GB GPUs** by using a tensor-parallelism.
460
 
461
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
462
  ### vLLM
463
 
464
  Both Transformers and vLLM of our forks are required to utilize EXAONE 4.5 model.
 
32
  <img src="https://img.shields.io/badge/📝-Blog-E343BD?style=for-the-badge" alt="Blog">
33
  </a>
34
  <a href="https://github.com/LG-AI-EXAONE/EXAONE-4.5/blob/main/assets/Technical_Report__EXAONE_4_5.pdf" style="text-decoration: none;">
35
+ <img src="https://img.shields.io/badge/📑-Technical_Report-684CF4?style=for-the-badge" alt="Technical Report">
36
  </a>
37
  <a href="https://github.com/LG-AI-EXAONE/EXAONE-4.5" style="text-decoration: none;">
38
  <img src="https://img.shields.io/badge/🖥️-GitHub-2B3137?style=for-the-badge" alt="GitHub">
 
70
  - Sliding Window Attention
71
  - Number of Attention Heads: 40 Q-heads and 8 KV-heads
72
  - Head Dimension: 128 for both Q/KV
73
+ - Sliding Window Size: 4096
74
  - Global Attention
75
  - Number of Attention Heads: 40 Q-heads and 8 KV-heads
76
  - Head Dimension: 128 for both Q/KV
 
459
  Practically, you can serve the EXAONE 4.5 model with 256K context length on **single H200 GPU**, or **4x A100-40GB GPUs** by using a tensor-parallelism.
460
 
461
 
462
+ ### TensorRT-LLM
463
+
464
+ TensorRT-LLM provides zero day support for EXAONE 4.5. Transformers library of our fork is required to utilize EXAONE 4.5 model.
465
+ You can install Transformers by running the following commands:
466
+
467
+ ```bash
468
+ pip install git+https://github.com/nuxlear/transformers.git@add-exaone4_5
469
+ ```
470
+
471
+ Please refer to the official [installation guide](https://github.com/NVIDIA/TensorRT-LLM?tab=readme-ov-file#getting-started), and [EXAONE documentations](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/exaone), and [EXAONE 4.5 PR](https://github.com/NVIDIA/TensorRT-LLM/pull/12873) for the detail.
472
+
473
+ After you install the TensorRT-LLM, you can launch the server with the following code snippet. You can remove unnecessary arguments from the snippet.
474
+
475
+ ```bash
476
+ trtllm-serve LGAI-EXAONE/EXAONE-4.5-33B \
477
+ —tp_size 2 \
478
+ —port 8000 \
479
+ —reasoning_parser qwen3
480
+
481
+ ```
482
+
483
+ An OpenAI-compatible API server will be available at http://localhost:8000/v1.
484
+
485
+
486
  ### vLLM
487
 
488
  Both Transformers and vLLM of our forks are required to utilize EXAONE 4.5 model.