Instructions to use OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints
- SGLang
How to use OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints with Docker Model Runner:
docker model run hf.co/OpenNLPLab/TransNormerLLM3-15B-Intermediate-Checkpoints
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| pipeline_tag: text-generation | |
| tags: | |
| - ' TransNormerLLM' | |
| <div align="center"> | |
| <h1> | |
| TransNormerLLM3 -- A Faster and Better LLM | |
| </h1> | |
| </div> | |
| # Introduction | |
| This official repository unveils the TransNormerLLM3 model along with its open-source weights for every 50 billion tokens processed during pre-training. | |
| [TransNormerLLM](https://arxiv.org/abs/2307.14995) evolving from [TransNormer](https://arxiv.org/abs/2210.10340), standing out as the first LLM within the linear transformer architecture. Additionally, it distinguishes itself by being the first non-Transformer LLM to exceed both traditional Transformer and other efficient Transformer models (such as, RetNet and Mamba) in terms of speed and performance. | |
| # TransNormerLLM3 | |
| - **TransNormerLLM3-15B** features **14.83 billion** parameters. It is structured with **42 layers**, includes **40 attention heads**, and has a total **embedding size of 5120**. | |
| - **TransNormerLLM3-15B** is purely intergrated with **[Lightning Attention-2](http://arxiv.org/abs/2401.04658)**, which can maintain a **stable TGS** during training of **unlimited sequence lengths**, up until encountering firm limitations like GPU memory constraints. | |
| - **Titoken** tokenizer is used with a total **vocabulary size** of about **100,000**. | |
| - It incorporates **Simple GLU** for its channel mixer, **GLA** in the token mixer, and **SRMSNorm** for normalization. | |
| - In terms of position encoding, the first layer employs **LRPE with exponential decay**, whereas the subsequent layers continue with **exponential decay encoding**. | |
| ### Pre-training Logbook | |
| * Realtime Track: https://api.wandb.ai/links/opennlplab/kip314lq | |
| * Join to dicussion: [discord](https://discord.gg/MYQh6BWN) <<<>>> [wechat group](https://github.com/OpenNLPLab/TransnormerLLM/blob/main/images/contact_me_qr.png) | |
| > --23.12.25-- startup: [WeChat - 预训练启航](https://mp.weixin.qq.com/s/YjUY-uy89WkF75_-rBTuKw) <<<>>> [Twitter - Pre-training Commences ](https://twitter.com/opennlplab/status/1739568669502611825) <<<>>> [YouTube Recording](https://t.co/wk7svS4o5r) <<<>>> [bilibili 回放](https://www.bilibili.com/video/BV11j411J7Dy) | |
| > --24.01.02-- first week review: [WeChat - 第一周概览](https://mp.weixin.qq.com/s/zwGnZZI3itNPoxzzXkuU2w) <<<>>> [Twitter - First Week Review](https://twitter.com/opennlplab/status/1742187694078501038) | |
| > --24.01.09-- second week review: [WeChat - 第二周概览](https://mp.weixin.qq.com/s/6D0qi-0aBier05OKuHfPEA) <<<>>> [Twitter - Second Week Review](https://twitter.com/opennlplab/status/1744720007299523063) | |
| # Released Weights | |
| | param | token | Hugging Face | Model Scope | Wisemodel | | |
| | :-----: | :---: | :----------: | :---------: | :-------: | | |
| | **15B** | 50B | 🤗 | 🤖 | 🐯 | | |
| # Benchmark Results | |
| The evaluations of all models are conducted using the official settings and the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) framework. | |
| | Model | P | T | BoolQ | PIQA | HS | WG | ARC-e | ARC-c | OBQA | MMLU | C-Eval | | |
| | ----------------------- | --- | ---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------ | | |
| | **TransNormerLLM3-15B** | 15 | 0.05 | 62.08 | 72.52 | 55.55 | 57.14 | 62.12 | 31.14 | 32.40 | 27.50 | 26.18 | | |
| | **TransNormerLLM3-15B** | 15 | 0.10 | 63.98 | 74.70 | 61.09 | 61.33 | 65.95 | 34.64 | 35.60 | 25.38 | 27.40 | | |
| | **TransNormerLLM3-15B** | 15 | 0.15 | 60.34 | 75.08 | 63.99 | 62.04 | 64.56 | 34.90 | 35.20 | 22.64 | 26.60 | | |
| > **P**: parameter size (billion). **T**: tokens (trillion). **BoolQ**: acc. **PIQA**: acc. **HellaSwag**: acc_norm. **WinoGrande**: acc. **ARC-easy**: acc. **ARC-challenge**: acc_norm. **OpenBookQA**: acc_norm. **MMLU**: 5-shot acc. **C-Eval**: 5-shot acc. | |
| # Acknowledgments and Citation | |
| ## Acknowledgments | |
| Our project is developed based on the following open source projects: | |
| - [tiktoken](https://github.com/openai/tiktoken) for the tokenizer. | |
| - [metaseq](https://github.com/facebookresearch/metaseq) for training. | |
| - [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluation. | |
| ## Citation | |
| If you wish to cite our work, please use the following reference: | |
| ``` | |
| @article{qin2023scaling, | |
| title={Scaling transnormer to 175 billion parameters}, | |
| author={Qin, Zhen and Li, Dong and Sun, Weigao and Sun, Weixuan and Shen, Xuyang and Han, Xiaodong and Wei, Yunshen and Lv, Baohong and Yuan, Fei and Luo, Xiao and others}, | |
| journal={arXiv preprint arXiv:2307.14995}, | |
| year={2023} | |
| } | |
| @misc{qin2024lightning, | |
| title={Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models}, | |
| author={Zhen Qin and Weigao Sun and Dong Li and Xuyang Shen and Weixuan Sun and Yiran Zhong}, | |
| year={2024}, | |
| eprint={2401.04658}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| ``` | |
| <p align="center"> | |
| - OpenNLPLab @2024 - | |
| </p> |