Papers
arxiv:2405.07719

A Unified Sequence Parallelism Approach for Long Context Generative AI

Published on May 13, 2024
Authors:
,

Abstract

A unified sequence parallelism approach for generative AI models is proposed, enhancing robustness across transformer architectures and hardware, and achieving high memory and communication efficiency.

AI-generated summary

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/expert/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 86% MFU on two 8xA800 nodes using SP for sequence length 208K for the LLAMA3-8B model. Our code is publicly available on https://github.com/feifeibear/long-context-attention.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2405.07719
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 11

Browse 11 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.07719 in a dataset README.md to link it from this page.

Spaces citing this paper 498

Collections including this paper 1