Serving LLMs on an RTX4090 with Sequoia

Introduction

We introduce Sequoia, a scalable, robust and hardware-aware speculative decoding framework that enables serving LLMs (70B, 33B...) with a reasonable latency on consumer GPUs without any approximation (using 16bit precision and maintaining the original output distribution). Addressing the problems of robustness and scalability of previous works on speculative decoding, we show below that Sequoia, with a large speculation budget, can serve a Llama2-70B on a single RTX-4090 with an average time between tokens (TBT) as low as 0.57s, which is 8X faster than a highly optimized offloading serving system, 9X faster than DeepSpeed-Zero Offloading. On a single 2080Ti GPU (only 11GB memory), Vicuna-33B can be served with a TBT of 0.87s.

Serving Solutions by Sequoia

GPU	Bandwidth(GB/s)	Target Model	Draft Model	TBT(s)	Baseline(s)
4090	31.5	Llama2-70B	Llama2-7B	0.57	4.54
4090	31.5	Vicuna-33B	TinyVicuna-1B	0.35	1.78
4090	31.5	Llama2-22B	TinyLlama-1.1B	0.17	0.95
4090	31.5	InternLM-20B	InternLM-7B	0.17	0.77
4090	31.5	Llama2-13B	TinyLlama-1.1B	0.09	0.27
2080Ti	15.8	Vicuna-33B	TinyVicuna-1B	0.87	4.81
2080Ti	15.8	Llama2-22B	TinyLlama-1.1B	0.53	3.04
2080Ti	15.8	Llama2-13B	TinyLlama-1.1B	0.34	1.53
A100-SXM4	31.5	Llama3-70B-Instruct	Llama3-8B-Instruct	0.38	2.64
A100-SXM4*	31.5	Llama3-70B-Instruct (T=0.6)	Llama3-8B-Instruct	0.47	5.30
A100-SXM4*	31.5	Llama3-70B-Instruct (T=0)	Llama3-8B-Instruct	0.47	5.30

Sequoia can speed up LLM inference for a variety of model sizes and types of hardware. We evaluate Sequoia with LLMs of various sizes (including Llama2-70B-chat, Vicuna-33B, Llama2-22B, InternLM-20B and Llama2-13B-chat), on 4090 and 2080Ti, prompted by MT-Bench with temperature=0.6. The hardware platforms have different GPUs, CPU RAMs and CPU-GPU bandwidth. The evaluation results are listed above. * indicates layer-wise offloading implementations.

Here we show a demo for Llama2-70B inference on a single RTX-4090 (with and without Sequoia. Video plays at 4X speed).

Why Sequoia

Benefiting from two key advantages, Sequoia significantly accelerates LLM serving with offloading. Firstly, Sequoia is more scalable with a large speculation budget. For a given draft / target model pairs, Sequoia leverages a dynamic programming algorithm to search for the optimal tree structure, which enables a much faster growth in terms of accepted tokens with a certain budget (i.e. the size of the speculation tree). Secondly, thanks to sampling without replacement algorithm, Sequoia is robust in terms of generating temperatures, compared to top-k sampling and sampling with replacement. Apart from offloading, Sequoia provides a hardware-aware solution to adjust the size and depth of speculation trees to adapt to different hardware platforms. Sequoia can also speed up LLM inference on data-center GPUs like A100 and L40, which is discussed in detail in our paper.

Left (Scalability): Handcrafted tree structures do not scale well with large speculation budget.
Right (Robustness): The total acceptance rate of 5 speculation tokens. Sampling with replacement (SpecTr) fails when temperature is low and Top-k sampling fails with high temperature. Sequoia, leveraging sampling without replacement, attains the highest acceptance rate.

Below we show two examples of tree structures in Sequoia. The left one has 64 nodes which is suitable for on-chip inference and the right one has 768 nodes, suitable for offloading settings. We append more budget to nodes in previous layers with a higher probability to get accepted.

Conclusion and Future Work

Leveraging a large speculation budget, everyone can use RTX 4090 or other consumer (low-cost) GPU, e.g., AMD RX7900 with Sequoia to host very strong LLMs like 70B model without approximation, boosting the applications of AI generated content. In addition, we believe Sequoia will perform particularly well on future hardware, because it’s performance scales well with the compute/bandwidth ratio of the hardware, which has been increasing over time (e.g., V100, A100 and H100). Moreover, Sequoia, as a speculative decoding framework which mitigates the gap in the memory hierarchy, adapts to any draft/target pairs and any AI accelerators. We will stay tuned with hardware community.

BibTeX

@article{chen2024sequoia,
  title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
  author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
  journal={arXiv preprint arXiv:2402.12374},
  year={2024}
}

SEQUOIA: Serving exact Llama2-70B on an RTX4090 with half-second per token latency

Introduction

Serving Solutions by Sequoia

Why Sequoia

Conclusion and Future Work

BibTeX

SEQUOIA: Serving exact Llama2-70B on an RTX4090
with half-second per token latency