SEQUOIA: Serving exact Llama2-70B on an RTX4090
with half-second per token latency

1Carnegie Mellon University 2Together AI 3Yandex 4Meta AI
*Indicates Equal Contribution

Introduction

We introduce Sequoia, a scalable, robust and hardware-aware speculative decoding framework that enables serving LLMs (70B, 33B...) with a reasonable latency on consumer GPUs without any approximation (using 16bit precision and maintaining the original output distribution). Addressing the problems of robustness and scalability of previous works on speculative decoding, we show below that Sequoia, with a large speculation budget, can serve a Llama2-70B on a single RTX-4090 with an average time between tokens (TBT) as low as 0.57s, which is 8X faster than a highly optimized offloading serving system, 9X faster than DeepSpeed-Zero Offloading. On a single 2080Ti GPU (only 11GB memory), Vicuna-33B can be served with a TBT of 0.87s.

Serving Solutions by Sequoia

GPU Bandwidth(GB/s) Target Model Draft Model TBT(s) Baseline(s)
4090 31.5 Llama2-70B Llama2-7B 0.57 4.54
4090 31.5 Vicuna-33B TinyVicuna-1B 0.35 1.78
4090 31.5 Llama2-22B TinyLlama-1.1B 0.17 0.95
4090 31.5 InternLM-20B InternLM-7B 0.17 0.77
4090 31.5 Llama2-13B TinyLlama-1.1B 0.09 0.27
2080Ti 15.8 Vicuna-33B TinyVicuna-1B 0.87 4.81
2080Ti 15.8 Llama2-22B TinyLlama-1.1B 0.53 3.04
2080Ti 15.8 Llama2-13B TinyLlama-1.1B 0.34 1.53

Sequoia can speed up LLM inference for a variety of model sizes and types of hardware. We evaluate Sequoia with LLMs of various sizes (including Llama2-70B-chat, Vicuna-33B, Llama2-22B, InternLM-20B and Llama2-13B-chat), on 4090 and 2080Ti, prompted by MT-Bench. The hardware platforms have different GPUs, CPU RAMs and CPU-GPU bandwidth. The evaluation results are listed above.

Here we show a demo for Llama2-70B inference on a single RTX-4090 (with and without Sequoia. Video plays at 4X speed).

Why Sequoia

Benefiting from two key advantages, Sequoia significantly accelerates LLM serving with offloading. Firstly, Sequoia is more scalable with a large speculation budget. For a given draft / target model pairs, Sequoia leverages a dynamic programming algorithm to search for the optimal tree structure, which enables a much faster growth in terms of accepted tokens with a certain budget (i.e. the size of the speculation tree). Secondly, thanks to sampling without replacement algorithm, Sequoia is robust in terms of generating temperatures, compared to top-k sampling and sampling with replacement. Apart from offloading, Sequoia provides a hardware-aware solution to adjust the size and depth of speculation trees to adapt to different hardware platforms. Sequoia can also speed up LLM inference on data-center GPUs like A100 and L40, which is discussed in detail in our paper.

Robustness and Scalability

Left (Scalability): Handcrafted tree structures do not scale well with large speculation budget.
Right (Robustness): The total acceptance rate of 5 speculation tokens. Sampling with replacement (SpecTr) fails when temperature is low and Top-k sampling fails with high temperature. Sequoia, leveraging sampling without replacement, attains the highest acceptance rate.


Below we show two examples of tree structures in Sequoia. The left one has 64 nodes which is suitable for on-chip inference and the right one has 768 nodes, suitable for offloading settings. We append more budget to nodes in previous layers with a higher probability to get accepted.

Tree Shape

Conclusion and Future Work

Leveraging a large speculation budget, everyone can use RTX 4090 or other consumer (low-cost) GPU, e.g., AMD RX7900 with Sequoia to host very strong LLMs like 70B model without approximation, boosting the applications of AI generated content. In addition, we believe Sequoia will perform particularly well on future hardware, because it’s performance scales well with the compute/bandwidth ratio of the hardware, which has been increasing over time (e.g., V100, A100 and H100). Moreover, Sequoia, as a speculative decoding framework which mitigates the gap in the memory hierarchy, adapts to any draft/target pairs and any AI accelerators. We will stay tuned with hardware community.

<i>Sequoia</i>

BibTeX

@article{chen2024sequoia,
  title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
  author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
  journal={arXiv preprint arXiv:2402.12374},
  year={2024}
}