Transformer based Large Language Models (LLMs) have been widely used in many
fields, and the efficiency of LLM inference becomes hot topic in real
applications. However, LLMs are usually complicatedly designed in model
structure with massive operations and perform inference in the auto-regressive
mode, making it a challenging task to design a system with high efficiency.

In this paper, we propose an efficient LLM inference solution with low
latency and high throughput. Firstly, we simplify the LLM decoder layer by
fusing data movement and element-wise operations to reduce the memory access
frequency and lower system latency. We also propose a segment KV cache policy
to keep key/value of the request and response tokens in separate physical
memory for effective device memory management, helping enlarge the runtime
batch size and improve system throughput. A customized
Scaled-Dot-Product-Attention kernel is designed to match our fusion policy
based on the segment KV cache solution. We implement our LLM inference solution
on Intel GPU and publish it publicly. Compared with the standard HuggingFace
implementation, the proposed solution achieves up to 7x lower token latency and
27x higher throughput for some popular LLMs on Intel GPU.

Analyzing the Efficient LLM Inference Solution

The article highlights the challenges of designing a system with high efficiency for Large Language Models (LLMs). LLMs are widely used in various fields, but their complex model structure and autoregressive inference mode pose significant barriers to achieving low latency and high throughput.

Simplifying the LLM Decoder Layer

To address the efficiency issues, the paper proposes simplifying the LLM decoder layer by fusing data movement and element-wise operations, reducing memory access frequency, and lowering system latency. By combining these operations, the proposed approach aims to streamline the decoding process and optimize memory usage.

Segment KV Cache Policy

In addition to simplifying the decoder layer, the paper introduces a segment KV cache policy to improve memory management. This policy separates the key/value pairs of request and response tokens into separate physical memory, allowing for more effective memory utilization. By doing so, the runtime batch size can be increased, resulting in improved system throughput.

Scaled-Dot-Product-Attention Kernel

To align with the fusion policy based on the segment KV cache solution, the paper devises a customized Scaled-Dot-Product-Attention kernel. This kernel is tailored specifically to match the fusion policy and further optimize the inference process. By customizing this kernel, the solution aims to enhance the overall efficiency of LLM inference.

Implementation and Results

The proposed LLM inference solution is implemented on Intel GPU and made publicly available. The performance of the solution is compared against the standard HuggingFace implementation. The results demonstrate impressive gains, with the proposed solution achieving up to 7x lower token latency and 27x higher throughput for popular LLMs on Intel GPU.

The concepts discussed in this article highlight the multi-disciplinary nature of LLM efficiency optimization. It combines knowledge from areas such as model architecture, memory management, and hardware-specific customization. The fusion of these concepts and the tailored approach to LLM inference could have broader implications for various applications that rely on efficient language modeling.

Read the original article