跳到主要内容

3 A Computer Systems Journey Through LLMs

Training vs. Inference

  • Training: forward + backward, parameter update, offline
  • Inference: autoregressive decoding, online, system bottleneck
  • Deployment performance is dominated by inference, not training

Attention and Complexity

  • Self-attention uses Query (Q), Key (K), Value (V)
  • Vanilla attention complexity:
O(N2)\begin{equation*} O(N^2) \end{equation*}

where NN is the sequence length

Autoregressive Inference

  • Tokens are generated one by one
  • Each new token repeats:
    • Multi-Head Attention (MHA)
    • MLP

KV Cache

Idea

  • Cache historical K and V
  • Only compute Q for new tokens

Effect

  • Attention complexity reduces:
O(N2)O(N)O(N^2) \rightarrow O(N)
  • Trade space for computation

Cost

  • KV Cache consumes large GPU memory
  • Memory usage grows linearly with sequence length

KV Cache Memory Problems

  • Internal fragmentation
  • External fragmentation
  • Worsens with variable-length and dynamic requests

Paged Attention

Core Idea

  • Apply virtual memory & paging concepts to KV Cache

Benefits

  • Fixed-size pages
  • Logical–physical separation via indirection
  • Eliminates fragmentation
  • Improves memory utilization and scalability

Performance Metrics

  • Throughput: tokens/s
  • Latency (TTFT): time to first token
  • Inter-token latency

Inference systems must balance throughput and latency.

Parallelism

  • Pipeline Parallelism: split layers across GPUs → higher throughput
  • Tensor Parallelism: split tensor ops → lower per-GPU compute
  • Mixed Parallelism: used in large models

Batching

  • Static batching: padding, low GPU utilization
  • Continuous batching:
    • Dynamically add/remove requests
    • Improves GPU utilization and throughput

Systems Takeaway

LLM inference is a systems problem.

Key principles:

  • Parallelism
  • Pipelining
  • Batching
  • Indirection
  • Speculation
  • Locality