14 Energy-Efficient Artificial Intelligence

AI 1.0 vs AI 2.0

AI 1.0

Task-specific models
Heavy feature engineering
Data–task tightly coupled
Limited generalization

AI 2.0

Foundation models (e.g., LLMs)
Unified token-based representation
Core paradigm: next-token prediction
Strong transfer and general intelligence

System I vs System II

System I
- Fast, intuitive, automatic
- Corresponds to perception intelligence
System II
- Slow, analytical, reasoning-based
- Corresponds to cognitive intelligence

Modern LLMs aim to combine both.

LLM Inference Pipeline

Pre-fill Stage

Process the full prompt
Memory-bandwidth bound
Initialize KV cache

Decoding Stage

Autoregressive token generation
Compute-intensive
Latency-critical

\begin{equation*} \text{Inference} = \text{Pre-fill} + \text{Decoding} \end{equation*}

Tokens per Joule (Tokens/J)

New efficiency metric for AI 2.0
Replaces FLOPS as primary system goal
Measures end-to-end inference efficiency

\begin{equation*} \text{Tokens/J} = \frac{\text{Generated Tokens}}{\text{Energy Consumption}} \end{equation*}

Scaling Paradigms

Scaling Up

Larger models
More data
Higher capability ceiling

Scaling Down

Maintain performance at lower cost
Techniques:
- Quantization
- Pruning
- Knowledge distillation

Scaling Out

Distributed systems
Parallelism and system-level optimization

New Scaling

Multi-agent and collaborative LLM systems
Example: MetaGPT

Quantization

Lower numerical precision (FP32 → INT8 / INT4)
Reduce memory footprint and computation cost
Main challenge: preserving accuracy

Hardware–Software Co-Design

Joint optimization of:
- Algorithms
- Systems
- Hardware architecture
Enables orders-of-magnitude performance improvement

Examples:

AI accelerators
Memory-aware model design
Dataflow optimization

Golden Age of Computer Architecture (AI Era)

Moore’s Law slowing down
Performance gains come from:
- Architectural innovation
- Software–hardware interface redesign
- Domain-specific accelerators

Multi-Agent Systems

Multiple LLMs with specialized roles
Mimic human organizational structure
Improve performance on complex tasks

Example:

MetaGPT

AI 1.0 vs AI 2.0​

AI 1.0​

AI 2.0​

System I vs System II​

LLM Inference Pipeline​

Pre-fill Stage​

Decoding Stage​

Tokens per Joule (Tokens/J)​

Scaling Paradigms​

Scaling Up​

Scaling Down​

Scaling Out​

New Scaling​

Quantization​

Hardware–Software Co-Design​

Golden Age of Computer Architecture (AI Era)​

Multi-Agent Systems​

AI 1.0 vs AI 2.0

AI 1.0

AI 2.0

System I vs System II

LLM Inference Pipeline

Pre-fill Stage

Decoding Stage

Tokens per Joule (Tokens/J)

Scaling Paradigms

Scaling Up

Scaling Down

Scaling Out

New Scaling

Quantization

Hardware–Software Co-Design

Golden Age of Computer Architecture (AI Era)

Multi-Agent Systems