14 Energy-Efficient Artificial Intelligence
AI 1.0 vs AI 2.0
AI 1.0
- Task-specific models
- Heavy feature engineering
- Data–task tightly coupled
- Limited generalization
AI 2.0
- Foundation models (e.g., LLMs)
- Unified token-based representation
- Core paradigm: next-token prediction
- Strong transfer and general intelligence
System I vs System II
-
System I
- Fast, intuitive, automatic
- Corresponds to perception intelligence
-
System II
- Slow, analytical, reasoning-based
- Corresponds to cognitive intelligence
Modern LLMs aim to combine both.
LLM Inference Pipeline
Pre-fill Stage
- Process the full prompt
- Memory-bandwidth bound
- Initialize KV cache
Decoding Stage
- Autoregressive token generation
- Compute-intensive
- Latency-critical
Tokens per Joule (Tokens/J)
- New efficiency metric for AI 2.0
- Replaces FLOPS as primary system goal
- Measures end-to-end inference efficiency
Scaling Paradigms
Scaling Up
- Larger models
- More data
- Higher capability ceiling
Scaling Down
- Maintain performance at lower cost
- Techniques:
- Quantization
- Pruning
- Knowledge distillation
Scaling Out
- Distributed systems
- Parallelism and system-level optimization
New Scaling
- Multi-agent and collaborative LLM systems
- Example: MetaGPT
Quantization
- Lower numerical precision (FP32 → INT8 / INT4)
- Reduce memory footprint and computation cost
- Main challenge: preserving accuracy
Hardware–Software Co-Design
- Joint optimization of:
- Algorithms
- Systems
- Hardware architecture
- Enables orders-of-magnitude performance improvement
Examples:
- AI accelerators
- Memory-aware model design
- Dataflow optimization
Golden Age of Computer Architecture (AI Era)
- Moore’s Law slowing down
- Performance gains come from:
- Architectural innovation
- Software–hardware interface redesign
- Domain-specific accelerators
Multi-Agent Systems
- Multiple LLMs with specialized roles
- Mimic human organizational structure
- Improve performance on complex tasks
Example:
- MetaGPT