Introduction to Deep Learning Systems
1. Recap: Automatic Differentiation
Automatically construct backward computation graph
2. Graph-Level Optimizations
input computation graph -> potential transformations -> optimized graph
Rule-based graph optimization:
- e.g. fusing conv & BN, conv & relu, multi. convs
- Robustness: Experts’ heuristics do not apply to all models/hardware
- Scalability: new rules for new operators
- Performance: Miss subtle optimizations for specific models/hardware
- Infeasible to manually design graph optimizations for all cases
-> Automated Generation and Verification of Graph Optimizations
3. Parallelize ML training
Comparison:
???
3.1 Data Parallelism
- Compute the gradients of each batch on a GPU
- Aggregate gradients across GPUs
Problems: need centralized parameter server
Solution: AllReduce perform element-wise reduction across multiple devices to achieve gradients aggregation
3.1.1 AllReduce
Visual intuition on ring-Allreduce for distributed Deep Learning
Comparison:
???
3.1.1.1 Naïve
3.1.1.2 Driver
3.1.1.3 Ring
Construct a ring of N workers, divide M parameters into N slices
2 steps:
-
Aggregation: each worker send one slice (M/N parameters) to the next worker; repeat N times
-
Broadcast: each worker send one slice of aggregated parameters to the next worker; repeat N times
Overall communication: \(2MN\) parameters
- Aggregation: \(MN\) parameters
- Broadcast: \(MN\) parameters
3.1.1.4 Tree
???
3.1.1.5 Butterfly
???
3.2 Model Parallelism
Device placement optimization with reinforcement learning
???
3.3 Pipeline Parallelism
Scaling Giant Models with Conditional Computation and Automatic Sharding
Model + Pipeline Parallelism
???
4. Code Optimization
Goal: find performant programs for each operator
Existing Approach: Engineer Optimized Tensor Programs ???
Issues: new operators; suboptimal
Solution: Automated Code Generation
5. Memory Efficient Training
5.1 Tensor Rematerialization
???
5.2 Zero Redundancy
Idea: ???
Balancing Computation/Memory/Communication Cost in DNN Trainings