Introduction to Deep Learning Systems

CMU 15-849

1. Recap: Automatic Differentiation

Automatically construct backward computation graph

2. Graph-Level Optimizations

input computation graph -> potential transformations -> optimized graph

Rule-based graph optimization:

e.g. fusing conv & BN, conv & relu, multi. convs
Robustness: Experts’ heuristics do not apply to all models/hardware
Scalability: new rules for new operators
Performance: Miss subtle optimizations for specific models/hardware
Infeasible to manually design graph optimizations for all cases

-> Automated Generation and Verification of Graph Optimizations

3. Parallelize ML training

Comparison:

???

3.1 Data Parallelism

Compute the gradients of each batch on a GPU
Aggregate gradients across GPUs

Problems: need centralized parameter server

Solution: AllReduce perform element-wise reduction across multiple devices to achieve gradients aggregation

3.1.1 AllReduce

Visual intuition on ring-Allreduce for distributed Deep Learning

Comparison:

???

3.1.1.1 Naïve

3.1.1.2 Driver

3.1.1.3 Ring

Construct a ring of N workers, divide M parameters into N slices

2 steps:

Aggregation: each worker send one slice (M/N parameters) to the next worker; repeat N times
Broadcast: each worker send one slice of aggregated parameters to the next worker; repeat N times

Overall communication: \(2MN\) parameters

Aggregation: \(MN\) parameters
Broadcast: \(MN\) parameters

3.1.1.4 Tree

???

3.1.1.5 Butterfly

???

3.2 Model Parallelism

Device placement optimization with reinforcement learning

???

3.3 Pipeline Parallelism

Scaling Giant Models with Conditional Computation and Automatic Sharding

Model + Pipeline Parallelism

???

4. Code Optimization

Goal: find performant programs for each operator

Existing Approach: Engineer Optimized Tensor Programs ???

Issues: new operators; suboptimal

Solution: Automated Code Generation

5. Memory Efficient Training

5.1 Tensor Rematerialization

???

5.2 Zero Redundancy

Idea: ???

Balancing Computation/Memory/Communication Cost in DNN Trainings

Last update: February 13, 2022

Authors: Co1lin