Skip to content

Introduction to Deep Learning Systems

CMU 15-849

1. Recap: Automatic Differentiation

Automatically construct backward computation graph

2. Graph-Level Optimizations

input computation graph -> potential transformations -> optimized graph

Rule-based graph optimization:

  • e.g. fusing conv & BN, conv & relu, multi. convs
  • Robustness: Experts’ heuristics do not apply to all models/hardware
  • Scalability: new rules for new operators
  • Performance: Miss subtle optimizations for specific models/hardware
  • Infeasible to manually design graph optimizations for all cases

-> Automated Generation and Verification of Graph Optimizations

3. Parallelize ML training

Comparison:

???

3.1 Data Parallelism

  • Compute the gradients of each batch on a GPU
  • Aggregate gradients across GPUs

Problems: need centralized parameter server

Solution: AllReduce perform element-wise reduction across multiple devices to achieve gradients aggregation

3.1.1 AllReduce

Visual intuition on ring-Allreduce for distributed Deep Learning

Comparison:

???

3.1.1.1 Naïve

img

3.1.1.2 Driver

img

3.1.1.3 Ring

img

Construct a ring of N workers, divide M parameters into N slices

2 steps:

  • Aggregation: each worker send one slice (M/N parameters) to the next worker; repeat N times

    img

    img

  • Broadcast: each worker send one slice of aggregated parameters to the next worker; repeat N times

Overall communication: \(2MN\) parameters

  • Aggregation: \(MN\) parameters
  • Broadcast: \(MN\) parameters
3.1.1.4 Tree

???

3.1.1.5 Butterfly

???

3.2 Model Parallelism

Device placement optimization with reinforcement learning

???

3.3 Pipeline Parallelism

Scaling Giant Models with Conditional Computation and Automatic Sharding

Model + Pipeline Parallelism

???

4. Code Optimization

Goal: find performant programs for each operator

Existing Approach: Engineer Optimized Tensor Programs ???

Issues: new operators; suboptimal

Solution: Automated Code Generation

5. Memory Efficient Training

5.1 Tensor Rematerialization

???

5.2 Zero Redundancy

Idea: ???

Balancing Computation/Memory/Communication Cost in DNN Trainings


Last update: February 13, 2022
Authors: Co1lin