Skip to content

Basis

Click on a tile to change the color scheme:

1. General Guidelines

Screen Shot 2021-04-27 at 10.59.51 AM

1.1 Issues of model complexity and optimization

How to realize the optimization issue?

Screen Shot 2021-04-27 at 11.22.45 AM

1.2 Tackle overfitting

  • Less parameters, sharing parameters

  • Less features

  • Early stopping

  • Regularization

  • Dropout

  • Validation set

The correct way to set hyperparameters is to split your training data into two: a training set and a fake test set, which we call validation set.

### Cross-validation

If the lack of training data is a concern

crossval

e.g. 5 folds Cross-validation: 每次用一个作为validation set,轮换5次,取平均

cvplot

1.3 Validation set

The correct way to set hyperparameters is to split your training data into two: a training set and a fake test set, which we call validation set.

1.3.1 Cross-validation

If the lack of training data is a concern

crossval

e.g. 5 folds Cross-validation: 每次用一个作为validation set,轮换5次,取平均

cvplot

2. Data Preprocessing

Note:

It is very important to zero-center the data, and it is common to see normalization of every pixel as well.

Common pitfall. An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).

2.1 Mean subtraction

X -= np.mean(X, axis = 0)

2.2 Normalization

After zero-centered, we can: X /= np.std(X, axis = 0)

2.3 PCA and Whitening

2.4 Data Augmentation

Clipping, rotating, ...

3. Weight Initialization

Pitfall: all zero initialization.

This turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

3.1 Small random numbers

W = 0.01* np.random.randn(D,H)

, where randn samples from a zero mean, unit standard deviation gaussian.

Warning: small is not always good!

For example, a Neural Network layer that has very small weights will during backpropagation compute very small gradients on its data (since this gradient is proportional to the value of the weights). This could greatly diminish the “gradient signal” flowing backward through a network, and could become a concern for deep networks.

3.2 Xavier Initialization - Calibrating the variances

Common: w = np.random.randn(n) / sqrt(n), where n is the number of its inputs.

w = np.random.randn(n) * sqrt(2.0/n) is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons.

Screen Shot 2021-04-27 at 10.06.42 AM

3.2.1 For ReLU

Screen Shot 2021-04-27 at 10.09.59 AM

4. Loss function

4.1 Multiclass SVM Loss (Hinge Loss)

Screen Shot 2021-04-27 at 12.46.07 AM

"1" can be replaced by other values.

The essence of SVM loss is that the score of the correct label needs to be greater than other scores by 1.

4.2 Softmax and Cross-entropy

Screen Shot 2021-04-27 at 2.41.08 PM

4.3 Regularization

5. Optimization

5.1 SGD

Screen Shot 2021-04-27 at 1.31.31 AM

5.2 Local minima and Saddle point

Screen Shot 2021-04-27 at 12.09.24 PM

A naive way to escape saddle point. Seldom used!

Screen Shot 2021-04-27 at 12.12.41 PM

Screen Shot 2021-04-27 at 12.53.09 PM

5.3 Minibatch

epoch: see all the batches once

shuffle for every epoch

Screen Shot 2021-04-27 at 1.21.15 PM

5.4 Momentum

Screen Shot 2021-04-27 at 1.45.03 PM

5.5 Learning rate

Learning rate cannot be one-size-fits-all!

5.5.1 Adam Optimizer: RMSProp + Momentum

Screen Shot 2021-04-27 at 2.05.58 PM

5.5.2 Learning rate scheduling

5.5.2.1 Learning rate decay

Screen Shot 2021-04-27 at 2.09.30 PM

5.5.2.2 Warm up

Screen Shot 2021-04-27 at 2.20.53 PM

6. Activation

6.1 ReLU

Screen Shot 2021-04-27 at 1.43.45 AM

So we want input data with mean 0!

6.2 Sigmoid

Screen Shot 2021-04-27 at 1.44.15 AM

6.3 Leaky ReLU

Screen Shot 2021-04-27 at 1.48.48 AM

6.4 PReLU

\(\alpha\) is not hard-coded! It can be learned!

6.5 ELU

Screen Shot 2021-04-27 at 9.20.03 AM

6.6 SELU

Screen Shot 2021-04-27 at 9.20.23 AM

6.7 Maxout

Screen Shot 2021-04-27 at 9.21.06 AM

6.8 Swish

Screen Shot 2021-04-27 at 9.23.14 AM

7. Batch Normalization

Screen Shot 2021-04-27 at 2.57.17 PM

7.1 Batch Understanding:

Input data \(\mathbf{x}\)s are not independent! \(\mathbf{x}\)s in a (mini-)batch are related to each other. So we need to treat the whole (mini-)batch as a large network!

7.2 Recovery

Sometimes we use a linear equation to recover:

(Initialize \(\gamma\) with \(\mathbf{1}\) and \(\beta\) with \(\mathbf0\); after training for some time, the loss function gets to some good error surface, then we can cancel the constraint of normalization.)

Screen Shot 2021-04-27 at 5.36.30 PM

7.3 Pros and Corns:

Screen Shot 2021-04-27 at 10.30.27 AM

7.4 Test-Time

Screen Shot 2021-04-27 at 5.55.44 PM

Screen Shot 2021-04-27 at 10.31.19 AM

8. Transfer Learning

Screen Shot 2021-04-27 at 10.43.52 AM

Screen Shot 2021-04-27 at 10.45.01 AM

Screen Shot 2021-04-27 at 10.45.51 AM


Last update: June 16, 2023
Authors: Colin