Dense-Sparse-Dense Training

for more accurate CNN, RNN and LSTM models

Slav Ivanov
Slav

--

In this post, I’m going to share my notes on a fascinating technique for training neural networks. It describes a method that increases accuracy on already state-of-the-art models.

Not the training method I was going for

Dense-Sparse-Dense Training

The research was published more than a year ago by researchers from Stanford, Nvidia, Baidu, and Facebook. It is rare to see people from many institutions working together on a single paper. I must say the results we will discuss below speak for themselves.

It’s also worth noting that Song Han, one of the leading DSD authors, maintains a repository of pretrained DSD Caffe models.

Paper: “Dense-Sparse-Dense Training for Deep Neural Networks

Video of Song Han explaining the method:

Skinny Deep Neural Networks

At the same time, independent researchers from Singapore and China, discovered and published the same findings as in the DSD paper. This goes to show how quickly the field of Deep Learning is moving nowadays. I’d recommend you read both if you are interested in the details.

Paper: “Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods

In this article, I’m going to refer to both Dense-Sparse-Dense training and Skinny Deep Neural Networks as DSD.

The Algorithm

Applying DSD takes 3 sequential steps:

Dense

This is normal training the Neural Net, business as usual. It’s notable that even though DSD acts as a regularizer, the usual regularization methods such as DropOut and Weight Regularization can be applied as well. The authors don’t mention Batch Normalization, but it would work as well.

Sparse

We regularize the network by removing connections with small weights. From each layer in the network, a percentage of the layer’s weights that are closest to 0 in absolute value are selected to be pruned. This means that they are set to 0 at each training iteration. It’s worth noting that the pruned weights are selected only once, not at each SGD iteration.

The authors recommend keeping the percentage of pruned weights between 25% and 50%. However at one of the experiments (see below) they used 80%.

Eventually, the network recovers the pruned weights’ knowledge and condenses it in the remaining ones. We train this sparse net until convergence.

Re-Dense

First, we re-enable the pruned weights from the previous step. The net is again trained normally until convergence. This step increases the capacity of the model. It can use the recovered capacity to store new knowledge.

The authors note that the learning rate should be 1/10th of the original. Since the model is already performing well, the lower LR helps preserve the knowledge gained in the previous step.

Rinse and repeat

DSD can be applied multiple times for further gains. I.e, after training has converged in the “Re-Dense” step, do another “Sparse” step.

Intuition behind DSD

The current generation of Neural Networks (NNs) are powerful models that have large expressive power. This enables them to solve many tasks that defy description by traditional programming practices. However, this expressiveness leads NNs to often capture the noise in a dataset, thus overfitting.

“Our intuition is that … relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.”

Removing pruning in the dense step allows the training to escape saddle points to eventually reach a better minimum. This lower minimum corresponds to improved training and validation metrics. Saddle points are areas in the multidimensional space of the model which might not be a good solution but are hard to escape from.

The authors hypothesize that the lower minimum is achieved because the sparsity in the network moves the optimization problem to a lower dimensional space. This space is more robust to noise in the training data.

It’s unclear how well DSD would work with other methods designed to escape saddle points, such as Stochastic Gradient Descent with Restarts. I plan on doing some experiments in the area.

Results

The authors tested DSD on image classification (CNN), caption generation (RNN) and speech recognition (LSTM). The proposed method improved accuracy across all three tasks. It’s quite remarkable that DSD works across domains.

Convolutional Neural Nets

DSD improved all CNN models tested — Resnet (50), VGG, and GoogLeNet. The absolute improvement in absolute top-1 accuracy was respectively 1.12%, 4.31%, and 1.12%. This corresponds to a relative improvement of 4.66%, 13.7%, and 3.6%. These results are remarkable for such finely tuned models!

Notes: 30% of layer weights were pruned in the Sparse step. The very first layer in convolutional networks is not pruned (as it is very sensitive to regularization).

Long Short-Term Memory

DSD was applied to NeuralTalk, an amazing model that generates a description from an image. It was developed by Andrew Karpathy and Fei-Fei Li from Stanford.

To verify that the Dense-Sparse-Dense method works on LSTM the CNN part of the Neural Talk is frozen. Only the LSTM layers are trained according to DSD. Very high (80% deducted by the validation set) pruning was applied at the Sparse step. Still, this gives the Neural Talk BLEU score an average relative improvement of 6.7%. It is fascinating that a such minor improvement produces this much better results (see the image below).

Results from NeuralTalk

Recurrent Neural Networks

Applying DSD to Speech Recognition (Deep Speech 1) achieves average relative improvement of Word Error Rate of 3.95%.

On the similar, but more advanced Deep Speech 2 model, Dense-Sparse-Dense is applied iteratively 2 times. On the first iteration, pruning 50% of the weights, then 25% of the weights are pruned. After these two DSD iterations, the average relative improvement is 6.5%.

Related works

In the paper, it’s pointed out that DSD is similar to several other approaches:

  • Dropout: Disabling random neurons on each Gradient Descent step, effectively creating a different variation of the model. Dropout is most often used in Fully Connected layers, although there is some evidence of it working with convolutions. Paper: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. DSD works alongside Dropout.
  • Model Compression: Describes a technique for reducing the size of a trained network. This is done to make NNs fit and work fast enough on a wider range of devices. The method applies 3 stages, the first of which is called “pruning”. Pruning basically applies the “Sparse” step of DSD. 9x to 13x model size reduction is reported while keeping the accuracy the same. The leading author of DSD and the Model Compression paper is the same person. It’s nice to be able to see the ideas build up on top of each other to produce state-of-the-art results. Paper: “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Summary

Dense-Sparse-Dense and Skinny Deep NNs describe an easy-to-implement training tweak. It regularizes and consistently improves the accuracy of a variety of Neural Network architectures.

If you liked this article, please help others find it by holding that clap icon for a while. Thank you!

--

--