Stochastic Gradient Descent

Introduction

Stochastic Gradient Descent (SGD) is a popular optimization algorithm used in many machine learning applications. It is a variant of gradient descent, a general-purpose optimization algorithm, where the objective is to find the minimum of a function. Unlike standard gradient descent, which calculates the gradient using the entire data set, SGD approximates the overall gradient by using a single randomly chosen data point. This makes SGD computationally faster, especially in large-scale machine learning problems.

Mathematical Formulation

The general idea of SGD is to minimize an objective function that has the form of a sum:

Here, <math>w</math> represents the parameters of the model, <math>n</math> is the number of observations in the data set, and <math>Q_i(w)</math> is the cost associated with the i-th observation. The goal is to find the parameter values that minimize the cost function.

In each iteration of SGD, a single training example is chosen randomly, and the gradient of the cost function with respect to the parameters is computed based on this single example. The parameters are then updated in the direction of the negative gradient. This process is repeated until the algorithm converges to an optimal solution.

The update rule for SGD is given by:

<math>w = w - \eta \nabla Q_i(w)</math>

Here, <math>\eta</math> is the learning rate, a hyperparameter that determines the step size at each iteration while moving toward a minimum of a loss function, and <math>\nabla Q_i(w)</math> is the gradient of the cost function at the current parameter value, calculated using the i-th training example.

Variants of Stochastic Gradient Descent

There are several variants of SGD that modify the basic algorithm to improve its performance in different ways.

Momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction of the update vector of the past time step to the current update vector.

Nesterov Accelerated Gradient

The Nesterov Accelerated Gradient (NAG) is a way to give our momentum term this kind of prescience. We know where we will be a short time in the future because we know our momentum and this allows us to calculate the gradient not w.r.t. to our current parameters but w.r.t. the approximate future position of our parameters.

Adagrad

Adagrad is an algorithm for gradient-based optimization that adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequent features.

RMSprop

RMSprop (Root Mean Square Propagation) is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his Coursera course.

Adam

Adam (Adaptive Moment Estimation) is another method that computes adaptive learning rates for each parameter. It stores an exponentially decaying average of past squared gradients like Adadelta and RMSprop.

Applications

SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this case are sparse linear models, and the incremental algorithms are a particularly good fit.

In addition to these, SGD has been applied to large-scale and sparse machine learning problems often encountered in image recognition and computational biology, and in the training of deep neural networks.

Advantages and Disadvantages

Like any other algorithm, SGD has its advantages and disadvantages.

Advantages

Efficiency: SGD allows us to use the same code to train a wide variety of models. This is a big advantage in situations where we have a large amount of data and/or we're tuning a large number of hyperparameters.
Ease of implementation: The algorithm is simple and easy to implement.

Disadvantages

SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
SGD is sensitive to feature scaling.