Imagine hiking in the mountains past dusk and finding that you can’t see much beyond your feet. And your phone’s battery died so you can’t use a GPS app to find your way home. You might find the quickest path down via gradient descent. Just be careful not to walk off a cliff.
Suns and rugs: French mathematician Augustin-Louis Cauchy invented the algorithm in 1847 to approximate the orbits of stars. Sixty years later, his compatriot Jacques Hadamard independently developed it to describe deformations of thin, flexible objects like throw rugs that might make a downward hike easier on the knees. In machine learning, though, its most common use is to find the lowest point in the landscape of a learning algorithm’s loss function.
Downward climb: A trained neural network provides a function that, given an input, computes a desired output. One way to train the network is to minimize the loss, or error in its output, by iteratively computing the difference between the actual and desired output and then changing the network’s parameter values to narrow the difference. Gradient descent accomplishes this by minimizing the function that computes the loss.
- The network’s parameter values are tantamount to a position on the landscape, and the loss is the current altitude. As you descend, you improve the network’s ability to compute outputs close to the desired one. Visibility is limited because, in a typical supervised learning situation, the algorithm relies solely on the network’s parameter values (your position on the hill) and the gradient (the slope immediately beneath your feet).
- The basic method is to move in the direction where the terrain descends most steeply. The trick is to calibrate your stride. Too small, and it takes ages to make any progress. Too large, and you leap into the unknown, possibly heading uphill instead of downward.
- Given the current position, the algorithm estimates the direction of steepest descent by computing the gradient of the loss function. The gradient points uphill, so the algorithm steps in the opposite direction by subtracting a fraction of the gradient. The fraction α, which is called the learning rate, determines the size of the step before measuring the gradient again.
- Apply this iteratively, and hopefully you’ll arrive at a valley.
Stuck in the valley: Too bad your phone is out of juice, because the algorithm may not have propelled you to the bottom of a convex mountain. Instead, you may be stuck in a nonconvex landscape of multiple valleys (local minima), peaks (local maxima), saddles (saddle points), and plateaus. In fact, tasks like image recognition, text generation, and speech recognition are nonconvex, and many variations on gradient descent have emerged to handle such situations. For example, the algorithm may have momentum that helps it zoom over small rises and dips, giving it a better chance at arriving at the bottom. Researchers have devised so many variants that it may seem as though there are as many optimizers as there are local minima. Luckily, local and global minima tend to be roughly equivalent.
Optimal optimizer: Gradient descent is the clear choice for finding the minimum of any function. In cases where an exact solution can be computed directly — say, a linear regression task with lots of variables — it can approximate one, often faster and more cheaply. But it really comes into its own in complex, nonlinear tasks. Armed with gradient descent and an adventurous spirit, you might just make it out of the mountains in time for dinner.