Subgradient Descent Explained, Step by Step

Gradient descent is one of the most popular algorithms to train machine learning models. However, many of the popular machine learning models like lasso regression or support vector machines contain loss functions that are not differentiable. Because of this, regular gradient descent can not be used. One of the most commonly utilized techniques to circumvent this issue is to use subgradients instead of regular gradients. And in this article, you will learn how it's done.

Linnea Herman
Hi! I'm Linnea and I run this website. You can find more information in the "About"-tab.
More posts by Linnea Herman.

Linnea Herman

18 min read

Share on:

Subgradient Descent Explained, Step by Step

Outline

You’ve probably heard about gradient descent. Maybe someone even told you that this or that machine learning model can be trained by using gradient descent. In many cases, what is actually used under the hood is not regular gradient descent, but subgradient descent. In this article, we will explore what subgradients are, how and why we can use them in practice, and what the potential dangers of subgradients are.

Prerequisites

Before reading this article, you should already be familiar with gradients and gradient descent. If you’re unfamiliar with gradient descent, don’t worry! Gradient descent is a lot less scary than it sounds, and it can be used for a variety of machine learning models, not just neural networks. If you’re interested in an introduction to gradient descent, check out the article Gradient Descent for Linear Regression Explained, Step by Step.

In this article, we will apply gradient descent to lasso regression, which is a slight variation of the popular linear regression algorithm. So if you want to understand that example fully, I recommend you take a look at Lasso Regression Explained, Step by Step.

With that being said, let’s get started!

The Problem

So why do we even need these subgradients? What’s the issue with regular gradients? For this, let’s look at a concrete example: lasso regression. Imagine we have a dataset of figure prices, where each entry in the dataset contains the age of a figure as well as its price for that age in € (or any other currency). We now want to predict the price of a figure given its age using lasso regression, to see how much the figures depreciate over time. The dataset looks like this:

Figure prices dataset

make interactive

We can now write down our lasso loss function like this:

LassoMSE(y,y_{pred}) = MSE(y,y_{pred}) + \alpha {\color{#26a6ed}||\boldsymbol{\theta}||_1}

where $\boldsymbol{\theta}$ is the vector containing all of our model parameters. The term in blue is an L1-penalty, and it can be rewritten like this:

LassoMSE(y,y_{pred}) = MSE(y,y_{pred}) + \alpha \sum_{i=1}^m{{\color{#26a6ed}|\theta_i|}}

${\color{#26a6ed}|\theta_i|}$ is an application of the absolute function, which looks like this:

Absolute function

make interactive

To find the optimal value for our model parameters $\boldsymbol{\theta}$ we have to find the minimum of the $LassoMSE$ . To find the minimum of a function, we usually have to take the first derivative of that function. In our case we have a multivariate function, so we will have to take the gradient of it. But here’s the problem: The L1 penalty inside of our $LassoMSE$ is not differentiable at 0!

The slope of the absolute function is -1 for all of the negative values, and it is 1 for all of the positive values. But the slope at 0 is not defined!

Because of this, we can not differentiate our $LassoMSE$ , and thus we can not find the model parameters for our lasso regression. Or can we?

Part 1: Subgradients

One key insight we can make is that the absolute function inside of our L1-penalty is only non-differentiable at zero. So maybe we can still differentiate our function at every point except 0, and then we’ll just find a clever way to solve the case when we get to the point at $\theta_i = 0$ .

This is the first key insight into subgradients. What is a subgradient of a function $f$ at a point $x$ ? Well, if $f$ is differentiable at $x$ , then the subgradient is just the regular gradient of that function at that point $x$ . Ok, so far so good. But now comes the trickier part. What do we do when $f$ is not differentiable at $x$ ?

Let’s take a step back and visualize our absolute function again. This time, I’ll also include the gradients for $x > 0$ and $x < 0$ . Since we’re dealing with a univariate function the gradient will just be a scalar. So instead of somehow adding this scalar into the plot, what I will do is I will display the gradient as a linear function where the slope of that function is exactly the value of the gradient, and the intercept is 0. This may sound a bit confusing, so let’s jump to the visuals to make things more clear.
The plot below contains two gradients, so feel free to toggle the visibility of the gradients by clicking on the labels in the legend at the right part of the plot.

Absolute function gradients

make interactive

One thing you might notice is that the linear function represented by one of the two gradients is always either below or exactly on our main function $f$ . What’s even better is that this property holds true for all convex functions. What is a convex function? In a nutshell, a convex function is a function that has exactly one global minimum

Technically it can also have no global minimum, but then it would be a pretty boring function.

and no global maximum. And as it turns out, the

LassoMSE

is convex! For a not so formal but far more visual proof of this, read until the end of this blog post where we will visualize the

LassoMSE

in 3D and see that it really has only one global minimum.

Ok, so we know that gradients (or rather the linear functions we define around them) are so-called global underrestimators. Now if we’re going to define a subgradient of $f(x) = |x|$ at 0, then it should probably look and behave somewhat similar to what a real gradient would look like at that point, if it existed in the first place. With this in mind, how could a potential subgradient of $f$ look like at $x = 0$ ?

Since we can’t truly know how a real gradient would look like at that point, one idea is to look at the gradients in the neighborhood around 0, and then define the subgradient as an intermediate vector between the gradients around our non-differentiable point.

Right now this still sounds pretty vague, so below is a visualization showing many possibilities for an intermediate gradient at $x = 0$ . At the bottom of the plot, there’s a slider which controls the value of the subgradient.

Absolute function subgradient with slider

make interactive

Notice how the line representing our subgradient turns orange once we cross the initial function, making the subgradient invalid because it would be greater than the neighborhood gradient, which is something we don’t want. Conveniently enough this also ensures that our subgradient is a global underrestimator of our function, just like a regular gradient would be. So any number between -1 and 1 (including both -1 and 1) is a valid subgradient for our function $|x|$ at $x = 0$ .

That is also why subgradients are called subgradients. The prefix sub comes from Latin and means “below”, which makes sense now that we know that subgradients are global underrestimators, or in other words, lie below our function.

Similarly, you could define supergradients, which are gradient approximations that lie above our function.

This is the second crucial insight into subgradients. They can be considered intermediate vectors of the neighborhood gradients, and they are global underrestimators of our function. This also means that there can be an infinite number of subgradients for a function $f$ at a non-differentiable point $x$ , which is in contrast to regular gradients, which are unique to every function and every point.

And that’s it! That is how subgradients are defined.

Formalizing Subgradients

Now let’s try to come up with a mathematical definition that encapsulates this behavior. Our definition might start like this:

Usually

x_0

is just called

x

, but this can be a bit confusing if the variable of your function is also named just

x

, so for the remainder of this article we will stick to

x_0

to make the two more clearly distinguishable.

\text{A vector } g \in \mathbb{R}^d \text{ is a subgradient of a function } \\ f: \mathbb{R}^d \rightarrow \mathbb{R} \text{ at the point } x_0 \text{ if } \\ ...

So far so good! We’ve seen that subgradients are global underrestimators, so let’s try to come up with a mathematical way of writing this property down. At this point, you can pause for a second and think about how you would define this property.

An initial idea might look similar to this:

\text{A vector } g \in \mathbb{R}^d \text{ is a subgradient of a function } \\ f: \mathbb{R}^d \rightarrow \mathbb{R} \text{ at the point } x_0 {\color{#26a6ed}\text{ if for all points } x \text{ the following holds:}} \\ {\color{#26a6ed}f(x) \geq g(x)}

This says that the value of our subgradient $g$ at any point $x$ has to be lower than the value of our function $f$ at that same point $x$ . This definition seems intuitive, but let’s see if it really accurately describes what subgradients are. For this, we will look at the absolute function once again. Let’s create a plot where we will see the following things:

the absolute function $f(x) = |x|$
our subgradient $g$ of $f$ at the point $x_0 = 0$
$f(x)$ for specific values of $x$
$g(x)$ for specific values of $x$

Since $x$ is a variable, we’ll add in a slider to control the value of $x$ . We’ll also consider just one fixed subgradient (namely 0.2), since, as we have seen before, the subgradient of the absolute function at $x_0 = 0$ can take on any value between -1 and 1. Here’s how that looks like:

Absolute function with point slider

make interactive

So far so good! We can see that $g(x)$ is always greater or equal than $f(x)$ .

Refining our Definition

Now let’s try a more complicated example. Consider the following function:

f_2(x) = \left \{ \begin{aligned} &0.5x && \text{if}\ x \lt 2 \\ &2x - 3 && \text{else} \\ \end{aligned} \right.

This function looks like this:

f2 function

make interactive

$f_2$ is not differentiable at $x_0 = 2$ , so let’s add in a subgradient at that particular point. The slope of $f_2$ is $\frac{1}{2}$ for every point less than 2, and 2 for every point after that. Since we can think of a subgradient as the intermediate vector between the gradients around it, a valid subgradient in this particular case should be 1. So let’s add the subgradient (once again represented as a linear function, we’ll call it $g_2$ ) to our plot and see how it looks like:

f2 function with subgradient at x0 = 0

make interactive

Well, that sure does not look as expected… our subgradient is at the wrong position! We want our subgradient to go through the point $(2, 1)$ , but right now this isn’t the case. So what can we do? We can move our entire subgradient by $x_0 = 2$ to the right. Maybe you also remember from calculus that to move a function to the right by some amount, we counterintuitively have to subtract that amount from our function. So instead of calculating $g(x)$ we will now calculate $g(x - x_0)$ , which in our case will be $g2(x - 2)$ . Now we have successfully shifted our function on the x-axis, but we still have to shift it on the y-axis. To do this we simply add $f(x_0)$ to the right-hand side of our subgradient condition. In our particular case that would be $f2(2) = 1$ .

Our subgradient definition now looks like this:

\text{A vector } g \in \mathbb{R}^d \text{ is a subgradient of a function } \\ f: \mathbb{R}^d \rightarrow \mathbb{R} \text{ at the point } x_0 \text{ if for all points } x \text{ the following holds:} \\ f(x) \geq {\color{#26a6ed}f(x_0)} + g(x {\color{#26a6ed} - x_0})

Ok, that seems reasonable. Let’s also visualize it to see if it actually does what we expect. Below is an interactive visualization where you can visualize the two steps we have just performed by pressing the button at the top of the plot.

f2 function subgradient animation

make interactive

Much better! Now let’s look at the actual definition of subgradients and see how it compares to our own. Here goes:

In literature,

x

is usually called

z

. So often times the definition looks like this:

A vector

g \in \mathbb{R}^d

is a subgradient of a function

f: \mathbb{R}^d \rightarrow \mathbb{R}

at the point

x

if for all points

z

the following holds:

f(z) \geq f(x) + g^T(z - x)

.

Personally, I find it a lot more intuitive to call

z

x

and to call

x

x_0

since this (at least for me) indicates that

x

and

x_0

lie on the same scale and that

x

is really a variable whereas

x_0

is just a single, fixed point. That is why I chose this version for this article, but I still wanted to show you this variant as well, since it is most common in literature.

\text{A vector } g \in \mathbb{R}^d \text{ is a subgradient of a function } \\ f: \mathbb{R}^d \rightarrow \mathbb{R} \text{ at the point } x_0 \text{ if for all points } x \text{ the following holds:} \\ f(x) \geq f(x_0) + g^T(x - x_0)

Well if that doesn’t look familiar! The definition is pretty much exactly the same. The only difference is that the subgradient is transposed. This is necessary since our subgradient might live in more than just one dimension like in our example. Because of this, we have to transpose it to make the multiplication work. Apart from that our definition is exact, nice!

Now we understand where subgradients come from, why they’re useful and how they are defined. But you’re also probably interested in when they are actually used in practice. To answer this question, let’s look at subgradient descent!

Part 2: Subgradient Descent

Million-dollar question: what’s the difference between subgradient descent and gradient descent? That’s right, subgradient descent uses a subgradient instead of a regular gradient! That is the only difference between the two. However, subgradient descent might not work completely as expected in some cases, so it’s definitely worthwhile to implement subgradient descent and take a look at some potential weaknesses. For the implementation, we’ll reuse the code that we wrote in the article Gradient Descent for Linear Regression Explained, Step by Step. Here’s how the code looks like:

 def gradient_descent(X, y, theta, criterion, gradient_function, number_of_iterations, learning_rate):
     X_b = add_intercept_ones(X)
     for i in range(number_of_iterations):
 
         # predict and calculate loss
         f = create_function(theta) # create the current function
         y_predicted = f(X_b) # predict our entire x
         loss = criterion(y,y_predicted) # calculate the error
 
         # perform optimization
         gradient = np.array( gradient_function(...) ) # calculate gradient
         theta = theta - learning_rate * gradient #adjust m and b
 
     return theta

add_intercept_ones is a little helper function that adds an additional column of ones to our X to make calculations with the bias (or intercept) easier. You can find it’s definition here. create_function is another small helper, you can find it’s definition here. Now all we need to do is rename our gradient to subgradient, like this:

 def subgradient_descent(X, y, theta, criterion, subgradient_function, number_of_iterations, learning_rate):
     X_b = add_intercept_ones(X)
     for i in range(number_of_iterations):
 
         # predict and calculate loss
         f = create_function(theta)  # create the current function
         y_predicted = f(X_b)  # predict our entire x
         loss = criterion(y, y_predicted)  # calculate the error
 
         # perform optimization
         subgradient = np.array(subgradient_function(...))  # calculate gradient
         theta = theta - learning_rate * subgradient  # adjust m and b
 
     return theta

Technically, you could even keep the old implementation altogether, since we only changed up variable names. However this way I think it’s a bit easier to know just what exactly we’re doing. Now that we have our general subgradient_descent-function defined, let’s look at a concrete example: lasso regression.

Example Implementation: Lasso Regression

To recap, here’s how the loss-function of lasso looks like:

LassoMSE(y, y_{pred}, \boldsymbol{\theta}) = MSE(y,y_{pred}) + \alpha {\color{#26a6ed}||\boldsymbol{\theta}||_1}

From our observations above we figured out that the subgradient of the absolute function at $x = 0$ can be any number between -1 and 1, including -1 and 1. And for every other x, our subgradient is just the regular gradient, i.e. $-1 \text{ for } x < 0$ and $+1 \text{ for } x > 0$ . We can write this down like so

Usually the

\stackrel{sub}{}

at the top of the gradient-symbol will be ommited, however I added it in here to make it clear that we’re dealing with a subgradient.

\stackrel{sub}{\nabla} _g(x) = \left \{ \begin{aligned} &1 && \text{if}\ x>0 \\ &{\color{#26a6ed}x \in [-1,1]} && {\color{#26a6ed}\text{if}\ x=0} \\ &-1 && \text{if}\ x<0 \end{aligned} \right.

Does this function look familiar to you? This looks pretty similar to the sign of the number $x$ . Any strictly positive number has sign 1, any strictly negative number has sign -1, and for 0 we can just define that the sign is 0, which, conveniently, is a number between -1 and 1, and even more conveniently is what np.sign(0) outputs. With this we can now use the sign of $\boldsymbol{\theta}$ as our subgradient, like so:

\stackrel{sub}{\nabla}_{LassoMSE} = \nabla(MSE(y,y_{pred})) + \alpha \cdot {\color{#54C667} sign(\boldsymbol{\theta})}

And that’s it! We have “fixed” our lasso-MSE and now it’s time to perform subgradient descent!

Implementing Subgradient Descent for Lasso

The only thing we have to implement now are our loss and (sub)gradient functions. In the article Ridge Regression Explained, Step by Step we’ve implemented these functions for ridge regression:

 def get_ridge_mse_function(alpha=0.0001):
 
     def ridge_mse(y, y_predicted, theta):
         mse_loss = mse(y, y_predicted)
         ridge_loss = mse_loss + alpha * np.dot(theta, theta)
         return ridge_loss
 
     return ridge_mse
 
 def get_ridge_gradient_function(alpha=0.0001):
 
     def ridge_gradient(X_b, y, y_pred, theta):
         return  -(2/y.size) * X_b.T.dot(y - y_pred) + alpha * theta
 
     return ridge_gradient

We can slightly modify this code to get our functions for lasso. Instead of writing two new functions that return us our gradient and loss functions, we’ll modify our existing ones to accept parameters alpha1 and alpha2 (or a1 and a2 for short) that will control the lasso and ridge penalties respectively. With this, we technically implemented functions not for lasso, but for elastic-net regression. However, since elastic-net is just a linear combination of ridge and lasso, this works out just fine. With this we have:

 def get_elastic_mse_function(a2=1, a1=0):
 
     def elastic_mse(y, y_predicted, theta):
         error = y - y_predicted
         return (
             1 / (y.size) * np.dot(error.T, error) # mse
             + a2 * np.dot(theta, theta)           # l2-penalty
             + a1 * np.sum(np.abs(theta))          # l1-penalty
         )
 
     return elastic_mse
 
 
 def get_elastic_gradient_function(a2=1, a1=0):
 
     def elastic_gradient(X, y, y_pred, theta):
         return (
             -(2 / y.size) * X.T.dot(y - y_pred) # gradient of mse
             + 2 * a2 * theta                    # gradient of l2-penalty
             + a1 * np.sign(theta)               # subgradient of-l1 penalty
         )
 
     return elastic_gradient

By calling get_elastic_mse_function with a1=0 and a2=0, we get the regular, unregularized MSE function. With a2=0 we get the ridgeMSE and with a1=0 we get the lassoMSE. The same goes for our gradient function, where we can just combine the penalties to prevent us from having to write three separate functions. With the loss and gradient now defined, we can use our subgradient_descent-function like this:

 subgradient_descent(X_train, y_train, theta,
                     get_elastic_mse_function(a2=0, a1=1),
                     get_elastic_gradient_function(a2=0, a1=1),
                     number_of_iterations, learning_rate)

Here it is important to note that our X has to be standardized, since we are using a regularized model. If you’re unsure as to why this is truly necessary or what standardization really is, I recommend you read Standardization Explained, Step by Step. For now we will assume that our X was already standardized, to make the code more readable.

Also note that we are only using a portion of the dataset for our subgradient descent, a so-called training subset. You can read more about how to split your dataset correctly in the article How to Split Your Dataset the Right Way. For this article, we’ll already assume that our dataset has been split properly.

So let’s run lasso regression and see how it goes!

 theta_lasso_subgd = subgradient_descent(
     X_train_s, # standardized!
     y_train,
     np.random.rand(2), # initialize theta as random vector
     get_elastic_mse_function(a2=0, a1=1),
     get_elastic_gradient_function(a2=0, a1=1),
     1200,
     0.01,
 )
 print(theta_lasso_subgd)
 # output: [40.75       -2.74229857]

Alright, that looks good! If we train a regular OLS regression model then our model parameters will look more like this: [41.25 -3.242]. So lasso did make our parameters a bit smaller, nice! We can plot the two models and we get:

Lasso compared to OLS

The lasso function is pretty similar to the regular linear regression function, just with slightly smaller coefficients. If we were to increase the L1-penalty coefficient, we would get an even more strongly regularized result.

The Fatal Flaw

The main property of lasso is that lasso is able to produce sparse model weights. In other words, lasso is able to set weights of unhelpful model parameters all the way to zero, unlike ridge regression for example. Since we’ve just implemented lasso regression, we should check whether our implementation zeroes out unnecessary weights, as we expect it to. We can do this by creating a random dataset (a random X and a random y) and then running lasso regression on it. Since the dataset is completely random, there should be no correlation between X and y, and thus lasso should set all of our model parameters to exactly zero. Let’s give it a try:

 theta_rand = np.random.rand(51)
 X_rand = np.random.rand(100, 50)
 y_rand = np.random.rand(100)
 theta_rand_lasso_subgd = subgradient_descent(
     X_rand,
     y_rand,
     theta_rand,
     get_elastic_mse_function(a2=0, a1=1),
     get_elastic_gradient_function(a2=0, a1=1),
     3000,
     0.01,
 )
 print(theta_rand_lasso_subgd)
 # output similar to:
 # [ 0.01036144  0.00646848  0.00116589  0.00647136 -0.00524321  0.00290512
 #   0.00664798  0.00808131  0.00937785  0.00165406 -0.00257609  0.00490855
 #  -0.00133889 -0.00582042  0.00832424  0.00403818  0.00799567  0.00062623
 #   0.01316767 -0.0043727   0.01226832 -0.0052027   0.00092482  0.00231925
 #   0.01305558  0.00906722  0.00154304  0.00505857  0.0087923   0.00346944
 #   0.00158308  0.0112326  -0.00631912 -0.0001088  -0.00211522  0.01203972
 #   0.00760454  0.01046682  0.00350387 -0.00540258  0.00655244  0.00841913
 #   0.00930862  0.00478768 -0.00475235  0.01260794  0.01329395 -0.00388777
 #   0.00920832  0.00485297  0.00296131]

Now I don’t know about you, but those values are not exactly zero. I mean sure, they’re close. But lasso should be able to set them to exactly zero, and not just close to it. That’s why we’re using it in the first place, right? Even if we were to increase a1 and run subgradient descent again, the parameters will still not be exactly zero. Why is that so? Did we do something wrong? Surely there has to be a mistake in our code. Surprisingly, no. Everything is working as intended.

When I first encountered this behavior, I did not know what caused the issue. So naturally, I started digging deeper and I also performed some experiments. The most common answer that I found online was that the problem lies in the subgradient. Since the subgradient is only an approximation of the real gradient

At least sort-of, since the real gradient doesn’t exist at non-differentiable points.

, it has some intrinsic noise to it, and that noise is what causes the model parameters to not be exactly zeroed out. But at least in this case, this explanation is actually incorrect. Why? To answer this question we must first more precisely formulate it. If the usage of a subgradient was the underlying issue, then the non-differentiable points of that subgradient would need to be hit rather often. In other terms, if the subgradient of the L1-penalty was what causes this issue, then the subgradient of that L1-penalty at the point x = 0 would have to be computed rather often in our subgradient descent-algorithm. This is because if that particular non-differentiable point is not hit frequently, then our subgradient can’t be the underlying issue because for every differentiable point our subgradient is just the regular gradient.

Now the question is, how often is this point at x = 0 hit? If you want to find this out for yourself, I recommend you take a look at the second programming exercise of the Reinforcement-section because that is exactly the purpose of that exercise. If you’re still reading, then you probably want to find out the answer. The answer is that the point at x = 0 is pretty much never hit. Why? The L1-penalty of $\boldsymbol{\theta}$ is the sum of the absolute entries of $\boldsymbol{\theta}$ , so for this term to be 0, every entry in $\boldsymbol{\theta}$ has to be 0 as well. Or, in other terms:

||\boldsymbol{\theta}||_1 = \sum^n_{i=1}|{\theta_i}| = 0 \iff \forall i: \theta_i = 0

The Fatal Flaw of…?

So if the subgradient is not the problem, then what is? The problem actually lies in gradient descent itself. The thing is, gradient descent in itself only approximates the optimal solution. In some cases, that approximation happens to be exactly the optimal solution. However, in our case, it is not. Why? Since there is no correlation between our randomized features, the optimal solution would contain only zeros, right? This means that as we get closer and closer to our optimal solution, our model parameters would get closer and closer to zero, but this also means that our gradient would shrink more and more! And with a smaller gradient, gradient descent will perform smaller and smaller updates. This means that we are asymptotically approaching the optimal solution and making smaller and smaller steps at every iteration. So to actually reach the exact optimal solution we would have to perform an infinite number of gradient descent iterations

Or find exactly the right set of hyperparameters for which gradient descent does immediately converge to our optimum, which is as difficult as winning the lottery.

We can see this even better when we visualize the values of $\boldsymbol{\theta}$ over time. To do this, let’s generate a random dataset with only two entries, because visualizing 50-dimensional data is a bit tricky. We could either do this in 3D, where the first two dimensions would be the first and second value of $\boldsymbol{\theta}$ , and the third dimension would be the lassoMSE for our randomized dataset. We will do this later when we visualize subgradient descent on our actual dataset, but for now a filled contour plot (or contourf plot) would be a better choice. A filled contour plot is a 2D plot where the background color is the third dimension. So in our example, the lassoMSE would be represented by the background color of the plot. If we track the value of $\boldsymbol{\theta}$ at every iteration and plot it (in red) alongside our contourf plot, we get this:

Subgradient descent for lasso converging

1/1

The dark region in the middle is where our loss is minimal. As we can see the point (0,0) lies exactly in that region. Now it might look like we pretty quickly reach our optimal point (0,0), but if we zoom in we will be able to better understand what really is going on:

Zooming into (sub)gradient descent

1/1

Our $\boldsymbol{\theta}$ jumps around our optimal point, but it never reaches it exactly. Gradient descent is often explained as a traveler climbing down a hill, but here I think it makes sense to think of gradient descent as a game of golf, where $\boldsymbol{\theta}$ is our golfball and the hole is our optimal solution. Putting the ball exactly into the hole is difficult as it requires good aim and just the right amount of force. But here, instead of a hole, the optimal point is just a marking on the ground. If putting a golf ball into a hole is difficult, then placing it onto a specific point on the ground is nearly impossible. That is why our (sub)gradient descent is hopping around the optimal point. It’s simply trying its best to get as close as possible to the optimum, but the ball keeps rolling just a bit further than it has to.

This is important because gradient descent (not stochastic gradient descent, but regular gradient descent) is oftentimes presented as an optimal solution finder, as I like to call it, even if it is a costly one. But this is simply not true as we see here. Usually, gradient descent finds a solution that is so close to the optimal one, that the difference is completely negligible. But if you are looking for some very specific values, like in the case of lasso, then this is an important fact to keep in mind, because it might explain some of the struggles you might encounter, similar to the ones presented in this article.

So now that we know about this issue, how can we solve it?

Fixing (Sub)gradient Descent for Lasso

This issue can actually be fixed rather easily. The crucial part is that we know our weights will be set very close to zero, even though they will not go all the way to zero. So what we can do is we can add another parameter to our subgradient_descent-function called epsilon, or eps for short. Then, at the end of an iteration of subgradient descent, we will set all weights that are less than our eps to zero. So all we have to do is add this one line to our subgradient_descent-algorithm:

 def subgradient_descent(
     X,
     y,
     theta,
     criterion,
     subgradient_function,
     number_of_iterations,
     learning_rate,
     eps,
 ):
     X_b = add_intercept_ones(X)
     for i in range(number_of_iterations):
 
         # predict and calculate loss
         f = create_function(theta)  # create the current function
         y_predicted = f(X_b)  # predict our entire x
         loss = criterion(y, y_predicted)  # calculate the error
 
         # perform optimization
         subgradient = np.array(subgradient_function(...))  # calculate gradient
         theta = theta - learning_rate * subgradient  # adjust m and b
 
         theta[theta < eps] = 0
 
     return theta

If we now run our subgradient descent on our random dataset again we get the following output:

 theta_rand = np.random.rand(51)
 X_rand = np.random.rand(100, 50)
 y_rand = np.random.rand(100)
 
 theta_rand_lasso_subgd = subgradient_descent(
     X_rand,
     y_rand,
     theta_rand,
     get_elastic_mse_function(a2=0, a1=1),
     get_elastic_gradient_function(a2=0, a1=1),
     3000,
     0.01,
     0.01,
 )
 print(theta_rand_lasso_subgd)
 # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 #  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 #  0. 0. 0.]

Awesome!

Visualizing Subgradient Descent

Now that we’ve fully implemented subgradient descent for lasso, let’s also visualize it! Before we used a filled contour plot to visualize how subgradient descent converges for a randomized dataset. Now we will do the same for our actual dataset of figure prices, but this time, we will use a 3D plot instead of a contour plot. The visualization will be similar to the one we created for ridge regression. At the top of the plot, there are buttons to control the current iteration of subgradient descent. Here’s how our algorithm performs:

History of subgradient descent for lasso visualized

make interactive

Nothing too special here, it looks mostly like regular gradient descent, and it also looks quite similar to the visualization we looked at earlier. But if we were to zoom in reeally deeply (deeper than this visualization allows us to), we would still see some jittering and imprecision, especially towards the end.

Here we can also intuitively see that the $LassoMSE$ is convex because the surface plot representing the $LassoMSE$ only has one minimum. This isn’t a mathematically rigorous proof, but for the intuition I think it conveys the message quite nicely.

Subgradient descent is only one way to solve lasso regression, and there is in fact an even better way to solve lasso. It’s called coordinate descent and is extremely fast (convergence can often occur after just 2-3 epochs!). If you want to learn more about this fascinating algorithm and why it is so blazingly fast, I recommend you take a look at the article Coordinate Descent Explained, Step by Step, where you will learn everything you need to know about coordinate descent.

Maybe you have noticed that the L1-penalty can be written both as a sum as well as a vector operation. This is true not only for the L1-penalty but for most functions in machine learning! However, one of the two is clearly better than the other when it comes to performance. If you want to know which of the two variants you should prefer when implementing machine learning models, as well as why you should do so, check out the article Vectorization Explained, Step by Step.

There is additional material available for this article.

Quiz

Briefly explain the main idea behind subgradients.

Subgradients can be interpreted as “intermediate” vectors between regular gradients. They only differ from regular gradients at non-differentiable points of a function. Any global underrestimator at that non-differentiable point is a valid subgradient. The main purpose of subgradients is to provide a gradient-like object at points where an exact gradient can not be computed. This is useful for algorithms like gradient descent, which rely on gradient computation.

What’s the difference between subgradients and supergradients?

Subgradients are global underrestimators whereas supergradients are global overrestimators. Subgradients make a lot of sense for convex functions because they only have (at most) one global minimum. Similarly, supergradients can be useful in the case of concave functions

A function

f

is concave when

-f

is convex.

, because concave functions have at most one maximum.

We can see why this is the case when we visualize the functions $|x|$ as well as $-|x|$ and add subgradients and supergradients into the mix:

Sub- and supergradients

One could imagine that the supergradient for $|x|$ would look rather complicated, as would the subgradient for $-|x|$ . Since convex functions are usually more frequently used in machine learning, subgradients are also more frequently used than supergradients.

What is the reason that subgradient descent for lasso generally does not produce sparse model weights?

Even though subgradients are only approximations of regular gradients, oftentimes the points that make subgradients different from regular gradients are very rarely, if ever, hit. We saw this at the end of section The Fatal Flaw and you can also explore this fact further in programming exercise #2.

So in essence, in most use cases, subgradients are treated just as regular gradients, because the very few points in which they differ are oftentimes not extremely relevant.

Gradient descent only finds approximations of optimal solutions. And although these approximations are often times extremely precise, they can still differ from the optimal solution. We saw this in the section The Fatal Flaw of…? where we quite literally took a closer look at gradient descent and saw that it does in fact not reach the exact, optimal solution, even though it gets extremely close.

Programming

How fast is subgradient descent?

In this exercise, we want to compare regular gradient descent with subgradient descent and see if subgradient descent is noticeably slower than regular gradient descent. Since we can’t directly compare the two algorithms because their use cases are non-overlapping, we’ll compare gradient descent for ridge regression with subgradient descent for lasso regression. The only difference between the two is that ridge uses an L2-penalty ( $||\boldsymbol{\theta}||_2$ ) whereas lasso uses an L1-penalty ( $||\boldsymbol{\theta}||_1$ ). What do we expect to see? If we use the same dataset as well as the same learning rate and initial $\boldsymbol{\theta}$ , ridge and lasso should converge similarly quickly. Let’s try this out and see how it goes!

To make our results a little bit more solid, we’ll use a slightly larger dataset than the one presented in the article.

The original dataset is actually contained in this larger one. Can you spot it?

It looks like this:

Figure prices dataset (large)

make interactive

Your task is to modify the existing subgradient_descent-function such that it now returns not only the optimized parameter vector $\boldsymbol{\theta}$ , but also the history of the training loss. This history can be a list which contains the loss of every training iteration. Use the plotting feature to compare the histories of ridge and lasso.

Bonus task: Create a validation set by splitting the X and y that are passed into the subgradient_descent-function. Then, in each epoch, compute the validation loss in addition to the training loss. Return both loss histories at the end of the function. Note: call the variables holding the validation losses loss_hist_val_ridge and loss_hist_val_lasso.

 def subgradient_descent(X, y, theta, criterion, subgradient_function, number_of_iterations, learning_rate, eps):
     X_b = add_intercept_ones(X)
     for i in range(number_of_iterations):
 
         # predict and calculate loss
         f = create_function(theta)  # create the current function
         y_predicted = f(X_b)  # predict our entire x
         loss = criterion(y, y_predicted, theta)  # calculate the error
 
         # perform optimization
         subgradient = subgradient_function(X_b, y, y_predicted, theta)  # calculate gradient
         theta = theta - learning_rate * subgradient  # adjust m and b
 
         theta[theta < eps] = 0
 
     return theta # return loss hist as well
 
 theta_init = np.random.rand(2)
 theta_lasso, loss_hist_train_lasso = subgradient_descent(
     # pass arguments here
 )
 theta_ridge, loss_hist_train_ridge = subgradient_descent(
     # pass arguments here
 )

Do we actually need the subgradient?

In the article we have seen that subgradient descent for lasso produces imprecise results, but not because of the subgradient, but because of gradient descent itself. In exercise #1 we have also seen that (at least for our dataset) subgradient descent reached much lower loss values for ridge than for lasso. Now the question is, why did lasso regression plateau so quickly? Is it because we used a subgradient, or because using the L2-penalty naturally results in a sharper loss decrease compared to using the L1-penalty?

If you want, you can think about this question from a theoretical point of view. Here, we want to go with a more practical approach and see how often the non-differentiable point of our L1-penalty is actually hit. In other terms, we want to know how often $\stackrel{sub}{\nabla}_{L1}(\boldsymbol{\theta}) = sign(\boldsymbol{\theta})$ evaluates to 0.

Your task is to modify the subgradient_descent-function in such a way that it keeps track of the number of occurrences this very particular point of our subgradient is hit. At the end, this value should be returned by the function alongside the optimized $\boldsymbol{\theta}$ .

 def subgradient_descent(X, y, theta, criterion, subgradient_function, number_of_iterations, learning_rate, eps):
     X_b = add_intercept_ones(X)
     for i in range(number_of_iterations):
 
         # predict and calculate loss
         f = create_function(theta)  # create the current function
         y_predicted = f(X_b)  # predict our entire x
         loss = criterion(y, y_predicted, theta)  # calculate the error
 
         # perform optimization
         subgradient = subgradient_function(X_b, y, y_predicted, theta)  # calculate gradient
         theta = theta - learning_rate * subgradient  # adjust m and b
 
         theta[theta < eps] = 0
 
     return theta # return number of times subgradient is hit at x = 0
 
 
 theta_init = np.random.rand(2)
 _, subgradient_hits_at_zero = subgradient_descent(
     X_train,
     y_train,
     theta_init,
     get_elastic_mse_function(a2=0, a1=1),
     get_elastic_gradient_function(a2=0, a1=1),
     3000,
     0.01,
     0.01,
 )

Share on:

Info

Filter by Tags

MLC Search

Table of Contents

Subgradient Descent Explained, Step by Step

Linnea Herman

Linnea Herman

Outline

Prerequisites

The Problem

Part 1: Subgradients

Formalizing Subgradients

Refining our Definition

Part 2: Subgradient Descent

Example Implementation: Lasso Regression

Implementing Subgradient Descent for Lasso

The Fatal Flaw

The Fatal Flaw of…?

Fixing (Sub)gradient Descent for Lasso

Visualizing Subgradient Descent

Further Reading

Gradient Descent for Linear Regression Explained, Step by Step

Vectorization Explained, Step by Step

When, Why, And How You Should Standardize Your Data