Introduction to Gradient Descent

In a previous post, I presented (generalized) linear regression from a linear algebra point of view (namely, the “normal equations”) and in this post, I discuss yet another take on the problem, this time using gradient descent.

Gradient descent isn’t particularly fascinating for this particular task (as we know closed, analytical expressions for obtaining the parameters), but linear regression is the simplest example of gradient descent I could come up with without being completely trivial.

So we have a bunch of $n$ points $\{(x_i,y_i)\}_{i=1}^n$ , and we want to fit a line of equation $a+bx$ through them, minimizing some error function. Usually, we use the least mean square error (which corresponds to “vertical offsets“) because it usually yields easier analytical solutions. Let us define

$\hat{y}_i=\hat{y}(x_i)=a+bx_i$ ,

and the error function

$E(a,b)=\displaystyle\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2$ .

Since $E(a,b)$ is convex in both $a$ and $b$ , we can solve

$\displaystyle\frac{\partial E(a,b)}{\partial a}=0$ and $\displaystyle\frac{\partial E(a,b)}{\partial b}=0$

for $a$ and $b$ . After quite a bit of algebra, we find

$\displaystyle a=\frac{\left(\sum_i y_i\right)\left(\sum_i x_i^2\right)-\left(\sum_i x_i\right)\left(\sum_i x_iy_i\right)}{n\left(\sum_ix_i^2\right)-\left(\sum_i x_i\right)^2}$

and

$\displaystyle a=\frac{n\left(\sum_i x_iy_i\right)-\left(\sum_i x_i\right)\left(\sum_i y_i\right)}{n\left(\sum_ix_i^2\right)-\left(\sum_i x_i\right)^2}$ .

Noticing that terms repeat a number of times, and possibly using special cases if the $\{x_i\}_{i=1}^n=\{0,1,2,\ldots,n-1\}$ , we can arrive at an efficient, linear-time, implementation.

*
* *

While the relatively simple case of linear regression affords an analytical solution, we may (in real life) end up with another function 1) for which we can compute partial derivatives but 2) we cannot solve analytically for all its parameters, because, maybe, it is non-elementary, or it poses some other difficulty. The technique to use then is gradient descent.

Gradient descent is a relatively simple procedure conceptually—while in practice it does have its share of gotchas. Let’s say we have some function $f(x;\theta)$ with parameter(s) $\theta$ which we want to minimize over the $x$ s. The variable(s) to adjust is $\theta$ . If we want to minimize $f(x;\theta)$ , we must find $\theta^*$ that minimizes given the $x$ s which are fixed (or “observed” depending on how you look at it). If we can compute the partial derivative of $f(x;\theta)$ with respect to $\theta$ , we know in which direction change $\theta$ so that $f(x;\theta)$ decreases. We then have the update rule

$\displaystyle\theta^\prime=\theta-\eta\frac{\partial f(x;\theta)}{\partial \theta}$ ,

which changes $\theta$ in the direction in which $f(x;\theta)$ decreases.

However, if we know in which direction we must change it, we do not necessarily know by how much exactly, and so a function-dependent scaling factor $\eta$ is introduced. This scaling, the gradient step, will prevent us from updating $\theta$ too far and miss the minimum. If $\theta$ is too small it will take too long to find the minimum; if it is too large the procedure may diverge—essentially crash.

(The underlying assumption for gradient descent to work properly is that the function one wants to minimize is somewhat approximately convex, so that we follow the slope and find a (good local) minimum. If the function would highly non-convex, it would probably not work very well.)

If we have more than one parameter, then $\theta$ is a vector of parameters, and the derivative is replaced by a gradient, which is the vector formed by the partial derivatives in all the parameters. That is,

$\displaystyle \nabla f=\left(\frac{\partial f(x;\theta)}{\partial \theta_1}, \frac{\partial f(x;\theta)}{\partial \theta_2},\cdots, \frac{\partial f(x;\theta)}{\partial \theta_k}\right),$

if we have $k$ parameters. It is therefore the general case where $\theta$ is a vector that gives the method its name, gradient descent.

*
* *

Back to the linear regression problem. Using $E(a,b)$ , we find that the partial derivative w/ respect to the two parameters $a$ and $b$ are

$\displaystyle\frac{\partial E(a,b)}{\partial a}=-2\sum_{i=1}^n(y_i-a-bx_i)$

and

$\displaystyle\frac{\partial E(a,b)}{\partial b}=-2\sum_{i=1}^n(y_i-a-bx_i)x_i$

and therefore

$\nabla E(a,b)=\left(\displaystyle\frac{\partial E(a,b)}{\partial a},\displaystyle\frac{\partial E(a,b)}{\partial b}\right)$

and the update rule is

$(a,b)^\prime=(a,b)-\eta\nabla E(a,b)$

However, for faster convergence, it is sometimes useful to use a different $\eta$ for each parameter. We then have

$\displaystyle (a,b)^\prime=(a,b)-\left(\eta_a \frac{\partial E(a,b)}{\partial a},\eta_b \frac{\partial E(a,b)}{\partial b}\right)$

*
* *

Let us have a look at a C++ implementation. First, as with the general regression, we see that the expressions of the gradient have quite a few invariants we can factor out. Using the identities

$\sum_{i=1}^n i=\frac{1}{2}n(n+1)$

and

$\sum_{i=1}^n i^2=\frac{1}{6}n(n+1)(2n+1)$ ,

we arrive at a constant-time computation of gradient in the iteration step (but it does take linear time, once, to initialize the invariants). Here’s how the actual code looks like:

void gradient_descent( const std::vector<float_x> & y,
                       float_x & a,
                       float_x & b )
 {
  const size_t n=y.size();
  const float_x sx=sum_of_int(n-1);
  const float_x sx2=sum_of_sqr(n-1);

  float_x sy=0;
  float_x sxy=0;

  for (size_t i=0;i<n;i++)
   {
    sy+=y[i];
    sxy+=i*y[i];
   }

  // initial guesses
  b=0;
  a=sy/n;

  // iterations
  float_x last_a,last_b;
  float_x sa=1e-3; // for n=100
  float_x sb=1e-6; // for n=100

  do
   {
    last_a=a;
    last_b=b;
    float_x db= -(sxy - a*sx - b*sx2);
    float_x da= -(sy - n*a - b*sx );

    a-=sa*da;
    b-=sb*db;
   }
  while ( (std::fabs(a-last_a)/a > 0.0001)
          ||
          (std::fabs(b-last_b)/b > 0.0001));
 }

*
* *

Where are the factors $2$ in front of the partial derivatives? Why are they gone? Think about it for a while, I’ll come back to it in a moment.

*
* *

Let us try it out!

Let $a=5$ and $b=3$ (why not) and let us add some uniform random noise $\varepsilon \sim \mathcal{U}(-1,1)$ . With $n=100$ , and not particularly efficient initial values for $a$ and $b$ , we have the following convergence:

So we see the parameters adjusting smoothly (which will not always be the case; more complex functions may yield surprises) and that we arrive to very good estimates after $\approx 250$ iterations (which, in this case, are inexpensive as being constant-time). After $\approx 475$ iterations, the algorithm bails out as it has detected convergence.

*
* *

We said earlier that $\eta$ might be function-specific. In the case of linear regression by means of gradient descent, the following $\eta$ s seems to be working fine:

$n$	$\eta_a$	$\eta_b$
100	1e-3	1e-6
1000	1e-4	1e-9
10000	1e-5	1e-12
100000	1e-6	1e-15

So $\eta_a$ and $\eta_b$ are the reason why we can drop the $2$ in front of the derivatives: the constant is absorbed into the $\eta_a$ and $\eta_b$ and so can be ignored.

*
* *

While this seems to be quite laborious for linear regression, gradient descent proves most useful in all kind of optimization problems, including machine learning. In machine learning gradient descent (and back-propagation) is used to train neural networks of all kinds. Newton’s method for finding square roots (last week’s post) can be thought of as gradient-descent type algorithm with an expression for $\eta$ that depends explicitly on $x$ .

We will certainly discuss gradient-descent methods again future posts.

This entry was posted on Tuesday, April 24th, 2012 at 9:37 am and is filed under algorithms, machine learning, Mathematics. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

22 Responses to Introduction to Gradient Descent

Manish says:

15/02/2013 at 0:28 am

Can you please explain the same thing with a small example?

Reply
- Steven Pigeon says:
  
  15/02/2013 at 12:39 pm
  
  Ok. Let $f(x)=x^2$ . If you want to find the minimum of $f(x)$ , you can either solve analytically the equation (and find trivially that $x=0$ ) or use gradient descend, start at some reasonnable guess $x_0$ and use gradient descent to find $x_{min}$ . You will find
  
  $\displaystyle \frac{\partial f(x)}{\partial x}=2x$
  
  So you would iterate
  
  $x_t = x_{t-1}-\eta \displaystyle \frac{\partial f(x)}{\partial x} = x_{t-1}-2\eta x$ ,
  
  starting at $t=0$ with reasonnable guess $x_0$ . Of course you can conflapulate $2\eta$ into $\eta'$ , and get $x_t=x_{t-1}-2\eta' x$ .
  
  Normally, you don’t get functions that are trivially analytically solvable. Usually, you get monsters with lots of internal parameters and non-separable expression, and you have to use gradient descent.
  
  Let re-do the $x^2$ example with a “more realistic” example: $g(x)=(ax+b)^2+c$ —that’s still a basic quadratic, but now the minimum is not zero, it’s $c$ , found at $x=-\frac{b}{a}$ . Applying gradient descent:
  
  $\displaystyle \frac{\partial g(x)}{\partial x}=2a(ax+b)$ .
  
  We start at some $x_0$ (possibly $x_0=0$ if you have a priori assumptions that $b$ should be small-ish), and we iterate:
  
  $x_t=x_{t-1}+\displaystyle \frac{\partial g(x)}{\partial x}=x_{t-1}+\eta a(ax+b)$ .
  
  Note that formally, we do not observe $a$ , $b$ , and $c$ . We can however estimate
  
  $\displaystyle \frac{\partial g(x)}{\partial x} \approx \frac{g(x)-g(x+\varepsilon)}{\varepsilon}$
  
  for a small-ish $\varepsilon>0$ . The choice of $\eta$ is also very important and related to the (local) curvature of the function. If you pick it too large, you oscillate around the solution (and if the function is not very well-behaved, it can even bring you further away from the solution), if you pick it too small, it takes forever to get to the solution.
  
  With the preceeding example, with $a=2$ , $b=1.2$ , and $c=3$, with initial guess $x_0=2$ , the gradient descent produces the following path (in magenta):
  
  (right-click and “view image” to view image to full size).
  
  I hope that helped a bit.
  
  Reply
有趣的机器学习：最简明入门指南 | zengine says:

11/05/2014 at 22:21 pm

[…] 这种找出最佳权重的办法被称为批量梯度下降，上面是对它的高度概括。如果想搞懂细节，不要害怕，继续深入下去吧。 […]

Reply
有趣的机器学习：最简明入门指南 | 我爱互联网 says:

11/05/2014 at 22:36 pm

[…] 　　这种找出最佳权重的办法被称为批量梯度下降，上面是对它的高度概括。如果想搞懂细节，不要害怕，继续深入下去吧。 […]

Reply
Machine learning is fun! | Actuaires – Big Data says:

06/01/2016 at 12:12 pm

[…] way to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

Reply
有趣的机器学习：最简明入门指南 | 心似浮雲常自在　意如流水任東西 says:

25/06/2016 at 8:24 am

[…] 这种找出最佳权重的办法被称为批量梯度下降，上面是对它的高度概括。如果想搞懂细节，不要害怕，继续深入下去吧。 […]

Reply
یادگیری ماشین به زبان ساده - بخش اول | صفر تا قهرمان | Zero To Hero says:

21/03/2017 at 4:05 am

[…] نام دارد. اگر به این روش علاقمند شدید، جزئیات بیشتری در این جا پیدا خواهید […]

Reply
یادگیری ماشین به زبان ساده – بخش اول (سری صفر تا قهرمان) | ام نوت | سامانه اشتراک یادداشت کاربردی و مهم says:

21/05/2017 at 15:34 pm

[…] نام دارد. اگر به این روش علاقمند شدید، جزئیات بیشتری در این جا پیدا خواهید […]

Reply
Machine Learning is Fun! – it is the real thing says:

19/07/2017 at 12:52 pm

[…] to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

Reply
Ma che accidenti è il Machine Learning? – MY BRAIN says:

02/09/2017 at 7:56 am

[…] la vostra funzione chiamato Batch Gradient Descent (discesa del gradiente). Non abbiate paura di scavare più a fondo, se siete interessati a capirne i […]

Reply
Machine Learning is Fun p.1 | Open Data Science says:

18/10/2017 at 9:35 am

[…] to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

Reply
Machine Learning Blogs | Programming, Reading, Running, Thinking, Traveling, Living says:

06/02/2018 at 2:32 am

[…] Introduction to Gradient Descent […]

Reply
机器学习入门（1） – 机器研习会 says:

19/07/2018 at 7:03 am

[…] 这是一种寻找方程最佳权重方式的高度概括，叫作批梯度下降(batch gradient descent)。如果你对此感兴趣，不要害怕，了解更多细节。 […]

Reply
Machine Learning is Fun! – Knowrtal says:

03/09/2018 at 7:08 am

[…] to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

Reply
¡El Machine Learning es divertido! (Parte 1) - Blog de Chatbot México says:

27/09/2018 at 17:06 pm

[…] descent, una de las mejores maneras de encontrar el peso de tu función. No tengas miedo de entender más a profundidad este concepto si estás interesado en entender el detalle […]

Reply
1 – Machine Learning is Fun – 8 Part series (2014) | Traffic.Ventures Blog says:

30/09/2018 at 14:30 pm

[…] to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

Reply
The Journey Begins – Makine says:

06/11/2018 at 9:25 am

[…] eğim azaltma (batch gradient descent)yöntemidir. Eğer detaylarını öğrenmek isterseniz, daha derineinmekten […]

Reply
Machine Learning thật thú vị (1): Dự đoán giá nhà đất – DevNET Shopping says:

27/12/2018 at 23:33 pm

[…] tìm trọng số với tên gọi là batch gradient descent – xuống dốc toàn lô. Hãy đọc thêm về đề tài này nếu bạn cảm thấy thú […]

Reply
Machine Learning is Fun! | dataminerz says:

16/04/2019 at 13:34 pm

[…] to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

Reply
有趣的机器学习概念纵览：从多元拟合，神经网络到深度学习，给每个感兴趣的人 - 算法网 says:

07/05/2019 at 9:23 am

[…] 有了图之后是不是感觉形象多了？我们寻找最优权重的过程就是一步一步走到谷底的过程，如果我们每次小小地修改权重而使得其不断向谷底靠近，我们也就在向结果靠近。如果你还记得微积分的一些知识，应该知道函数的导数代表着函数的切线方向，换言之，在图中的任何一点我们通过计算函数的导数就知道变化的方向，即梯度下降的方向。我们可以计算出代价函数中每个变量的偏导数然后将每个当前变量值减去该偏导数，即按照梯度相反的方向前进，那就可以逐步解决谷底咯。如果你感兴趣的话，可以深入看看批量梯度下降相关的知识。 […]

Reply
Machine Learning is Fun! – Adam Geitgey | WaveMaker says:

05/10/2019 at 2:15 am

[…] way to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

Reply
Machine Learning thật thú vị (1): Dự đoán giá nhà đất – GÓC LẬP TRÌNH says:

18/06/2020 at 21:25 pm

[…] với tên gọi là batch gradient descent – xuống dốc toàn lô. Hãy đọc thêm về đề tài này nếu bạn cảm thấy thú […]

Reply