Tuan Anh Le

Convergence of Gradient Descent

14 March 2018

The following draws from this paper and David Martinez Rubio’s master’s thesis.

Given a function , we want to solve the optimization problem \begin{align} \argmin_{x \in \mathbb R^d} f(x) \label{eq:argmin} \end{align} by the gradient descent procedure:

for some learning rate .

We assume that the optimization problem has a non-empty solution set and we use to denote the corresponding optimal function value.

Definition 1 (Smoothness). A function is -smooth if its first derivative is -Lipschitz continuous, i.e. there exists a finite and positive such that \begin{align} f(y) \leq f(x) + (\nabla f(x))^T (y - x) + \frac{L}{2} \| y - x \|^2 \label{eq:smoothness} \end{align} for all .

Definition 2 (Strong Convexity). A function is -strongly convex if there exists a finite and positive such that \begin{align} f(y) \geq f(x) + (\nabla f(x))^T (y - x) + \frac{\mu}{2} \| y - x \|^2, \label{eq:strong-convexity} \end{align} for all .

Definition 3 (Polyak-Łojasiewicz). A function satisfies the Polyak-Łojasiewicz (PL) inequality if the following holds for some : \begin{align} \frac{1}{2} \| \nabla f(x) \|^2 \geq \mu (f(x) - f^*) \label{eq:pl-inequality} \end{align} for all .

Proposition 1. Let be an -smooth and -strongly convex function. Then satisfies the PL inequality with constant .

Proof. For in the non-empty solution set we use the strong convexity property in \eqref{eq:strong-convexity} to write \begin{align} f^* \geq f(x) + (\nabla f(x))^T (x^* - x) + \frac{\mu}{2} \| x^* - x \|^2. \end{align}

Now, since , we can write \begin{align} 0 \geq f^* - f(x) \geq (\nabla f(x))^T (x^* - x) + \frac{\mu}{2} \| x^* - x \|^2 \end{align} from which we obtain \begin{align} (\nabla f(x))^T (x - x^*) \geq \frac{\mu}{2} \| x^* - x \|^2. \end{align} Using the Cauchy-Schwartz inequality, we have and hence \begin{align} \| \nabla f(x) \| \cdot \|x - x^* \| \geq \frac{\mu}{2} \| x^* - x \|^2. \end{align} Simplyfying further, we get \begin{align} \| x^* - x \|^2 \leq \frac{4}{\mu^2} || \nabla f(x) ||^2. \label{eq:temp} \end{align}

Now, using the smoothness property in \eqref{eq:smoothness}, we obtain \begin{align} f(x) &\leq f^* + (\nabla f(x^*))^T (x - x^*) + \frac{L}{2} || x - x^* ||^2 \\ &= f^* + \frac{L}{2} || x - x^* ||^2 && \text{(since } \nabla f(x^*) = 0 \text{)} \\ &\leq f^* + \frac{2L}{\mu^2} || \nabla f(x) ||^2. && \text{(using \eqref{eq:temp})} \end{align} Rearranging, we obtain \begin{align} \frac{1}{2} || \nabla f(x) ||^2 \geq \frac{\mu^2}{4L} (f(x) - f^*), \end{align} which means that satisfies the PL inequality with the constant .

Theorem 1. Given an -smooth function that satisfies the PL inequality with constant , if the learning rate is set to , gradient descent has the following convergence rate: \begin{align} f(x_t) - f^* \leq \left(1 - \frac{\mu}{L}\right)^t (f(x_0) - f^*). \label{eq:convergence} \end{align}

Proof. For any given , we have \begin{align} x_{t + 1} = x_t - \frac{1}{L} \nabla f(x_t). \label{eq:update} \end{align}

Using the smoothness property in \eqref{eq:smoothness}, we obtain \begin{align} f(x_{t + 1}) &\leq f(x_t) + (\nabla f(x_t))^T (x_{t + 1} - x_t) + \frac{L}{2} \| x_{t + 1} - x_t \|^2 \\ &= f(x_t) + (\nabla f(x_t))^T \left(-\frac{1}{L} \nabla f(x_t) \right) + \frac{L}{2} \left\| -\frac{1}{L} \nabla f(x_t) \right\|^2 && \text{(using \eqref{eq:update})} \\ &= f(x_t) - \frac{1}{2L} \left\| \nabla f(x_t) \right\|^2 && \text{(simplifying)} \\ &\leq f(x_t) - \frac{\mu}{L} (f(x_t) - f^*). && \text{(using the PL inequality \eqref{eq:pl-inequality})} \end{align} We rearrange to obtain \begin{align} f(x_{t + 1}) - f^* \leq \left(1 - \frac{\mu}{L}\right) (f(x_t) - f^*). \end{align} Applying this equation successively results in the desired convergence rate \eqref{eq:convergence}.

Corollary 1. Given an -smooth function and -strongly convex function , if the learning rate is set to , gradient descent has the following convergence rate: \begin{align} f(x_t) - f^* \leq \left(1 - \frac{\mu^2}{4L^2}\right)^t (f(x_0) - f^*). \label{eq:convergence-2} \end{align}

Proof. Use Proposition 1 and Theorem 1.

[back]