# Convergence of Gradient Descent

14 March 2018

The following draws from this paper and David Martinez Rubio’s master’s thesis.

Given a function $$f: \mathbb R^d \to \mathbb R$$, we want to solve the optimization problem \begin{align} \argmin_{x \in \mathbb R^d} f(x) \label{eq:argmin} \end{align} by the gradient descent procedure:

• Initialize $$x_0$$
• For $$t = 0, 1, 2, \dotsc$$:
• Update $$x_{t + 1} = x_t - \alpha \nabla f(x_t)$$

for some learning rate $$\alpha$$.

We assume that the optimization problem has a non-empty solution set $$\mathcal X^*$$ and we use $$f^*$$ to denote the corresponding optimal function value.

Definition 1 (Smoothness). A function $$f$$ is $$L$$-smooth if its first derivative is $$L$$-Lipschitz continuous, i.e. there exists a finite and positive $$L$$ such that \begin{align} f(y) \leq f(x) + (\nabla f(x))^T (y - x) + \frac{L}{2} \| y - x \|^2 \label{eq:smoothness} \end{align} for all $$x, y \in \mathbb R^d$$.

Definition 2 (Strong Convexity). A function $$f$$ is $$\mu$$-strongly convex if there exists a finite and positive $$\mu$$ such that \begin{align} f(y) \geq f(x) + (\nabla f(x))^T (y - x) + \frac{\mu}{2} \| y - x \|^2, \label{eq:strong-convexity} \end{align} for all $$x, y \in \mathbb R^d$$.

Definition 3 (Polyak-Łojasiewicz). A function $$f$$ satisfies the Polyak-Łojasiewicz (PL) inequality if the following holds for some $$\mu > 0$$: \begin{align} \frac{1}{2} \| \nabla f(x) \|^2 \geq \mu (f(x) - f^*) \label{eq:pl-inequality} \end{align} for all $$x \in \mathbb R^d$$.

Proposition 1. Let $$f$$ be an $$L$$-smooth and $$\mu$$-strongly convex function. Then $$f$$ satisfies the PL inequality with constant $$\frac{\mu^2}{4L}$$.

Proof. For $$x^*$$ in the non-empty solution set $$\mathcal X^*$$ we use the strong convexity property in \eqref{eq:strong-convexity} to write \begin{align} f^* \geq f(x) + (\nabla f(x))^T (x^* - x) + \frac{\mu}{2} \| x^* - x \|^2. \end{align}

Now, since $$f^* - f(x) \leq 0$$, we can write \begin{align} 0 \geq f^* - f(x) \geq (\nabla f(x))^T (x^* - x) + \frac{\mu}{2} \| x^* - x \|^2 \end{align} from which we obtain \begin{align} (\nabla f(x))^T (x - x^*) \geq \frac{\mu}{2} \| x^* - x \|^2. \end{align} Using the Cauchy-Schwartz inequality, we have $$\| \nabla f(x) \| \cdot \|x - x^* \| \geq (\nabla f(x))^T (x - x^*)$$ and hence \begin{align} \| \nabla f(x) \| \cdot \|x - x^* \| \geq \frac{\mu}{2} \| x^* - x \|^2. \end{align} Simplyfying further, we get \begin{align} \| x^* - x \|^2 \leq \frac{4}{\mu^2} || \nabla f(x) ||^2. \label{eq:temp} \end{align}

Now, using the smoothness property in \eqref{eq:smoothness}, we obtain \begin{align} f(x) &\leq f^* + (\nabla f(x^*))^T (x - x^*) + \frac{L}{2} || x - x^* ||^2 \\
&= f^* + \frac{L}{2} || x - x^* ||^2 && \text{(since } \nabla f(x^*) = 0 \text{)} \\
&\leq f^* + \frac{2L}{\mu^2} || \nabla f(x) ||^2. && \text{(using \eqref{eq:temp})} \end{align} Rearranging, we obtain \begin{align} \frac{1}{2} || \nabla f(x) ||^2 \geq \frac{\mu^2}{4L} (f(x) - f^*), \end{align} which means that $$f$$ satisfies the PL inequality with the constant $$\frac{\mu^2}{4L}$$. $$\square$$

Theorem 1. Given an $$L$$-smooth function $$f: \mathbb R^d \to \mathbb R$$ that satisfies the PL inequality with constant $$\mu$$, if the learning rate is set to $$\alpha = 1 / L$$, gradient descent has the following convergence rate: \begin{align} f(x_t) - f^* \leq \left(1 - \frac{\mu}{L}\right)^t (f(x_0) - f^*). \label{eq:convergence} \end{align}

Proof. For any given $$t \geq 0$$, we have \begin{align} x_{t + 1} = x_t - \frac{1}{L} \nabla f(x_t). \label{eq:update} \end{align}

Using the smoothness property in \eqref{eq:smoothness}, we obtain \begin{align} f(x_{t + 1}) &\leq f(x_t) + (\nabla f(x_t))^T (x_{t + 1} - x_t) + \frac{L}{2} \| x_{t + 1} - x_t \|^2 \\
&= f(x_t) + (\nabla f(x_t))^T \left(-\frac{1}{L} \nabla f(x_t) \right) + \frac{L}{2} \left\| -\frac{1}{L} \nabla f(x_t) \right\|^2 && \text{(using \eqref{eq:update})} \\
&= f(x_t) - \frac{1}{2L} \left\| \nabla f(x_t) \right\|^2 && \text{(simplifying)} \\
&\leq f(x_t) - \frac{\mu}{L} (f(x_t) - f^*). && \text{(using the PL inequality \eqref{eq:pl-inequality})} \end{align} We rearrange to obtain \begin{align} f(x_{t + 1}) - f^* \leq \left(1 - \frac{\mu}{L}\right) (f(x_t) - f^*). \end{align} Applying this equation successively results in the desired convergence rate \eqref{eq:convergence}. $$\square$$

Corollary 1. Given an $$L$$-smooth function and $$\mu$$-strongly convex function $$f: \mathbb R^d \to \mathbb R$$, if the learning rate is set to $$\alpha = 1 / L$$, gradient descent has the following convergence rate: \begin{align} f(x_t) - f^* \leq \left(1 - \frac{\mu^2}{4L^2}\right)^t (f(x_0) - f^*). \label{eq:convergence-2} \end{align}

Proof. Use Proposition 1 and Theorem 1. $$\square$$

[back]