### Tuan Anh Le

14 March 2018

The following draws from this paper and David Martinez Rubio’s master’s thesis.

Given a function $f: \mathbb R^d \to \mathbb R$, we want to solve the optimization problem \begin{align} \argmin_{x \in \mathbb R^d} f(x) \label{eq:argmin} \end{align} by the gradient descent procedure:

• Initialize $x_0$
• For $t = 0, 1, 2, \dotsc$:
• Update $x_{t + 1} = x_t - \alpha \nabla f(x_t)$

for some learning rate $\alpha$.

We assume that the optimization problem has a non-empty solution set $\mathcal X^*$ and we use $f^*$ to denote the corresponding optimal function value.

Definition 1 (Smoothness). A function $f$ is $L$-smooth if its first derivative is $L$-Lipschitz continuous, i.e. there exists a finite and positive $L$ such that \begin{align} f(y) \leq f(x) + (\nabla f(x))^T (y - x) + \frac{L}{2} \| y - x \|^2 \label{eq:smoothness} \end{align} for all $x, y \in \mathbb R^d$.

Definition 2 (Strong Convexity). A function $f$ is $\mu$-strongly convex if there exists a finite and positive $\mu$ such that \begin{align} f(y) \geq f(x) + (\nabla f(x))^T (y - x) + \frac{\mu}{2} \| y - x \|^2, \label{eq:strong-convexity} \end{align} for all $x, y \in \mathbb R^d$.

Definition 3 (Polyak-Łojasiewicz). A function $f$ satisfies the Polyak-Łojasiewicz (PL) inequality if the following holds for some $\mu > 0$: \begin{align} \frac{1}{2} \| \nabla f(x) \|^2 \geq \mu (f(x) - f^*) \label{eq:pl-inequality} \end{align} for all $x \in \mathbb R^d$.

Proposition 1. Let $f$ be an $L$-smooth and $\mu$-strongly convex function. Then $f$ satisfies the PL inequality with constant $\frac{\mu^2}{4L}$.

Proof. For $x^*$ in the non-empty solution set $\mathcal X^*$ we use the strong convexity property in \eqref{eq:strong-convexity} to write \begin{align} f^* \geq f(x) + (\nabla f(x))^T (x^* - x) + \frac{\mu}{2} \| x^* - x \|^2. \end{align}

Now, since $f^* - f(x) \leq 0$, we can write \begin{align} 0 \geq f^* - f(x) \geq (\nabla f(x))^T (x^* - x) + \frac{\mu}{2} \| x^* - x \|^2 \end{align} from which we obtain \begin{align} (\nabla f(x))^T (x - x^*) \geq \frac{\mu}{2} \| x^* - x \|^2. \end{align} Using the Cauchy-Schwartz inequality, we have $\| \nabla f(x) \| \cdot \|x - x^* \| \geq (\nabla f(x))^T (x - x^*)$ and hence \begin{align} \| \nabla f(x) \| \cdot \|x - x^* \| \geq \frac{\mu}{2} \| x^* - x \|^2. \end{align} Simplyfying further, we get \begin{align} \| x^* - x \|^2 \leq \frac{4}{\mu^2} || \nabla f(x) ||^2. \label{eq:temp} \end{align}

Now, using the smoothness property in \eqref{eq:smoothness}, we obtain \begin{align} f(x) &\leq f^* + (\nabla f(x^*))^T (x - x^*) + \frac{L}{2} || x - x^* ||^2 \\ &= f^* + \frac{L}{2} || x - x^* ||^2 && \text{(since } \nabla f(x^*) = 0 \text{)} \\ &\leq f^* + \frac{2L}{\mu^2} || \nabla f(x) ||^2. && \text{(using \eqref{eq:temp})} \end{align} Rearranging, we obtain \begin{align} \frac{1}{2} || \nabla f(x) ||^2 \geq \frac{\mu^2}{4L} (f(x) - f^*), \end{align} which means that $f$ satisfies the PL inequality with the constant $\frac{\mu^2}{4L}$. $\square$

Theorem 1. Given an $L$-smooth function $f: \mathbb R^d \to \mathbb R$ that satisfies the PL inequality with constant $\mu$, if the learning rate is set to $\alpha = 1 / L$, gradient descent has the following convergence rate: \begin{align} f(x_t) - f^* \leq \left(1 - \frac{\mu}{L}\right)^t (f(x_0) - f^*). \label{eq:convergence} \end{align}

Proof. For any given $t \geq 0$, we have \begin{align} x_{t + 1} = x_t - \frac{1}{L} \nabla f(x_t). \label{eq:update} \end{align}

Using the smoothness property in \eqref{eq:smoothness}, we obtain \begin{align} f(x_{t + 1}) &\leq f(x_t) + (\nabla f(x_t))^T (x_{t + 1} - x_t) + \frac{L}{2} \| x_{t + 1} - x_t \|^2 \\ &= f(x_t) + (\nabla f(x_t))^T \left(-\frac{1}{L} \nabla f(x_t) \right) + \frac{L}{2} \left\| -\frac{1}{L} \nabla f(x_t) \right\|^2 && \text{(using \eqref{eq:update})} \\ &= f(x_t) - \frac{1}{2L} \left\| \nabla f(x_t) \right\|^2 && \text{(simplifying)} \\ &\leq f(x_t) - \frac{\mu}{L} (f(x_t) - f^*). && \text{(using the PL inequality \eqref{eq:pl-inequality})} \end{align} We rearrange to obtain \begin{align} f(x_{t + 1}) - f^* \leq \left(1 - \frac{\mu}{L}\right) (f(x_t) - f^*). \end{align} Applying this equation successively results in the desired convergence rate \eqref{eq:convergence}. $\square$

Corollary 1. Given an $L$-smooth function and $\mu$-strongly convex function $f: \mathbb R^d \to \mathbb R$, if the learning rate is set to $\alpha = 1 / L$, gradient descent has the following convergence rate: \begin{align} f(x_t) - f^* \leq \left(1 - \frac{\mu^2}{4L^2}\right)^t (f(x_0) - f^*). \label{eq:convergence-2} \end{align}

Proof. Use Proposition 1 and Theorem 1. $\square$

[back]