Proximal Gradient Descent

A proximal algorithm is an algorithm for solving a convex optimization problem that uses the proximal operators of the objective terms. It is called such since it consists of a gradient step followed by a proximal mapping. There are three main benefits to the application of proximal algorithms:

    1. They work under extremely general conditions, including cases where the functions are nonsmooth and extended real-valued
    1. They can be fast, since there can be simple proximal operators for functions that are otherwise challenging to handle in an optimization problem
    1. They are amenable to distributed optimization, so they can be used to solve very large scale problems

The proximal operator is defined as $$ prox_f(x) = argmin\left \{ f(u) + \frac{1}{2}\left \| u-x \right \|^2: u \in \mathbb{R}^n \right \}, \forall x \in \mathbb{R}^n $$ with the goal being to $$minimize\left \{ f(u) + h(u): u \in \mathbb{R}^n \right \}$$ where h is a proper lower semi-continuous function and f is a smooth convex function on dom(h).

Some important assumptions before we begin:

We assume that f has L-Lipschitz continuous gradient, i.e., $$\left \| \bigtriangledown f(x) - \bigtriangledown f(y) \right \| \leq L\left \| x-y \right \|, \forall x, y \in dom(h)$$ and hence for every $x, y \in dom(h)$, $$ f(x) \leq l_f(x; y) + \frac{L}{2}\left \| x-y \right \|^2$$ where $l_f(x; y) := f(y) + \left \langle \bigtriangledown f(y), x-y \right \rangle$.

Recal that PGM with a constant prox stepsize is recursive in nature and iterates according to : $$x_{k+1}=prox_{\lambda h}(x_k-\lambda\nabla f(x_k)).$$

Let's get started!

First, we will derive a single iteration of PGM and prove that it is strong convex.

$$ x_{k+1} = argmin\left \{ h(u) +\frac{1}{2}\left \| u-(x_k-\bigtriangledown f(x_k)) \right \|^2 \right \}$$

$$ x_{k+1} = argmin\left \{ f(x_k) + \left \langle \bigtriangledown f(x_k), u-x_k \right \rangle +h(u) + \frac{1}{2}\left \| x-x_k \right \|^2 \right \}$$

$$x_{k+1}= argmin \left\{\ell_f(u;x_k)+h(u) + \frac{1}{2\lambda}||u-x_k||^2 \right\}, $$

And proving strong convexity $\left \langle \bigtriangledown h(u) - \bigtriangledown h(x), u - x \right \rangle \geq \lambda \left \| u-x \right \|^{2}$:

$$ \left \langle prox_{\lambda h}(u) - prox_{\lambda h}(x), (u-\frac{1}{\lambda}\bigtriangledown h(u)) -(x-\frac{1}{\lambda}\bigtriangledown h(x)) \right \rangle \geq \left \| prox_{\lambda h}(u)-prox_{\lambda h}(x) \right \|^{2} $$

$$ \left \langle (u - \frac{1}{\lambda}\bigtriangledown h(u)) - (x - \frac{1}{\lambda}\bigtriangledown h(x)), (u-\frac{1}{\lambda}\bigtriangledown h(u)) -(x-\frac{1}{\lambda}\bigtriangledown h(x)) \right \rangle \geq \left \| (u - \frac{1}{\lambda}\bigtriangledown h(u)) - (x - \frac{1}{\lambda}\bigtriangledown h(x)) \right \|^{2} $$

Using the definition of $x_{k+1}$ and the strong convexity, we obtain upon rearranging terms that:

$$ h(x_{k+1}) \leq h(x) + \left \langle -\bigtriangledown f(u), x^{k+1}-x \right \rangle + \frac{1}{2}\left \| u-x \right \|^2 - \frac{1}{2}\left \| u-x^{k+1} \right \|^2 - \frac{1}{2}\left \| x^{k+1}-x \right \|^2 $$

Due to the Lipschitz continuity: $$ f(x_{k+1}) \leq f(u) + \left \langle -\bigtriangledown f(u), u-x^{k+1} \right \rangle + \frac{1}{2}\left \| u-x_{k+1} \right \|^2 $$

Adding the two: $$ f(x_{k+1}) + h(x_{k+1}) \leq f(u) + h(x) + \left \langle \bigtriangledown f(u), u-x \right \rangle - \frac{1}{2}\left \| x_{k+1}-x \right \|^2 + \frac{1}{2}\left \| u-x \right \|^2 $$

Using definition for $\ell_f(u;x_k)$ $$\ell_f(u;x_k)+h(u) + \frac{1}{2}||u-x_k||^2 \geq \ell_f(x_{k+1};x_k)+h(x_{k+1})+\frac{1}{2}||x_{k+1}-x_k||^2 + \frac{1}{2}||u-x_{k+1}||^2$$

Similarly for any x in int(dom(f)): $$ f(x_*) \leq f(x) + \left \langle \bigtriangledown f(x), x_*-x \right \rangle + \frac{1}{2\lambda}\left \| x_*-x \right \|^2 $$ It holds that $$ (f+h)(x)-(f+h)(x_*) \geq \frac{1}{2\lambda}\left \| x-x_* \right \|^2 $$

Consider $$ g(u) = f(x_{k+1})+\left \langle \bigtriangledown f(x_{k+1}), u-x_{k+1} \right \rangle + g(u)+\frac{1}{2\lambda}\left \| u-x_{k+1} \right \|^2$$

$ x_* = argmin_g(u) $ $$ g(x)-g(x_*) \geq \frac{1}{2\lambda}\left \| x-x_* \right \|^2 $$

Since $$ g(x_*) = f(x_{k+1}) + \left \langle \bigtriangledown f(x_{k+1}),x_*-x_{k+1} \right \rangle + \frac{1}{2\lambda}\left \| x_*-x_{k+1} \right \|^2 + h(x_*) $$ $$ \geq f(x_*)+h(x_*) = (f+h)(x_*) $$

This implies that $$ h(x_{k+1})-(f+h)(x_*) \geq \frac{1}{2\lambda}\left \| x_{k+1}-x_*\right \|^2 $$

Plugging for g(u) into above inequality

$$ f(x_{k+1}) + \left \langle \bigtriangledown f(x_{k+1}), x-x_{k+1} \right \rangle + h(x)+\frac{1}{2\lambda}\left \| x-x_{k+1} \right \|^2 -(f+h)(x_*) \geq \frac{1}{2\lambda}\left \| x-x_* \right \|^2$$

Which is equal to

$$ (f+h)(x_{k+1})-(f+h)(x_*) \geq \frac{1}{2\lambda}\left \| x_{k+1}-x_* \right \|^2 -\frac{1}{2\lambda}\left \| x-x_{k+1} \right \|^2 +f(x_{k+1}) + \ell_f(x;x_{k+1}) $$

$$(f+h)(x_*)+\frac{1}{2\lambda}||x_k-x_*||^2 \geq (f+h)(x_{k+1})+h(x_{k+1})+ \frac{1}{2\lambda}||x_{k+1}-x_*||^2$$

Using $$ \frac{1}{2\lambda}((f+h)(x_*)-(f+h)(x_{k+1})) \geq \left \| x_*-x_{k+1} \right \|^2 -\left \| x_*-x_k \right \|^2 + \frac{1}{2\lambda}\ell_f(x_*,x_k)$$

$$ \frac{1}{2\lambda}((f+h)(x_*)-(f+h)(x_{k+1})) \geq \left \| x_*-x_{k+1} \right \|^2 -\left \| x_*-x_k \right \|^2 $$

Sum over all n from 0 to k to obtain: $$ \frac{1}{2\lambda}\sum_{}^{k}(f+h)(x_*)-(f+h)(x_{k+1}) \geq \left \| x_*-x_k \right \|^2-\left \| x_*-x_0 \right \|^2 $$

Thus $$ \sum_{}^{k}((f+h)(x_{k+1})-(f+h)(x_*)) \leq \frac{1}{2\lambda}\left \| x_*-x_0 \right \|^2-\frac{1}{2\lambda}\left \| x_*-x_k \right \|^2 \leq \frac{1}{2\lambda}\left \| x_*-x_0 \right \|^2 $$

Given the monotonicity of $(f+h)(x_n)$ for $n \geq 0$ $$ k((f+h)(x_k)-(f+h)(x_*)) \leq \sum_{}^{k}((f+h)(x_{k+1})-(f+h)(x_*)) \leq \frac{1}{2\lambda}\left \| x_*-x_0 \right \|^2 $$

Thus $$\sum_{i=1}^k (f+h)(x_i)-k(f+h)(x_*) \leq \frac{||x_0-x_*||^2}{2\lambda} $$

Proving PGM has the descent property:

$$(f+h)(x_k) \geq (f+h)(x_{k+1}), \forall k \geq 0 $$

$$ \frac{1}{2\lambda}((f+h)(x_*)-(f+h)(x_{k+1})) \geq \left \| x_*-x_{k+1} \right \|^2 -\left \| x_*-x_k \right \|^2 + \frac{1}{2\lambda}\ell_f(x_*,x_k)$$

Along with the relationship: $$ \left \| x_{k+1} -x_*\right \| \leq \left \| x_k-x_* \right \|$$

It follows that: $$ (f+h)(x_*)-(f+h)(x_{k+1}) \leq (f+h)(x_*)-(f+h)(x_{k}) \leq 0$$

Thus for all k $\geq 0$ $$(f+h)(x_{k+1}) \leq (f+h)(x_{k}) $$

Finally, given the above: $$ k((f+h)(x_k)-(f+h)(x_*)) \leq \sum_{}^{k}((f+h)(x_{k+1})-(f+h)(x_*)) \leq \frac{1}{2\lambda}\left \| x_0-x_* \right \|^2 $$ Consequently $$ (f+h)(x_i)-(f+h)(x_*) \leq \frac{1}{k2\lambda}||x_0-x_*||^2 $$

Hence we obtain the $O(\frac{1}{k})$ convergence rate


Code Example

Here we will employ proximal gradient descent with stochastic schemes. In general, when the loss function we are trying to minimize can be wwritten in the form $\sum_{i=1}^{m}g_i(\theta )$ where each $g_i(\theta)$ is the loss sample at i, and the training time is long, then stochastic schemes should be considered. We will optimize $$f(\theta) = \underset{\theta \in \mathbb{R}^d}{min}\frac{1}{m}\sum_{i=1}^{m}\left [ log(1+exp(x_i\theta)) -y_ix_i\theta \right ] + \lambda\left \| \theta \right \|_1$$ We decompose $f(\theta)$ into a convex and differentiable function g and a convex but not differentiable function h: $$ g(\theta) = \frac{1}{m}\sum_{i=1}^{m}log(1+exp(x_i\theta)) $$ $$ h(\theta) = \frac{1}{m}\sum_{i=1}^{m} -y_ix_i\theta + \lambda\left \| \theta \right \|_1$$

The data we are using is from the classic MNIST machine learning dataset. There are two classes, 0 and 1, and we have a total of 14,780 images; a training set of 12,665 and a test set of 2,115. Each image is 28x28. Each image is vectorized and stacked to form a training and test matrix, with the label appended to the last column of each matrix. Thus, our classifier will learn $\theta$ on the train set to predict the labels for the test set.

import numpy as np
from sklearn.metrics import accuracy_score
x_train = train[:, :-1]
y_train = train[:, -1 :]
x_test = test[:, :-1]
y_test = test[:, -1 :]

x = x_train
y = y_train
def predict_labels(X, weights):
    return 1/(1+np.exp(-X.dot(weights)))

def soft_threshold(x,t):
    pos = np.maximum(x - t, 0)
    neg = np.minimum(x + t, 0)
    return pos+neg

def log_loss(X, theta):
    return np.sum(np.log(1 + np.exp(X.dot(theta)))) / X.shape[0]

def h(X, y, lam=10, lr=0.01):
    return (1/len(X))*(-y.T.dot(X)) + lam*lr

def evaluate_gradient(X, theta, y=None):
    return np.sum((X*np.exp(X.dot(theta))) / (1 + np.exp(X.dot(theta))), axis=0)/m
n = 100 
lam = 10
lr= 0.01
max_iters=1000
tol= 1e-3
N, D = x.shape
theta_current = np.zeros(shape=(D, 1))
losses = [log_loss(x, theta_current)]
thetas = [theta_current]

iterations = 1
while (loss > tol) or (iterations > max_iters):
    theta_current = thetas[-1]

    # Stochastic
    number_of_rows = x.shape[0]
    random_indices = np.random.choice(number_of_rows, size=n, replace=False)
    x_temp, y_temp = x[random_indices, :], y[random_indices, :] 

    for it in range(n):
    # Proximal GD
        grad = evaluate_gradient(x_temp, theta_current).reshape(-1,1)
        theta_new_grad =  theta_current - (lr * grad)
        theta_new = soft_threshold(theta_new_grad, h(x_temp, y_temp))
        theta_current = theta_new
        
    loss = log_loss(x, theta_current)
    losses.append(loss)
    thetas.append(theta_current)  
    iterations += 1
n = 100 
lam = 10
lr= 0.01
max_iters=1000
tol= 1e-5
N, D = x.shape
theta_current = np.zeros(shape=(D, 1))
loss_1 = log_loss(x, theta_current)
losses = [loss_1]
thetas = [theta_current]

iterations = 1
#while losses[-1] > tol:
for i in range(200):
    theta_current = thetas[-1]
    grad = evaluate_gradient(x, theta_current).reshape(-1,1)
    theta_new_grad =  theta_current - (lr * grad)
    theta_new = soft_threshold(theta_new_grad, h(x, y).T)
    theta_current = theta_new
        
    loss = log_loss(x, theta_current)
    losses.append(loss)
    thetas.append(theta_current)  
    #iterations += 1
predict_labels(x, thetas[-1])
accuracy_score(y_test, predict_labels(x_test, thetas[-1]))

Overall, this stochastic implementation achieves an accuracy of 93.76 on the training set.