A proximal algorithm is an algorithm for solving a convex optimization problem that uses the proximal operators of the objective terms. It is called such since it consists of a gradient step followed by a proximal mapping. There are three main benefits to the application of proximal algorithms:
- They work under extremely general conditions, including cases where the functions are nonsmooth and extended real-valued
- They can be fast, since there can be simple proximal operators for functions that are otherwise challenging to handle in an optimization problem
- They are amenable to distributed optimization, so they can be used to solve very large scale problems
The proximal operator is defined as $$ prox_f(x) = argmin\left \{ f(u) + \frac{1}{2}\left \| u-x \right \|^2: u \in \mathbb{R}^n \right \}, \forall x \in \mathbb{R}^n $$
with the goal being to $$minimize\left \{ f(u) + h(u): u \in \mathbb{R}^n \right \}$$
where h is a proper lower semi-continuous function and f is a smooth convex function on dom(h).
Some important assumptions before we begin:
We assume that f has L-Lipschitz continuous gradient, i.e., $$\left \| \bigtriangledown f(x) - \bigtriangledown f(y) \right \| \leq L\left \| x-y \right \|, \forall x, y \in dom(h)$$ and hence for every $x, y \in dom(h)$, $$ f(x) \leq l_f(x; y) + \frac{L}{2}\left \| x-y \right \|^2$$ where $l_f(x; y) := f(y) + \left \langle \bigtriangledown f(y), x-y \right \rangle$.
Recal that PGM with a constant prox stepsize is recursive in nature and iterates according to : $$x_{k+1}=prox_{\lambda h}(x_k-\lambda\nabla f(x_k)).$$
First, we will derive a single iteration of PGM and prove that it is strong convex.
$$ x_{k+1} = argmin\left \{ h(u) +\frac{1}{2}\left \| u-(x_k-\bigtriangledown f(x_k)) \right \|^2 \right \}$$
$$ x_{k+1} = argmin\left \{ f(x_k) + \left \langle \bigtriangledown f(x_k), u-x_k \right \rangle +h(u) + \frac{1}{2}\left \| x-x_k \right \|^2 \right \}$$
$$x_{k+1}= argmin \left\{\ell_f(u;x_k)+h(u) + \frac{1}{2\lambda}||u-x_k||^2 \right\}, $$
And proving strong convexity $\left \langle \bigtriangledown h(u) - \bigtriangledown h(x), u - x \right \rangle \geq \lambda \left \| u-x \right \|^{2}$:
$$ \left \langle prox_{\lambda h}(u) - prox_{\lambda h}(x), (u-\frac{1}{\lambda}\bigtriangledown h(u)) -(x-\frac{1}{\lambda}\bigtriangledown h(x)) \right \rangle \geq \left \| prox_{\lambda h}(u)-prox_{\lambda h}(x) \right \|^{2} $$
$$ \left \langle (u - \frac{1}{\lambda}\bigtriangledown h(u)) - (x - \frac{1}{\lambda}\bigtriangledown h(x)), (u-\frac{1}{\lambda}\bigtriangledown h(u)) -(x-\frac{1}{\lambda}\bigtriangledown h(x)) \right \rangle \geq \left \| (u - \frac{1}{\lambda}\bigtriangledown h(u)) - (x - \frac{1}{\lambda}\bigtriangledown h(x)) \right \|^{2} $$
Using the definition of $x_{k+1}$ and the strong convexity, we obtain upon rearranging terms that:
$$ h(x_{k+1}) \leq h(x) + \left \langle -\bigtriangledown f(u), x^{k+1}-x \right \rangle + \frac{1}{2}\left \| u-x \right \|^2 - \frac{1}{2}\left \| u-x^{k+1} \right \|^2 - \frac{1}{2}\left \| x^{k+1}-x \right \|^2 $$
Due to the Lipschitz continuity: $$ f(x_{k+1}) \leq f(u) + \left \langle -\bigtriangledown f(u), u-x^{k+1} \right \rangle + \frac{1}{2}\left \| u-x_{k+1} \right \|^2 $$
Adding the two: $$ f(x_{k+1}) + h(x_{k+1}) \leq f(u) + h(x) + \left \langle \bigtriangledown f(u), u-x \right \rangle - \frac{1}{2}\left \| x_{k+1}-x \right \|^2 + \frac{1}{2}\left \| u-x \right \|^2 $$
Using definition for $\ell_f(u;x_k)$ $$\ell_f(u;x_k)+h(u) + \frac{1}{2}||u-x_k||^2 \geq \ell_f(x_{k+1};x_k)+h(x_{k+1})+\frac{1}{2}||x_{k+1}-x_k||^2 + \frac{1}{2}||u-x_{k+1}||^2$$
Similarly for any x in int(dom(f)): $$ f(x_*) \leq f(x) + \left \langle \bigtriangledown f(x), x_*-x \right \rangle + \frac{1}{2\lambda}\left \| x_*-x \right \|^2 $$ It holds that $$ (f+h)(x)-(f+h)(x_*) \geq \frac{1}{2\lambda}\left \| x-x_* \right \|^2 $$
Consider $$ g(u) = f(x_{k+1})+\left \langle \bigtriangledown f(x_{k+1}), u-x_{k+1} \right \rangle + g(u)+\frac{1}{2\lambda}\left \| u-x_{k+1} \right \|^2$$
$ x_* = argmin_g(u) $ $$ g(x)-g(x_*) \geq \frac{1}{2\lambda}\left \| x-x_* \right \|^2 $$
Since $$ g(x_*) = f(x_{k+1}) + \left \langle \bigtriangledown f(x_{k+1}),x_*-x_{k+1} \right \rangle + \frac{1}{2\lambda}\left \| x_*-x_{k+1} \right \|^2 + h(x_*) $$ $$ \geq f(x_*)+h(x_*) = (f+h)(x_*) $$
This implies that $$ h(x_{k+1})-(f+h)(x_*) \geq \frac{1}{2\lambda}\left \| x_{k+1}-x_*\right \|^2 $$
Plugging for g(u) into above inequality
$$ f(x_{k+1}) + \left \langle \bigtriangledown f(x_{k+1}), x-x_{k+1} \right \rangle + h(x)+\frac{1}{2\lambda}\left \| x-x_{k+1} \right \|^2 -(f+h)(x_*) \geq \frac{1}{2\lambda}\left \| x-x_* \right \|^2$$
Which is equal to
$$ (f+h)(x_{k+1})-(f+h)(x_*) \geq \frac{1}{2\lambda}\left \| x_{k+1}-x_* \right \|^2 -\frac{1}{2\lambda}\left \| x-x_{k+1} \right \|^2 +f(x_{k+1}) + \ell_f(x;x_{k+1}) $$
$$(f+h)(x_*)+\frac{1}{2\lambda}||x_k-x_*||^2 \geq (f+h)(x_{k+1})+h(x_{k+1})+ \frac{1}{2\lambda}||x_{k+1}-x_*||^2$$
Using $$ \frac{1}{2\lambda}((f+h)(x_*)-(f+h)(x_{k+1})) \geq \left \| x_*-x_{k+1} \right \|^2 -\left \| x_*-x_k \right \|^2 + \frac{1}{2\lambda}\ell_f(x_*,x_k)$$
$$ \frac{1}{2\lambda}((f+h)(x_*)-(f+h)(x_{k+1})) \geq \left \| x_*-x_{k+1} \right \|^2 -\left \| x_*-x_k \right \|^2 $$
Sum over all n from 0 to k to obtain: $$ \frac{1}{2\lambda}\sum_{}^{k}(f+h)(x_*)-(f+h)(x_{k+1}) \geq \left \| x_*-x_k \right \|^2-\left \| x_*-x_0 \right \|^2 $$
Thus $$ \sum_{}^{k}((f+h)(x_{k+1})-(f+h)(x_*)) \leq \frac{1}{2\lambda}\left \| x_*-x_0 \right \|^2-\frac{1}{2\lambda}\left \| x_*-x_k \right \|^2 \leq \frac{1}{2\lambda}\left \| x_*-x_0 \right \|^2 $$
Given the monotonicity of $(f+h)(x_n)$ for $n \geq 0$ $$ k((f+h)(x_k)-(f+h)(x_*)) \leq \sum_{}^{k}((f+h)(x_{k+1})-(f+h)(x_*)) \leq \frac{1}{2\lambda}\left \| x_*-x_0 \right \|^2 $$
Thus $$\sum_{i=1}^k (f+h)(x_i)-k(f+h)(x_*) \leq \frac{||x_0-x_*||^2}{2\lambda} $$
Proving PGM has the descent property:
$$(f+h)(x_k) \geq (f+h)(x_{k+1}), \forall k \geq 0 $$
$$ \frac{1}{2\lambda}((f+h)(x_*)-(f+h)(x_{k+1})) \geq \left \| x_*-x_{k+1} \right \|^2 -\left \| x_*-x_k \right \|^2 + \frac{1}{2\lambda}\ell_f(x_*,x_k)$$
Along with the relationship: $$ \left \| x_{k+1} -x_*\right \| \leq \left \| x_k-x_* \right \|$$
It follows that: $$ (f+h)(x_*)-(f+h)(x_{k+1}) \leq (f+h)(x_*)-(f+h)(x_{k}) \leq 0$$
Thus for all k $\geq 0$ $$(f+h)(x_{k+1}) \leq (f+h)(x_{k}) $$
Finally, given the above: $$ k((f+h)(x_k)-(f+h)(x_*)) \leq \sum_{}^{k}((f+h)(x_{k+1})-(f+h)(x_*)) \leq \frac{1}{2\lambda}\left \| x_0-x_* \right \|^2 $$ Consequently $$ (f+h)(x_i)-(f+h)(x_*) \leq \frac{1}{k2\lambda}||x_0-x_*||^2 $$
Hence we obtain the $O(\frac{1}{k})$ convergence rate
Here we will employ proximal gradient descent with stochastic schemes. In general, when the loss function we are trying to minimize can be wwritten in the form $\sum_{i=1}^{m}g_i(\theta )$ where each $g_i(\theta)$ is the loss sample at i, and the training time is long, then stochastic schemes should be considered. We will optimize $$f(\theta) = \underset{\theta \in \mathbb{R}^d}{min}\frac{1}{m}\sum_{i=1}^{m}\left [ log(1+exp(x_i\theta)) -y_ix_i\theta \right ] + \lambda\left \| \theta \right \|_1$$ We decompose $f(\theta)$ into a convex and differentiable function g and a convex but not differentiable function h: $$ g(\theta) = \frac{1}{m}\sum_{i=1}^{m}log(1+exp(x_i\theta)) $$ $$ h(\theta) = \frac{1}{m}\sum_{i=1}^{m} -y_ix_i\theta + \lambda\left \| \theta \right \|_1$$
The data we are using is from the classic MNIST machine learning dataset. There are two classes, 0 and 1, and we have a total of 14,780 images; a training set of 12,665 and a test set of 2,115. Each image is 28x28. Each image is vectorized and stacked to form a training and test matrix, with the label appended to the last column of each matrix. Thus, our classifier will learn $\theta$ on the train set to predict the labels for the test set.
import numpy as np
from sklearn.metrics import accuracy_score
x_train = train[:, :-1]
y_train = train[:, -1 :]
x_test = test[:, :-1]
y_test = test[:, -1 :]
x = x_train
y = y_train
def predict_labels(X, weights):
return 1/(1+np.exp(-X.dot(weights)))
def soft_threshold(x,t):
pos = np.maximum(x - t, 0)
neg = np.minimum(x + t, 0)
return pos+neg
def log_loss(X, theta):
return np.sum(np.log(1 + np.exp(X.dot(theta)))) / X.shape[0]
def h(X, y, lam=10, lr=0.01):
return (1/len(X))*(-y.T.dot(X)) + lam*lr
def evaluate_gradient(X, theta, y=None):
return np.sum((X*np.exp(X.dot(theta))) / (1 + np.exp(X.dot(theta))), axis=0)/m
n = 100
lam = 10
lr= 0.01
max_iters=1000
tol= 1e-3
N, D = x.shape
theta_current = np.zeros(shape=(D, 1))
losses = [log_loss(x, theta_current)]
thetas = [theta_current]
iterations = 1
while (loss > tol) or (iterations > max_iters):
theta_current = thetas[-1]
# Stochastic
number_of_rows = x.shape[0]
random_indices = np.random.choice(number_of_rows, size=n, replace=False)
x_temp, y_temp = x[random_indices, :], y[random_indices, :]
for it in range(n):
# Proximal GD
grad = evaluate_gradient(x_temp, theta_current).reshape(-1,1)
theta_new_grad = theta_current - (lr * grad)
theta_new = soft_threshold(theta_new_grad, h(x_temp, y_temp))
theta_current = theta_new
loss = log_loss(x, theta_current)
losses.append(loss)
thetas.append(theta_current)
iterations += 1
n = 100
lam = 10
lr= 0.01
max_iters=1000
tol= 1e-5
N, D = x.shape
theta_current = np.zeros(shape=(D, 1))
loss_1 = log_loss(x, theta_current)
losses = [loss_1]
thetas = [theta_current]
iterations = 1
#while losses[-1] > tol:
for i in range(200):
theta_current = thetas[-1]
grad = evaluate_gradient(x, theta_current).reshape(-1,1)
theta_new_grad = theta_current - (lr * grad)
theta_new = soft_threshold(theta_new_grad, h(x, y).T)
theta_current = theta_new
loss = log_loss(x, theta_current)
losses.append(loss)
thetas.append(theta_current)
#iterations += 1
predict_labels(x, thetas[-1])
accuracy_score(y_test, predict_labels(x_test, thetas[-1]))
Overall, this stochastic implementation achieves an accuracy of 93.76 on the training set.