It is frequently the case when dealing with high dimensional datasets that there are more variables than observations but we only expect a small fragment of the variables to be truly meaningful. To compensate for such occurances and enhance our ability to generalize our resulting model, it is common to employ regularization techniques which are also used to reduce overfitting in other settings. The most common examples are L1 (Lasso) and L2 (Ridge) regularization. Lasso, in particular, causes sparsity for weights. It provides sparse solutions, because it will send to zero some of the β coefficients (the least related with the response variable). The effect of this penalization can be controlled using the λ parameter. A large λ value provides solutions where the penalization has a greater importance, and thus there are more zeros among the β coefficients.

However, there are many regression problems dealing with high dimensional data in which the covariates have a natural group structure, and it is desirable to have all coefficients within a group become nonzero (or zero) simultaneously. For example, in biostatistics it is common to deal with genetic datasets in which predictors are grouped into genetical pathways. In stock market analysis one can group companies from the same business segment. In climate data one can group different regions… And lasso provides individual sparse solutions, not group sparse. A more general leading example is when we have qualitative factors among our predictors. We typically code their levels using a set of dummy variables or contrasts, and would want to include or exclude this group of variables together. For such a scenario, we have a technique called the Group Lasso, which accounts for a natural grouped structure of predictors while causing sparsity for weights.

In the image above, we can see a visual depiction of the contours of the L1, group lasso, and L2 penalties respectively (Source: Ming Yuan, "Model selection and estimation in regression with grouped variables," Statistical Methodology 68:1 (Feb 2006). Specifically, we can see how the group lasso (middle image) incorporates elements of both the L1 and L2 penalties.

Lets get into the math of the Group Lasso. Consider a linear regression model involving J groups of covariates, where j = 1,..., J, and the vector $\mathbb{Z}_j \in \mathbb{R}^{pj}$ represents the covariates of group j. Our goal is to predict a real valued response $Y \in \mathbb{R}$ based on the collection of covariates ($\mathbb{Z}_1, ..., \mathbb{Z}_J $. A linear model for the regression function $\mathbb{E}(Y|\mathbb{Z})$ $\sum_{j=1}^{J}\mathbb{Z}_{j}^{T}\theta_j$ where $\theta_j \in \mathbb{R}^{pj}$ represents a group of $p_j$ regression coefficients.

Given a collection of N samples $\left \{ (y_i, z_{i1}, z_{i2},..., z_{iJ}) \right \}_{i=1}^{N}$ the group lasso solves the convex problem: $$ \underset{\theta_j\in\mathbb{R}^{pj}}{min}\frac{1}{2}\left \| y-\sum_{j=1}^{J}\mathbb{Z}_{j}^{T}\theta_j \right \|_{2}^{2} + \lambda\sum_{j=1}^{J}\left \| \theta_j \right \|_2 $$ where $\left \| \theta_j \right \|_2$ is the euclidean norm of the vector $\theta_j$.

This is a group generalization of the lasso, with the properties:

  • depending on $\lambda \geq 0$, either the entire vector $\theta_j$ will be zero, or all its elements will be nonzero
  • when $p_j=1$ (continuous variables), then we have $\left \| \theta_j \right \|_2 = \left | \theta_j \right |$, so if all of the groups are singletons, the optimization problem reduces to ordinary lasso.

We can solve the group lasso problem using block coordinate descent. Here is a proof showing that we can solve it iteratively, for $j = 1 · · · , J$:

We start with our optimization problem: $$ \underset{\theta_j\in\mathbb{R}^{pj}}{min}\frac{1}{2}\left \| r_j-\mathbb{Z}_j^T\theta_j \right \|_2^2 + \lambda\left \| \theta_j \right \|_2 $$ Where $r_j = y - \sum_{k\neq j}^{}Z_k^T\theta_k$

With a bit of manipulation we get $$ -Z_j^T(y-\sum_{j=1}^{J}Z_j\theta_j)+\lambda(s(\frac{\theta_j}{\left \| \theta_j \right \|_{2}})) $$

Using the definition of $r_j$ we can solve for $\theta$: $$ \theta_j = (Z_j^TZ_j + \frac{\lambda}{\left \| \theta_j \right \|_2})^{-1}Z_j^Tr_j $$

Or $$ \theta_j = S_{\frac{\lambda}{\left \| \theta_j \right \|^2}}(\frac{Z_{j}^{T}r_j}{Z_{j}^{T}Z_j}) $$


We will now implement the block coordinate proximal gradient descent to solve the group lasso problem presented above. We will be using the kaggle classic boston housing dataset, where our independent variable is price, and our dependent variables are a combination of categorical (number of bed/bath) and continous (price/sq feet) features. Our first step is to create dummy variables corresponding to the categorical variables. To avoid multicollinearity issues we use 0 bedrooms, 1 bathroom, and short sale as baselines, respectively. To improve results we standardize our data, and will use a $\lambda$ value of 0.012.

from sklearn.preprocessing import MinMaxScaler
import itertools
bh = pd.read_csv("boston_housing.csv")

Create dummy variables

rooms = pd.get_dummies(bh.Bedrooms, prefix='Bedrooms')
baths = pd.get_dummies(bh.Bathrooms, prefix='Bath')
status = pd.get_dummies(bh.Status, prefix='Status')
df = pd.concat([bh, rooms, baths, status], axis=1)
df.drop(['Bathrooms', 'Bedrooms', 'Status'], axis=1, inplace=True)

Normalize data

scaler = MinMaxScaler()
df['Price'] = scaler.fit_transform(df['Price'].values.reshape(-1,1))
df['PriceSF'] = scaler.fit_transform(df['PriceSF'].values.reshape(-1,1))
df.head()
Price PriceSF Bedrooms_0 Bedrooms_1 Bedrooms_2 Bedrooms_3 Bedrooms_4 Bath_1 Bath_2 Bath_3 Bath_4 Status_1 Status_2 Status_3
0 0.389410 0.280785 0 0 0 1 0 0 0 1 0 1 0 0
1 0.188751 0.108646 0 0 0 0 1 0 0 1 0 1 0 0
2 0.262731 0.142556 0 0 0 0 1 0 0 1 0 1 0 0
3 0.447175 0.211009 0 0 0 0 1 0 0 0 1 1 0 0
4 0.042260 0.061014 0 0 0 1 0 1 0 0 0 1 0 0
y = df['Price'].values
features = df.drop('Price', axis=1)
groups = ['PriceSF', 'Bed', 'Bath', 'Status']

Defining our soft thresholding function for PGM, and our loss function.

def soft_threshold(x, gamma=lamda):
    for i, val in enumerate(x):
        if val > gamma:
            x[i] = 1-(lamda/abs(val-gamma))
        elif val  < gamma:
            x[i] = 1-(lamda/abs(val+gamma))   
        elif (val <= gamma) and (val>= -gamma):
            x[i] = 0
    return x
        
def loss(b, l=0.012):
    temp_coeffs = [[beta]*i for beta, i in zip(b, group_lengths)]
    coeff_vector = np.array(list(itertools.chain(*temp_coeffs)))
    f_x = np.sum(0.5*(y - np.dot(features.values,coeff_vector))**2)
    penalty = l*sum([i**2 for i in b])
    
    return f_x + penalty

def create_b_not_vector(b, i):
    not_group_lengths = [j for k, j in enumerate(group_lengths) if k != i]
    temp_coeffs = [[beta]*i for beta, i in zip(b, not_group_lengths)]
    return np.array(list(itertools.chain(*temp_coeffs)))
slices = [(0,1), (1, 6), (6, 10), (10, 13)]
b = np.random.rand(len(features.columns))
lamda = 0.012
losses = [loss(b)]
iterations = 1
for iteration in range(200):
    for sliced in slices:
        Z = features[features.columns[sliced[0]:sliced[1]]]
        Z_cols = Z.columns
        Z_not = features.loc[:, [feat for feat in features.columns if feat not in Z_cols]]
        b_not = [i for j, i in enumerate(b) if j not in range(sliced[0], sliced[1])]
        r = y - np.dot(Z_not, b_not)
        a = b[sliced[0]:sliced[1]] - np.sum((-Z.values.T*(r-np.dot(Z.values, b[sliced[0]:sliced[1]]))), axis=1)        
        b[sliced[0]:sliced[1]] = soft_threshold(a)
     
    f_x = np.sum(0.5*(y - np.dot(features.values,b))**2)
    penalty = lamda*sum([i**2 for i in b])
    losses.append(f_x + penalty)
    iterations += 1

And there you have it