There are two fundamental goals in statistical learning: ensuring high prediction accuracy and discovering relevant predictive variables. Variable selection is particularly important when the true underlying model has a sparse representation. It may be important to clarify that the expression 'sparse' should not be confused with techniques for sparse data, containing many zero entries. Here, sparsity refers to the estimated parameter vector, which is forced to contain many zeros. A sparse representation can be manifested as a result of two common occurances. First, the number of predictors might exceed the number of observations. Such high-dimensional data settings are nowadays commonplace in operational research. Second, some data points might behave differently from the majority of the data. Such atypical data points are called outliers in statistics, and anomalies in machine learning. Traditional methods for linear regression analysis such as the ordinary Least Squares estimator (OLS) fail when these problems arise: the OLS cannot be computed or becomes unreliable due to the presence of outliers.

A regression vector is sparse if only some of its components are nonzero while the rest is set equal to zero, hereby inducing variable selection.


Here we want to compare some different regression techniques that induce feature or input sparsity: Lasso Regression, Ridge Regression, Adaptive Lasso Regression, and Elastic Net Regression. We will calculate the optimal tuning parameters, and fit the model to aquire the coefficients obtained with the optimal parameters as well as the Mean Square Prediction Error for the test dataset.

In this demonstration our goal is to predict the concentration of carbon oxide (CO) in mg/m^3. For this purpose, we have the following information provided by air quality sensors:

  • Benzene (C6H6) concentration in μg/m3
  • Non Metanic HydroCarbons (NMHC) concentration in μg/m3
  • Nitrogen Oxides (NOx)concentration in ppb
  • Nitrogen Dioxide (NO2) concentration in μg/m3
  • Ozone (O3) concentration in μg/m3
  • Temperature (T) in Celsius degrees
  • Relative Humidity (RH)
  • Absolute Humidity (AH)
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import uniform 
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, lasso_path, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from numpy import arange
scaler = StandardScaler()
train_data = pd.read_csv('train.air.csv')
test_data = pd.read_csv('test.air.csv')
standardized_train = scaler.fit_transform(train_data)
standardized_test = scaler.fit_transform(test_data)
train = pd.DataFrame(standardized_train, columns=train.columns)
test = pd.DataFrame(standardized_test, columns=test.columns)
y_train = train['CO']
x_train = train.drop('CO', axis=1)
y_test = test['CO']
x_test = test.drop('CO', axis=1)

Ridge

param_grid = {'alpha': uniform()}

model = Ridge()
ridge_search = RandomizedSearchCV(estimator=model, 
                                 param_distributions=param_grid,
                                 n_iter=100)

ridge_search.fit(x_train, y_train)

print("Optimal lasso penality parameter:", round(ridge_search.best_estimator_.alpha, 3))
print("Best parameter score:", round(ridge_search.best_score_, 3))
print("Coefficients:", ridge_search.best_estimator_.coef_)
ridge_pred = ridge_search.predict(x_test)
print("Ridge MSE for test data:", round(mean_squared_error(y_test, ridge_pred),2))

Lasso

param_grid = {'alpha': uniform()}

model = Lasso()
lasso_search = RandomizedSearchCV(estimator=model, 
                                 param_distributions=param_grid,
                                 n_iter=100)

lasso_search.fit(x_train, y_train)

print("Optimal lasso penality parameter:", round(lasso_search.best_estimator_.alpha, 3))
print("Best parameter score:", round(lasso_search.best_score_, 3))
print("Coefficients:", lasso_search.best_estimator_.coef_)
lasso_pred = lasso_search.predict(x_test)
print("Lasso MSE for test data:", round(mean_squared_error(y_test, lasso_pred), 2))

Adaptive Lasso

coefficients = LinearRegression(fit_intercept=False).fit(x_train, y_train).coef_
gamma = 2
weights = coefficients**-gamma
X = x_train/weights
lambdas, lasso_betas, _ = lasso_path(X, y_train)
lassoCV = LassoCV(alphas=lambdas, fit_intercept=False, cv=10)
lassoCV.fit(X, y_train)
print("Optimal adaptive lasso penality parameter:", lassoCV.alpha_)
print("Coefficients:", lassoCV.coef_)
adaptive_pred = lassoCV.predict(x_test/weights)
print("Adaptive Lasso MSE for test data:", round(mean_squared_error(y_test, adaptive_pred), 2))

Elastic Net

param_grid = {'alpha': uniform(), 'l1_ratio': arange(0, 1, 0.01)}

model = ElasticNet()
EN_search = RandomizedSearchCV(estimator=model, 
                                 param_distributions=param_grid,
                                 n_iter=100)

EN_search.fit(x_train, y_train)

print("Optimal parameters:", EN_search.best_params_)
print("Best parameter score:", round(EN_search.best_score_, 3))
print("Coefficients:", EN_search.best_estimator_.coef_)
EN_pred = EN_search.predict(x_test)
print("Elastic Net MSE for test data:", round(mean_squared_error(y_test, EN_pred), 2))

Conclusion

Elastic net can be recommended without knowing the size of the dataset or the number of predictors, as it automatically handles data with various sparsity patterns as well as correlated groups of regressors. Lasso outperforms ridge for data with a small to moderate number of moderate-sized effects. In these cases, rdige will not provide a sparse model that is easy to interpret, which would lead one to use Lasso methods. On the other hand, Ridge regression performs the best with a large number of small effects.This is because the ridge penalty will prefer equal weighting of colinear variables while lasso penalty will not be able to choose. This is one reason ridge (or more generally, elastic net, which is a linear combination of lasso and ridge penalties) will work better with colinear predictors. If the data give little reason to choose between different linear combinations of colinear predictors, lasso will struggle to prioritize a predictor amongst colinears, while ridge tends to choose equal weighting. Given our dataset and number of predictors here, I would recommend Lasso.