Covariance estimation is a problem of great interest in many different disciplines, including machine learning, signal processing, economics and bioinformatics. In many applications the number of variables is very large, e.g., in the tens or hundreds of thousands, leading to a number of covariance parameters that greatly exceeds the number of observations. To address this problem constraints are frequently imposed on the covariance to reduce the number of parameters in the model. For example, the Glasso model of Yuan and Lin and Banerjee et al 1 imposes sparsity constraints on the covariance. The Kronecker product model of Dutilleul and Werner et al 2 assumes that the covariance can be represented as the Kronecker product of two lower dimensional covariance matrices. Here we will implement a combination of these two aproaches.

Here is our problem setting:

A combustion engine produces gas with polluting substances such as nitrogen oxides (NOx).Gas emission control regulations have been set up to protect the environment. The NOx Storage Catalyst (NSC) is an emission control system by which the exhaust gas is treated after the combustion process in two phases: adsorption and regeneration. During the regeneration phase, the engine control unit is programmed to maintain the combustion process in a rich air-to-fuel status. The average relative air/fuel ratio is the indicator of a correct regeneration phase. Our goal is to predict this value, using the information from eleven sensors. To do so, we are going to use group lasso regression.

List of on-board sensorsair aspirated per cylinder

  • engine rotational speed
  • total quantity of fuel injected
  • low presure EGR valve
  • inner torque
  • accelerator pedal position
  • aperture ratio of inlet valve
  • downstreem intercooler preasure
  • fuel in the 2nd pre-injection
  • vehicle velocity

First we will write the problem that we want to solve in mathematical notation.

$$ \underset{\beta_g \in \mathbb{R}}{armin} \ \left \| \sum_{g \in G}\left [ X_g\beta_g \right ]-y\right \|_2^2 + \lambda_1\left | \beta \right |_1 + \lambda_2\sum_{g \in G}\sqrt[]{d_g}\left \| \beta_g \right \|_2 $$ Where $$ $$ $ X_g \in \mathbb{R}^{n x d_g}$ is the data matrix for each sensor's covariates which compose group $g$, $ \beta_g $ is the B spline coefficients for group $g$, $ y \in \mathbb{R}^{n}$ is the air/fuel ratio target, $ n$ is the number of measurements, $d_g$ is the dimensionality of group $g$, $\lambda_1 $ is the parameter-wise regularization penalty, $\lambda_2$ is the group-wise regularization penalty, $ G $ is the set of all groups for all sensors

Now on to the code. We will use group lasso to learn the B-spline coefficients. We will use B-splines with 8 knots to reduce the dimensionality of the problem. Ultimately, we want to determine which sensors are correlated with the air/fuel ratio? Also, we want to predict the air/fuel ratio for the observations in the test dataset.

from scipy import interpolate
import group_lasso
import sklearn.linear_model as lm
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
x_train = loadmat('NSC.mat')['x']
y_train = loadmat('NSC.mat')['y']
x_test = loadmat('NSC.test.mat')['x_test']
y_test = loadmat('NSC.test.mat')['y_test']
for i in range(len(x_train[0])):
    plt.figure(figsize=(15,8))
    pd.DataFrame(x_train[0][i]).plot(legend=False, title=f"Sensor {i}")
def transformation(data):
    coefficients = []
    x = np.linspace(0, 203, 203)
    knots = np.linspace(0, 203, 10) [1:-1]
    for i,d in enumerate(data):
        t, c, k = interpolate.splrep(x, d, task=-1, t=knots, k=2)
        coefficients.append(np.trim_zeros(c, trim='b')[:-1])
    return np.array(coefficients)

def standardize(data):
    results = []
    for i in data:
        temp = scaler.fit_transform(i)
        results.append(temp)
    return results
scaler = StandardScaler()
Y_train = transformation(scaler.fit_transform(y_train)).ravel()
Y_test = transformation(scaler.fit_transform(y_test)).ravel()

X_train = np.hstack(np.array([transformation(i) for i in standardize(x_train[0])]))
X_test = np.hstack(np.array([transformation(i) for i in standardize(x_test[0])]))
identity = np.identity(10)

Kronecker Products

final_train = np.kron(X_train, identity)
final_test = np.kron(X_test, identity)
g = [[i]*100 for i in range(1,11)]
groups = np.array([item for sublist in g for item in sublist])
gl = group_lasso.GroupLasso(
    groups=groups,
    group_reg=0,
    l1_reg=0,
    fit_intercept=True,
    scale_reg="none",
    supress_warning=True,
    tol=1e-5
    )
lambdas, _, _ = lm.lasso_path(final_train, Y_train)
CV = RandomizedSearchCV(estimator=gl, param_distributions={'group_reg': lambdas[::5]}, scoring='neg_mean_squared_error', n_iter=100, verbose=2)
CV.fit(final_train, Y_train)
coef = gl.coef_.ravel().reshape(100, 10)
coef_base = X_train@coef
coef_df = pd.DataFrame(coef_base)
print("Best lambda:", CV.best_params_['group_reg'])
print("Coefficients Correlated to Target")
coef_df.corrwith(pd.DataFrame(Y_train.reshape(150,10)))

It appears sensors 2 and 7 have the greatest correlation to the air fuel ration

_y = pd.DataFrame(Y_train.reshape(150,10))
for sensor in [2, 7]:
    plt.figure(figsize=(15,8))
    plt.scatter(coef_df[sensor], _y[sensor])
    plt.title(f"Correlation of sensor {sensor} and air/fuel ratio")
    plt.xlabel(f"Sensor {sensor}")
    plt.ylabel("Air/fuel ratio")
coef_df[2].plot(title='Coefficients for sensor 2')
coef_df[7].plot(title='Coefficients for sensor 7')
predicted = CV.predict(final_test)
print("Mean Square Prediction Error:", sum((Y_test - predicted)**2))

Yuan et al. "Model Selection and Estimation in Regression With Grouped Variables," Journal of the Royal Statistical Society Series B. (2006): 49-67.

Tsiligkaridis et al. "Convergence Properties of Kronecker Graphical Lasso Algorithms," IEEE (2013).