Motivation

Feature selection is one of the trending challenges in multi-label classification, as the proliferation of high-dimensional data has become a trend in the last few years. Datasets with a dimensionality over the tens of thousands are constantly appearing in applications such as medical image and text retrieval or genetic data. The high-dimensionality of data has an important impact in learning algorithms, since they degrade their performance when a number of irrelevant and redundant features are present. In fact, this phenomenon is known as the curse of dimensionality, because unnecessary features increase the size of the search space and make generalization more difficult. For overcoming this major obstacle in machine learning, researchers usually employ dimensionality reduction techniques. In this manner, the set of features required for describing the problem is reduced, most of the times along with an improvement in the performance of the models.

Feature selection is arguably the most famous dimensionality reduction technique. It consists of detecting the relevant features and discarding the irrelevant ones. Its goal is to obtain a subset of features that describe properly the given problem with a minimum degradation in performance, with the implicit benefits of improving data and model understanding and the reduction in the need for data storage. With this technique, the original features are maintained, contrary to what usually happens in other techniques such as feature extraction, where the generated dataset is represented by a newly generated set of features, different than the original.

Feature selection methods can be divided into wrappers, filters and embedded methods. While wrapper models involve optimizing a predictor as a part of the selection process, filter models rely on the general characteristics of the training data to select features with independence of any predictor. The embedded methods generally use machine learning models for classification, and then an optimal subset or ranking of features is built by the classifier algorithm. Wrappers and embedded methods tend to obtain better performances but at the expense of being very time consuming and having the risk of overfitting when the sample size is small. On the other hand, filters are faster and, therefore, more suitable for large datasets. They are also easier to implement and scale up better than wrapper and embedded methods. As a matter of fact, filters can be used as a pre-processing step before applying other more complex feature selection methods.

However the existing approaches assume that all the features have the same cost. This assumption may be inappropriate when the acquisition of the feature values is costly. For example in medical diagnosis each diagnostic value extracted by a clinical test is associated with its own cost. In such cases it may be better to choose a model with an acceptable classification performance but a much lower cost.

As a result, our goal of here is to obtain a trade-off between a filter metric and the cost associated to the selected features, in order to select relevant features with a low associated cost. We will try and introduce four different approaches that can be considered filter methods. In this manner, any filter metric can be modified to have into account the cost associated to the input features.

Cost-sensitive Methods

Approach #1- Feature-cost-sensitive Random Forest (FCS-RF)

Random forest (RF) is an ensemble of decision trees. RF has a wide range of applications because of its good stability and generalization. The typical construction process of RF consists of the following procedures. First, bagging is adopted on the training dataset to produce many subsets (with differences). Then each subset is used to construct a decision tree. In the tree growth, the splitting on each node depends on the feature selected from a group of candidates that are randomly chosen from all features. Without pruning (i.e., all leaf nodes are pure), all trees grow fully and each of them functions as a base classifier. Finally, all these tree classifiers are integrated. There are two important random characteristics in growing a random forest. One is randomly sampling, and the other is randomly generating the node splitting candidates. The diversity between the trees caused by randomness is crucial to the performance of the forest.

We can extend the random forest to incorporate feature cost by incorporating a probability vector into the tree construction process. The probability vector is generated based on the cost vector and we require the probability of a feature being selected inversely proportional to its cost. It can be easily shown that the expected cost of the feature selected in probability is smaller than the expected cost of the feature selected randomly. But some randomness among the trees still remains, which is indispensable to preserve the accuracy. It should be noted that if all the cost of features are equal, the FCS-RF will degenerate to the ordinary RF.

By using the the feature cost into the construction of random forest, the high-cost features less likely to be selected and the low-cost features with larger chances of being selected. In this way, the importance score of a feature is explicitly influenced by feature cost and its distinguishing ability. As a result, the top ranked features in the rank produced through FCS-RF have larger chance to be both low-cost and informative.

Approach #2- Non-convex Constrained Optimization

Given a loss function (negative log-likelihood function) $\mathcal{L}_k(\theta_k)$, the most natural way to take cost into account would be to minimize $\mathcal{L}_k(\theta_k)$ with the addition of constraint $$ \sum_{j=1}^{p}c_j(\left| \theta_{k,j} \right|>0)\leq t $$

This is equivalent to finding the best model (in terms of maximization of log-likelihood function) subject to limited budget. Parameter t can be interpreted as an upper bound for the budget of the kth model. When c1 = ··· = cp, the above problem reduces to best-subset selection. In best-subset selection we maximize log-likelihood function subject to limited number of features. Note that in best-subset selection parameter t stands for the maximal possible number of features that can be used in the model, whereas in in the additional constraint $t$ denotes the maximal possible cost that can be incurred to build the model. In other words, the constraint ensures that we control the budget instead of controlling the number of features.

Minimization of the log-likelihood function subject to (4) is equivalent to finding the optimal values of parameters $\theta_k$ by solving: $$ \hat{\theta}_k=\underset{\theta_k\epsilon R}{argmin}\left\{ \mathcal{L}_k(\theta_k) + \lambda\sum_{j=1}^{p}c_j(\left| \theta_{k,j} \right|>0) \right\} $$

More precisely, for a given t > 0 there exists $\lambda$ > 0 such that the two problems share the same solution, and vice versa. The second term is a penalty for the cost. Parameter $\lambda$ > 0 controls the balance between goodness of fit of the model and the cost. The larger is the value of $\lambda$ the cheaper is the model. Although the above approach seems to be encouraging, the issue is that the above equation is nonconvex, which makes it difficult to understand theoretically, and especially, to solve computationally. The problem is due to employment of $\mathscr{l}_0$-type penalty and renders the approach computationally infeasible for large number of features. The popular solution is to use $\mathscr{l}_1$ (lasso) penalty instead of $\mathscr{l}_0$, i.e., to optimize:

$$ \hat{\theta}_k=\underset{\theta_k\epsilon R}{argmin}\left\{ \mathcal{L}_k(\theta_k) + \lambda\sum_{j=1}^{p}c_j\left| \theta_{k,j} \right| \right\} $$

It is known that $\mathscr{l}_1$ regularization (lasso) encourages sparsity in parameter vector, and thus selects relevant features. We can take this approach a step further, whereby instead of using $\mathscr{l}_1$ penalty, we can consider the elastic-net penalty which is a linear combination of $\mathscr{l}_1$ and $\mathscr{l}_2$ norms. Including $\mathscr{l}_2$ (ridge) regularization stabilizes the solution. The standard elastic-net does not take into account feature costs. The key idea in our approach is to modify the elastic-net penalty in such a way that features with higher costs are assigned larger weights: $$ \hat{\theta}_k=\underset{\theta_k\epsilon R}{argmin}\left\{ \mathcal{L}_k(\theta_k) + \lambda\sum_{j=1}^{p}c_jP_w(\theta_{k,j}) + \lambda\sum_{j=p+1}^{p+k-1}P_w(\theta_{k,j}) \right\} $$

where $P_w(\theta_{k,j}) = (1-w)\theta_{k,j}^2 + w\left| \theta_{k,j}\right|$ is the elastic net penalty, and $\lambda$ is a regularization parameter. The second term depends on costs; the larger is the cost the larger is the penalty. In other words, it is more difficult to choose expensive features to the final model. The costs are normalized in such a way that they sum up to p. Such normalization is necessary, otherwise the scale can be absorbed into the value of $\lambda$.

The total cost depends on $\lambda$, large enough value of $\lambda$ sets all the coefficients exactly equal to zero which gives a total cost zero. Small value of $\lambda$ results in many non-zero coefficients which yields large total cost. A practical challenge is how to select $\lambda$ in order to get a certain number of features with non-zero coefficients or to get a certain total cost. Unfortunately, it is not possible to directly control the number of selected features, and as such the total cost. It is necessary to test different regularization parameters and observe the total cost associated with selected features, similar to tuning a hyperparameter. It is relatively easy to obtain the solutions for many different values of $\lambda$, which in turn allows to select the value of $\lambda$ best corresponding to the desired total cost.

Approach #3- Cost-based Correlation-based Feature Selection (CCFS)

CFS (Correlation-based Feature Selection) is a multivariate subset filter algorithm. It uses a search algorithm combined with an evaluation function to estimate the merit of feature subsets. The evaluation function takes into account the usefulness of individual features for predicting the class label as well as the level of correlation among them. It is assumed that good feature subsets contain features highly correlated with the class and uncorrelated with each other. The evaluation function can be seen in the following equation: $$ M_s = \frac{k\overline{r_{ci}}}{\sqrt{k+k(k-1)\overline{r_{ii}}}} $$

where $M_s$ is the merit of a feature subset S that contains k features, \overline{r{ci}} is the average correlation between the features of S and the class, and \overline{r{ii}} is the average intercorrelation between the features of S.In fact, this function is Pearson's correlation with all variables standardized. The numerator estimates how predictive of the class S is and the denominator quantifies the redundancy among the features in S. We can modify CFS by adding a term to the evaluation function to take into account the cost of the features, as can be seen in the following equation: $$ MC_s = \frac{k\overline{r_{ci}}}{\sqrt{k+k(k-1)\overline{r_{ii}}}} - \lambda\frac{\sum_{i}^{k}C_i}{k} $$

where $MC_s$ is the merit of the subset S affected by the cost of the features, $C_i$ is the cost of the feature i, and $\lambda$ is a parameter introduced to weight the influence of the cost in the evaluation function. If $\lambda$ is 0, the cost is ignored and the method works as the regular CFS. If $\lambda$ is between 0 and 1, the influence of the cost is smaller than the other term. If $\lambda = 1$ both terms have the same influence and if $\lambda > 1$, the influence of the cost is greater than the influence of the other term.

Approach #4- Cost-based Minimal-Redundancy-Maximal-Relevance (mRMR)

mRMR (Minimal-Redundancy-Maximal-Relevance) is a multivariate ranker filter algorithm. As mRMR is a ranker, the search algorithm is simpler than CFS's.The evaluation function combines two constraints (as the name of the method indicates), maximal relevance and minimal redundancy. The former is denoted by the letter D, it corresponds to the mean value of all mutual information values between each feature $x_i$ and class c, and has the expression shown in the following equation: $$ D(S,c) = \frac{1}{\left| S \right|}\sum_{}^{}I(x_i;c) $$

where S is a set of features and $I(x_i;c)$ is the mutual information between feature x and class c. The constraint of minimal redundancy is denoted by R and is shown in the following equation: $$ R(S) = \frac{1}{\left| S \right|^2}\sum_{}^{}I(x_i,x_j) $$

The evaluation function to be maximized combines the two constraints, $D(S,c) - R(S)$. We can make a modification of mRMR by adding a term to the condition to be maximized so as to take into account the cost of the feature to be selected, as can be seen in the following equation: $$ \underset{x_j \epsilon X-S_{m-1}}{max}\left[ \left( I(x_j;c)-\frac{1}{m-1} \sum_{x_i}I(x_j;x_i)\right) -\lambda C_j\right] $$

where $C_j$ is the cost of the feature j, and $\lambda$ is a parameter introduced to weight the influence of the cost in the evaluation function, as explained in the previous subsection.

In Practice

When using the above mentioned approaches in real machine learning projects, cost sensitive feature selection methods typically achieves higher accuracy when the budget is low. For higher budgets, the traditional methods tend to perform better. This is because for a larger budget, traditional methods can include all relevant features, which results in a large predictive power of the model. For a limited budget, cost sensitive methods select features that serve as cheaper substitutes of the relevant expensive features. The graph below shows a typical tradoff between cost sensitive and non-cost sensitive feature selection approachs.

Source: Teisseyre, P., Klonecki, T. (2021). Controlling Costs in Feature Selection: Information Theoretic Approach. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021.

We can see that until 60% of the total cost, cost sensitive method performs better. This is due to the fact that,in this case, traditional methods can only use a fraction of all original features (say 1 or 2 out of 4 original features) within the assumed budget, which deteriorates the predictive performance of the corresponding classification model. On the other hand, the cost sensitive method aims to replace the original features by their cheaper counterparts, which allows to achieve higher accuracy of the corresponding model. When the budget exceeds 60% of the total cost, the traditional feature selection method tends to perform better than the proposed method, which is associated with the fact that, in this case, traditional methods include all original features (i.e., those which constitute the minimal set allowing for accurate prediction of the target variable) which results in a large predictive power of the corresponding model. For a larger budget, cost sensitive methods include both noisy features as well as the original ones. The noisy features become redundant when considering together with the original ones. This results in slightly lower prediction accuracy of the corresponding model. As expected, the cost sensitive methods are worth considering when the assumed budget is limited.