Feature Selection in Machine Learning

Feature Reduction :-

The information about the target class inherent in the variables.

Native view :

More features
⇒ More information
⇒ More better discrimination power

In practice :
- many reasons why this is not the case!

Course of Dimensionality

number of training examples is fixed
- the classifier's performance usually will degrade for a large number of features !

Feature Selection :-

Given a set of features F = {𝓍1,........𝓍n}
the Feature Selection problem is to find a subset F' ⊆ F that maximizes the learners ability to classify patterns.
Formally F' should maximize some scoring function
𝓍1   → 𝓍i1
𝓍2   → 𝓍i2
.             .
.             .
.             .
𝓍n → 𝓍in

Feature Selection Steps

Feature selection is an optimization problem
Step 1 : Search the space of possible feature subset.
Step 2 : Pick the subset that is optimal or near-optimal with respect to some objective function.

Search strategies
- Optimum
- Heuristic
- Randomized

Evaluation strategies
- Filter methods
- Wrapper methods

Evaluating feature subset

Supervised (Wrapper method)
- Train using selected subset
- Estimate error on validation dataset

Unsupervised (Filter method)
- Look at input only
- Select the subset that has the most information

Forward Selection
- Start with empty feature set
- Try each remaining feature
- Estimate classification/reg. error for adding each feature
- Select feature that given maximum improvement
- Stop when there is no significant improvement

Backward Search
- Start with full feature set
- Try remaining feature
- Drop the feature with smallest impact an error

Univariate (looks at each feature independently of others)
- Person correlation coefficient
- F-score
- Chi-square
- Signal to noise ration
- mutual information
- Etc.

Rank features by importance
Ranking cut-off is determined by user

Person correlation coefficient

- Measures the correlation between two variables
- Formula for person correlation =

- The correlation r is between +1 and -1.

+1 means perfect positive correlation
- 1 in the other direction

Signal to noise ratio

- Difference in means divided by difference in standard deviation between the two classes
S2N(X,Y) = (μx - μy) / (σx - σy)
- Large values indicate a strong correlation

Multivariate feature selection

- Multivariate (consider all features simultaneously)
- Consider the vector w for any linear classifier.
- Classification of a point x is given by wtx+w0.
- Small entries of w will have little effect on the dot product and therefore those features are less relevant
- For example if w = (10, 0.1, -9) then features 0 and 2 are contributing more to the dot product than feature 1.
- A ranking of features given by this w is 0,2,1.
- The w can be obtained by any of linear classifiers
- A variant of this approach is called recursive feature elimination.
     - Compute w on all features
     - Remove feature with smallest wi
     - Recompute w on reduced data
     - If stopping criterion not met then go to step 2