The information about the target class inherent in the variables.
Native view :
More features
⇒ More information
⇒ More better discrimination power
In practice :
- many reasons why this is not the case!
Course of Dimensionality
number of training examples is fixed
- the classifier's performance usually will degrade for a large number of features !
Feature Selection :-
Given a set of features F = {𝓍1,........𝓍n}
the Feature Selection problem is to find a subset F' ⊆ F that maximizes the learners ability to classify patterns.
Formally F' should maximize some scoring function
𝓍1 → 𝓍i1
𝓍2 → 𝓍i2
. .
. .
. .
𝓍n → 𝓍in
Feature Selection Steps
Feature selection is an optimization problem
Step 1 : Search the space of possible feature subset.
Step 2 : Pick the subset that is optimal or near-optimal with respect to some objective function.
Search strategies
- Optimum
- Heuristic
- Randomized
Evaluation strategies
- Filter methods
- Wrapper methods
Evaluating feature subset
Supervised (Wrapper method)
- Train using selected subset
- Estimate error on validation dataset
Unsupervised (Filter method)
- Look at input only
- Select the subset that has the most information
Forward Selection
- Start with empty feature set
- Try each remaining feature
- Estimate classification/reg. error for adding each feature
- Select feature that given maximum improvement
- Stop when there is no significant improvement
Backward Search
- Start with full feature set
- Try remaining feature
- Drop the feature with smallest impact an error
Univariate (looks at each feature independently of others)
- Person correlation coefficient
- F-score
- Chi-square
- Signal to noise ration
- mutual information
- Etc.
Rank features by importance
Ranking cut-off is determined by user
Person correlation coefficient
- Measures the correlation between two variables
- Formula for person correlation =
- The correlation r is between +1 and -1.
- +1 means perfect positive correlation
- - 1 in the other direction
Signal to noise ratio
- Difference in means divided by difference in standard deviation between the two classes
S2N(X,Y) = (μx - μy) / (σx - σy)
- Large values indicate a strong correlation
Multivariate feature selection
- Multivariate (consider all features simultaneously)
- Consider the vector w for any linear classifier.
- Classification of a point x is given by wtx+w0.
- Small entries of w will have little effect on the dot product and therefore those features are less relevant
- For example if w = (10, 0.1, -9) then features 0 and 2 are contributing more to the dot product than feature 1.
- A ranking of features given by this w is 0,2,1.
- The w can be obtained by any of linear classifiers
- A variant of this approach is called recursive feature elimination.
- Compute w on all features
- Remove feature with smallest wi
- Recompute w on reduced data
- If stopping criterion not met then go to step 2
No comments:
Post a Comment