Nonlinear SVMs: Feature Space
Nonlinear SVMS: The Kernel Tricks
- With this mapping, our discriminant function is now:
- We only use the dot product of feature vectors in both the training and test.
- A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:
k (𝓍a, 𝓍b) = Φ(𝓍a). Φ(𝓍b)
Often k (𝓍a, 𝓍b) may be very inexpensive to compute even ifΦ(𝓍a) may be extremely high dimensional.
Kernel Example
2-dimensional vector x = [x1x2]
let K(𝓍i, 𝓍j) = (1+𝓍i.𝓍j)2
We need to show that K(𝓍i, 𝓍j) = Φ(𝓍i). Φ(𝓍j)
Commonly-used kernel functions
Nonlinear SVMS: The Kernel Tricks
- With this mapping, our discriminant function is now:
- We only use the dot product of feature vectors in both the training and test.
- A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:
k (𝓍a, 𝓍b) = Φ(𝓍a). Φ(𝓍b)
Often k (𝓍a, 𝓍b) may be very inexpensive to compute even ifΦ(𝓍a) may be extremely high dimensional.
Kernel Example
2-dimensional vector x = [x1x2]
let K(𝓍i, 𝓍j) = (1+𝓍i.𝓍j)2
We need to show that K(𝓍i, 𝓍j) = Φ(𝓍i). Φ(𝓍j)
Commonly-used kernel functions
- Linear kernel: K(xi.xj) = xi.xj
- Polynomial of power p: K(xi,xj) = (1+xi.xj)p
- Gaussian (radial-basis function):
- Sigmoid: K(xi,xj) = tanh(β0xi.xj +β1)
In general, function that satisfy Mercer's condition can be kernel functions.
Kernel Functions
- Kernel function can be thought of as a similarity measure between the input objects
- Not all similarity measure can be used as kernel function.
- Mercer's condition state that any positive semi-definite kernel K(x,y), i.e.
Σ K(xi,xj)cicj ≥0
- Can be expressed as a dot product in a high dimensional space.
- The user must choose the kernel function and its parameters
- They can be expensive in time and space for big datasets
- The computation of the maximum-margin hyper-plane depends on the square of the number of training cases.
- We need to store all the support vectors.
- The kernel trick can also be used to do PCA in a much higher-dimensional space, thus giving a non-linear version of PCA in the original space.
Multi-class classification
- SVMs can only handle two-class outputs
- Learn N SVMs
- SVM 1 learns Class1 vs REST
- SVM 2 learns Class2 vs REST
- .
.
- SVM n learns Class N vs REST
- Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into positive region.
No comments:
Post a Comment