Multiclass SVMs Nonlinear SVMs

Preliminary draft c 2008 Cambridge UP 330 15 Support vector machines and machine learning on documents the higher-order features and to train a linear SVM. 4

15.2.2 Multiclass SVMs

SVMs are inherently two-class classifiers. The traditional way to do mul- ticlass classification with SVMs is to use one of the methods discussed in Section 14.5 page 306 . In particular, the most common technique in prac- tice has been to build | C | one-versus-rest classifiers commonly referred to as “one-versus-all” or OVA classification, and to choose the class which classi- fies the test datum with greatest margin. Another strategy is to build a set of one-versus-one classifiers, and to choose the class that is selected by the most classifiers. While this involves building | C || C | − 1 2 classifiers, the time for training classifiers may actually decrease, since the training data set for each classifier is much smaller. However, these are not very elegant approaches to solving multiclass prob- lems. A better alternative is provided by the construction of multiclass SVMs, where we build a two-class classifier over a feature vector Φ ~ x , y derived from the pair consisting of the input features and the class of the datum. At test time, the classifier chooses the class y = arg max y ′ ~ w T Φ ~ x , y ′ . The mar- gin during training is the gap between this value for the correct class and for the nearest other class, and so the quadratic program formulation will require that ∀ i ∀ y 6= y i ~ w T Φ ~ x i , y i − ~ w T Φ ~ x i , y ≥ 1 − ξ i . This general method can be extended to give a multiclass formulation of various kinds of linear classifiers. It is also a simple instance of a generalization of classifica- tion where the classes are not just a set of independent, categorical labels, but may be arbitrary structured objects with relationships defined between them. In the SVM world, such work comes under the label of structural SVMs. We STRUCTURAL SVM S mention them again in Section 15.4.2 .

15.2.3 Nonlinear SVMs

With what we have presented so far, data sets that are linearly separable per- haps with a few exceptions or some noise are well-handled. But what are we going to do if the data set just doesn’t allow classification by a linear clas- sifier? Let us look at a one-dimensional case. The top data set in Figure 15.6 is straightforwardly classified by a linear classifier but the middle data set is not. We instead need to be able to pick out an interval. One way to solve this problem is to map the data on to a higher dimensional space and then to use a linear classifier in the higher dimensional space. For example, the bottom part of the figure shows that a linear separator can easily classify the data 4. Materializing the features refers to directly calculating higher order and interaction terms and then putting them into a linear model. Preliminary draft c 2008 Cambridge UP

15.2 Extensions to the SVM model

331 ◮ Figure 15.6 Projecting data that is not linearly separable into a higher dimensional space can make it linearly separable. if we use a quadratic function to map the data into two dimensions a po- lar coordinates projection would be another possibility. The general idea is to map the original feature space to some higher-dimensional feature space where the training set is separable. Of course, we would want to do so in ways that preserve relevant dimensions of relatedness between data points, so that the resultant classifier should still generalize well. SVMs, and also a number of other linear classifiers, provide an easy and efficient way of doing this mapping to a higher dimensional space, which is referred to as “the kernel trick”. It’s not really a trick: it just exploits the math KERNEL TRICK that we have seen. The SVM linear classifier relies on a dot product between data point vectors. Let K ~ x i , ~ x j = ~ x i T ~ x j . Then the classifier we have seen so Preliminary draft c 2008 Cambridge UP 332 15 Support vector machines and machine learning on documents far is: f ~ x = sign ∑ i α i y i K ~ x i , ~ x + b 15.13 Now suppose we decide to map every data point into a higher dimensional space via some transformation Φ: ~ x 7→ φ ~ x . Then the dot product becomes φ ~ x i T φ ~ x j . If it turned out that this dot product which is just a real num- ber could be computed simply and efficiently in terms of the original data points, then we wouldn’t have to actually map from ~ x 7→ φ ~ x . Rather, we could simply compute the quantity K ~ x i , ~ x j = φ ~ x i T φ ~ x j , and then use the function’s value in Equation 15.13 . A kernel function K is such a function KERNEL FUNCTION that corresponds to a dot product in some expanded feature space. ✎ Example 15.2: The quadratic kernel in two dimensions. For 2-dimensional vectors ~ u = u 1 u 2 , ~ v = v 1 v 2 , consider K ~ u , ~ v = 1 + ~ u T ~ v 2 . We wish to show that this is a kernel, i.e., that K ~ u , ~ v = φ ~ u T φ ~ v for some φ. Consider φ ~ u = 1 u 2 1 √ 2u 1 u 2 u 2 2 √ 2u 1 √ 2u 2 . Then: K ~ u , ~ v = 1 + ~ u T ~ v 2 15.14 = 1 + u 2 1 v 2 1 + 2u 1 v 1 u 2 v 2 + u 2 2 v 2 2 + 2u 1 v 1 + 2u 2 v 2 = 1 u 2 1 √ 2u 1 u 2 u 2 2 √ 2u 1 √ 2u 2 T 1 v 2 1 √ 2v 1 v 2 v 2 2 √ 2v 1 √ 2v 2 = φ ~ u T φ ~ v In the language of functional analysis, what kinds of functions are valid kernel functions ? Kernel functions are sometimes more precisely referred to KERNEL as Mercer kernels, because they must satisfy Mercer’s condition: for any g ~ x M ERCER KERNEL such that R g ~ x 2 d ~ x is finite, we must have that: Z K ~ x , ~ z g ~ x g ~ z d ~ xd ~ z ≥ 0 . 15.15 A kernel function K must be continuous, symmetric, and have a positive def- inite gram matrix. Such a K means that there exists a mapping to a reproduc- ing kernel Hilbert space a Hilbert space is a vector space closed under dot products such that the dot product there gives the same value as the function K . If a kernel does not satisfy Mercer’s condition, then the corresponding QP may have no solution. If you would like to better understand these issues, you should consult the books on SVMs mentioned in Section 15.5 . Other- wise, you can content yourself with knowing that 90 of work with kernels uses one of two straightforward families of functions of two vectors, which we define below, and which define valid kernels. The two commonly used families of kernels are polynomial kernels and radial basis functions. Polynomial kernels are of the form K ~ x , ~ z = 1 + Preliminary draft c 2008 Cambridge UP

15.2 Extensions to the SVM model

333 ~ x T ~ z d . The case of d = 1 is a linear kernel, which is what we had before the start of this section the constant 1 just changing the threshold. The case of d = 2 gives a quadratic kernel, and is very commonly used. We illustrated the quadratic kernel in Example 15.2 . The most common form of radial basis function is a Gaussian distribution, calculated as: K ~ x , ~ z = e −~x−~z 2 2σ 2 15.16 A radial basis function rbf is equivalent to mapping the data into an infi- nite dimensional Hilbert space, and so we cannot illustrate the radial basis function concretely, as we did a quadratic kernel. Beyond these two families, there has been interesting work developing other kernels, some of which is promising for text applications. In particular, there has been investigation of string kernels see Section 15.5 . The world of SVMs comes with its own language, which is rather different from the language otherwise used in machine learning. The terminology does have deep roots in mathematics, but it’s important not to be too awed by that terminology. Really, we are talking about some quite simple things. A polynomial kernel allows us to model feature conjunctions up to the order of the polynomial. That is, if we want to be able to model occurrences of pairs of words, which give distinctive information about topic classification, not given by the individual words alone, like perhaps operating AND system or ethnic AND cleansing , then we need to use a quadratic kernel. If occurrences of triples of words give distinctive information, then we need to use a cubic kernel. Simultaneously you also get the powers of the basic features – for most text applications, that probably isn’t useful, but just comes along with the math and hopefully doesn’t do harm. A radial basis function allows you to have features that pick out circles hyperspheres – although the decision boundaries become much more complex as multiple such features interact. A string kernel lets you have features that are character subsequences of terms. All of these are straightforward notions which have also been used in many other places under different names.

15.2.4 Experimental results