Preliminary draft c 2008 Cambridge UP
330
15 Support vector machines and machine learning on documents
the higher-order features and to train a linear SVM.
4
15.2.2 Multiclass SVMs
SVMs are inherently two-class classifiers. The traditional way to do mul- ticlass classification with SVMs is to use one of the methods discussed in
Section 14.5
page 306
. In particular, the most common technique in prac- tice has been to build
|
C
|
one-versus-rest classifiers commonly referred to as “one-versus-all” or OVA classification, and to choose the class which classi-
fies the test datum with greatest margin. Another strategy is to build a set of one-versus-one classifiers, and to choose the class that is selected by the
most classifiers. While this involves building
|
C
||
C
| −
1 2 classifiers, the
time for training classifiers may actually decrease, since the training data set for each classifier is much smaller.
However, these are not very elegant approaches to solving multiclass prob- lems. A better alternative is provided by the construction of multiclass SVMs,
where we build a two-class classifier over a feature vector Φ
~
x , y
derived from the pair consisting of the input features and the class of the datum. At
test time, the classifier chooses the class y
=
arg max
y
′
~
w
T
Φ
~
x , y
′
. The mar- gin during training is the gap between this value for the correct class and
for the nearest other class, and so the quadratic program formulation will require that
∀
i
∀
y
6=
y
i
~
w
T
Φ
~
x
i
, y
i
− ~
w
T
Φ
~
x
i
, y
≥
1
−
ξ
i
. This general method can be extended to give a multiclass formulation of various kinds of
linear classifiers. It is also a simple instance of a generalization of classifica- tion where the classes are not just a set of independent, categorical labels, but
may be arbitrary structured objects with relationships defined between them. In the SVM world, such work comes under the label of structural SVMs. We
STRUCTURAL
SVM
S
mention them again in Section 15.4.2
.
15.2.3 Nonlinear SVMs
With what we have presented so far, data sets that are linearly separable per- haps with a few exceptions or some noise are well-handled. But what are
we going to do if the data set just doesn’t allow classification by a linear clas- sifier? Let us look at a one-dimensional case. The top data set in Figure
15.6 is straightforwardly classified by a linear classifier but the middle data set is
not. We instead need to be able to pick out an interval. One way to solve this problem is to map the data on to a higher dimensional space and then to use
a linear classifier in the higher dimensional space. For example, the bottom part of the figure shows that a linear separator can easily classify the data
4. Materializing the features refers to directly calculating higher order and interaction terms and then putting them into a linear model.
Preliminary draft c 2008 Cambridge UP
15.2 Extensions to the SVM model
331
◮
Figure 15.6
Projecting data that is not linearly separable into a higher dimensional space can make it linearly separable.
if we use a quadratic function to map the data into two dimensions a po- lar coordinates projection would be another possibility. The general idea is
to map the original feature space to some higher-dimensional feature space where the training set is separable. Of course, we would want to do so in
ways that preserve relevant dimensions of relatedness between data points, so that the resultant classifier should still generalize well.
SVMs, and also a number of other linear classifiers, provide an easy and efficient way of doing this mapping to a higher dimensional space, which is
referred to as “the kernel trick”. It’s not really a trick: it just exploits the math
KERNEL TRICK
that we have seen. The SVM linear classifier relies on a dot product between data point vectors. Let K
~
x
i
,
~
x
j
= ~
x
i
T
~
x
j
. Then the classifier we have seen so
Preliminary draft c 2008 Cambridge UP
332
15 Support vector machines and machine learning on documents
far is: f
~
x
=
sign
∑
i
α
i
y
i
K
~
x
i
,
~
x
+
b 15.13
Now suppose we decide to map every data point into a higher dimensional space via some transformation Φ:
~
x
7→
φ
~
x . Then the dot product becomes
φ
~
x
i
T
φ
~
x
j
. If it turned out that this dot product which is just a real num- ber could be computed simply and efficiently in terms of the original data
points, then we wouldn’t have to actually map from
~
x
7→
φ
~
x . Rather, we
could simply compute the quantity K
~
x
i
,
~
x
j
=
φ
~
x
i
T
φ
~
x
j
, and then use the function’s value in Equation
15.13 . A kernel function K is such a function
KERNEL FUNCTION
that corresponds to a dot product in some expanded feature space.
✎
Example 15.2: The quadratic kernel in two dimensions.
For 2-dimensional vectors
~
u
=
u
1
u
2
,
~
v
=
v
1
v
2
, consider K
~
u ,
~
v
=
1
+ ~
u
T
~
v
2
. We wish to show that this is a kernel, i.e., that K
~
u ,
~
v
=
φ
~
u
T
φ
~
v for some φ. Consider φ
~
u
=
1 u
2 1
√
2u
1
u
2
u
2 2
√
2u
1
√
2u
2
. Then: K
~
u ,
~
v
=
1
+ ~
u
T
~
v
2
15.14
=
1
+
u
2 1
v
2 1
+
2u
1
v
1
u
2
v
2
+
u
2 2
v
2 2
+
2u
1
v
1
+
2u
2
v
2
=
1 u
2 1
√
2u
1
u
2
u
2 2
√
2u
1
√
2u
2 T
1 v
2 1
√
2v
1
v
2
v
2 2
√
2v
1
√
2v
2
=
φ
~
u
T
φ
~
v
In the language of functional analysis, what kinds of functions are valid kernel functions
? Kernel functions are sometimes more precisely referred to
KERNEL
as Mercer kernels, because they must satisfy Mercer’s condition: for any g
~
x
M
ERCER KERNEL
such that
R
g
~
x
2
d
~
x is finite, we must have that:
Z
K
~
x ,
~
z g
~
x g
~
z d
~
xd
~
z
≥
0 . 15.15
A kernel function K must be continuous, symmetric, and have a positive def- inite gram matrix. Such a K means that there exists a mapping to a reproduc-
ing kernel Hilbert space a Hilbert space is a vector space closed under dot products such that the dot product there gives the same value as the function
K
. If a kernel does not satisfy Mercer’s condition, then the corresponding QP may have no solution. If you would like to better understand these issues,
you should consult the books on SVMs mentioned in Section 15.5
. Other- wise, you can content yourself with knowing that 90 of work with kernels
uses one of two straightforward families of functions of two vectors, which we define below, and which define valid kernels.
The two commonly used families of kernels are polynomial kernels and radial basis functions. Polynomial kernels are of the form K
~
x ,
~
z
=
1
+
Preliminary draft c 2008 Cambridge UP
15.2 Extensions to the SVM model
333
~
x
T
~
z
d
. The case of d
=
1 is a linear kernel, which is what we had before the start of this section the constant 1 just changing the threshold. The case of
d
=
2 gives a quadratic kernel, and is very commonly used. We illustrated the quadratic kernel in Example
15.2 .
The most common form of radial basis function is a Gaussian distribution, calculated as:
K
~
x ,
~
z
=
e
−~x−~z
2
2σ
2
15.16 A radial basis function rbf is equivalent to mapping the data into an infi-
nite dimensional Hilbert space, and so we cannot illustrate the radial basis function concretely, as we did a quadratic kernel. Beyond these two families,
there has been interesting work developing other kernels, some of which is promising for text applications. In particular, there has been investigation of
string kernels see Section
15.5 .
The world of SVMs comes with its own language, which is rather different from the language otherwise used in machine learning. The terminology
does have deep roots in mathematics, but it’s important not to be too awed by that terminology. Really, we are talking about some quite simple things. A
polynomial kernel allows us to model feature conjunctions up to the order of the polynomial. That is, if we want to be able to model occurrences of pairs
of words, which give distinctive information about topic classification, not given by the individual words alone, like perhaps operating
AND
system or
ethnic
AND
cleansing , then we need to use a quadratic kernel. If occurrences
of triples of words give distinctive information, then we need to use a cubic kernel. Simultaneously you also get the powers of the basic features – for
most text applications, that probably isn’t useful, but just comes along with the math and hopefully doesn’t do harm. A radial basis function allows you
to have features that pick out circles hyperspheres – although the decision boundaries become much more complex as multiple such features interact. A
string kernel lets you have features that are character subsequences of terms. All of these are straightforward notions which have also been used in many
other places under different names.
15.2.4 Experimental results