Majorization-Minimization for Manifold Embedding (Supplemental Document)
Majorization-Minimization for Manifold Embedding
(Supplemental Document)
1,2,3 1,3,4 1,2,3Zhirong Yang , Jaakko Peltonen and Samuel Kaski
1
2 Helsinki Institute for Information Technology HIIT, University of Helsinki,
3
4 Aalto University, University of Tampere Abstract Lemma 1. If f (z) is concave in z, then
∂f (˜ x) f (˜ x) ≤f (x) + , ˜ x − x
˜ x=x
∂ ˜ x This supplemental document provides addi- def
∂f (˜ x) tional information that does not fit in the pa- = , ˜ x + constant = G(˜ x, x).
∂ ˜ x ˜ x=x per due to the space limit. Section 1 gives related mathematical basics, including ma- jorizations of convex and concave functions,
For a scalar concave function f (), this simplifies to
′
as well as the definitions of information diver- f (˜ x) ≤ G(˜ x, x) = ˜ xf (x) + constant. Obviously, gences. Section 2 presents more examples on
∂f (˜ x) ∂G(˜ x, x) G(x, x) = f (x) and = . developing MM algorithms for manifold em-
x=x ˜ x=x ˜
∂ ˜ x ∂ ˜ x bedding, in addition to the t-SNE included 2) A convex function can be upper-bounded by using in the paper. Section 3 illustrates that many the Jensen’s inequality. existing manifold embedding methods can be unified into the Neighbor Embedding frame-
Lemma 2. If f (z) is convex in z, and ˜ x = [˜ x
1 , . . . , ˜ x n ]
work given in Section 5 of the paper. Section
as well as x = [x 1 , . . . , x n ] are nonnegative,
4 provides proofs of the theorems in the pa- ! !
n n
per. Section 5 gives an example of QL be-
X X x i x ˜ i f x ˜ ≤ f
i P x i
yond manifold embedding. Section 6 gives
P
x
j j x i=1 i=1 j j
the statistics and source of the experimented
n
datasets. Section 7 provides supplemental
X X def x i x ˜ i = f x = G(˜ x, x).
P j experiment results. x x
j i j i=1 j
P
n
Obviously G(x, x) = f ( x i ). Their first deriva-
i=1
tives also match:
1 Preliminaries
P
X x j ∂G(˜ x, x) x ˜ x
i i j ′
= f x
P j Here we provide two lemmas related to majorization
˜ x=x ˜ x=x
∂ ˜ x x x x
i j i i j j
of concave/convex functions and definitions of infor- mation divergences.
X
′
=f x
j j
1.1 Majorization of concave/convex functions P
n
∂f ( x ˜ )
i i=1
= .
x=x ˜
∂ ˜ x
i
1) A concave function can be upper-bounded by its tangent. th
1.2 Information divergences
Appearing in Proceedings of the 18 International Con- Information divergences, denoted by D(p||q), were
ference on Artificial Intelligence and Statistics (AISTATS)originally defined for probabilities and later extended 2015, San Diego, CA, USA. JMLR: W&CP volume 38. to measure the difference between two (usually non- Copyright 2015 by the authors. negative) tensors p and q, where D(p||q) ≥ 0, and
Majorization-Minimization for Manifold Embedding (Supplemental Document)
D(p||q) = 0 iff p = q. To avoid notational clutter
2.1 Elastic Embedding (EE) we only give vectorial definitions; it is straightforward Given a symmetric and nonnegative matrix P and λ > to extend the formulae to matrices and higher-order 0, Elastic Embedding [2] minimizes tensors.
X X
2
2 J ( e Y ) = P k˜ y − ˜ y k + λ exp −k˜ y − ˜ y k
Let D , D , D , and D respectively denote α-, β-, EE ij i j i j
α β γ r ij ij
γ-, and R´enyi-divergences. Their definitions are (see
X X e.g. [3]) e
= − P ln e Q + λ Q
ij ij ij ij ij 2 where e Q = exp −k˜ y − ˜ y k . ij i j
The EE objective is naturally decomposed into
X
1 X
α 1−α
D (p||q) = p q − αp + (α − 1)q ,
α i i i A(P, e Q) = − P ln e Q i ij ij
α(α − 1)
i ij
h i
X X
1
β β β−1
e B( e Q) =λ Q . D β (p||q) = p + (β − 1)q − βp i q , ij
i i i
β(β − 1)
ij i
P γ−1 P P
γ γ
ln p q
i i i The quadratification of A(P, e Q) is simply identical,
ln ( p ) ln ( q )
i i i i
D (p||q) = + − ,
γ
with W = P . The final majorization function is γ(γ − 1) γ γ − 1
D E !
X
2 X
1 G( e Y , Y ) = P + ij k˜ y i − ˜ y j k Ψ, e Y
r 1−r
D (p||q) = ln p ˜ q ˜ ,
r i i ij
r − 1
i
ρ
2
- k e Y − Y k + constant,
2 where ∂B where p and q are the entries in p and q respectively,
i i Ψ = = −4λL Y.
Q
P P
Y =Y e
˜ p = p / p , and ˜ q = q / q . To handle p’s
i i j i i j ∂ e Y j j
containing zero entries, we only consider nonnegative Thus the MM update rule of EE is
α, β, γ and r. These families are rich as they cover
−1
ρ ρ
new
most commonly used divergences in machine learning + Y = L I λL Y + Y .
P Q
4
4 such as normalized Kullback-Leibler divergence (ob- tained from D with r → 1 or D with γ → 1), non-
r γ
2.2 Stochastic Neighbor Embedding (SNE) normalized KL-divergence (α → 1 or β → 1), Itakura- Saito divergence (β → 0), squared Euclidean distance Suppose the input matrix is row-stochastic, i.e. P ≥ 0
ij
P
2
(β = 2), Hellinger distance (α = 0.5), and Chi-square and P = 1. Denote ˜ q = exp −k˜ y − ˜ y k and
ij ij i j j
P divergence (α = 2). Different divergences have become e
Q = ˜ q / q ˜ . Stochastic Neighbor Embedding
ij ij ib b
widespread in different domains. For example, D KL is [6] minimizes total Kullback-Leibler (KL) divergence widely used for text documents (e.g. [7]) and D
IS is
between P and e Q rows: popular for audio signals (e.g. [4]). In general, estima- J ( e Y )
SNE
tion using α-divergence is more exclusive with larger α’s, and more inclusive with smaller α’s (e.g. [10]). For
X X P
ij
= P ln
ij
β-divergence, the estimation becomes more robust but e Q ij
i j less efficient with larger β’s.
X X
X = − P ln ˜ q ln q ˜ + constant +
ij ij ij ij i j
X X
X
2
= P k˜ y − ˜ + y k ln q ˜ + constant
ij i j ij ij i j
2 More Development Examples
Thus we can decompose the SNE objective into A(P, ˜ q) + B(˜ q) + constant, where
X The paper provides an example of developing a MM
2 A(P, ˜ q) = P k˜ y − ˜ y k ij i j
algorithm for t-Distributed Stochastic Neighbor Em-
ij
bedding (t-SNE); here we provide additional examples
X X B(˜ q) = ln q ˜ .
ij
of developing MM algorithms for other manifold em-
i j bedding methods.
- constant That is, W
- D Ψ, e Y E +
P ij 2ky i −y j k
2ky i − y j k k˜ y i − ˜ y j k
ij
P
ij
X
G( e Y , Y ) =
in the quadratifi- cation phase. The final majorization function of the LinLog objective is
ij
Q
2 P ij
1
=
ij
=
2
k
j
− ˜ y
i
k k˜ y
j
− y
i
P ij 2ky
ij
X
- constant, where
- D Ψ, e Y E +
- constant, where
- ρ
2
2 k e Y − Y k
ρ
2 k e Y − Y k
Y, where ◦ is the elementwise product. Then the MM update rule of LinLog is
= −2λL
Ψ = ∂B ∂ e Y
2
Zhirong Yang, Jaakko Peltonen, and Samuel Kaski
Again the quadratification of A(P, ˜ q) is simply iden- tical, with W = P . The final majorization function is
G( e Y , Y ) =
X
ij
P ij k˜ y i − ˜ y j k
2
ρ
2
2 Y . Similarly, we can develop the rule for symmetric SNE (s-SNE) [12]:
Ψ = ∂B ∂ e Y
e Y =Y
= −2L
Q+Q T Y.
Thus the MM update rule of SNE is Y
new
= L
P +P T
2 I
−1
L
Q+Q T
Y + ρ
Since the square root function is concave, by Lemma 1 we can upper bound A(P, e Q) by A(P, e Q) ≤
Q◦Q
i
i
i
, ˜ y
j
i qP
ij
P
2 ij
· qP
ij
h˜ y
, ˜ y
ij
j
i
2
where P
ij
= hx
i
, x
j
i. The inner products h˜ y
h˜ y
P
, ˜ y
= L
ij
P
ij
X
(Y ) =
s-SNE
J
Y =Y e
Y
new
P ◦Q
ij
2 I
−1
L
λ◦Q◦Q
Y + ρ
2 Y .
2.4 Multidimensional Scaling with Kernel Strain (MDS-KS)
Strain-based multidimensional scaling maximizes the cosine between the inner products in the input and output spaces (see e.g. [1]):
J
MDS-S
( e Y ) = P
i
j
ij
ij
e Q
2 ij
2 ln
X
ij
P
2 ij
. We decompose the MDS-KS objective function into A(P, e Q) + B(P, e Q) + constant, where
A(P, e Q) = − ln
X
P
X
ij
e Q
ij
B(P, e Q) =
1
2 ln
X
ij
e Q
2 ij
.
ij
2 ln
i above can be seen as a linear kernel e Q ij = h˜ y i , ˜ y j i and simply lead to kernel PCA of P . In this exam- ple we instead consider a nonlinear embedding kernel e
P
Q ij = exp(−k˜ y i − ˜ y j k
2
), where the corresponding ob- jective function is P
ij
P
ij
e Q
ij
qP
ij
2 ij
1
· qP
ij
e Q
2 ij
, and maximizing it is equivalent to minimizing its neg- ative logarithm
J
MDS-KS
( e Y ) = − ln
X
ij
P ij e Q ij +
− y
ln P
Q
- ρ
- (1 − λ)
P
P
2 I
−1
Ψ + ρ
2 Y where P and Q are defined same manner as in SNE, and
Ψ = ∂J
NeRV
P
i
P
j
ij
ij
= L
ln q
ij
ij
∂Y .
2.3 LinLog Suppose a weighted undirected graph is encoded in a symmetric and nonnegative matrix P (i.e. the weighted adjacency matrix). The node-repulsive Lin- Log graph layout method [11] minimizes the following energy function:
J
LinLog
( e Y ) =
X
P +P T
Y
new
P
j
k
2
; and for Neighbor Retrieval Visu- alizer (NeRV) [13]: J
NeRV
(Y ) =λ
X
i
X
j
ij
i
ln P ij Q
ij
X
i
X
j
Q
ij
ln Q
ij
P ij ,
- ρ
- λ
k˜ y
k − λ
− ˜ y
ij
− ˜ y
j k.
ij
and q
ab
q
ab
/ P
ij
= q
= 1, Q
ln k˜ y
ij
P
ij
4 Y , where P
L Q Y + ρ
−1
4 I
= L P + ρ
new
, Y
- 1
i
ij
j
k
ij
X
ij
ln k˜ y
i
− ˜ y
j k.
We write e Q = k˜ y
i
− ˜ y
j
−1
X
and then decompose the Lin- Log objective function into A(P, e Q) + B(P, e Q), where
A(P, e Q) =
X
ij
P
ij
k˜ y
i
= exp −ky
j
k B(P, e Q) = − λ
− ˜ y
Majorization-Minimization for Manifold Embedding (Supplemental Document)
We can upper bound A(P, e Q) by the Jensen’s inequal- Next we show some other existing manifold embed- ity (Lemma 2): ding objectives can equivalently expressed as NE. Ob- viously SNE and its variants belong to NE. Moreover,
we have X e
P ij Q ij P ij Q ij A(P, e Q) ≤ − ln
P
P Q ij ij
P Q
ab ab P ab ij ab P Q ab ab
- arg min J ( e Y ) = arg min D (P ||λ e Q), with
EE β→1 Y e Y e
X P ij Q ij
2
2
e = k˜ y − ˜ y k Q = exp −k˜ y − ˜ y k [17]. P i j ij i j
P Q
ab ab ab ij
X P Q Q
ij ij ij Proof.
- . ln
P P P Q P Q
ab ab ab ab ab ab ij
J ( e Y )
EE P Q ij ij
X X
P
That is, W = in the quadratification
ij
2
2 ab P Q ab ab
= P k˜ y − ˜ y k + λ exp −k˜ y − ˜ y k
ij i j i j phase. ij ij
X X The final majorization function of the MDS-KS objec- e
= − P ij ln e Q ij + λ Q ij tive is
ij ij
D E
X X
X X P ij
P Q
ij ij
2
e = P ln + λ Q − P
G( e Y , Y ) = k˜ y − ˜ y k Ψ, e Y P i j
- ij ij ij
P Q λ e Q
ab ab ij ab ij ij ij ij
X X ρ
2
− P ln P + (ln λ + 1) P
ij ij ij
- k e Y − Y k + constant,
2
ij ij
where =D β→1 (P ||λ e Q) + constant.
∂B Ψ = = −L Y,
U Y =Y e 2 ∂ e Y Q ij P 2 • arg min J ( e Y ) = arg min D (P ||λ e Q),
with U = . Then the MM update rule of LinLog β→0
ij Y e Y e ab ab Q
−1 with e Q = k˜ y − ˜ y k [17].
MDS-KS is
ij i j −1
ρ ρ
new
- Y = L I L Y + Y .
W U Proof.
4
4
1
3 Neighbor Embedding
J ( e Y )
LinLog
λ
X X
1 In Section 5 of the paper, we review a framework for = P k˜ y − ˜ y k − ln k˜ y − ˜ y k,
ij i j i j
λ manifold embedding which is called Neighbor Embed- ij ij " # ding (NE) [16, 17]. Here we demonstrate that many
X P
1
ij
manifold embedding objectives, including the above = − ln e
λ e Q Q
ij ij
examples, can be equivalently formulated as an NE ij " # problem.
X X P ij P ij P ij
- = − ln − 1 ln + 1 NE minimizes D(P || e Q), where D is a divergence from
λ λ e Q λ e Q
ij ij ij ij
α-, β-, γ-, or R´enyi-divergence families, and the em- =D (P ||λ e Q) + constant.
β→0
bedding kernel in the paper is parameterized as e Q ij =
−b/a
2
c + ak˜ y i − ˜ y j k for a ≥ 0, b > 0, and c ≥ 0 (adapted from [15]). First we show the parameterized form of e Q includes
- arg min J ( e Y ) = arg min D (P || e Q) by
MDS-KS γ=2 e e Y Y
the most popularly used embedding kernels. When −1 their definitions.
2
a = 1, b = 1, and c = 1, e Q = 1 + k˜ y − ˜ y k
ij i j
is the Cauchy kernel (i.e. the Student-t kernel with a single degree of freedom); when a → 0 and b = 1,
4 Proofs
2
e Q = exp −k˜ y − ˜ y k is the Gaussian kernel; when
ij i j −1
a = 1, b = 1/2, c = 0, e Q ij = k˜ y i − ˜ y j k is the inverse Here we provide proofs of the five theorems in the pa- to the Euclidean distance. per.
Proof. The fixed points of the MM update rule appear
Zhirong Yang, Jaakko Peltonen, and Samuel Kaski
4.1 Proof of Theorem 1
Proof. Since B is upper-bounded by its Lipschitz sur-
- ρI)
rogate B(P, e Q) = B( e Y ) ≤ B(Y ) + hΨ, e Y − Y i +
- ρI) Y = (−Ψ + ρY )
ρ
W +W T
(−Ψ + ρY ) (2L
−1
W +W T
when Y = (2L
2 F
4.2 Proof of Theorem 2 The proposed MM updates share many convergence properties with the Expectation-Maximization (EM) show that the MM updates will converge a stationary point of J . A stationary point can be a local optimum or a saddle point. The convergence result is a special case of the Global Convergence Theorem (GCT; [19]) which is quoted be- low. A map A from points of X to subsets of X is called a point-to-set map on X. It is said to be closed at x if x
) ≥ J (Y ) and the bounded- ness assumption of J imply Condition i; M is a point- to-point map and thus the continuity of G over both e Y and Y implies the closedness condition ii; Lemma 4 implies iii(a); Theorem 1 in the paper implies iii(b). Therefore, the proposed MM updates are a special case of GCT and thus converge to a stationary point of J .
2
k
j
− ˜ y
i
is concave to k˜ y
Proof. Since A ij
4.3 Proof of Theorem 3
new
ij
tinuous function α in the GCT theorem. Lemma 3 shows that this is equivalent to considering F the so- lution set. Next we show that the QL-majorization and its resulting map M fulfill the conditions of the GCT theorem: J (Y
Proof. Consider S the solution set Γ, and J the con-
Now we are ready to prove the convergence to station- ary points (Theorem 2).
new , Y ) < G(Y, Y ) = J (Y ).
) ≤ G(Y
new
, which implies J (Y
new
= M(Y ). If Y / ∈ F , i.e. Y 6= M(Y ), we have Y 6= Y
, it can be upper-bounded by its tangent: A
(P
unique minimum Y
= ∂A
2 + constant.
k
j
− ˜ y
i
, k˜ y
2 e Y =Y
∂k˜ y i − ˜ y j k
ij
2
ij
− ky i − y j k
2
, k˜ y i − ˜ y j k
2 e Y =Y
∂k˜ y i − ˜ y j k
ij
) ≤A ij (P ij , Q ij )
ij
, e Q
new
Proof. Because G is convexly quadratic, it has a
k
}
} are in the solution
Then all the limit points of {x k
ii M is closed over the complement of Γ; iii there is a continuous function α on X such that (a) if x / ∈ Γ, α(x) > α(y) for all y ∈ M(x), and (b) if x ∈ Γ, α(x) ≥ α(y) for all y ∈ M(x)
X;
a solution set Γ ∈ X be given, and suppose that: i all points x k are contained in a compact set S ⊂
), where M is a point-to-set map on X. Let
k
∈ M(x
∞ k=0 be generated by x k+1
Let the sequence {x k
) converges monotonically to α(x) for some x ∈ Γ. The proof can be found in [19]. Before showing the convergence of MM, we need the following Lemmas. For brevity, denote M : Y → Y
), imply y ∈ A(x). For point-to-point map, continuity implies closedness. Global Convergence Theorem. (GCT; from [14])
k
∈ A(x
k
→ y, y
k
∈ X and y
k
→ x, x
set Γ and α(x k
new the map by using the MM update Eq. 4 in the paper.
= 0, i.e. Y = M(Y ). Therfore S ⊆ F Lemma 4. J (Y new ) < J (Y ) if Y / ∈ F .
= 0. Since we re- quire the majorization function H shares the tangent with J , i.e.
∂G ∂ e Y Y =Y e
= 0 implies
∂J ∂ e Y e Y =Y
= 0, the condition of stationary points of J . Therefore F ⊆ S. On the other hand, because QL requires that G and J share the same tangent at Y , and thus
∂J ∂ e Y e Y =Y
∂H ∂ e Y Y =Y e
=
∂J ∂ e Y Y =Y e
∂H ∂ e Y e Y =Y
Let S and F be the sets of stationary points of J (Y ) and fixed points of M, respectively. Lemma 3. S = F .
Y + Ψ =0, (1) which is recognized as
W +W T
2L
2 k e Y − Y k
, we have H(Y, Y ) = G(Y, Y ). Therefore J (Y ) = H(Y, Y ) = G(Y, Y ) ≥ G(Y
new
, Y ) ≥ J (Y
new
), where the first inequality comes from minimization and the second is ensured by the backtracking.
, Eq. 1 is equivalent to
- ∂A
−1
P ij ln e Q ij =
P
ij
P
ij
ln e Q
ij
ij
e Q
ij
X
ij
A ij (P ij , e Q ij ) = −
X
ij
X
I
ij
b a P
ij
ln (c + aξ
ij
) B(P, e Q) =
X
ij
e Q
ij
The second derivative of A ij =
b a
P ij ln (c + aξ ij ) with respect to ξ
ij
(P || e Q) = −
≤ 0 iff −(a + b(β − 1)) ≤ 0. That is A ij is concave in ξ ij iff β ∈ [1 − a/b, ∞). Special case 1 (α → 1 or β → 1): D
abP ij (aξ ij +c) 2 , which is always non- positive.
e Q
ij
1 1 − β P
ij
e Q
β−1 ij
=
X
ij
1 1 − β P ij (c + aξ ij )
−b(β−1)/a
B(P, e Q) =
1 β
X
ij
β ij
∂ 2 A ij ∂ξ 2 ij
The second derivative of A ij =
1 1−β
P ij (c + aξ ij )
−b(β−1)/a
with respect to ξ ij is − bP
ij
(a + b(β − 1)) (aξ
ij
− b a β−1
(aξ
ij
(2) Since a, b, c, and P
ij
are nonnegative, we have
is −
Special case 2 (α → 0 for P
) =
−b/a
e Q
ij
ln e Q
ij
−
X
ij
e Q
ij
The second derivative of A
ij
= (c + aξ
ij
)
ln P
X
ij
with respect to ξ
ij
is b(a+b) ln P
ij
(aξ
ij
− b a −2
, which is non-positive iff P
ij ∈ (0, 1].
Special case 3 (β → 0): D
IS (P || e Q) =
P P ij e
Q
ij
ln P ij B(P, e Q) =
ij
A
> 0): D
dual-I
(P || e Q) = P
ij
e Q ij ln P ij +
P
ij
e Q ij ln e Q ij −
P
ij
e Q ij . We write
A(P, e Q) =
X
ij
ij
−b/a
ij
(c + aξ ij )
ij
X
=
ij
ln P
e Q
(P
ij
X
) =
ij
, e Q
ij
X
ij
j
= 2 L
Y
st
is the same as ∂
P
ij
W
ij
k˜ y
i
− ˜ y
j
k
2
st
W +W T e
) =2 L
Y
st
, we have
∂H ∂ e Y e Y =Y
=
∂J ∂ e Y e Y =Y
.
4.4 Proof of Theorem 4
Proof. Obviously all α- and β-divergences are addi-
tively separable. Next we show the concavity in the given range. Denote ξ
ij
= k˜ y
i
− ˜ y
W +W T e
it
k
k˜ y i − ˜ y j k
Majorization-Minimization for Manifold Embedding (Supplemental Document)
That is, W
ij
=
∂A ij ∂k˜ y i −˜ y j k 2 Y =Y e .
For the tangent sharing, first we have H(Y, Y ) = J (Y ) because obviously A ij (P ij , e Q ij ) = A ij (P ij , Q ij ) when e
Y = Y . Next, because ∂A
∂ e Y
st
=
X
ij
A
ij
2
− ˜ y
=2
st
(˜ y
si
W
j
X
st
∂k˜ y
∂ e Y
2
k
j
− ˜ y
i
- c)
- constant ∂ e Y
- c)
- constant. We write A(P, e Q) =
- P
2 for brevity.
, e Q
α(ax + c)
ij
)
−b(1−α)/a
to ξ
ij
is ∂
2 A ij
∂ξ
2 ij
= bP
α ij
((α − 1)b − a) (aξ
ij
− b a 1−α
2 .
α ij
Since a, b, c, and P
ij
are nonnegative, we have
∂ 2 A ij ∂ξ 2 ij
≤ 0 iff α((α − 1)b − a) ≤ 0 and α 6= 0. That is, A ij is concave in ξ ij iff α ∈ (0, 1 + a/b]. β-divergence. We decompose a β-divergence into
D
β
(P || e Q) = A(P, e Q) + B(P, e Q) + constant, where A(P, e Q) =
X
ij
A
ij
(P
ij
(c + aξ
P
α-divergence. We decompose an α-divergence into D
ij
α
(P || e Q) = A(P, e Q) + B(P, e Q) + constant, where A(P, e Q) =
X
ij
A ij (P ij , e Q ij ) =
X
ij
1 α(α − 1)
P
α ij
e Q
1−α ij
=
X
1 α(α − 1)
1 α(α − 1)
X
=
ij
The second derivative of A
ij
e Q
ij
1 α
P
B(P, e Q) =
−b(1−α)/a
)
ij
(c + aξ
α ij
- c)
- c)
- constant. We write A(P, e Q) =
[5]) A(w, b) = λ
ij
N
X
i=1 N
X
j=1
S
(z
In this example a composite cost function is used: J (w, b) = A(w, b) + B(w, b) where A(w) is a locality preserving regularizer (see e.g.
i
− z
j
)
2
and B(w, b) is an empirical loss function [9] B(w, b) = (1 − λ)
- c) b a
are vectorial primary data and y i ∈ {−1, 1} are supervised labels, the task is to learn a linear function z = hw, xi+ b for predicting y.
X
ij
∗
= exp − P
ij
Q ij ln(Q ij /P ij ) P
ij
Q
! for α → 0.
D
5 Examples of QL beyond Manifold Embedding
To avoid vagueness we defined the scope to be manifold embedding, which already is a broad field and one of listed AISTATS research areas. Within this scope the work is general.
It is also naturally applicable to any other optimiza- tion problems amenable to QL majorization. QL is applicable to any cost function J = A + B, where A can be tightly and quadratically upper bounded by Eq. 2 in the paper, and B is smooth. Next we give an example beyond visualization and discuss its potential extensions. Consider a semi-supervised problem: given a training set comprising a supervised subset {(x i , y i )}
n i=1
and an unsupervised subset {x i }
N i=n+1
, where x i ∈ R
n
i=1
, with the special case λ
z
n
X
i=1
1 −
1 1 + exp(−y
i
i
This example can be further extended with the same spirit, where B is replaced by another smooth loss function which is non-convex and non-concave, e.g. [8]
)
2
, and XL S+ST 2 X
T
can be replaced by other positive semi-definite matrices.
6 Datasets Twelve datasets have been used in our experiments.
Their statistics and sources are given in Table 1. Below are brief descriptions of the datasets.
B(w, b) = (1 − λ)
∂B ∂w + ρw .
[1 − tanh(y
i
i
z
i
)] with λ ∈ [0, 1] the tradeoff parameter and S
ij
the local similarity between x
and x
−1
j (e.g. a Gaussian kernel).
Because B is non-convex and non-concave, conven- tional majorization techniques that require convex- ity/concavity such as CCCP [18] are not applicable. However, we can apply QL here (Lipschitzation to B and quadratification to A), which gives the update rule for w (denote X = [x
1 . . . x N ]):
w
new
= λXL S+ST 2 X
T
- ρI
τ ij
Q
b/a
e Q
−1 ij
=
X
ij
P ij (c + aξ ij )
B(P, e Q) =
P
X
ij
ln e Q
ij .
The second derivative of A ij = P ij (c + aξ ij )
b/a
ij
ij
ij
ij
Zhirong Yang, Jaakko Peltonen, and Samuel Kaski
P
ij
ln e Q
ij
X
A
X
ij
(P
ij
, e Q
ij
) =
with respect to ξ
is (b − a)bP
ij
λ
Q
ij
! 1 τ ,
λ
∗
= arg min
D β→τ (P ||λQ) = P
P
ij
P
ij
Q
τ −1 ij
P
ij
1−τ ij
ij
∗
(aξ
ij
−2
, which is non- positive iff a ≥ b, or equivalently 0 ∈ [1 − a/b, ∞].
4.5 Proof of Theorem 5 The proofs are done by zeroing the derivative of the right hand side with respect to λ (see [17]). The closed- form solutions of λ at the current estimate for τ ≥ 0 are
λ
= arg min
Q
λ
D
α→τ
(P ||λQ) = P
ij
P
τ ij
- SCOTLAND: It lists the (136) multiple directors of the 108 largest joint stock companies in Scotland in 1904-5: 64 non-financial firms, 8 banks, 14 in- surance companies, and 22 investment and prop- erty companies (Scotland.net).
- COIL20: the COIL-20 dataset from Columbia University Image Library, images of toys from dif- ferent angles, each image of size 128 × 128.
- 7SECTORS: the 4 Universities dataset from CMU
Text Learning group, text documents classified to 7 sectors; 10,000 words with maximum informa- tion gain are preserved. Majorization-Minimization for Manifold Embedding (Supplemental Document)
Table 1: Dataset statistics and sources Dataset #samples #classes Domain Source SCOTLAND 108 8 network PAJEK COIL20 1440 20 image COIL
7SECTORS 4556 7 text CMUTE RCV1 9625 4 text RCV1 PENDIGITS 10992 10 image UCI MAGIC 19020 2 telescope UCI
20NEWS 19938 20 text
20NEWS LETTERS 20000 26 image UCI
SHUTTLE 58000 7 aerospace UCI MNIST 70000 10 image MNIST SEISMIC 98528 3 sensor LIBSVM MINIBOONE 130064 2 physics UCI
PAJEK http://vlado.fmf.uni-lj.si/pub/networks/data/ UCI http://archive.ics.uci.edu/ml/ COIL http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php CMUTE http://www.cs.cmu.edu/~TextLearning/datasets.html RCV1 http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/
20NEWS http://people.csail.mit.edu/jrennie/20Newsgroups/ MNIST http://yann.lecun.com/exdb/mnist/ LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
- RCV1: text documents from four classes, with and is used to distinguish electron neutrinos (sig- 29992 words. nal) from muon neutrinos (background).
- PENDIGITS: the UCI pen-based recognition of
7 Supplemental experiment results handwritten digits dataset, originally with 16 di- mensions.
7.1 Evolution curves: objective vs. iteration
- MAGIC: the UCI MAGIC Gamma Telescope Data Set, 11 numerical features.
Figure 1 shows the evolution curves of the t-SNE ob- jective (cost function value) as a function of iteration
- 20NEWS: text documents from 20 newsgroups; for the compared algorithms. This supplements the
10,000 words with maximum information gain are results in the paper of objective vs. running time. preserved.
- LETTERS: the UCI Letter Recognition Data Set, 16
7.2 Average number of MM trials numerical features. The proposed backtracking algorithm for MM involves
- SHUTTLE: the UCI Statlog (Shuttle) Data Set, 9 an inner loop for searching for ρ. We have recorded the numerical features. The classes are imbalanced.
numbers of trials in each outer loop iteration. The av- Approximately 80% of the data belongs to class erage number of trials is then calculated. The means
1. and standard deviations across multiple runs are re- ported in Table 2.
- MNIST: handwritten digit images, each of size 28× 28.
We can see that for all datasets, the average number of trials is around two. The number is generally and
- SEISMIC: the LIBSVM SensIT Vehicle (seismic) slightly decreased with larger datasets. Two trials in data, with 50 numerical features (the seismic sig- an iteration means that ρ remains unchanged after the nals) from the sensors on vehicles.
inner loop. This indicates potential speedups may be achieved in future work by keeping ρ constant in most
- MINIBOONE: the UCI MiniBooNE particle identi- iterations.
fication Data Set, 50 numerical features. This
dataset is taken from the MiniBooNE experiment We have also recorded the average of number of func-
Zhirong Yang, Jaakko Peltonen, and Samuel Kaski
−8
[2] M. Carreira-Perpi˜ n´ an. The elastic embedding al- gorithm for dimensionality reduction. In ICML, pages 167–174, 2010. [3] A. Cichocki, S. Cruces, and S.-I. Amari. General- ized alpha-beta divergences and their application to robust nonnegative matrix factorization. En- tropy, 13:134–170, 2011.
putational and Graphical Statistics, pages 444– 472, 2008.
H. Hofmann, and L. Chen. Data visualization with multidimensional scaling. Journal of Com-