Majorization-Minimization for Manifold Embedding (Supplemental Document)

Majorization-Minimization for Manifold Embedding

(Supplemental Document)

1,2,3 1,3,4 1,2,3

Zhirong Yang , Jaakko Peltonen and Samuel Kaski

2 Helsinki Institute for Information Technology HIIT, University of Helsinki,

4 Aalto University, University of Tampere Abstract Lemma 1. If f (z) is concave in z, then

∂f (˜ x) f (˜ x) ≤f (x) + , ˜ x − x

˜ x=x

∂ ˜ x This supplemental document provides addi- _def

∂f (˜ x) tional information that does not fit in the pa- = , ˜ x + constant = G(˜ x, x).

∂ ˜ x ˜ x=x per due to the space limit. Section 1 gives related mathematical basics, including ma- jorizations of convex and concave functions,

For a scalar concave function f (), this simplifies to

′

as well as the definitions of information diver- f (˜ x) ≤ G(˜ x, x) = ˜ xf (x) + constant. Obviously, gences. Section 2 presents more examples on

∂f (˜ x) ∂G(˜ x, x) G(x, x) = f (x) and = . developing MM algorithms for manifold em-

x=x ˜ x=x ˜

∂ ˜ x ∂ ˜ x bedding, in addition to the t-SNE included 2) A convex function can be upper-bounded by using in the paper. Section 3 illustrates that many the Jensen’s inequality. existing manifold embedding methods can be unified into the Neighbor Embedding frame-

Lemma 2. If f (z) is convex in z, and ˜ x = [˜ x

1 , . . . , ˜ x n ]

work given in Section 5 of the paper. Section

as well as x = [x 1 , . . . , x n ] are nonnegative,

4 provides proofs of the theorems in the pa- ! !

n n

per. Section 5 gives an example of QL be-

X X x i x ˜ i f x ˜ ≤ f

i P x _i

yond manifold embedding. Section 6 gives

j j x i=1 i=1 j _j

the statistics and source of the experimented  

datasets. Section 7 provides supplemental

X X _def x i x ˜ i = f  x  = G(˜ x, x).

P j experiment results. x x

j i j i=1 j

Obviously G(x, x) = f ( x i ). Their first deriva-

i=1

tives also match:

1 Preliminaries

  P

X x j ∂G(˜ x, x) x ˜ x

i i j ′

  = f x

P j Here we provide two lemmas related to majorization

˜ x=x ˜ x=x

∂ ˜ x x x x

i j i i j j

of concave/convex functions and definitions of infor-   mation divergences.

′

  =f x

j j

1.1 Majorization of concave/convex functions P

∂f ( x ˜ )

i i=1

= .

x=x ˜

∂ ˜ x

1) A concave function can be upper-bounded by its tangent. _th

1.2 Information divergences

Appearing in Proceedings of the 18 International Con- Information divergences, denoted by D(p||q), were

ference on Artificial Intelligence and Statistics (AISTATS)

originally defined for probabilities and later extended 2015, San Diego, CA, USA. JMLR: W&CP volume 38. to measure the difference between two (usually non- Copyright 2015 by the authors. negative) tensors p and q, where D(p||q) ≥ 0, and

Majorization-Minimization for Manifold Embedding (Supplemental Document)

D(p||q) = 0 iff p = q. To avoid notational clutter

2.1 Elastic Embedding (EE) we only give vectorial definitions; it is straightforward Given a symmetric and nonnegative matrix P and λ > to extend the formulae to matrices and higher-order 0, Elastic Embedding [2] minimizes tensors.

X X

2 J ( e Y ) = P k˜ y − ˜ y k + λ exp −k˜ y − ˜ y k

Let D , D , D , and D respectively denote α-, β-, EE ij i j i j

α β γ r ij ij

γ-, and R´enyi-divergences. Their definitions are (see

X X e.g. [3]) e

= − P ln e Q + λ Q

ij ij ij ij ij 2 where e Q = exp −k˜ y − ˜ y k . ij i j

The EE objective is naturally decomposed into

1 X

α 1−α

D (p||q) = p q − αp + (α − 1)q ,

α i i i A(P, e Q) = − P ln e Q i ij ij

α(α − 1)

i ij

h i

X X

β β β−1

e B( e Q) =λ Q . D β (p||q) = p + (β − 1)q − βp i q , ij

i i i

β(β − 1)

ij i

P γ−1 P P

γ γ

ln p q

i i i The quadratification of A(P, e Q) is simply identical,

ln ( p ) ln ( q )

i i i i

D (p||q) = + − ,

with W = P . The final majorization function is γ(γ − 1) γ γ − 1

D E !

2 X

1 G( e Y , Y ) = P + ij k˜ y i − ˜ y j k Ψ, e Y

r 1−r

D (p||q) = ln p ˜ q ˜ ,

r i i ij

r − 1

k e Y − Y k + constant,

2 where ∂B where p and q are the entries in p and q respectively,

i i Ψ = = −4λL Y.

P P

Y =Y e

˜ p = p / p , and ˜ q = q / q . To handle p’s

i i j i i j ∂ e Y j j

containing zero entries, we only consider nonnegative Thus the MM update rule of EE is

α, β, γ and r. These families are rich as they cover

−1

ρ ρ

new

most commonly used divergences in machine learning + Y = L I λL Y + Y .

P Q

4 such as normalized Kullback-Leibler divergence (ob- tained from D with r → 1 or D with γ → 1), non-

r γ

2.2 Stochastic Neighbor Embedding (SNE) normalized KL-divergence (α → 1 or β → 1), Itakura- Saito divergence (β → 0), squared Euclidean distance Suppose the input matrix is row-stochastic, i.e. P ≥ 0

(β = 2), Hellinger distance (α = 0.5), and Chi-square and P = 1. Denote ˜ q = exp −k˜ y − ˜ y k and

ij ij i j j

P divergence (α = 2). Different divergences have become e

Q = ˜ q / q ˜ . Stochastic Neighbor Embedding

ij ij ib b

widespread in different domains. For example, D KL is [6] minimizes total Kullback-Leibler (KL) divergence widely used for text documents (e.g. [7]) and D

IS is

between P and e Q rows: popular for audio signals (e.g. [4]). In general, estima- J ( e Y )

SNE

tion using α-divergence is more exclusive with larger α’s, and more inclusive with smaller α’s (e.g. [10]). For

X X P

= P ln

β-divergence, the estimation becomes more robust but e Q ij

i j less efficient with larger β’s.

X X

X = − P ln ˜ q ln q ˜ + constant +

ij ij ij ij i j

X X

= P k˜ y − ˜ + y k ln q ˜ + constant

ij i j ij ij i j

2 More Development Examples

Thus we can decompose the SNE objective into A(P, ˜ q) + B(˜ q) + constant, where

X The paper provides an example of developing a MM

2 A(P, ˜ q) = P k˜ y − ˜ y k ij i j

algorithm for t-Distributed Stochastic Neighbor Em-

bedding (t-SNE); here we provide additional examples

X X B(˜ q) = ln q ˜ .

of developing MM algorithms for other manifold em-

i j bedding methods.

constant That is, W

D Ψ, e Y E +

P _ij 2ky i −y j k

2ky i − y j k k˜ y i − ˜ y j k

G( e Y , Y ) =

in the quadratification phase. The final majorization function of the LinLog objective is

2 P ij

− ˜ y

k k˜ y

− y

P ij 2ky

constant, where
D Ψ, e Y E +
constant, where
ρ

2 k e Y − Y k

Y, where ◦ is the elementwise product. Then the MM update rule of LinLog is

= −2λL

Ψ = ∂B ∂ e Y

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski

Again the quadratification of A(P, ˜ q) is simply identical, with W = P . The final majorization function is

G( e Y , Y ) =

P ij k˜ y i − ˜ y j k

2 Y . Similarly, we can develop the rule for symmetric SNE (s-SNE) [12]:

Ψ = ∂B ∂ e Y

e Y =Y

= −2L

Q+Q ^T Y.

Thus the MM update rule of SNE is Y

new

= L

P +P ^T

2 I

−1

Q+Q ^T

Y + ρ

Since the square root function is concave, by Lemma 1 we can upper bound A(P, e Q) by A(P, e Q) ≤

Q◦Q

, ˜ y

i qP

2 ij

· qP

h˜ y

, ˜ y

where P

= hx

, x

i. The inner products h˜ y

h˜ y

, ˜ y

= L

(Y ) =

s-SNE

Y =Y e

new

P ◦Q

2 I

−1

λ◦Q◦Q

Y + ρ

2 Y .

2.4 Multidimensional Scaling with Kernel Strain (MDS-KS)

Strain-based multidimensional scaling maximizes the cosine between the inner products in the input and output spaces (see e.g. [1]):

MDS-S

( e Y ) = P

e Q

2 ij

2 ln

2 ij

. We decompose the MDS-KS objective function into A(P, e Q) + B(P, e Q) + constant, where

A(P, e Q) = − ln

e Q

B(P, e Q) =

2 ln

e Q

2 ij

2 ln

i above can be seen as a linear kernel e Q ij = h˜ y i , ˜ y j i and simply lead to kernel PCA of P . In this example we instead consider a nonlinear embedding kernel e

Q ij = exp(−k˜ y i − ˜ y j k

), where the corresponding objective function is P

e Q

2 ij

· qP

e Q

2 ij

, and maximizing it is equivalent to minimizing its negative logarithm

MDS-KS

( e Y ) = − ln

P ij e Q ij +

− y

ln P

(1 − λ)

2 I

−1

Ψ + ρ

2 Y where P and Q are defined same manner as in SNE, and

Ψ = ∂J

NeRV

= L

ln q

∂Y .

2.3 LinLog Suppose a weighted undirected graph is encoded in a symmetric and nonnegative matrix P (i.e. the weighted adjacency matrix). The node-repulsive Lin- Log graph layout method [11] minimizes the following energy function:

LinLog

( e Y ) =

P +P ^T

new

; and for Neighbor Retrieval Visu- alizer (NeRV) [13]: J

NeRV

(Y ) =λ

ln P ij Q

ln Q

P ij ,

k˜ y

k − λ

− ˜ y

j k.

and q

/ P

= q

= 1, Q

ln k˜ y

4 Y , where P

L Q Y + ρ

−1

4 I

= L P + ρ

new

, Y

ln k˜ y

− ˜ y

j k.

We write e Q = k˜ y

− ˜ y

−1

and then decompose the Lin- Log objective function into A(P, e Q) + B(P, e Q), where

A(P, e Q) =

k˜ y

= exp −ky

k B(P, e Q) = − λ

− ˜ y

Majorization-Minimization for Manifold Embedding (Supplemental Document)

We can upper bound A(P, e Q) by the Jensen’s inequal- Next we show some other existing manifold embed- ity (Lemma 2): ding objectives can equivalently expressed as NE. Ob- viously SNE and its variants belong to NE. Moreover,

  we have X e

P ij Q ij P ij Q ij A(P, e Q) ≤ − ln  

P Q _{ij ij}

P Q

ab ab P ab ij _ab P Q _{ab ab}

arg min J ( e Y ) = arg min D (P ||λ e Q), with

EE β→1 Y e Y e

X P ij Q ij

e = k˜ y − ˜ y k Q = exp −k˜ y − ˜ y k [17]. P i j ij i j

P Q

ab ab ab ij

X P Q Q

ij ij ij Proof.

. ln

P P P Q P Q

ab ab ab ab ab ab ij

J ( e Y )

EE P Q _{ij ij}

X X

That is, W = in the quadratification

2 _ab P Q _{ab ab}

= P k˜ y − ˜ y k + λ exp −k˜ y − ˜ y k

ij i j i j phase. ij ij

X X The final majorization function of the MDS-KS objec- e

= − P ij ln e Q ij + λ Q ij tive is

ij ij

D E

X X

X X P ij

P Q

ij ij

e = P ln + λ Q − P

G( e Y , Y ) = k˜ y − ˜ y k Ψ, e Y P i j

ij ij ij

P Q λ e Q

ab ab ij ab ij ij ij ij

X X ρ

− P ln P + (ln λ + 1) P

ij ij ij

k e Y − Y k + constant,

ij ij

where =D β→1 (P ||λ e Q) + constant.

∂B Ψ = = −L Y,

U Y =Y e ₂ ∂ e Y Q _ij P ^{2 • arg min J ( e Y ) = arg min D (P ||λ e Q),}

with U = . Then the MM update rule of LinLog β→0

ij Y e Y e _{ab ab} Q

−1 with e Q = k˜ y − ˜ y k [17].

MDS-KS is

ij i j −1

ρ ρ

new

Y = L I L Y + Y .

W U Proof.

3 Neighbor Embedding

J ( e Y )

LinLog

X X

1 In Section 5 of the paper, we review a framework for = P k˜ y − ˜ y k − ln k˜ y − ˜ y k,

ij i j i j

λ manifold embedding which is called Neighbor Embed- ij ij " # ding (NE) [16, 17]. Here we demonstrate that many

X P

manifold embedding objectives, including the above = − ln e

λ e Q Q

ij ij

examples, can be equivalently formulated as an NE ij " # problem.

X X P ij P ij P ij

= − ln − 1 ln + 1 NE minimizes D(P || e Q), where D is a divergence from

λ λ e Q λ e Q

ij ij ij ij

α-, β-, γ-, or R´enyi-divergence families, and the em- =D (P ||λ e Q) + constant.

β→0

bedding kernel in the paper is parameterized as e Q ij =

−b/a

c + ak˜ y i − ˜ y j k for a ≥ 0, b > 0, and c ≥ 0 (adapted from [15]). First we show the parameterized form of e Q includes

arg min J ( e Y ) = arg min D (P || e Q) by

MDS-KS γ=2 e e Y Y

the most popularly used embedding kernels. When −1 their definitions.

a = 1, b = 1, and c = 1, e Q = 1 + k˜ y − ˜ y k

ij i j

is the Cauchy kernel (i.e. the Student-t kernel with a single degree of freedom); when a → 0 and b = 1,

4 Proofs

e Q = exp −k˜ y − ˜ y k is the Gaussian kernel; when

ij i j −1

a = 1, b = 1/2, c = 0, e Q ij = k˜ y i − ˜ y j k is the inverse Here we provide proofs of the five theorems in the pa- to the Euclidean distance. per.

Proof. The fixed points of the MM update rule appear

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski

4.1 Proof of Theorem 1

Proof. Since B is upper-bounded by its Lipschitz sur-

ρI)

rogate B(P, e Q) = B( e Y ) ≤ B(Y ) + hΨ, e Y − Y i +

ρI) Y = (−Ψ + ρY )

W +W ^T

(−Ψ + ρY ) (2L

−1

W +W ^T

when Y = (2L

2 F

4.2 Proof of Theorem 2 The proposed MM updates share many convergence properties with the Expectation-Maximization (EM) show that the MM updates will converge a stationary point of J . A stationary point can be a local optimum or a saddle point. The convergence result is a special case of the Global Convergence Theorem (GCT; [19]) which is quoted below. A map A from points of X to subsets of X is called a point-to-set map on X. It is said to be closed at x if x

) ≥ J (Y ) and the bounded- ness assumption of J imply Condition i; M is a point- to-point map and thus the continuity of G over both e Y and Y implies the closedness condition ii; Lemma 4 implies iii(a); Theorem 1 in the paper implies iii(b). Therefore, the proposed MM updates are a special case of GCT and thus converge to a stationary point of J .

− ˜ y

is concave to k˜ y

Proof. Since A ij

4.3 Proof of Theorem 3

new

tinuous function α in the GCT theorem. Lemma 3 shows that this is equivalent to considering F the solution set. Next we show that the QL-majorization and its resulting map M fulfill the conditions of the GCT theorem: J (Y

Proof. Consider S the solution set Γ, and J the con-

Now we are ready to prove the convergence to stationary points (Theorem 2).

new , Y ) < G(Y, Y ) = J (Y ).

) ≤ G(Y

new

, which implies J (Y

new

= M(Y ). If Y / ∈ F , i.e. Y 6= M(Y ), we have Y 6= Y

, it can be upper-bounded by its tangent: A

unique minimum Y

= ∂A

2 + constant.

− ˜ y

, k˜ y

2 e Y =Y

∂k˜ y i − ˜ y j k

− ky i − y j k

, k˜ y i − ˜ y j k

2 e Y =Y

∂k˜ y i − ˜ y j k

) ≤A ij (P ij , Q ij )

, e Q

new

Proof. Because G is convexly quadratic, it has a

}

} are in the solution

Then all the limit points of {x k

ii M is closed over the complement of Γ; iii there is a continuous function α on X such that (a) if x / ∈ Γ, α(x) > α(y) for all y ∈ M(x), and (b) if x ∈ Γ, α(x) ≥ α(y) for all y ∈ M(x)

a solution set Γ ∈ X be given, and suppose that: i all points x k are contained in a compact set S ⊂

), where M is a point-to-set map on X. Let

∈ M(x

∞ k=0 be generated by x k+1

Let the sequence {x k

) converges monotonically to α(x) for some x ∈ Γ. The proof can be found in [19]. Before showing the convergence of MM, we need the following Lemmas. For brevity, denote M : Y → Y

), imply y ∈ A(x). For point-to-point map, continuity implies closedness. Global Convergence Theorem. (GCT; from [14])

∈ A(x

→ y, y

∈ X and y

→ x, x

set Γ and α(x k

new the map by using the MM update Eq. 4 in the paper.

= 0, i.e. Y = M(Y ). Therfore S ⊆ F Lemma 4. J (Y ^new ) < J (Y ) if Y / ∈ F .

= 0. Since we require the majorization function H shares the tangent with J , i.e.

∂G ∂ e Y Y =Y e

= 0 implies

∂J ∂ e Y e Y =Y

= 0, the condition of stationary points of J . Therefore F ⊆ S. On the other hand, because QL requires that G and J share the same tangent at Y , and thus

∂J ∂ e Y e Y =Y

∂H ∂ e Y Y =Y e

∂J ∂ e Y Y =Y e

∂H ∂ e Y e Y =Y

Let S and F be the sets of stationary points of J (Y ) and fixed points of M, respectively. Lemma 3. S = F .

Y + Ψ =0, (1) which is recognized as

W +W ^T

2 k e Y − Y k

, we have H(Y, Y ) = G(Y, Y ). Therefore J (Y ) = H(Y, Y ) = G(Y, Y ) ≥ G(Y

new

, Y ) ≥ J (Y

new

), where the first inequality comes from minimization and the second is ensured by the backtracking.

, Eq. 1 is equivalent to

∂A

−1

P ij ln e Q ij =

ln e Q

e Q

A ij (P ij , e Q ij ) = −

b a P

ln (c + aξ

) B(P, e Q) =

e Q

The second derivative of A ij =

b a

P ij ln (c + aξ ij ) with respect to ξ

(P || e Q) = −

≤ 0 iff −(a + b(β − 1)) ≤ 0. That is A ij is concave in ξ ij iff β ∈ [1 − a/b, ∞). Special case 1 (α → 1 or β → 1): D

abP _ij (aξ ij +c) ^{2 , which is always non-} positive.

e Q

1 1 − β P

e Q

β−1 ij

1 1 − β P ij (c + aξ ij )

−b(β−1)/a

B(P, e Q) =

1 β

β ij

∂ ² A ij ∂ξ ² _ij

The second derivative of A ij =

1 1−β

P ij (c + aξ ij )

−b(β−1)/a

with respect to ξ ij is − bP

(a + b(β − 1)) (aξ

− ^b _a β−1

(aξ

(2) Since a, b, c, and P

are nonnegative, we have

is −

Special case 2 (α → 0 for P

) =

−b/a

e Q

ln e Q

−

e Q

The second derivative of A

= (c + aξ

)

ln P

with respect to ξ

is b(a+b) ln P

(aξ

− ^b _a −2

, which is non-positive iff P

ij ∈ (0, 1].

Special case 3 (β → 0): D

IS (P || e Q) =

P P ij e

ln P ij B(P, e Q) =

> 0): D

dual-I

(P || e Q) = P

e Q ij ln P ij +

e Q ij ln e Q ij −

e Q ij . We write

A(P, e Q) =

−b/a

(c + aξ ij )

ln P

e Q

) =

, e Q

= 2 L

is the same as ∂

k˜ y

− ˜ y

W +W ^{T e}

) =2 L

, we have

∂H ∂ e Y e Y =Y

∂J ∂ e Y e Y =Y

4.4 Proof of Theorem 4

Proof. Obviously all α- and β-divergences are addi-

tively separable. Next we show the concavity in the given range. Denote ξ

= k˜ y

− ˜ y

W +W ^{T e}

k˜ y i − ˜ y j k

Majorization-Minimization for Manifold Embedding (Supplemental Document)

That is, W

∂A _ij ∂k˜ y i −˜ y j k ² Y =Y e .

For the tangent sharing, first we have H(Y, Y ) = J (Y ) because obviously A ij (P ij , e Q ij ) = A ij (P ij , Q ij ) when e

Y = Y . Next, because ∂A

∂ e Y

− ˜ y

(˜ y

∂k˜ y

∂ e Y

− ˜ y

c)
constant ∂ e Y
c)
constant. We write A(P, e Q) =
P

2 for brevity.

, e Q

α(ax + c)

)

−b(1−α)/a

to ξ

is ∂

2 A ij

∂ξ

2 ij

= bP

α ij

((α − 1)b − a) (aξ

− ^b _a 1−α

2 .

α ij

Since a, b, c, and P

are nonnegative, we have

∂ ² A _ij ∂ξ ² _ij

≤ 0 iff α((α − 1)b − a) ≤ 0 and α 6= 0. That is, A ij is concave in ξ ij iff α ∈ (0, 1 + a/b]. β-divergence. We decompose a β-divergence into

(P || e Q) = A(P, e Q) + B(P, e Q) + constant, where A(P, e Q) =

(c + aξ

α-divergence. We decompose an α-divergence into D

(P || e Q) = A(P, e Q) + B(P, e Q) + constant, where A(P, e Q) =

A ij (P ij , e Q ij ) =

1 α(α − 1)

α ij

e Q

1−α ij

1 α(α − 1)

The second derivative of A

e Q

1 α

B(P, e Q) =

−b(1−α)/a

)

(c + aξ

α ij

c)
c)
constant. We write A(P, e Q) =

[5]) A(w, b) = λ

i=1 N

j=1

In this example a composite cost function is used: J (w, b) = A(w, b) + B(w, b) where A(w) is a locality preserving regularizer (see e.g.

− z

)

and B(w, b) is an empirical loss function [9] B(w, b) = (1 − λ)

c) ^b ^a

are vectorial primary data and y i ∈ {−1, 1} are supervised labels, the task is to learn a linear function z = hw, xi+ b for predicting y.

∗

= exp − P

Q ij ln(Q ij /P ij ) P

! for α → 0.

5 Examples of QL beyond Manifold Embedding

To avoid vagueness we defined the scope to be manifold embedding, which already is a broad field and one of listed AISTATS research areas. Within this scope the work is general.

It is also naturally applicable to any other optimization problems amenable to QL majorization. QL is applicable to any cost function J = A + B, where A can be tightly and quadratically upper bounded by Eq. 2 in the paper, and B is smooth. Next we give an example beyond visualization and discuss its potential extensions. Consider a semi-supervised problem: given a training set comprising a supervised subset {(x i , y i )}

n i=1

and an unsupervised subset {x i }

N i=n+1

, where x i ∈ R

i=1

, with the special case λ

i=1

1 −

1 1 + exp(−y

This example can be further extended with the same spirit, where B is replaced by another smooth loss function which is non-convex and non-concave, e.g. [8]

)

, and XL S+ST ₂ X

can be replaced by other positive semi-definite matrices.

6 Datasets Twelve datasets have been used in our experiments.

Their statistics and sources are given in Table 1. Below are brief descriptions of the datasets.

B(w, b) = (1 − λ)

∂B ∂w + ρw .

[1 − tanh(y

)] with λ ∈ [0, 1] the tradeoff parameter and S

the local similarity between x

and x

−1

j (e.g. a Gaussian kernel).

Because B is non-convex and non-concave, conven- tional majorization techniques that require convex- ity/concavity such as CCCP [18] are not applicable. However, we can apply QL here (Lipschitzation to B and quadratification to A), which gives the update rule for w (denote X = [x

1 . . . x N ]):

new

= λXL _S+ST ₂ X

τ ij

b/a

e Q

−1 ij

P ij (c + aξ ij )

B(P, e Q) =

ln e Q

ij .

The second derivative of A ij = P ij (c + aξ ij )

b/a

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski

ln e Q

, e Q

) =

with respect to ξ

is (b − a)bP

! ¹ _τ ,

∗

= arg min

D β→τ (P ||λQ) = P

τ −1 ij

1−τ ij

∗

(aξ

−2

, which is non- positive iff a ≥ b, or equivalently 0 ∈ [1 − a/b, ∞].

4.5 Proof of Theorem 5 The proofs are done by zeroing the derivative of the right hand side with respect to λ (see [17]). The closed- form solutions of λ at the current estimate for τ ≥ 0 are

= arg min

α→τ

(P ||λQ) = P

τ ij

SCOTLAND: It lists the (136) multiple directors of the 108 largest joint stock companies in Scotland in 1904-5: 64 non-financial firms, 8 banks, 14 in- surance companies, and 22 investment and prop- erty companies (Scotland.net).
COIL20: the COIL-20 dataset from Columbia University Image Library, images of toys from dif- ferent angles, each image of size 128 × 128.
7SECTORS: the 4 Universities dataset from CMU

Text Learning group, text documents classified to 7 sectors; 10,000 words with maximum information gain are preserved. Majorization-Minimization for Manifold Embedding (Supplemental Document)

Table 1: Dataset statistics and sources Dataset #samples #classes Domain Source SCOTLAND 108 8 network PAJEK COIL20 1440 20 image COIL

7SECTORS 4556 7 text CMUTE RCV1 9625 4 text RCV1 PENDIGITS 10992 10 image UCI MAGIC 19020 2 telescope UCI

20NEWS 19938 20 text

20NEWS LETTERS 20000 26 image UCI

SHUTTLE 58000 7 aerospace UCI MNIST 70000 10 image MNIST SEISMIC 98528 3 sensor LIBSVM MINIBOONE 130064 2 physics UCI

PAJEK http://vlado.fmf.uni-lj.si/pub/networks/data/ UCI http://archive.ics.uci.edu/ml/ COIL http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php CMUTE http://www.cs.cmu.edu/~TextLearning/datasets.html RCV1 http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/

20NEWS http://people.csail.mit.edu/jrennie/20Newsgroups/ MNIST http://yann.lecun.com/exdb/mnist/ LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

RCV1: text documents from four classes, with and is used to distinguish electron neutrinos (sig- 29992 words. nal) from muon neutrinos (background).
PENDIGITS: the UCI pen-based recognition of

7 Supplemental experiment results handwritten digits dataset, originally with 16 di- mensions.

7.1 Evolution curves: objective vs. iteration

MAGIC: the UCI MAGIC Gamma Telescope Data Set, 11 numerical features.

Figure 1 shows the evolution curves of the t-SNE objective (cost function value) as a function of iteration

20NEWS: text documents from 20 newsgroups; for the compared algorithms. This supplements the

10,000 words with maximum information gain are results in the paper of objective vs. running time. preserved.

LETTERS: the UCI Letter Recognition Data Set, 16

7.2 Average number of MM trials numerical features. The proposed backtracking algorithm for MM involves

SHUTTLE: the UCI Statlog (Shuttle) Data Set, 9 an inner loop for searching for ρ. We have recorded the numerical features. The classes are imbalanced.

numbers of trials in each outer loop iteration. The av- Approximately 80% of the data belongs to class erage number of trials is then calculated. The means

1. and standard deviations across multiple runs are re- ported in Table 2.

MNIST: handwritten digit images, each of size 28× 28.

We can see that for all datasets, the average number of trials is around two. The number is generally and

SEISMIC: the LIBSVM SensIT Vehicle (seismic) slightly decreased with larger datasets. Two trials in data, with 50 numerical features (the seismic sig- an iteration means that ρ remains unchanged after the nals) from the sensors on vehicles.

inner loop. This indicates potential speedups may be achieved in future work by keeping ρ constant in most

MINIBOONE: the UCI MiniBooNE particle identi- iterations.

fication Data Set, 50 numerical features. This

dataset is taken from the MiniBooNE experiment We have also recorded the average of number of func-

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski

−8

[2] M. Carreira-Perpi˜ n´ an. The elastic embedding algorithm for dimensionality reduction. In ICML, pages 167–174, 2010. [3] A. Cichocki, S. Cruces, and S.-I. Amari. General- ized alpha-beta divergences and their application to robust nonnegative matrix factorization. En- tropy, 13:134–170, 2011.

putational and Graphical Statistics, pages 444– 472, 2008.

H. Hofmann, and L. Chen. Data visualization with multidimensional scaling. Journal of Com-

Majorization-Minimization for Manifold Embedding (Supplemental Document)

1 Preliminaries

1 X

2 More Development Examples

Q◦Q

3 Neighbor Embedding

2 F

Dokumen yang terkait

IbM Pembuatan Insektarium dan Embedding Serangga Menjadi Souvenir

Design Optimization of Multi Vacuum Manifold for Semiconductor Industry

Analisa Kegagalan pada Fuel Intake Manifold Pesawat Terbang Boeing 737-500

Peningkatan Prestasi Motor Diesel dengan Metode Optimalisasi Sistem Intake Manifold Turbo Fan Axial

Web Hosting for Dummies for Big data

New Method for Extracting Keyword for the Social Actor

New Method for Extracting Keyword for the Social Actor

Embedding Analytics in Modern Applications pdf pdf

Improving Cluster Analysis by Co-initializations (Supplemental Document)

Majorization-Minimization for Manifold Embedding

Dukungan

Links

Majorization-Minimization for Manifold Embedding (Supplemental Document)

1 Preliminaries

1 X

2 More Development Examples

Q◦Q

3 Neighbor Embedding

2 F

Dokumen yang terkait

IbM Pembuatan Insektarium dan Embedding Serangga Menjadi Souvenir

Design Optimization of Multi Vacuum Manifold for Semiconductor Industry

Analisa Kegagalan pada Fuel Intake Manifold Pesawat Terbang Boeing 737-500

Peningkatan Prestasi Motor Diesel dengan Metode Optimalisasi Sistem Intake Manifold Turbo Fan Axial

Web Hosting for Dummies for Big data

New Method for Extracting Keyword for the Social Actor

New Method for Extracting Keyword for the Social Actor

Embedding Analytics in Modern Applications pdf pdf

Improving Cluster Analysis by Co-initializations (Supplemental Document)

Majorization-Minimization for Manifold Embedding

Dokumen yang Anda mencari sudah siap untuk unduhkan