Majorization-Minimization for Manifold Embedding (Supplemental Document)

  

Majorization-Minimization for Manifold Embedding

(Supplemental Document)

1,2,3 1,3,4 1,2,3

  Zhirong Yang , Jaakko Peltonen and Samuel Kaski

  1

  2 Helsinki Institute for Information Technology HIIT, University of Helsinki,

  3

  4 Aalto University, University of Tampere Abstract Lemma 1. If f (z) is concave in z, then

  ∂f (˜ x) f (˜ x) ≤f (x) + , ˜ x − x

  ˜ x=x

  ∂ ˜ x This supplemental document provides addi- def

  ∂f (˜ x) tional information that does not fit in the pa- = , ˜ x + constant = G(˜ x, x).

  ∂ ˜ x ˜ x=x per due to the space limit. Section 1 gives related mathematical basics, including ma- jorizations of convex and concave functions,

  For a scalar concave function f (), this simplifies to

  ′

  as well as the definitions of information diver- f (˜ x) ≤ G(˜ x, x) = ˜ xf (x) + constant. Obviously, gences. Section 2 presents more examples on

  ∂f (˜ x) ∂G(˜ x, x) G(x, x) = f (x) and = . developing MM algorithms for manifold em-

  x=x ˜ x=x ˜

  ∂ ˜ x ∂ ˜ x bedding, in addition to the t-SNE included 2) A convex function can be upper-bounded by using in the paper. Section 3 illustrates that many the Jensen’s inequality. existing manifold embedding methods can be unified into the Neighbor Embedding frame-

  Lemma 2. If f (z) is convex in z, and ˜ x = [˜ x

  1 , . . . , ˜ x n ]

  work given in Section 5 of the paper. Section

  as well as x = [x 1 , . . . , x n ] are nonnegative,

  4 provides proofs of the theorems in the pa- ! !

  n n

  per. Section 5 gives an example of QL be-

  X X x i x ˜ i f x ˜ ≤ f

  i P x i

  yond manifold embedding. Section 6 gives

  P

  x

  j j x i=1 i=1 j j

  the statistics and source of the experimented  

  n

  datasets. Section 7 provides supplemental

  X X def x i x ˜ i = f  x  = G(˜ x, x).

  P j experiment results. x x

  j i j i=1 j

  P

  n

  Obviously G(x, x) = f ( x i ). Their first deriva-

  i=1

  tives also match:

1 Preliminaries

    P

  X x j ∂G(˜ x, x) x ˜ x

  i i j ′

    = f x

  P j Here we provide two lemmas related to majorization

  ˜ x=x ˜ x=x

  ∂ ˜ x x x x

  i j i i j j

  of concave/convex functions and definitions of infor-   mation divergences.

  X

  ′

    =f x

  j j

  1.1 Majorization of concave/convex functions P

  n

  ∂f ( x ˜ )

  i i=1

  = .

  x=x ˜

  ∂ ˜ x

  i

  1) A concave function can be upper-bounded by its tangent. th

  1.2 Information divergences

  

Appearing in Proceedings of the 18 International Con- Information divergences, denoted by D(p||q), were

ference on Artificial Intelligence and Statistics (AISTATS)

  originally defined for probabilities and later extended 2015, San Diego, CA, USA. JMLR: W&CP volume 38. to measure the difference between two (usually non- Copyright 2015 by the authors. negative) tensors p and q, where D(p||q) ≥ 0, and

  

Majorization-Minimization for Manifold Embedding (Supplemental Document)

  D(p||q) = 0 iff p = q. To avoid notational clutter

  2.1 Elastic Embedding (EE) we only give vectorial definitions; it is straightforward Given a symmetric and nonnegative matrix P and λ > to extend the formulae to matrices and higher-order 0, Elastic Embedding [2] minimizes tensors.

  X X

  2

  2 J ( e Y ) = P k˜ y − ˜ y k + λ exp −k˜ y − ˜ y k

  Let D , D , D , and D respectively denote α-, β-, EE ij i j i j

  α β γ r ij ij

  γ-, and R´enyi-divergences. Their definitions are (see

  X X e.g. [3]) e

  = − P ln e Q + λ Q

  ij ij ij ij ij 2 where e Q = exp −k˜ y − ˜ y k . ij i j

  The EE objective is naturally decomposed into

  X

1 X

  α 1−α

  D (p||q) = p q − αp + (α − 1)q ,

  α i i i A(P, e Q) = − P ln e Q i ij ij

  α(α − 1)

  i ij

  h i

  X X

  1

  β β β−1

  e B( e Q) =λ Q . D β (p||q) = p + (β − 1)q − βp i q , ij

  i i i

  β(β − 1)

  ij i

  P γ−1 P P

  γ γ

  ln p q

  i i i The quadratification of A(P, e Q) is simply identical,

  ln ( p ) ln ( q )

  i i i i

  D (p||q) = + − ,

  γ

  with W = P . The final majorization function is γ(γ − 1) γ γ − 1

  D E !

  X

  2 X

  1 G( e Y , Y ) = P + ij k˜ y i − ˜ y j k Ψ, e Y

  r 1−r

  D (p||q) = ln p ˜ q ˜ ,

  r i i ij

  r − 1

  i

  ρ

  2

  • k e Y − Y k + constant,

  2 where ∂B where p and q are the entries in p and q respectively,

  i i Ψ = = −4λL Y.

  Q

  P P

  Y =Y e

  ˜ p = p / p , and ˜ q = q / q . To handle p’s

  i i j i i j ∂ e Y j j

  containing zero entries, we only consider nonnegative Thus the MM update rule of EE is

  α, β, γ and r. These families are rich as they cover

  −1

  ρ ρ

  new

  most commonly used divergences in machine learning + Y = L I λL Y + Y .

  P Q

  4

  4 such as normalized Kullback-Leibler divergence (ob- tained from D with r → 1 or D with γ → 1), non-

  r γ

  2.2 Stochastic Neighbor Embedding (SNE) normalized KL-divergence (α → 1 or β → 1), Itakura- Saito divergence (β → 0), squared Euclidean distance Suppose the input matrix is row-stochastic, i.e. P ≥ 0

  ij

  P

  2

  (β = 2), Hellinger distance (α = 0.5), and Chi-square and P = 1. Denote ˜ q = exp −k˜ y − ˜ y k and

  ij ij i j j

  P divergence (α = 2). Different divergences have become e

  Q = ˜ q / q ˜ . Stochastic Neighbor Embedding

  ij ij ib b

  widespread in different domains. For example, D KL is [6] minimizes total Kullback-Leibler (KL) divergence widely used for text documents (e.g. [7]) and D

  IS is

  between P and e Q rows: popular for audio signals (e.g. [4]). In general, estima- J ( e Y )

  SNE

  tion using α-divergence is more exclusive with larger α’s, and more inclusive with smaller α’s (e.g. [10]). For

  X X P

  ij

  = P ln

  ij

  β-divergence, the estimation becomes more robust but e Q ij

  i j less efficient with larger β’s.

  X X

  X = − P ln ˜ q ln q ˜ + constant +

  ij ij ij ij i j

  X X

  X

  2

  = P k˜ y − ˜ + y k ln q ˜ + constant

  ij i j ij ij i j

2 More Development Examples

  Thus we can decompose the SNE objective into A(P, ˜ q) + B(˜ q) + constant, where

  X The paper provides an example of developing a MM

  2 A(P, ˜ q) = P k˜ y − ˜ y k ij i j

  algorithm for t-Distributed Stochastic Neighbor Em-

  ij

  bedding (t-SNE); here we provide additional examples

  X X B(˜ q) = ln q ˜ .

  ij

  of developing MM algorithms for other manifold em-

  i j bedding methods.

  • constant That is, W

  • D Ψ, e Y E +

  P ij 2ky i −y j k

  2ky i − y j k k˜ y i − ˜ y j k

  ij

  P

  ij

  X

  G( e Y , Y ) =

  in the quadratifi- cation phase. The final majorization function of the LinLog objective is

  ij

  Q

  2 P ij

  1

  =

  ij

  =

  2

  k

  j

  − ˜ y

  i

  k k˜ y

  j

  − y

  i

  P ij 2ky

  ij

  X

  • constant, where
  • D Ψ, e Y E +
  • constant, where
  • ρ

  2

  2 k e Y − Y k

  ρ

  2 k e Y − Y k

  Y, where ◦ is the elementwise product. Then the MM update rule of LinLog is

  = −2λL

  Ψ = ∂B ∂ e Y

  2

  

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski

  Again the quadratification of A(P, ˜ q) is simply iden- tical, with W = P . The final majorization function is

  G( e Y , Y ) =

  X

  ij

  P ij k˜ y i − ˜ y j k

  2

  ρ

  2

  2 Y . Similarly, we can develop the rule for symmetric SNE (s-SNE) [12]:

  Ψ = ∂B ∂ e Y

  e Y =Y

  = −2L

  Q+Q T Y.

  Thus the MM update rule of SNE is Y

  new

  = L

  P +P T

  2 I

  −1

  L

  Q+Q T

  Y + ρ

  Since the square root function is concave, by Lemma 1 we can upper bound A(P, e Q) by A(P, e Q) ≤

Q◦Q

  i

  i

  i

  , ˜ y

  j

  i qP

  ij

  P

  2 ij

  · qP

  ij

  h˜ y

  , ˜ y

  ij

  j

  i

  2

  where P

  ij

  = hx

  i

  , x

  j

  i. The inner products h˜ y

  h˜ y

  P

  , ˜ y

  = L

  ij

  P

  ij

  X

  (Y ) =

  s-SNE

  J

  Y =Y e

  Y

  new

  P ◦Q

  ij

  2 I

  −1

  L

  λ◦Q◦Q

  Y + ρ

  2 Y .

  2.4 Multidimensional Scaling with Kernel Strain (MDS-KS)

  Strain-based multidimensional scaling maximizes the cosine between the inner products in the input and output spaces (see e.g. [1]):

  J

  MDS-S

  ( e Y ) = P

  i

  j

  ij

  ij

  e Q

  2 ij

  2 ln

  X

  ij

  P

  2 ij

  . We decompose the MDS-KS objective function into A(P, e Q) + B(P, e Q) + constant, where

  A(P, e Q) = − ln

  X

  P

  X

  ij

  e Q

  ij

  B(P, e Q) =

  1

  2 ln

  X

  ij

  e Q

  2 ij

  .

  ij

  2 ln

  i above can be seen as a linear kernel e Q ij = h˜ y i , ˜ y j i and simply lead to kernel PCA of P . In this exam- ple we instead consider a nonlinear embedding kernel e

  P

  Q ij = exp(−k˜ y i − ˜ y j k

  2

  ), where the corresponding ob- jective function is P

  ij

  P

  ij

  e Q

  ij

  qP

  ij

  2 ij

  1

  · qP

  ij

  e Q

  2 ij

  , and maximizing it is equivalent to minimizing its neg- ative logarithm

  J

  MDS-KS

  ( e Y ) = − ln

  X

  ij

  P ij e Q ij +

  − y

  ln P

  Q

  • ρ

  • (1 − λ)

  P

  P

  2 I

  −1

  Ψ + ρ

  2 Y where P and Q are defined same manner as in SNE, and

  Ψ = ∂J

  NeRV

  P

  i

  P

  j

  ij

  ij

  = L

  ln q

  ij

  ij

  ∂Y .

  2.3 LinLog Suppose a weighted undirected graph is encoded in a symmetric and nonnegative matrix P (i.e. the weighted adjacency matrix). The node-repulsive Lin- Log graph layout method [11] minimizes the following energy function:

  J

  LinLog

  ( e Y ) =

  X

  P +P T

  Y

  new

  P

  j

  k

  2

  ; and for Neighbor Retrieval Visu- alizer (NeRV) [13]: J

  NeRV

  (Y ) =λ

  X

  i

  X

  j

  ij

  i

  ln P ij Q

  ij

  X

  i

  X

  j

  Q

  ij

  ln Q

  ij

  P ij ,

  • ρ
  • λ

  k˜ y

  k − λ

  − ˜ y

  ij

  − ˜ y

  j k.

  ij

  and q

  ab

  q

  ab

  / P

  ij

  = q

  = 1, Q

  ln k˜ y

  ij

  P

  ij

  4 Y , where P

  L Q Y + ρ

  −1

  4 I

  = L P + ρ

  new

  , Y

  • 1

  i

  ij

  j

  k

  ij

  X

  ij

  ln k˜ y

  i

  − ˜ y

  j k.

  We write e Q = k˜ y

  i

  − ˜ y

  j

  −1

  X

  and then decompose the Lin- Log objective function into A(P, e Q) + B(P, e Q), where

  A(P, e Q) =

  X

  ij

  P

  ij

  k˜ y

  i

  = exp −ky

  j

  k B(P, e Q) = − λ

  − ˜ y

  

Majorization-Minimization for Manifold Embedding (Supplemental Document)

  We can upper bound A(P, e Q) by the Jensen’s inequal- Next we show some other existing manifold embed- ity (Lemma 2): ding objectives can equivalently expressed as NE. Ob- viously SNE and its variants belong to NE. Moreover,

    we have X e

  P ij Q ij P ij Q ij A(P, e Q) ≤ − ln  

  P

  P Q ij ij

  P Q

  ab ab P ab ij ab P Q ab ab

  • arg min J ( e Y ) = arg min D (P ||λ e Q), with

  EE β→1 Y e Y e

  X P ij Q ij

  2

  2

  e = k˜ y − ˜ y k Q = exp −k˜ y − ˜ y k [17]. P i j ij i j

  P Q

  ab ab ab ij

  X P Q Q

  ij ij ij Proof.

  • . ln

  P P P Q P Q

  ab ab ab ab ab ab ij

  J ( e Y )

  EE P Q ij ij

  X X

  P

  That is, W = in the quadratification

  ij

  2

  2 ab P Q ab ab

  = P k˜ y − ˜ y k + λ exp −k˜ y − ˜ y k

  ij i j i j phase. ij ij

  X X The final majorization function of the MDS-KS objec- e

  = − P ij ln e Q ij + λ Q ij tive is

  ij ij

  D E

  X X

  X X P ij

  P Q

  ij ij

  2

  e = P ln + λ Q − P

  G( e Y , Y ) = k˜ y − ˜ y k Ψ, e Y P i j

  • ij ij ij

  P Q λ e Q

  ab ab ij ab ij ij ij ij

  X X ρ

  2

  − P ln P + (ln λ + 1) P

  ij ij ij

  • k e Y − Y k + constant,

  2

  ij ij

  where =D β→1 (P ||λ e Q) + constant.

  ∂B Ψ = = −L Y,

  U Y =Y e 2 ∂ e Y Q ij P 2 • arg min J ( e Y ) = arg min D (P ||λ e Q),

  with U = . Then the MM update rule of LinLog β→0

  ij Y e Y e ab ab Q

  −1 with e Q = k˜ y − ˜ y k [17].

  MDS-KS is

  ij i j −1

  ρ ρ

  new

  • Y = L I L Y + Y .

  W U Proof.

  4

  4

  1

3 Neighbor Embedding

  J ( e Y )

  LinLog

  λ

  X X

  1 In Section 5 of the paper, we review a framework for = P k˜ y − ˜ y k − ln k˜ y − ˜ y k,

  ij i j i j

  λ manifold embedding which is called Neighbor Embed- ij ij " # ding (NE) [16, 17]. Here we demonstrate that many

  X P

  1

  ij

  manifold embedding objectives, including the above = − ln e

  λ e Q Q

  ij ij

  examples, can be equivalently formulated as an NE ij " # problem.

  X X P ij P ij P ij

  • = − ln − 1 ln + 1 NE minimizes D(P || e Q), where D is a divergence from

  λ λ e Q λ e Q

  ij ij ij ij

  α-, β-, γ-, or R´enyi-divergence families, and the em- =D (P ||λ e Q) + constant.

  β→0

  bedding kernel in the paper is parameterized as e Q ij =

  −b/a

  2

  c + ak˜ y i − ˜ y j k for a ≥ 0, b > 0, and c ≥ 0 (adapted from [15]). First we show the parameterized form of e Q includes

  • arg min J ( e Y ) = arg min D (P || e Q) by

  MDS-KS γ=2 e e Y Y

  the most popularly used embedding kernels. When −1 their definitions.

  2

  a = 1, b = 1, and c = 1, e Q = 1 + k˜ y − ˜ y k

  ij i j

  is the Cauchy kernel (i.e. the Student-t kernel with a single degree of freedom); when a → 0 and b = 1,

  4 Proofs

  2

  e Q = exp −k˜ y − ˜ y k is the Gaussian kernel; when

  ij i j −1

  a = 1, b = 1/2, c = 0, e Q ij = k˜ y i − ˜ y j k is the inverse Here we provide proofs of the five theorems in the pa- to the Euclidean distance. per.

  Proof. The fixed points of the MM update rule appear

  

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski

  4.1 Proof of Theorem 1

  Proof. Since B is upper-bounded by its Lipschitz sur-

  • ρI)

  rogate B(P, e Q) = B( e Y ) ≤ B(Y ) + hΨ, e Y − Y i +

  • ρI) Y = (−Ψ + ρY )

  ρ

  W +W T

  (−Ψ + ρY ) (2L

  −1

  W +W T

  when Y = (2L

2 F

  4.2 Proof of Theorem 2 The proposed MM updates share many convergence properties with the Expectation-Maximization (EM) show that the MM updates will converge a stationary point of J . A stationary point can be a local optimum or a saddle point. The convergence result is a special case of the Global Convergence Theorem (GCT; [19]) which is quoted be- low. A map A from points of X to subsets of X is called a point-to-set map on X. It is said to be closed at x if x

  ) ≥ J (Y ) and the bounded- ness assumption of J imply Condition i; M is a point- to-point map and thus the continuity of G over both e Y and Y implies the closedness condition ii; Lemma 4 implies iii(a); Theorem 1 in the paper implies iii(b). Therefore, the proposed MM updates are a special case of GCT and thus converge to a stationary point of J .

  2

  k

  j

  − ˜ y

  i

  is concave to k˜ y

  Proof. Since A ij

  4.3 Proof of Theorem 3

  new

  ij

  tinuous function α in the GCT theorem. Lemma 3 shows that this is equivalent to considering F the so- lution set. Next we show that the QL-majorization and its resulting map M fulfill the conditions of the GCT theorem: J (Y

  Proof. Consider S the solution set Γ, and J the con-

  Now we are ready to prove the convergence to station- ary points (Theorem 2).

  new , Y ) < G(Y, Y ) = J (Y ).

  ) ≤ G(Y

  new

  , which implies J (Y

  new

  = M(Y ). If Y / ∈ F , i.e. Y 6= M(Y ), we have Y 6= Y

  , it can be upper-bounded by its tangent: A

  (P

  unique minimum Y

  = ∂A

  2 + constant.

  k

  j

  − ˜ y

  i

  , k˜ y

  2 e Y =Y

  ∂k˜ y i − ˜ y j k

  ij

  2

  ij

  − ky i − y j k

  2

  , k˜ y i − ˜ y j k

  2 e Y =Y

  ∂k˜ y i − ˜ y j k

  ij

  ) ≤A ij (P ij , Q ij )

  ij

  , e Q

  new

  Proof. Because G is convexly quadratic, it has a

  k

  }

  } are in the solution

  Then all the limit points of {x k

  ii M is closed over the complement of Γ; iii there is a continuous function α on X such that (a) if x / ∈ Γ, α(x) > α(y) for all y ∈ M(x), and (b) if x ∈ Γ, α(x) ≥ α(y) for all y ∈ M(x)

  X;

  a solution set Γ ∈ X be given, and suppose that: i all points x k are contained in a compact set S ⊂

  ), where M is a point-to-set map on X. Let

  k

  ∈ M(x

  ∞ k=0 be generated by x k+1

  Let the sequence {x k

  ) converges monotonically to α(x) for some x ∈ Γ. The proof can be found in [19]. Before showing the convergence of MM, we need the following Lemmas. For brevity, denote M : Y → Y

  ), imply y ∈ A(x). For point-to-point map, continuity implies closedness. Global Convergence Theorem. (GCT; from [14])

  k

  ∈ A(x

  k

  → y, y

  k

  ∈ X and y

  k

  → x, x

  set Γ and α(x k

  new the map by using the MM update Eq. 4 in the paper.

  = 0, i.e. Y = M(Y ). Therfore S ⊆ F Lemma 4. J (Y new ) < J (Y ) if Y / ∈ F .

  = 0. Since we re- quire the majorization function H shares the tangent with J , i.e.

  ∂G ∂ e Y Y =Y e

  = 0 implies

  ∂J ∂ e Y e Y =Y

  = 0, the condition of stationary points of J . Therefore F ⊆ S. On the other hand, because QL requires that G and J share the same tangent at Y , and thus

  ∂J ∂ e Y e Y =Y

  ∂H ∂ e Y Y =Y e

  =

  ∂J ∂ e Y Y =Y e

  ∂H ∂ e Y e Y =Y

  Let S and F be the sets of stationary points of J (Y ) and fixed points of M, respectively. Lemma 3. S = F .

  Y + Ψ =0, (1) which is recognized as

  W +W T

  2L

  2 k e Y − Y k

  , we have H(Y, Y ) = G(Y, Y ). Therefore J (Y ) = H(Y, Y ) = G(Y, Y ) ≥ G(Y

  new

  , Y ) ≥ J (Y

  new

  ), where the first inequality comes from minimization and the second is ensured by the backtracking.

  , Eq. 1 is equivalent to

  • ∂A

  −1

  P ij ln e Q ij =

  P

  ij

  P

  ij

  ln e Q

  ij

  ij

  e Q

  ij

  X

  ij

  A ij (P ij , e Q ij ) = −

  X

  ij

  X

  I

  ij

  b a P

  ij

  ln (c + aξ

  ij

  ) B(P, e Q) =

  X

  ij

  e Q

  ij

  The second derivative of A ij =

  b a

  P ij ln (c + aξ ij ) with respect to ξ

  ij

  (P || e Q) = −

  ≤ 0 iff −(a + b(β − 1)) ≤ 0. That is A ij is concave in ξ ij iff β ∈ [1 − a/b, ∞). Special case 1 (α → 1 or β → 1): D

  abP ij (aξ ij +c) 2 , which is always non- positive.

  e Q

  ij

  1 1 − β P

  ij

  e Q

  β−1 ij

  =

  X

  ij

  1 1 − β P ij (c + aξ ij )

  −b(β−1)/a

  B(P, e Q) =

  1 β

  X

  ij

  β ij

  ∂ 2 A ij ∂ξ 2 ij

  The second derivative of A ij =

  1 1−β

  P ij (c + aξ ij )

  −b(β−1)/a

  with respect to ξ ij is − bP

  ij

  (a + b(β − 1)) (aξ

  ij

  − b a β−1

  (aξ

  ij

  (2) Since a, b, c, and P

  ij

  are nonnegative, we have

  is −

  Special case 2 (α → 0 for P

  ) =

  −b/a

  e Q

  ij

  ln e Q

  ij

  −

  X

  ij

  e Q

  ij

  The second derivative of A

  ij

  = (c + aξ

  ij

  )

  ln P

  X

  ij

  with respect to ξ

  ij

  is b(a+b) ln P

  ij

  (aξ

  ij

  − b a −2

  , which is non-positive iff P

  ij ∈ (0, 1].

  Special case 3 (β → 0): D

  IS (P || e Q) =

  P P ij e

  Q

  ij

  ln P ij B(P, e Q) =

  ij

  A

  > 0): D

  dual-I

  (P || e Q) = P

  ij

  e Q ij ln P ij +

  P

  ij

  e Q ij ln e Q ij −

  P

  ij

  e Q ij . We write

  A(P, e Q) =

  X

  ij

  ij

  −b/a

  ij

  (c + aξ ij )

  ij

  X

  =

  ij

  ln P

  e Q

  (P

  ij

  X

  ) =

  ij

  , e Q

  ij

  X

  ij

  j

  = 2 L

  Y

  st

  is the same as ∂

  P

  ij

  W

  ij

  k˜ y

  i

  − ˜ y

  j

  k

  2

  st

  W +W T e

  ) =2 L

  Y

  st

  , we have

  ∂H ∂ e Y e Y =Y

  =

  ∂J ∂ e Y e Y =Y

  .

  4.4 Proof of Theorem 4

  Proof. Obviously all α- and β-divergences are addi-

  tively separable. Next we show the concavity in the given range. Denote ξ

  ij

  = k˜ y

  i

  − ˜ y

  W +W T e

  it

  k

  k˜ y i − ˜ y j k

  

Majorization-Minimization for Manifold Embedding (Supplemental Document)

  That is, W

  ij

  =

  ∂A ij ∂k˜ y i −˜ y j k 2 Y =Y e .

  For the tangent sharing, first we have H(Y, Y ) = J (Y ) because obviously A ij (P ij , e Q ij ) = A ij (P ij , Q ij ) when e

  Y = Y . Next, because ∂A

  ∂ e Y

  st

  =

  X

  ij

  A

  ij

  2

  − ˜ y

  =2

  st

  (˜ y

  si

  W

  j

  X

  st

  ∂k˜ y

  ∂ e Y

  2

  k

  j

  − ˜ y

  i

  • c)
  • constant ∂ e Y
  • c)
  • constant. We write A(P, e Q) =
  • P

  2 for brevity.

  , e Q

  α(ax + c)

  ij

  )

  −b(1−α)/a

  to ξ

  ij

  is ∂

  2 A ij

  ∂ξ

  2 ij

  = bP

  α ij

  ((α − 1)b − a) (aξ

  ij

  − b a 1−α

  2 .

  α ij

  Since a, b, c, and P

  ij

  are nonnegative, we have

  ∂ 2 A ij ∂ξ 2 ij

  ≤ 0 iff α((α − 1)b − a) ≤ 0 and α 6= 0. That is, A ij is concave in ξ ij iff α ∈ (0, 1 + a/b]. β-divergence. We decompose a β-divergence into

  D

  β

  (P || e Q) = A(P, e Q) + B(P, e Q) + constant, where A(P, e Q) =

  X

  ij

  A

  ij

  (P

  ij

  (c + aξ

  P

  α-divergence. We decompose an α-divergence into D

  ij

  α

  (P || e Q) = A(P, e Q) + B(P, e Q) + constant, where A(P, e Q) =

  X

  ij

  A ij (P ij , e Q ij ) =

  X

  ij

  1 α(α − 1)

  P

  α ij

  e Q

  1−α ij

  =

  X

  1 α(α − 1)

  1 α(α − 1)

  X

  =

  ij

  The second derivative of A

  ij

  e Q

  ij

  1 α

  P

  B(P, e Q) =

  −b(1−α)/a

  )

  ij

  (c + aξ

  α ij

  • c)
  • c)
  • constant. We write A(P, e Q) =

  [5]) A(w, b) = λ

  ij

  N

  X

  i=1 N

  X

  j=1

  S

  (z

  In this example a composite cost function is used: J (w, b) = A(w, b) + B(w, b) where A(w) is a locality preserving regularizer (see e.g.

  i

  − z

  j

  )

  2

  and B(w, b) is an empirical loss function [9] B(w, b) = (1 − λ)

  • c) b a

  are vectorial primary data and y i ∈ {−1, 1} are supervised labels, the task is to learn a linear function z = hw, xi+ b for predicting y.

  X

  ij

  ∗

  = exp − P

  ij

  Q ij ln(Q ij /P ij ) P

  ij

  Q

  ! for α → 0.

  D

  5 Examples of QL beyond Manifold Embedding

  To avoid vagueness we defined the scope to be manifold embedding, which already is a broad field and one of listed AISTATS research areas. Within this scope the work is general.

  It is also naturally applicable to any other optimiza- tion problems amenable to QL majorization. QL is applicable to any cost function J = A + B, where A can be tightly and quadratically upper bounded by Eq. 2 in the paper, and B is smooth. Next we give an example beyond visualization and discuss its potential extensions. Consider a semi-supervised problem: given a training set comprising a supervised subset {(x i , y i )}

  n i=1

  and an unsupervised subset {x i }

  N i=n+1

  , where x i ∈ R

  n

  i=1

  , with the special case λ

  z

  n

  X

  i=1

  1 −

  1 1 + exp(−y

  i

  i

  This example can be further extended with the same spirit, where B is replaced by another smooth loss function which is non-convex and non-concave, e.g. [8]

  )

  2

  , and XL S+ST 2 X

  T

  can be replaced by other positive semi-definite matrices.

  6 Datasets Twelve datasets have been used in our experiments.

  Their statistics and sources are given in Table 1. Below are brief descriptions of the datasets.

  B(w, b) = (1 − λ)

  ∂B ∂w + ρw .

  [1 − tanh(y

  i

  i

  z

  i

  )] with λ ∈ [0, 1] the tradeoff parameter and S

  ij

  the local similarity between x

  and x

  −1

  j (e.g. a Gaussian kernel).

  Because B is non-convex and non-concave, conven- tional majorization techniques that require convex- ity/concavity such as CCCP [18] are not applicable. However, we can apply QL here (Lipschitzation to B and quadratification to A), which gives the update rule for w (denote X = [x

  1 . . . x N ]):

  w

  new

  = λXL S+ST 2 X

  T

  • ρI

  τ ij

  Q

  b/a

  e Q

  −1 ij

  =

  X

  ij

  P ij (c + aξ ij )

  B(P, e Q) =

  P

  X

  ij

  ln e Q

  ij .

  The second derivative of A ij = P ij (c + aξ ij )

  b/a

  ij

  ij

  ij

  ij

  

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski

  P

  ij

  ln e Q

  ij

  X

  A

  X

  ij

  (P

  ij

  , e Q

  ij

  ) =

  with respect to ξ

  is (b − a)bP

  ij

  λ

  Q

  ij

  ! 1 τ ,

  λ

  ∗

  = arg min

  D β→τ (P ||λQ) = P

  P

  ij

  P

  ij

  Q

  τ −1 ij

  P

  ij

  1−τ ij

  ij

  ∗

  (aξ

  ij

  −2

  , which is non- positive iff a ≥ b, or equivalently 0 ∈ [1 − a/b, ∞].

  4.5 Proof of Theorem 5 The proofs are done by zeroing the derivative of the right hand side with respect to λ (see [17]). The closed- form solutions of λ at the current estimate for τ ≥ 0 are

  λ

  = arg min

  Q

  λ

  D

  α→τ

  (P ||λQ) = P

  ij

  P

  τ ij

  • SCOTLAND: It lists the (136) multiple directors of the 108 largest joint stock companies in Scotland in 1904-5: 64 non-financial firms, 8 banks, 14 in- surance companies, and 22 investment and prop- erty companies (Scotland.net).
  • COIL20: the COIL-20 dataset from Columbia University Image Library, images of toys from dif- ferent angles, each image of size 128 × 128.
  • 7SECTORS: the 4 Universities dataset from CMU

  Text Learning group, text documents classified to 7 sectors; 10,000 words with maximum informa- tion gain are preserved. Majorization-Minimization for Manifold Embedding (Supplemental Document)

  Table 1: Dataset statistics and sources Dataset #samples #classes Domain Source SCOTLAND 108 8 network PAJEK COIL20 1440 20 image COIL

  7SECTORS 4556 7 text CMUTE RCV1 9625 4 text RCV1 PENDIGITS 10992 10 image UCI MAGIC 19020 2 telescope UCI

  20NEWS 19938 20 text

  20NEWS LETTERS 20000 26 image UCI

  SHUTTLE 58000 7 aerospace UCI MNIST 70000 10 image MNIST SEISMIC 98528 3 sensor LIBSVM MINIBOONE 130064 2 physics UCI

  PAJEK http://vlado.fmf.uni-lj.si/pub/networks/data/ UCI http://archive.ics.uci.edu/ml/ COIL http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php CMUTE http://www.cs.cmu.edu/~TextLearning/datasets.html RCV1 http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/

  20NEWS http://people.csail.mit.edu/jrennie/20Newsgroups/ MNIST http://yann.lecun.com/exdb/mnist/ LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

  • RCV1: text documents from four classes, with and is used to distinguish electron neutrinos (sig- 29992 words. nal) from muon neutrinos (background).
  • PENDIGITS: the UCI pen-based recognition of

  7 Supplemental experiment results handwritten digits dataset, originally with 16 di- mensions.

  7.1 Evolution curves: objective vs. iteration

  • MAGIC: the UCI MAGIC Gamma Telescope Data Set, 11 numerical features.

  Figure 1 shows the evolution curves of the t-SNE ob- jective (cost function value) as a function of iteration

  • 20NEWS: text documents from 20 newsgroups; for the compared algorithms. This supplements the

  10,000 words with maximum information gain are results in the paper of objective vs. running time. preserved.

  • LETTERS: the UCI Letter Recognition Data Set, 16

  7.2 Average number of MM trials numerical features. The proposed backtracking algorithm for MM involves

  • SHUTTLE: the UCI Statlog (Shuttle) Data Set, 9 an inner loop for searching for ρ. We have recorded the numerical features. The classes are imbalanced.

  numbers of trials in each outer loop iteration. The av- Approximately 80% of the data belongs to class erage number of trials is then calculated. The means

  1. and standard deviations across multiple runs are re- ported in Table 2.

  • MNIST: handwritten digit images, each of size 28× 28.

  We can see that for all datasets, the average number of trials is around two. The number is generally and

  • SEISMIC: the LIBSVM SensIT Vehicle (seismic) slightly decreased with larger datasets. Two trials in data, with 50 numerical features (the seismic sig- an iteration means that ρ remains unchanged after the nals) from the sensors on vehicles.

  inner loop. This indicates potential speedups may be achieved in future work by keeping ρ constant in most

  • MINIBOONE: the UCI MiniBooNE particle identi- iterations.

  fication Data Set, 50 numerical features. This

  dataset is taken from the MiniBooNE experiment We have also recorded the average of number of func-

  

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski

  −8

  [2] M. Carreira-Perpi˜ n´ an. The elastic embedding al- gorithm for dimensionality reduction. In ICML, pages 167–174, 2010. [3] A. Cichocki, S. Cruces, and S.-I. Amari. General- ized alpha-beta divergences and their application to robust nonnegative matrix factorization. En- tropy, 13:134–170, 2011.

  putational and Graphical Statistics, pages 444– 472, 2008.

  H. Hofmann, and L. Chen. Data visualization with multidimensional scaling. Journal of Com-