Optimization Equivalence of Divergences Improves Neighbor Embedding (Supplemental Document)

  • luxembourg: the data is from The University of

  • shuttle: the data is from the UCI Machine Learn- ing Repository, Statlog (Shuttle) Data

  

  It is a binary graph where the nodes are 332 airports in the United States and the edges indicate whether there is direct flight between the airports.

  

  We used the version from the collection by

   It is a similarity graph of 3090

  songs. The songs are evenly divided among 10 classes that roughly correspond to different music genres. The weighted edges are human judgment on how similar two songs are.

  Florida Sparse Matrix

  

  It contains 114599 nodes and 239332 edges, where each edge is a street in Luxembourg. To maintain the space limit in the paper, we present results of two other datasets in this supplemental document (see Section

  

  

  It is a social network of 198 musicians. The musicians are mainly in New York and Chicago, with a few in both places or in other places.

  • jazz: the data is from Arenas’s

  • MNIST: the data is from the MNIST
  • ca-GrQc: the data is from Stanford Network Analy- sis
  • worldtrade: the data is from Pajek
  • usair97: the data is from the LinLog layout pack-
  • mirex07: the data is from the the Third Music Information Retrieval Evaluation eXchange (MIREX

  

  It is a coauthor network among 5242 re- searchers working on general relativity. We extracted the largest connected component with 4158 authors.

  2. Retrieval interpretation of ws-SNE

  Consider the objective of ws-SNE, J

  ws-SNE

  (Y ) = D KL (p||M ◦q), where M ij = d i d j and d i represents an im- portance of node i which is known or computable from the data, such as number of neighbors of the node (below this is also denoted as ’high degree’). We will show J ws-SNE (Y ) has an information retrieval interpretation as maximizing recall of retrieved neighbors, weighted by severity of the left-out misses.

  

  It is a weighted graph whose edges are trading amounts between 80 countries in the world.

  used in the paper. Section

  

Optimization Equivalence of Divergences Improves Neighbor Embedding

(Supplemental Document)

  Zhirong Yang 2 ZHIRONG . YANG @ AALTO . FI Jaakko Peltonen 1

  ,4 JAAKKO . PELTONEN @ AALTO . FI

  Samuel Kaski 1

  ,3 SAMUEL . KASKI @ AALTO . FI 1 Helsinki Institute for Information Technology HIIT, 2 Department of Information and Computer Science, Aalto University,

  Finland, 3 Department of Computer Science, University of Helsinki, and 4 University of Tampere This supplemental document is organized as follows. Sec- tion

   gives the source and brief descriptions of the datasets

   gives a detailed information

  It con- tains 70000 handwritten digit images from 10 classes. We preprocessed the images by the scattering operator and PCA, which yields 256 features for each sample.

  retrieval interpretation of ws-SNE. Section

   provides ad-

  ditional visualizations that do not fit within the page limit of the main paper.

  1. Datasets We present results of six datasets in the paper.

  

  It contains 58000 samples from 7 classes, where each sample has 9 numerical shuttle attributes.

  

  

  (7) = X ˜ p i

  |i

  We next analyze the conditional divergences and provide an information retrieval interpretation for them, and then an- alyze the marginal divergence and provide an information retrieval interpretation for it.

  2.1. Analysis of the conditional divergences We first analyze the second term which is a weighted aver- age of Kullback-Leibler divergences between conditional distributions of neighbors. Each Kullback-Leibler diver- gence in the second term compares the true neighborhood of node i to an on-screen neighborhood where users are likely to retrieve close-by nodes, or nodes with high de- gree, and counts the cost of misses in such retrieval. We now show this has an information retrieval interpretation.

  First, note that the weighted conditional output probabili- ties ˜ q j

  |i

  in

  can be written as

  ˜ q j

  = d j q ij P

  for a fixed i, and similarly for {˜ q j

  k

  d k q ik = d j

  q ij P l q P il k

  d k

  q ik P l q il

  = d j

  q ij P l q il

  |i }.

  |i

  l

  ˜ q j

  X ˜ p

  j |i

  log ˜ p i ˜ q i

  ˜ p i ˜ p

  j |i

  log ˜ p j

  |i

  |i

  } is the distribution formed by all values ˜ p j

  (8) =D KL ({˜ p i }||{˜ q i }) + X

  i

  ˜ p i D KL ({˜ p j

  |i

  }||{˜ q j

  |i

  }) (10) where {˜ p i } is the distribution formed by all values ˜ p i , and similarly for {˜ q i }; and {˜ p j

  |i

  G i . (11) where the q ij / P

  q il are unweighted conditional output probabilities, which are based only on how close points are to points i on the display but not on the importances of the points. Here we denoted the denominator by

  |i

  k

  q ik = C i by S T P,i , the set where ˜ p j

  |i

  = A i and q ij / P

  k

  q ik = D i by S M ISS,i , the set where ˜ p j

  |i

  = B i and q ij / P

  q ik = C i by S F P,i , and the set where ˜ p j

  = A i and q ij / P

  |i

  = B i and q ij / P

  k q ik = D i by S T N,i .

  The Kullback-Leibler divergence between the conditional input distribution {˜ p j

  |i

  } and the weighted conditional out- put distribution {˜ q j

  |i

  k

  |i

  G i = P

  takes a high value A i = 1

  k

  d k

  q ik P l q il

  which is the weighted average importance of nodes near i on the display, where the importance of each node k is weighted by

  q ik P l q il .

  Analysis in a simplified situation. Consider a simplified situation where the input probability ˜ p j

  |i

  −δ R i

  for other points. Denote the set where ˜ p j

  for R i points and a low value B i =

  δ N −R i −1

  for other points, where δ is a very small positive number, and the unweighted output probability q ij / P

  k

  q ik simi- larly takes a high value C i = 1

  −δ K i

  for K i points and a low value D i =

  δ N −K i −1

  • log

  ˜ q j

  } can then be written as

  where ˜ q i = X

  lm

  p lm and ˜ p

  j |i

  = ˜ p ij /˜ p i = p ij / P

  ik

  p ik , and similarly ˜ q ij = ˜ q i ˜ q

  j |i

  k

  k

  ˜ q ik = d i P

  k

  P d k q ik

  kl

  d k d l q kl and ˜ q j

  |i

  = ˜ q ij

  p ik / P

  ˜ p ik = P

  k

  d i d j q ij P kl d k d l q kl

  Notation. As in the main paper, let ˜ p ij =

  p ij P kl p kl

  be a symmetric input distribution jointly over nodes and their neighbors in the original space. Here p ij are symmetric but not necessarily normalized, and the normalization in ˜ p ij ensures P

  ij

  ˜ p ij = 1. Let ˜ q ij =

  M ij q ij P kl M kl q kl

  =

  be a distribution over nodes and their neighbors on the dis- play such that q ij is Gaussian q ij = exp(−||y i − y j || 2 ) or Cauchy q ij = (1 + ||y i − y j || 2 )

  k

  −1

  . Note that again the q ij are not normalized, and the normalization in ˜ q ij ensures P

  ij ˜ q ij = 1.

  The joint distribution ˜ q ij depends on locations of nodes on the display and on their importances; intuitively, this distri- bution represents retrieval behavior of an analyst who looks at the display in a quick manner, noticing high-importance nodes and close-by other nodes, and the task of the visual- ization is to ensure that such retrieval of nodes and neigh- bors corresponds to the true neighbors represented in

  ˜ p ij . Each joint distribution over nodes can be written as a prod- uct of a marginal and a conditional distribution, so that

  ˜ p ij = ˜ p i ˜ p

  j |i

  where ˜ p i = P

  ˜ q i = d j q ij P

  d k q ik .

  |i

  |i

  ij

  ˜ p ij log ˜ p ij ˜ q ij

  (5) = X

  ij

  ˜ p i ˜ p j

  |i

  log ˜ p i ˜ p j

  ˜ q i ˜ q j

  p ij P kl p kl d i d j q ij P kl d k d l q kl

  |i

  (6) = X

  ij

  ˜ p i ˜ p j

  |i

  log ˜ p i ˜ q i

  ˜ p j

  (4) = X

  p kl log

  (1) Note that P

  ˜ q i = 1 and P

  i

  ˜ p i = 1 and P

  j

  ˜ p j

  |i

  = 1 for all i, and simi- larly P

  i

  j

  kl

  ˜ q j

  |i = 1 for all i.

  Inserting the above into the cost function of ws-SNE, the Kullback-Leibler divergence becomes a sum of two terms, a divergence between the marginals and a weighted mean of divergences between conditionals:

  J

  ws-SNE

  (Y ) (2) =D KL (p||M ◦ q) (3) = X

  ij

  P p ij

  • X
X X

  j i j i

  d C d D The last line of

  provides an information retrieval in-

  − B i log − B i log + constant G i G i terpretation of the divergence between the conditional dis-

  j ∈S j ∈S F P,i T N,i tributions. The term at left measures recall of neighbors.

  (14) X The term at right is a weighting term interpreted as the av- 1 − δ d j (1 − δ) erage cost of the misses. In the weighting term, the cost

  = − log (15) R i K i G i of missing high-degree neighbors is discounted, in order to

  j ∈S T P,i X allow high-degree nodes to be kept farther from each other 1 − δ d j δ

  (16) − log than in the unweighted setting; this prevents high-degree

  R i (N − K i − 1)G i

  j ∈S M ISS,i X

  nodes from crowding and allows the method to distribute the high- and low-degree nodes more evenly. δ d j (1 − δ)

  (17) − log

  N − R i − 1 K i G i

  j

  Thus, when ws-SNE minimizes the weighted average of

  ∈S F P,i X

  the Kullback-Leibler divergences

  it optimizes recall

  δ d j δ − log + constant. of true neighbors from the display, weighted by severity of

  N − R i − 1 (N − K i − 1)G i

  j ∈S T N,i

  the missed neighbors. (18)

  2.2. Analysis of the marginal divergence where in this simplified situation First, denote the unweighted marginal probability by X P P q i = q ik

  , and note that the weighted marginal out- q ij / q kl

  j kl

  G i = d k (19) P q il

  l put distribution can then be written as

  q ˜ i

  k X

  (20) = d k C i q ˜ i

  (28)

  k ∈S T P,i X d i d j q ij P j

  • d D k i (21)

  (29) = P d k d l q kl

  kl k ∈S X M ISS,i P P P P −1 −1

  d i ( d j q ij )( q ij ) ( q ij )( q mn )

  • j j j mn

  d k C i (22) =

  k P P P P P ∈S F P,i −1 −1

  d k ( d l q kl )( q kl ) ( q kl )( q mn ) X k l l l mn

  • (23) d k D i .

  (30)

  k ∈S T N,i

  d i q i G i (31) = P . d k q k G k

  k

  Analysis of dominating terms, and information re- trieval interpretation. Assuming that each d j ≥ 1 (and The weighted marginal probability attains high values for thus also that G i ≥ 1), the dominating terms are the terms 1 1 nodes i that have high degree and that are near many other

  −δ

  log(d j δ) ≈ log(d j δ) and the divergence simplifies

  R R

i i nodes of high degree. Nodes that are far from the other

  to nodes (low ) have low marginal weighted probability. q i

  Now consider a simplified situation where some for some D KL ({˜ p }||{˜ q }) (24)

  j |i j |i

  nodes the marginal probability takes a high value X p ˜ i A =

  1 1

  δ −δ

  ≈ − log(d j δ) + constant (25) for for others, R points and low value B =

  R N −R

  R i

  j ∈S M ISS,i

  P where h i play the unweighted marginal probability takes a high δ is a very small positive number, and on the dis- log d j q i

  N M ISS,i 1 j

  ∈S M ISS,i 1 δ −δ

  = log − + constant value for for C = K points and low value D =

  K N −K

  R i δ N M ISS,i

  ′

  others. Denote the set where p ˜ i = A and q i = C by S ,

  T P

  (26)

  ′ P the set where p ˜ i = A and q i = D by S , the set where h i M ISS

  log d j ′ 1 j

  ∈S M ISS,i

  p ˜ i = B and q i = C by S , and the set where p ˜ i = B and

  F P

  =(1 − recall(i)) · log − + constant,

  ′ δ N M ISS,i q i = D by S . T N

  (27) The Kullback-Leibler divergence of the marginal distribu- tions can then be written as where

  N M ISS,i = |S M ISS,i | is the number of missed true neighbors which were not close to i on the display, and

  X X Conclusion. Based on the separation of the ws-SNE cost

  − B log ˜ q i − B log ˜ q i + constant. (34) ′ ′

  i i function into two kinds of divergences, and the informa- ∈S ∈S F P T N

  tion retrieval interpretation of both kinds of divergences, we conclude that the ws-SNE cost function corresponds to Analysis of dominating terms, and information re- a two-stage information retrieval task of first choosing ini- trieval interpretation. Assuming again that each d i ≥ 1 tial nodes and then choosing neighbors for them. For both and thus also each G i ≥ 1, the dominating terms are the retrieval tasks, the method minimizes cost of errors in such

  ′

  terms where i ∈ S since there the term under the log-

  M ISS

  retrieval which are dominated by costs of misses, and the arithm is q ˜ i ∝ q i which is close to zero. The divergence degree (importance) of nodes is taken into account by dis- then simplifies to counting the costs of high-degree nodes in order to avoid crowding of the high-degree nodes on the display.

  D KL ({˜ p i }||{˜ q i }) (35) X ≈ − A log ˜ q i + constant (36)

  3. Additional visualizations i ∈S M ISS X Figure 1 in the paper is high resolution, and readers can 1 − δ δ d i G i

  = − log · P + constant find more details by zooming in. The visualizations in this R N − K d k q k G k k i

  ∈S M ISS

  supplemental document provide extra information that can- (37) not fit to the paper due to space limit. Moreover, we also provide the results of two other datasets. where on the second line we inserted the form of from q ˜ i

   to shows the large visualization of the

  • Figures simplifies to

  dataset using ws-SNE. It also displays worldtrade the countries names and different node sizes to facili-

  (38) D KL ({˜ p i }||{˜ q i }) X tate comparison to our common knowledge.

  1 d i G i ≈ − log δ · P + constant (39)

  R d k q k G k k i ∈S

   shows the large visualization of the h i usair97 dataset using ws-SNE. It also displays the M ISS X • Figures

  N M ISS

  1 1 d i G i = log − log P airport names and different node sizes to facilitate

  R δ N M ISS d k q k G k k i

  ∈S M ISS comparison to our common knowledge.

  • constant (40) h i • Figures
  • X shows the large visualization of the 1 1 d i G i

      =(1 − recall) log − log P mirex07 dataset using ws-SNE. It also displays the δ N M ISS d k q k G k k i

      ∈S M ISS class names and different node sizes to facilitate com- parison to our common knowledge.

    • constant (41)

      ′

    • Figure

       shows the geographical layout (ground

      where N M ISS = |S | is the number of initial nodes

      M ISS

      truth) of the luxembourg dataset, which can be used not chosen based on the display that were chosen based on

      N M ISS to compared the sixth row in Figure 1 in the paper.

      original data, and recall = 1 − is the corresponding

      R standard definition of recall.

    • Figure

       shows the visualizations of the jazz

      The last line of

      provides an information retrieval inter- dataset using graphviz, LinLog, t-SNE, and ws-SNE.

      pretation of the divergence between the marginal distribu- tions. The term at left evaluates recall of the initial nodes.

    • Figure

       shows the visualizations of the ca-GrQc

      The term on the right is again a weighting term evaluat- dataset using graphviz, LinLog, t-SNE, and ws-SNE. ing the cost of initial nodes left out (missed). The misses here denote points with low q i that are far from other nodes

       shows the full visualization of the

    • Figure and would not be chosen as initial nodes based on the vi- dataset using EE with shuttle λ = 1.

      sual arrangement on the display. The weighting term dis- counts missing high-degree nodes (more precisely, nodes

       shows the full visualization of the

    • Figure with high degree and high degree of nearby neigh- d i G i shuttle dataset using ws-SNE.

      bors), and thus allows them to be placed in sparser areas of

      

    Figure 1. Visualizations with text labels for the worldtrade dataset using graphviz. The node size is proportional to the square root

      

    Figure 2. Visualizations with text labels for the worldtrade dataset using LinLog. The node size is proportional to the square root of

      

    Figure 3. Visualizations with text labels for the worldtrade dataset using t-SNE. The node size is proportional to the square root of

      

    Figure 4. Visualizations with text labels for the worldtrade dataset using ws-SNE (Cauchy kernel). The node size is proportional to

      

    Figure 5. Visualizations with text labels for the usair97 dataset using graphviz. The node size is proportional to the square root of its

      

    Figure 6. Visualizations with text labels for the usair97 dataset using LinLog. The node size is proportional to the square root of its

      

    Figure 7. Visualizations with text labels for the usair97 dataset using t-SNE. The node size is proportional to the square root of its

      

    Figure 8. Visualizations with text labels for the usair97 dataset using ws-SNE (Cauchy kernel). The node size is proportional to the

      

    Figure 9. Visualizations with class names (shown by legend) for the mirex07 dataset using graphviz. The node size is proportional to

      

    Figure 10. Visualizations with class names (shown by legend) for the mirex07 dataset using LinLog. The node size is proportional to

      

    Figure 11. Visualizations with class names (shown by legend) for the mirex07 dataset using t-SNE. The node size is proportional to

      

    Figure 12. Visualizations with class names (shown by legend) for the mirex07 dataset using ws-SNE (Cauchy kernel). The node size

      

    Figure 13. Geographical layout (ground truth) of the luxembourg dataset.

      

    Figure 14. Visualizations of the jazz dataset using A) graphviz, B) LinLog, C) t-SNE, and D) ws-SNE. Cauchy kernel was used in

    ws-SNE. A desired layout should clearly reveal the two musician groups, LinLog and ws-SNE are able to do so.

      

    Figure 15. Visualizations of the ca-GrQc dataset using A) graphviz, B) LinLog, C) t-SNE, and D) ws-SNE. Cauchy kernel was used

    in ws-SNE. has shown that for large coauthor networks, there is a central big community, as well as one or

    more small communities in periphery. Thus we expect that a good layout will here as well show both a central community and smaller

    peripheral communities as ws-SNE does.

      

    Figure 16. Full visualization of shuttle using EE with λ = 1.

      

    Figure 17. Full visualization of shuttle using ws-SNE with the Cauchy kernel.

      

    Figure 18. Full visualization of MNIST using EE with λ = 1.

    References

      Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. Similarity-based classification: Concepts and algo- rithms. Journal of Machine Learning Research, 10:747– 776, 2009.

      Leskovec, J., Lang, K., Dasgupta, A., and Mahoney, M.

      Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. In- ternet Mathematics , 6(1):29–123, 2009. Mallat, S. Group invariant scattering. Communications in Pure and Applied Mathematics , 2012.