Optimization Equivalence of Divergences Improves Neighbor Embedding (Supplemental Document)

luxembourg: the data is from The University of

shuttle: the data is from the UCI Machine Learn- ing Repository, Statlog (Shuttle) Data

It is a binary graph where the nodes are 332 airports in the United States and the edges indicate whether there is direct flight between the airports.

We used the version from the collection by

_{_{It is a similarity graph of 3090}}

songs. The songs are evenly divided among 10 classes that roughly correspond to different music genres. The weighted edges are human judgment on how similar two songs are.

Florida Sparse Matrix

It contains 114599 nodes and 239332 edges, where each edge is a street in Luxembourg. To maintain the space limit in the paper, we present results of two other datasets in this supplemental document (see Section

It is a social network of 198 musicians. The musicians are mainly in New York and Chicago, with a few in both places or in other places.

jazz: the data is from Arenas’s

MNIST: the data is from the MNIST
ca-GrQc: the data is from Stanford Network Analy- sis
worldtrade: the data is from Pajek
usair97: the data is from the LinLog layout pack-
mirex07: the data is from the the Third Music Information Retrieval Evaluation eXchange (MIREX

It is a coauthor network among 5242 re- searchers working on general relativity. We extracted the largest connected component with 4158 authors.

2. Retrieval interpretation of ws-SNE

Consider the objective of ws-SNE, J

ws-SNE

(Y ) = D KL (p||M ◦q), where M ij = d i d j and d i represents an importance of node i which is known or computable from the data, such as number of neighbors of the node (below this is also denoted as ’high degree’). We will show J ws-SNE (Y ) has an information retrieval interpretation as maximizing recall of retrieved neighbors, weighted by severity of the left-out misses.

It is a weighted graph whose edges are trading amounts between 80 countries in the world.

used in the paper. Section

Optimization Equivalence of Divergences Improves Neighbor Embedding

(Supplemental Document)

Zhirong Yang ² ZHIRONG . YANG @ AALTO . FI Jaakko Peltonen ¹

,4 JAAKKO . PELTONEN @ AALTO . FI

Samuel Kaski ¹

,3 SAMUEL . KASKI @ AALTO . FI ₁ Helsinki Institute for Information Technology HIIT, ² Department of Information and Computer Science, Aalto University,

Finland, ³ Department of Computer Science, University of Helsinki, and ⁴ University of Tampere This supplemental document is organized as follows. Sec- tion

gives the source and brief descriptions of the datasets

gives a detailed information

It contains 70000 handwritten digit images from 10 classes. We preprocessed the images by the scattering operator and PCA, which yields 256 features for each sample.

retrieval interpretation of ws-SNE. Section

provides ad-

ditional visualizations that do not fit within the page limit of the main paper.

1. Datasets We present results of six datasets in the paper.

It contains 58000 samples from 7 classes, where each sample has 9 numerical shuttle attributes.

(7) = ^X ˜ p i

We next analyze the conditional divergences and provide an information retrieval interpretation for them, and then analyze the marginal divergence and provide an information retrieval interpretation for it.

2.1. Analysis of the conditional divergences We first analyze the second term which is a weighted average of Kullback-Leibler divergences between conditional distributions of neighbors. Each Kullback-Leibler divergence in the second term compares the true neighborhood of node i to an on-screen neighborhood where users are likely to retrieve close-by nodes, or nodes with high degree, and counts the cost of misses in such retrieval. We now show this has an information retrieval interpretation.

First, note that the weighted conditional output probabilities ˜ q j

can be written as

˜ q j

= d j q ij _P

for a fixed i, and similarly for {˜ q j

d k q ik = d j

q _ij P _l q _P _il k

d k

q _ik P _l q _il

= d j

q _ij P _l q _il

|i }.

˜ q j

X ˜ p

j |i

log ˜ p i ˜ q i

˜ p i ˜ p

j |i

log ˜ p j

} is the distribution formed by all values ˜ p j

(8) =D KL ({˜ p i }||{˜ q i }) + ^X

˜ p i D KL ({˜ p j

}||{˜ q j

}) (10) where {˜ p i } is the distribution formed by all values ˜ p i , and similarly for {˜ q i }; and {˜ p j

G i . (11) where the q ij / ^P

q il are unweighted conditional output probabilities, which are based only on how close points are to points i on the display but not on the importances of the points. Here we denoted the denominator by

q ik = C i by S T P,i , the set where ˜ p j

= A i and q ij / ^P

q ik = D i by S M ISS,i , the set where ˜ p j

= B i and q ij / ^P

q ik = C i by S F P,i , and the set where ˜ p j

= A i and q ij / ^P

= B i and q ij / ^P

k q ik = D i by S T N,i .

The Kullback-Leibler divergence between the conditional input distribution {˜ p j

} and the weighted conditional output distribution {˜ q j

G i = _P

takes a high value A i = ¹

d k

q _ik P _l q _il

which is the weighted average importance of nodes near i on the display, where the importance of each node k is weighted by

q _ik P _l q _il .

Analysis in a simplified situation. Consider a simplified situation where the input probability ˜ p j

−δ R _i

for other points. Denote the set where ˜ p j

for R i points and a low value B i =

δ N −R i −1

for other points, where δ is a very small positive number, and the unweighted output probability q ij / ^P

q ik similarly takes a high value C i = ¹

−δ K _i

for K i points and a low value D i =

δ N −K i −1

˜ q j

} can then be written as

where ˜ q i = ^X

p lm and ˜ p

j |i

= ˜ p ij /˜ p i = p ij / ^P

p ik , and similarly ˜ q ij = ˜ q i ˜ q

j |i

˜ q ik = d i ^P

_P d k q ik

d k d l q kl and ˜ q j

= ˜ q ij

p ik / ^P

˜ p ik = ^P

d _i d _j q _ij P _kl d _k d _l q _kl

Notation. As in the main paper, let ˜ p ij =

p _ij P _kl p _kl

be a symmetric input distribution jointly over nodes and their neighbors in the original space. Here p ij are symmetric but not necessarily normalized, and the normalization in ˜ p ij ensures ^P

˜ p ij = 1. Let ˜ q ij =

M _ij q _ij P _kl M _kl q _kl

be a distribution over nodes and their neighbors on the display such that q ij is Gaussian q ij = exp(−||y i − y j || ² ) or Cauchy q ij = (1 + ||y i − y j || ² )

−1

. Note that again the q ij are not normalized, and the normalization in ˜ q ij ensures _P

ij ˜ q ij = 1.

The joint distribution ˜ q ij depends on locations of nodes on the display and on their importances; intuitively, this distribution represents retrieval behavior of an analyst who looks at the display in a quick manner, noticing high-importance nodes and close-by other nodes, and the task of the visualization is to ensure that such retrieval of nodes and neighbors corresponds to the true neighbors represented in

˜ p ij . Each joint distribution over nodes can be written as a prod- uct of a marginal and a conditional distribution, so that

˜ p ij = ˜ p i ˜ p

j |i

where ˜ p i = ^P

˜ q i = d j q ij _P

d k q ik .

˜ p ij log ˜ p ij ˜ q ij

(5) = ^X

˜ p i ˜ p j

log ˜ p i ˜ p j

˜ q i ˜ q j

p _ij P _kl p _kl d _i d _j q _ij P _kl d _k d _l q _kl

(6) = ^X

˜ p i ˜ p j

log ˜ p i ˜ q i

˜ p j

(4) = ^X

p kl log

(1) Note that ^P

˜ q i = 1 and ^P

˜ p i = 1 and ^P

˜ p j

= 1 for all i, and similarly ^P

˜ q j

|i = 1 for all i.

Inserting the above into the cost function of ws-SNE, the Kullback-Leibler divergence becomes a sum of two terms, a divergence between the marginals and a weighted mean of divergences between conditionals:

ws-SNE

(Y ) (2) =D KL (p||M ◦ q) (3) = ^X

_P p ij

X ^X

j i j i

d C d D The last line of

provides an information retrieval in-

− B i log − B i log + constant G i G i terpretation of the divergence between the conditional dis-

j ∈S j ∈S _{F P,i T N,i} tributions. The term at left measures recall of neighbors.

(14) _X The term at right is a weighting term interpreted as the av- 1 − δ d j (1 − δ) erage cost of the misses. In the weighting term, the cost

= − log (15) R i K i G i of missing high-degree neighbors is discounted, in order to

j ∈S _{T P,i} _X allow high-degree nodes to be kept farther from each other 1 − δ d j δ

(16) − log than in the unweighted setting; this prevents high-degree

R i (N − K i − 1)G i

j ∈S _{M ISS,i} _X

nodes from crowding and allows the method to distribute the high- and low-degree nodes more evenly. δ d j (1 − δ)

(17) − log

N − R i − 1 K i G i

Thus, when ws-SNE minimizes the weighted average of

∈S _{F P,i} _X

the Kullback-Leibler divergences

it optimizes recall

δ d j δ − log + constant. of true neighbors from the display, weighted by severity of

N − R i − 1 (N − K i − 1)G i

j ∈S T N,i

the missed neighbors. (18)

2.2. Analysis of the marginal divergence where in this simplified situation First, denote the unweighted marginal probability by _X _{P P} q i = q ik

, and note that the weighted marginal out- q ij / q kl

j kl

G i = d k (19) _P q il

l put distribution can then be written as

q ˜ i

k _X

(20) = d k C i q ˜ i

(28)

k ∈S _{T P,i} _{X d i d j q ij} _P j

d D k i (21)

(29) = P d k d l q kl

kl k ∈S _X _{M ISS,i} _{P P P P} −1 −1

d i ( d j q ij )( q ij ) ( q ij )( q mn )

j j j mn

d k C i (22) =

k P P P P P ∈S F P,i −1 −1

d k ( d l q kl )( q kl ) ( q kl )( q mn ) _X k l l l mn

(23) d k D i .

(30)

k ∈S _{T N,i}

d i q i G i (31) = P . d k q k G k

Analysis of dominating terms, and information retrieval interpretation. Assuming that each d j ≥ 1 (and The weighted marginal probability attains high values for thus also that G i ≥ 1), the dominating terms are the terms ₁ ₁ nodes i that have high degree and that are near many other

−δ

log(d j δ) ≈ log(d j δ) and the divergence simplifies

R R

_{i i nodes of high degree. Nodes that are far from the other}

to nodes (low ) have low marginal weighted probability. q i

Now consider a simplified situation where some for some D KL ({˜ p }||{˜ q }) (24)

j |i j |i

nodes the marginal probability takes a high value _X p ˜ i A =

1 ₁

δ −δ

≈ − log(d j δ) + constant (25) for for others, R points and low value B =

R N −R

R i

j ∈S _{M ISS,i}

_P where _{h i play the unweighted marginal probability takes a high} δ is a very small positive number, and on the dis- log d j q i

N M ISS,i 1 j

∈S M ISS,i ₁ δ −δ

= log − + constant value for for C = K points and low value D =

K N −K

R i δ N M ISS,i

′

others. Denote the set where p ˜ i = A and q i = C by S ,

T P

(26)

′ _P the set where p ˜ i = A and q i = D by S , the set where _{h i} M ISS

log d j ′ 1 j

∈S M ISS,i

p ˜ i = B and q i = C by S , and the set where p ˜ i = B and

F P

=(1 − recall(i)) · log − + constant,

′ δ N M ISS,i q i = D by S . T N

(27) The Kullback-Leibler divergence of the marginal distributions can then be written as where

N M ISS,i = |S M ISS,i | is the number of missed true neighbors which were not close to i on the display, and

X ^X Conclusion. Based on the separation of the ws-SNE cost

− B log ˜ q i − B log ˜ q i + constant. (34) _{′ ′}

i i function into two kinds of divergences, and the informa- ∈S ∈S _{F P T N}

tion retrieval interpretation of both kinds of divergences, we conclude that the ws-SNE cost function corresponds to Analysis of dominating terms, and information re- a two-stage information retrieval task of first choosing ini- trieval interpretation. Assuming again that each d i ≥ 1 tial nodes and then choosing neighbors for them. For both and thus also each G i ≥ 1, the dominating terms are the retrieval tasks, the method minimizes cost of errors in such

′

terms where i ∈ S since there the term under the log-

M ISS

retrieval which are dominated by costs of misses, and the arithm is q ˜ i ∝ q i which is close to zero. The divergence degree (importance) of nodes is taken into account by dis- then simplifies to counting the costs of high-degree nodes in order to avoid crowding of the high-degree nodes on the display.

D KL ({˜ p i }||{˜ q i }) (35) _X ≈ − A log ˜ q i + constant (36) _′

3. Additional visualizations i ∈S _{M ISS} _X Figure 1 in the paper is high resolution, and readers can 1 − δ δ d i G i

= − log · P + constant find more details by zooming in. The visualizations in this R N − K d k q k G k _′ k i

∈S _{M ISS}

supplemental document provide extra information that can- (37) not fit to the paper due to space limit. Moreover, we also provide the results of two other datasets. where on the second line we inserted the form of from q ˜ i

to shows the large visualization of the

Figures simplifies to

dataset using ws-SNE. It also displays worldtrade the countries names and different node sizes to facili-

(38) D KL ({˜ p i }||{˜ q i }) _X tate comparison to our common knowledge.

1 d i G i ≈ − log δ · P + constant (39)

R d k q k G k _′ k i ∈S

shows the large visualization of the _{h i usair97 dataset using ws-SNE. It also displays the} _{M ISS} _X • Figures

N M ISS

1 1 d i G i = log − log _P airport names and different node sizes to facilitate

R δ N M ISS d k q k G k _′ k i

∈S _{M ISS comparison to our common knowledge.}

constant (40) _{h i • Figures}

=(1 − recall) log − log _P mirex07 dataset using ws-SNE. It also displays the δ N M ISS d k q k G k _′ k i

∈S _{M ISS class names and different node sizes to facilitate com-} parison to our common knowledge.

constant (41)

′

Figure

shows the geographical layout (ground

where N M ISS = |S | is the number of initial nodes

M ISS

truth) of the luxembourg dataset, which can be used not chosen based on the display that were chosen based on

N _{M ISS} to compared the sixth row in Figure 1 in the paper.

original data, and recall = 1 − is the corresponding

R standard definition of recall.

Figure

shows the visualizations of the jazz

The last line of

provides an information retrieval inter- dataset using graphviz, LinLog, t-SNE, and ws-SNE.

pretation of the divergence between the marginal distributions. The term at left evaluates recall of the initial nodes.

Figure

shows the visualizations of the ca-GrQc

The term on the right is again a weighting term evaluat- dataset using graphviz, LinLog, t-SNE, and ws-SNE. ing the cost of initial nodes left out (missed). The misses here denote points with low q i that are far from other nodes

shows the full visualization of the

Figure and would not be chosen as initial nodes based on the vi- dataset using EE with shuttle λ = 1.

sual arrangement on the display. The weighting term dis- counts missing high-degree nodes (more precisely, nodes

shows the full visualization of the

Figure with high degree and high degree of nearby neigh- d i G i shuttle dataset using ws-SNE.

bors), and thus allows them to be placed in sparser areas of

Figure 1. Visualizations with text labels for the worldtrade dataset using graphviz. The node size is proportional to the square root

Figure 2. Visualizations with text labels for the worldtrade dataset using LinLog. The node size is proportional to the square root of

Figure 3. Visualizations with text labels for the worldtrade dataset using t-SNE. The node size is proportional to the square root of

Figure 4. Visualizations with text labels for the worldtrade dataset using ws-SNE (Cauchy kernel). The node size is proportional to

Figure 5. Visualizations with text labels for the usair97 dataset using graphviz. The node size is proportional to the square root of its

Figure 6. Visualizations with text labels for the usair97 dataset using LinLog. The node size is proportional to the square root of its

Figure 7. Visualizations with text labels for the usair97 dataset using t-SNE. The node size is proportional to the square root of its

Figure 8. Visualizations with text labels for the usair97 dataset using ws-SNE (Cauchy kernel). The node size is proportional to the

Figure 9. Visualizations with class names (shown by legend) for the mirex07 dataset using graphviz. The node size is proportional to

Figure 10. Visualizations with class names (shown by legend) for the mirex07 dataset using LinLog. The node size is proportional to

Figure 11. Visualizations with class names (shown by legend) for the mirex07 dataset using t-SNE. The node size is proportional to

Figure 12. Visualizations with class names (shown by legend) for the mirex07 dataset using ws-SNE (Cauchy kernel). The node size

Figure 13. Geographical layout (ground truth) of the luxembourg dataset.

Figure 14. Visualizations of the jazz dataset using A) graphviz, B) LinLog, C) t-SNE, and D) ws-SNE. Cauchy kernel was used in

Figure 15. Visualizations of the ca-GrQc dataset using A) graphviz, B) LinLog, C) t-SNE, and D) ws-SNE. Cauchy kernel was used

in ws-SNE. has shown that for large coauthor networks, there is a central big community, as well as one or

more small communities in periphery. Thus we expect that a good layout will here as well show both a central community and smaller

Figure 16. Full visualization of shuttle using EE with λ = 1.

Figure 17. Full visualization of shuttle using ws-SNE with the Cauchy kernel.

Figure 18. Full visualization of MNIST using EE with λ = 1.

References

Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. Similarity-based classification: Concepts and algo- rithms. Journal of Machine Learning Research, 10:747– 776, 2009.

Leskovec, J., Lang, K., Dasgupta, A., and Mahoney, M.

Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. In- ternet Mathematics , 6(1):29–123, 2009. Mallat, S. Group invariant scattering. Communications in Pure and Applied Mathematics , 2012.

Optimization Equivalence of Divergences Improves Neighbor Embedding (Supplemental Document)

References

Dokumen yang terkait

The Analysis of Translation Equivalence On Bilingual Book Active English For Nurses

An analysis of Equivalence of Culrural terms in some Cultural Articles in Indonesia’s Official Tourism Website

Hibridisasi Metode Fuzzy K-Nearest Neighbor Dengan Metode Modified Particle Swarm Optimization Pada Pengklasifikasian Penyakit Tanaman Kedelai

Optimization of Sub-Optimally Land Through Planting of Mucuna bracreata

Optimization of Neuron cells Maturation and Differentiation

Embedding Analytics in Modern Applications pdf pdf

Improving Cluster Analysis by Co-initializations (Supplemental Document)

Majorization-Minimization for Manifold Embedding

Majorization-Minimization for Manifold Embedding (Supplemental Document)

Optimization Equivalence of Divergences Improves Neighbor Embedding

Dukungan

Links

Optimization Equivalence of Divergences Improves Neighbor Embedding (Supplemental Document)

References

Dokumen yang terkait

The Analysis of Translation Equivalence On Bilingual Book Active English For Nurses

An analysis of Equivalence of Culrural terms in some Cultural Articles in Indonesia’s Official Tourism Website

Hibridisasi Metode Fuzzy K-Nearest Neighbor Dengan Metode Modified Particle Swarm Optimization Pada Pengklasifikasian Penyakit Tanaman Kedelai

Optimization of Sub-Optimally Land Through Planting of Mucuna bracreata

Optimization of Neuron cells Maturation and Differentiation

Embedding Analytics in Modern Applications pdf pdf

Improving Cluster Analysis by Co-initializations (Supplemental Document)

Majorization-Minimization for Manifold Embedding

Majorization-Minimization for Manifold Embedding (Supplemental Document)

Optimization Equivalence of Divergences Improves Neighbor Embedding

Dokumen yang Anda mencari sudah siap untuk unduhkan