Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt

February 25 to March 10

Bioinformatics Group MPIs Tübingen

Clustering in bioinformatics

Microarrays

Clustering is a widely used tool in microarray analysis Class discovery is an important problem in microarray studies for two reasons: either the classes are completely unknown before- hand or it is unknown whether a known class contains interesting subclasses

Clustering in bioinformatics

Examples

Classes unknown: Does a disease affect gene expression in a particular tissue? Does gene expression differ between two groups in a particular condition?

Subclasses unknown: Are there subtypes of a disease? Is there even a hierarchy of subclasses within one disease?

Clustering in bioinformatics

Popularity

Clustering tools are available in the large microarray database

NCBI Gene Expression Omnibus (GEO)

http://www.ncbi.nlm.nih.gov/geo/ 3002 hits for ’microarray clustering’

pubmed

Recent editorial of OUP Bioinformatics

Distance metrics

Euclidean distance

Euclidean distance of gene x and y of n samples or sample x and y of n genes: _u _{u n} _v _X _t

2 d xy = (x i − y i ) (1) i =1

Pearson’s Correlation

Pearson Correlation of gene x and y of n samples or sample x and y of n genes, where is the mean of x

x ¯

and is y the mean of y:

¯ _{P n} x y (x i − ¯ )(y i − ¯ ) i =1 r xy = _{pP n pP n} (2)

2 x y (x i − ¯ ) (y i − ¯ ) i =1 i =1

Distance metrics

Un-centered correlation coefficient

Un-centered correlation coefficient of gene x and y of n samples or sample and of genes:

x y n _P n x y i i u i =1 r =

(3) xy _{pP pP} n n

2 x y i i i =1 i =1

Clustering algorithms

Hierarchical Clustering

Single linkage: The linking distance is the minimum distance between two clusters. Complete linkage: The linking distance is the maximum distance between two clusters.

Average linkage/UPGMA (The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA))

‘Flat’ Clustering

k-means (k from 2 to 15, 3 runs) k-median (k-medoid)

The two-sample problem

Interpretation of clusters

Clustering introduces ‘structure’ into microarray datasets But is there a statistical or biomedical meaning of these classes? Biomedical meaning has to be established in experiments ‘Statistical meaning’ can be measured using statistical

two-sample test

tests, by a so-called A two-sample tests decides whether two samples were drawn from the same probability distribution or not

The two-sample problem

Data diversity

Molecular biology produces a wealth of information The problem is that these data are generated on different platforms and by different protocols under different levels of noise

Hence data from different labs show different scales different ranges different distributions

Main problem:

Joint data analysis may detect differences in distributions, not biological phenomena!

The two-sample problem

The two-sample problem

Given two samples X and Y . Were they generated by the same distribution?

Previous approaches

two-sample tests exist for univariate and multivariate data

The two-sample problem

t-test

A test of the null hypothesis that the means of two nor- mally distributed populations are equal unpaired/independent (versus paired) For equal sample sizes and equal variances, the t statistic to test whether the means are different can be calcu- lated as follows:

x y ¯ − ¯ t = _q (4)

2 σ xy · _q ₂ ₂ n σ +σ _{x y} where σ .

= xy

2n − 2 the size of each sample.

The two-sample problem

New challenges in bioinformatics

high-dimensional structured (strings and graphs) low sample size

Novel distribution test: Maximum Mean Discrepancy

(MMD)

MMD key idea

MMD key idea

Key Idea

Avoid density estimator, use means in feature spaces Maximum Mean Discrepancy (Fortet and Mourier, 1953)

D (p, q, F) := sup [f (x)] − E [f (y)] p q f

∈F Theorem

F iff , when . D p (p, q, F) = 0 = q = C (X)

Follows directly, e.g. from Dudley, 1984.

Theorem F D iff p , when

(p, q, F) = 0 = q = {f | kf k H ≤ 1} H provided that is a universal RKHS.

(follows via Steinwart, 2001, Smola et al., 2006).

MMD statistic

) Theorem

exceeds acceptance threshold.

(X, Y, F)

if D

= q

Reject null hypothesis that p

2 from data.

Estimate σ

Test

(p, q, F) .

is an unbiased estimator of D

D (X, Y, F)

Goal: Estimate D (p, q, F)

E p,p k (x, x

) + k(y i

, x j

) − k(y i

, y j

) − k(x

, x j

1 m (m−1) ^X i 6=j k (x i

U-Statistic: Empirical estimate D (X, Y, F)

′ )

(x, y) + E q,q k (y, y

′ ) − 2E p,q

, y j

Attractive for bioinformatics

MMD

two-sample test in terms of kernels

Computationally attractive

search infinite space of functions by evaluating one expression no optimization problem has to be solved

All thanks to kernels!

Attractive for bioinformatics

Wide applicability

for one- and higher-dimensional vectorial data, but also for structured data ! two-sample problems can now be tackled on strings: protein and DNA sequences graphs: molecules, protein interaction networks time series: time series of microarray data and sets, trees, . . .

Cross-platform comparability

Data

microarray data from two breast cancer studies one on cDNA platform (Gruvberger et al., 2001) other on oligonucleotide microarray platform (West et al., 2001)

Task

Can MMD help to find out if two sets of observations were generated by the same study (both from Gruvberger or both from West)? different studies (one Gruvberger, one West)?

Cross-platform comparability

Experiment

sample size each: 25 dimension of each datapoint 2,116 significance level:

α = 0.05

100 times: 1 sample from Gruvberger, 1 from West 100 times: both from Gruvberger or both from West report percentage of correct decisions compare to t-test, Friedman-Rafsky Wald-Wolfowitz and Smirnov

Cross-platform comparability

Kernel-based statistical test

novel statistical test for two-sample problem: easy to implement non-parametric first for structured data best on high-dimensional data quadratic runtime w.r.t. the number of data points impressive accuracy in our experiments kernel method for two-sample problem: all kernels recently defined in molecular biology can be re-used for data integration applicable to vectors, strings, sets, trees, graphs and time series

Biclustering

Clustering in two dimensions

alternative names: co-clustering, two-mode clustering A bicluster is a subset of genes that show similar activ- ity patterns under a subset of conditions.

Clustering in 2 dimensions Cluster patients and conditions Earliest work by Hartigan, 1972: Divide a matrix into submatrices with minimum variance.

Most interesting cases are NP-complete. Many extensions in bioinformatics (e.g. Cheng and Church, 2002)

References and further reading

References

[1] Gretton, Borgwardt, Rasch, Schölkopf, Smola: A kernel method for the two-sample problem. NIPS 2006

The end

See you tomorrow! Next topic: Feature Selection in Bioinformatics

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics