Data Timings and Statistics

a Topology map 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 p 10000 20000 30000 40000 50000 60000 70000 l 1 2 3 4 5 6 7 8 9 10 log 2 β 1 l,p +1 b Graph of log 2 β l ,p 1 + 1 Fig. 13.5. The persistent Betti numbers of 1hck. 2 4 6 8 10 12 14 5000 10000 15000 20000 25000 30000 35000 log 2 Number of Cycles + 1 Persistence a BOG 2 4 6 8 10 12 14 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 log 2 Number of Cycles + 1 Persistence b 1hck Fig. 13.6. Persistence histograms. BOG ’s histogram a shows some grouping, but 1hck ’s b does not.

13.1.2 Knotting

We also wish to detect whether proteins are knotted or have linking in their structures. I have already described algorithms for detecting linking in Chap- ter 10. The linking number algorithms give us a signature function for a pro- tein. We may also look for alternate signature functions for describing the topology of a protein. The approach here is to exploit the fast combinatorial representation to compute other knot and link invariants. Future directions in- clude computing polynomial invariants, such as the Alexander polynomial for detecting knots Adams, 1994.

13.1.3 Structure Determination

One method used for determining the architecture of a protein is X-Ray crys- tallography Rhodes, 2000. After forming a high-quality crystal of a protein, we analyze the diffraction pattern produced by X-irradiation to generate an electron density map . The sequence of amino acids in the protein must be known independently. We then fit the atoms of the residues into the computed electron density map via a series of refinements. The result is a set of Cartesian coordinates for every non-hydrogen atom in the molecule. Usually, we use these coordinates, augmented with van der Waals radii, to produce filtrations for proteins, the input to the algorithms in this book. We wish to use persistence also as a tool for refining the resolved protein. We guide modifications to the structure of the protein and the radii of the atoms by using persistent complexes. We then produce a synthetic electronic density map for the new coordinates and radii, and compare it to the original density map. We may also construct three-dimensional MS complexes of the electron- density data for denoising using persistence. I will discuss general denoising of density functions in Section 13.3.

13.2 Hierarchical Clustering

In Chapter 2, we looked at α-shapes as a method for describing the connectiv- ity of a space. As we increase α, the centers of the balls in our data sets are connected via edges and triangles. We may view the connections as a hierar- chical clustering mechanism. Persistence adds another dimension to α-shapes, giving us a two-parameter family of shapes for describing the clustering of point sets. Edelsbrunner and Mücke 1994 first noted the possibility of using α-shapes as a method for studying the distribution of galaxies in our universe. Dykster- house 1992 took initial steps in this direction. Persistence gives us additional tools for examining the clustering of galaxies in the universe. Figure 13.7 displays a simulated data set due to Marc Dyksterhouse. Each of the 1,717 vertices represents a galaxy and is a component 0-cycle of the complex. The figure also displays the manifolds of the 0-cycles: the path through which galaxies will be connected in the future. We may use this information to con- struct a hierarchical description of the galaxies. In addition, we can examine the persistent topological features of the filtration of the universe. Voids, for example, correspond to empty areas of space. Another instance of using persistence for hierarchical clustering is to clas- Fig. 13.7. A simulated universe, its 0-cycles, and manifolds. 1 4 16 64 256 1024 4096 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 β l, p l Fig. 13.8. Graph of β l ,p projected on the l, β plane for new data set 1mct: Trypsin complexed with inhibitor from bitter. sify proteins according to their hydrophobic surfaces. Here, we sample hy- drophobic points along the surface of a protein. We then compute an α- complex filtration from these points and examine the persistent components. Figure 13.8 shows the graph of the β for this data set. The graph is projected