1
Topological modeling of biomolecular data Kelin Xia 1 , Xin Feng 2 , Zhixiong Zhao 1 , Rundong Zhao 2 , Yiying Tong 2 and Guo-wei Wei 1* 1 Department of Mathematics, Michigan State University, East Lansing, MI, 48824 2 Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824 Introduction Biomolecules have enormous amount of topological information, which tightly connects with their biological functions. In this work, we reveal the topology-function relationship of biomolecules using the persistent homology analysis (PHA). We use fullerene molecules as an example and find out that our PHA is able to quantitatively predict fullerene stability. We introduce PHA to extract molecular topological fingerprints (MTFs) based on the persistence of molecular topological invariants, and utilize MTFs for biomolecular data analysis, including characterization, identification and analysis (CIA). We construct both all-atom and coarse-grained representations of MTFs. We further employ MTFs to characterize protein topological evolution during protein folding and quantitatively predict the protein folding stability. An excellent consistence between our persistent homology prediction and molecular dynamics simulation is found. Finally, we propose a multiscale, multiresolution and multidimensional persistent homology to match the resolution with the scale of interest so as to create a topological microscopy for the underlying data. Figure: Enormous amount of topological information in biomolecular systems. Persistent homology theory Homology utilizes a topological space with an algebraic group representation to characterize topological features, such as isolated components, circles, holes and void. Persistent homology further embeds geometric information to topological invariants through a filtration process, in which a series of nested simplicial complexes is constructed and topological structures are characterized continuously over a range of scales. Figure: Illustration of boundary operators, and chain, cycle and boundary groups. Red dots stand for empty sets. I Fundamental theorem of finitely generated abelian groups: H k = Z ⊕···⊕ Z Z p 1 ⊕···⊕ Z p n = Z β k Z p 1 ⊕···⊕ Z p n . (1) Betti number can be simply calculated by β k = rank H k = rank Z k - rank B k . (2) I Filtration process: = K 0 K 1 ⊆···⊆ K m = K. (3) I Persistence: The p-persistent k-th homology group K i is H i,p k = Z i k /(B i+p k \ Z i k ). (4) Fullerene stability analysis Fullerene molecules have carbon-cage structures, which contain only pentagonal and hexagonal rings. Using the Vietoris-Rips complex, the persistence of Betti numbers (i.e., ranks of homology groups), including β 0 , β 1 and β 2 , provides the information regarding length of the bond, width of the pentagon and hexagon ring, and size of the central void. Molecular topological fingerprints Persistent homology is used to extract MTFs based on barcode representation. MTFs are utilized for protein CIA. Figure: The identification of MTFs from two all-atom representations, i.e., hydrogen-included and non-hydrogen, and coarse-grain representation. Figure: Various types of MTFs. Cryo-EM data analysis Cryo-electron microscopy (Cryo-EM) data is usually suered from low signal to noise ratio (SNR). A geometric flow based denoising process is used to reveal the intrinsic topological invariants. Figure: Persistent Betti number (PBN) representation of a noise reduction process. A consistent pattern can be observed after 100 iteration steps. Two types of microtubule models have similarly high correlation coecient but with dramatically dierent MTFs. Figure: MTFs for microtubule experimental data and two fitting models, i.e., monomer model and dimer model, indicating that the latter model is topologically correct. Protein folding analysis We use steered molecular dynamics to simulate the unfolding process of PDB 1UBQ. To analyze its topological evolution, PBNs are calculated for every configurations and then stacked together in sequence to deliver a multidimensional persistence diagram. Figure: Protein unfolding process and its PBN representation. The accumulated bar lengths of β 0 and β 1 are used to quantitatively predict bond energy and total energy during the unfolding process. The Pearson’s correlation coecients are 0.924 and 0.990, respectively. Multiscale, multiresolution and multidimensional topology To describe multiscale properties, we introduce a rigidity density function with a resolution parameter based on our FRI theory. When systematically changing the resolution, a series of geometric and topological representations of the underlying system will be produced. Our multiresolution PHA can match the resolution to the scale of interest so as to create a topological microscopy for the underlying data. Figure: Dierent scales within protein complex 1DYL. Conclusion We introduce PHA for quantitative modeling of the biomolecular systems, and establish biomolecular topology-function relationship through the study of protein flexibility, rigidity, folding and multiscale structure properties. Based on biomolecular data obtained from dierent sources, i.e., Protein Data Bank and Cryo-EM Data Bank, and in dierent representations, i.e., point-cloud and density data, we employ various ways to represent biomolecular structures and construct simplicial complexes for our PHA. Moreover, we introduce a new framework of multiresolution and multidimensional persistent topology. As a result, we can systematically analyze a dynamic process and identify related intrinsic topological properties. Finally, by varying the resolution, we are able to deliver a full geometric and topological “spectrum” for biomolecular analysis. References I 1. Kelin Xia and Guo-Wei Wei, “Persistent homology analysis of protein structure, flexibility and folding”, IJNMBE, 30, 814-844 (2014) I 2. Kelin Xia, Xin Feng, Yiying Tong and Guo-Wei Wei, “Persistent homology for the quantitative prediction of fullerene stability”, JCC, 36, 408-422 (2015) I 3. Kelin Xia and Guo-Wei Wei, “Multidimensional persistence in biomolecular data”, accepted, JCC (2015) I 4. Kelin Xia and Guo-Wei Wei, “Persistent topology for cryo-EM data analysis”, accepted, IJNMBE (2015) I 5. Kelin Xia, Zhixiong Zhao and Guo-Wei Wei, “Multiresolution topological simplification”, accepted, JCB (2015) Acknowledgement This work was partially supported by NSF grants DMS-1160352 and IIS-1302285, NIH Grant R01GM-090208 and MSU Center for Mathematical Molecular Biosciences initiative. http://www.math.msu.edu/˜wei/ http://www.cse.msu.edu/˜ytong/ http://www.msu.edu/˜xiakelin/ [email protected] [email protected] [email protected]

Topological modeling of biomolecular datausers.math.msu.edu/users/wei/PersistentHomology.pdfTopological modeling of biomolecular data Kelin Xia1, Xin Feng2, Zhixiong Zhao1, Rundong

Embed Size (px)

Citation preview

Page 1: Topological modeling of biomolecular datausers.math.msu.edu/users/wei/PersistentHomology.pdfTopological modeling of biomolecular data Kelin Xia1, Xin Feng2, Zhixiong Zhao1, Rundong

Topological modeling of biomolecular data

Kelin Xia1, Xin Feng2, Zhixiong Zhao1, Rundong Zhao2, Yiying Tong2 and Guo-wei Wei1∗

1Department of Mathematics, Michigan State University, East Lansing, MI, 488242Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824

IntroductionBiomolecules have enormous amount of topological information, which tightly connects with theirbiological functions. In this work, we reveal the topology-function relationship of biomolecules using thepersistent homology analysis (PHA). We use fullerene molecules as an example and find out that our PHAis able to quantitatively predict fullerene stability. We introduce PHA to extract molecular topologicalfingerprints (MTFs) based on the persistence of molecular topological invariants, and utilize MTFs forbiomolecular data analysis, including characterization, identification and analysis (CIA). We constructboth all-atom and coarse-grained representations of MTFs. We further employ MTFs to characterizeprotein topological evolution during protein folding and quantitatively predict the protein foldingstability. An excellent consistence between our persistent homology prediction and molecular dynamicssimulation is found. Finally, we propose a multiscale, multiresolution and multidimensional persistenthomology to match the resolution with the scale of interest so as to create a topological microscopy for theunderlying data.

Figure: Enormous amount of topological information in biomolecular systems.

Persistent homology theoryHomology utilizes a topological space with an algebraic group representation to characterize topologicalfeatures, such as isolated components, circles, holes and void. Persistent homology further embedsgeometric information to topological invariants through a filtration process, in which a series of nestedsimplicial complexes is constructed and topological structures are characterized continuously over arange of scales.

Figure: Illustration of boundary operators, and chain, cycle and boundary groups. Red dots stand for empty sets.

I Fundamental theorem of finitely generated abelian groups:

Hk = Z ⊕ · · · ⊕ Z ⊕ Zp1 ⊕ · · · ⊕ Zpn = Zβk ⊕ Zp1 ⊕ · · · ⊕ Zpn. (1)

Betti number can be simply calculated by

βk = rank Hk = rank Zk − rank Bk. (2)

I Filtration process:

∅ = K0 ⊆ K1 ⊆ · · · ⊆ Km = K. (3)

I Persistence: The p-persistent k-th homology group Ki is

Hi,pk = Zi

k/(Bi+pk

⋂Zi

k). (4)

Fullerene stability analysisFullerene molecules have carbon-cage structures, which contain only pentagonal and hexagonal rings.Using the Vietoris-Rips complex, the persistence of Betti numbers (i.e., ranks of homology groups),including β0, β1 and β2, provides the information regarding length of the bond, width of the pentagonand hexagon ring, and size of the central void.

Molecular topological fingerprintsPersistent homology is used to extract MTFs based on barcode representation. MTFs are utilized forprotein CIA.

Figure: The identification of MTFs from two all-atom representations, i.e., hydrogen-included and non-hydrogen, andcoarse-grain representation.

Figure: Various types of MTFs.

Cryo-EM data analysisCryo-electron microscopy (Cryo-EM) data is usually suffered from low signal to noise ratio (SNR). Ageometric flow based denoising process is used to reveal the intrinsic topological invariants.

Figure: Persistent Betti number (PBN) representation of a noise reduction process. A consistent pattern can be observed after 100iteration steps.

Two types of microtubule models have similarly high correlation coefficient but with dramaticallydifferent MTFs.

Figure: MTFs for microtubule experimental data and two fitting models, i.e., monomer model and dimer model, indicating thatthe latter model is topologically correct.

Protein folding analysisWe use steered molecular dynamics to simulate the unfolding process of PDB 1UBQ. To analyze itstopological evolution, PBNs are calculated for every configurations and then stacked together in sequenceto deliver a multidimensional persistence diagram.

Figure: Protein unfolding process and its PBN representation.

The accumulated bar lengths of β0 and β1 are used to quantitatively predict bond energy and total energyduring the unfolding process. The Pearson’s correlation coefficients are 0.924 and 0.990, respectively.

Multiscale, multiresolution and multidimensional topology

To describe multiscale properties, we introduce a rigidity density function with a resolution parameterbased on our FRI theory. When systematically changing the resolution, a series of geometric andtopological representations of the underlying system will be produced. Our multiresolution PHA canmatch the resolution to the scale of interest so as to create a topological microscopy for the underlying data.

Figure: Different scales within protein complex 1DYL.

ConclusionWe introduce PHA for quantitative modeling of the biomolecular systems, and establish biomoleculartopology-function relationship through the study of protein flexibility, rigidity, folding and multiscalestructure properties. Based on biomolecular data obtained from different sources, i.e., Protein Data Bankand Cryo-EM Data Bank, and in different representations, i.e., point-cloud and density data, we employvarious ways to represent biomolecular structures and construct simplicial complexes for our PHA.Moreover, we introduce a new framework of multiresolution and multidimensional persistent topology.As a result, we can systematically analyze a dynamic process and identify related intrinsic topologicalproperties. Finally, by varying the resolution, we are able to deliver a full geometric and topological“spectrum” for biomolecular analysis.

References

I 1. Kelin Xia and Guo-Wei Wei, “Persistent homology analysis of protein structure, flexibility andfolding”, IJNMBE, 30, 814-844 (2014)

I 2. Kelin Xia, Xin Feng, Yiying Tong and Guo-Wei Wei, “Persistent homology for the quantitativeprediction of fullerene stability”, JCC, 36, 408-422 (2015)

I 3. Kelin Xia and Guo-Wei Wei, “Multidimensional persistence in biomolecular data”, accepted, JCC(2015)

I 4. Kelin Xia and Guo-Wei Wei, “Persistent topology for cryo-EM data analysis”, accepted, IJNMBE (2015)I 5. Kelin Xia, Zhixiong Zhao and Guo-Wei Wei, “Multiresolution topological simplification”, accepted,

JCB (2015)

Acknowledgement

This work was partially supported by NSF grants DMS-1160352 andIIS-1302285, NIH Grant R01GM-090208 and MSU Center for MathematicalMolecular Biosciences initiative.

http://www.math.msu.edu/˜wei/ http://www.cse.msu.edu/˜ytong/ http://www.msu.edu/˜xiakelin/ [email protected] [email protected] [email protected]