Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart [email protected] & Dr. Eric Wernert [email protected] 7 August 2003

Computational Biology: Data, computation, and visualization

Dr. Craig A. [email protected]

&Dr. Eric Wernert

[email protected]

7 August 2003

License terms

• Please cite as: Stewart, C.A. and E. Wernert. Computational Biology: Data, computation, and visualization. 2003. Presentation. Presented at: Visualization Workshop (Arctic Region Supercomputer Center, University of Alaska Fairbanks, 7 Aug 2003). Available from: http://hdl.handle.net/2022/15219

• Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

2

http://hdl.handle.net/2022/15219



Outline

• A bit about biomedical data• Computation and visualization• The revolution in biology & IU’s response –the

Indiana Genomics Initiative• Hardware• Some thoughts about dealing with biological and

biomedical researchers in general

The revolution in biology

• Automated, high-throughput sequencing has revolutionized biology.

• Computing has been a part of this revolution in three ways so far:– Computing has been essential to

the assembly of genomes– There is now so much biological

data available that it is impossible to utilize it effectively without aid of computers

– Networking and the Web have made biological data generally and publicly available

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

FASTA format

>gi|532319|pir|TVFV2E|TVFV2E envelope proteinELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF

Some of the issues about this exponential growth in data stores

• WO/RN• Comparability/replicability problems with certain

types of data• HIPPA – how do you de-identify patient data?

Indiana Genomics Initiative (INGEN)

• Created by a $105M grant from the Lilly Endowment, Inc. and launched December, 2000

• Build on traditional strengths and add new areas of research for IU

• Perform the research that will generate new treatments for human disease in the post-genomic era

• Improve human health generally and in the State of Indiana particularly

• Enhance economic growth in Indiana

Challenges for UITS and the INGEN IT Core

• Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing)

• Assist bioinformatics researchers in use of advanced computing facilities

• Questions we are asked:– Why wouldn't it be better just to buy me a newer PC?

• Questions we asked:– What do you do now with computers that you would like

to do faster?– What would you do if computer resources were not a

constraint?

So, why is this better than just buying me a new PC?

• Unique facilities provided by IT Core– Redundant data storage– HPC – better uniprocessor performance; trivially

parallel programming, parallel programming– Visualization in the research laboratories

• Hardcopy document – INGEN's advanced IT facilities: The least you need to know

• Outreach efforts• Demonstration projects

Example projects

• Data integration• fastDNAml – maximum likelihood phylogenies

(http://www.indiana.edu/~rac/hpc/fastDNAml/index.html)

• PiVN - Software to visualize human family trees • 3-DIVE (3D Interactive Volume Explorer).

http://www.avl.iu.edu/projects/3DIVE/• Protein Family Annotator – collaborative

development with IBM, Inc.

Data Integration• Goal set by IU School of Medicine: Any

research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges

• IU has more than 1 TB of biomedical data stored in massive data storage system

• There are many public data sources• Different labs were independently downloading,

subsetting, and formatting data• Solution: IBM DiscoveryLink, DB/2 Information

Integrator

A life sciences data example - Centralized Life Science Database

• Based on use of IBM DiscoveryLink(TM) and DB/2 Information Integrator(TM)

• Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized.

• Lab data and programs like BLAST are included via DL’s wrappers.

• Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences

• IU contributed writing of data parsers

Dot Plots

• Simple way to get a feel for how sequences compare to each other.

• Used both with DNA and Protein sequences• http://www.cgr.ki.se/cgr/groups/sonnhammer/Dot

ter.html/• "A dot-matrix program with dynamic threshold

control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995)

http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html/

http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html/

http://www.dkfz-heidelberg.de/tbi/bioinfo/Pairwise/DotPlots/index.html

Protein Family Annotator

• New project• Designed to allow federation and searching of

protein family data• ‘Visualizing’ the effect of variation in proteins a

real challenge for the biologists

Phylogenetic Inference

• Determine likely evolutionary relationships among different taxa

• NP hard• Very large search space• Heuristic search required• Problems:

– searches that are clearly going nowhere– Comparison of different trees

PViN

Gamma Knife• Used to treat

inoperable tumors• Treatment methods

currently use a standardized head model

• UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head

Tomography

• Key issue is processing of images• Weeks to days• Days to minutes• Visualization techniques applied do not need to

be fancy to be useful• Starting with some simple visualizations and

then moving to some very sophisticated visualizations would be tremendous

Some information about the Indiana University high performance

computing environment

Networking: I-light• Network jointly owned by

Indiana University and Purdue University

• 36 fibers between Bloomington and Indianapolis (IU’s main campuses)

• 24 fibers between Indianapolis and West Lafayette (Purdue’s main campus)

• Co-location with Abilene GigaPOP

• Expansion to other universities recently funded

Massive Data Storage System• Based on HPSS (High Performance

Software System)• First HPSS installation with

distributed movers; STK 9310 Silos in Bloomington and Indianapolis

• Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable.

• 180 TB capacity with existing tapes; total capacity of 480 TB. 100 TB currently in use; 1 TB for biomedical data.

• Common File System (CFS) – disk storage ‘for the masses’

Photo: Tyagan Miller. May be reused by IU for noncommercialpurposes. To license for commercial use, contact the photographer

AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and

Visualization of Instrument-Driven Data• Hardware components:

– Distributed Linux cluster• Three locations: IU Northwest, Indiana University Purdue University

Indianapolis, IU Bloomington• 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk• Tuned, configured, and optimized for handling real-time data

streams– A suite of distributed visualization environments– Massive data storage

• Usage components:– Research by application scientists– Research by computer scientists– Education

Goals for AVIDD

• Create a massive, distributed facility ideally suited to managing the complete data/experimental lifecycle (acquisition to insight to archiving)

• Focused on modern instruments that produce data in digital format at high rates. Example instruments:– Advanced Photon Source, Advanced Light Source– Atmospheric science instruments in forest– Gene sequencers, expression chip readers

Goals for AVIDD, Con’t• Performance goals:

– Two researchers should be able simultaneously to analyze 1 TB data sets (along with other smaller jobs running)

– The system should be able to give (nearly) immediate attention to real-time computing tasks, while still running at high rates of overall utilization

– It should be possible to move 1 TB of data from HPSS disk cache into the cluster in ~2 hours

• Science goals:– The distribution of 3D visualization environments in scientists’ labs

should enhance the ability of scientists to spontaneously interact with their data.

– Ability to manage large data sets should no longer be an obstacle to scientific research

– AVIDD should be an effective research platform for cluster engineering R&D as well as computer science research

John-E-BoxInvented by John N. Huffman, John C. Huffman, and Eric

Wernert

Thoughts about visualization and collaboration in bioinformatics

• Do you want a one-off, or sustained improvement in productivity of scientists?

• Collaboration tools can be highly sophisticated, or pretty darn ugly• Sometimes they must be sophisticated• The key for collaborative technology is that the collaboration has to

solve a problem (other than ‘what are we going to do in the booth this year’) and has to feel natural to the application scientist

• Many problems are as much about the theory and practice of interacting with the information

• Placing facilities in the lab is tremendously beneficial• We should encourage researchers not to be too cost-sensitive• Grand challenge problems are great, but there have to be facilities

that facilitate a learning curve and increases in sophistication over time for the application scientist. This creates a feeder system for the high end systems!

HPC Challenge

• “Arthropods evolving all over the world”• (sort of) computational steering• Big problem: how do you summarize the views

of LOTS of different trees?

What are some really important challenges in visualization today?

• Expression chip data• Trees• Multi-scale problems

Thoughts about working with biologists

Bioinformatics and Biomedical Research

• Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done.

• Traditional biomedical researchers must take advantage of new possibilities

• Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers

• So why do you want to interrupt the work of my paper mill?

Anopheles gambiae

From www.sciencemag.org/feature/ data/mosquito/mtm/index.htmlSource Library: Centers for Disease Control Photo Credit: Jim Gathany

So how do you find biologists with whom to collaborate?

• Chicken and egg problem?

• Or more like fishing?• Or bank robbery?

Bank robbery• Willie Sutton, a famous American bank robber, was

asked why he robbed banks, and reportedly said “because that's where the money is.”*

• Cultivating collaborations with biologists in the short run will require:– Active outreach– Different expectations than we might have when working with

an aerospace design firm– Patience

• There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships with biologists and biomedical researchers. To do this, we’ll all have to spend a bit of time “going where the biologists are.”

*Unfortunately this is an urban legend; Sutton never said this

Acknowledgments

• This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc.

• This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.

• This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Acknowledgements con’t• UITS Research and Academic Computing Division

managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar

• Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock

• Assistance with this presentation: John Herrin, Malinda Lingwall

• Thanks to Dr. M. Resch, Director, HLRS, for inviting me to visit HLRS

• Thanks to Dr. H. Bungartz for his hospitality, help, and for including Einführung in die Bioinformatik as an elective

• Thanks to Dr. S. Zimmer for help throughout the semester

• Further information is available at– ingen.iu.edu– http://www.indiana.edu/~uits/rac/– http://www.ncsc.org/casc/paper.html– http://www.indiana.edu/~rac/staff_papers.html

Documents

Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart [email protected] & Dr. Eric Wernert [email protected] 7 August 2003