41
Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart [email protected] & Dr. Eric Wernert [email protected] 7 August 2003

Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart [email protected] & Dr. Eric Wernert [email protected] 7 August 2003

Embed Size (px)

Citation preview

Page 1: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Computational Biology: Data, computation, and visualization

Dr. Craig A. [email protected]

&Dr. Eric Wernert

[email protected]

7 August 2003

Page 2: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

License terms

• Please cite as: Stewart, C.A. and E. Wernert. Computational Biology: Data, computation, and visualization. 2003. Presentation. Presented at: Visualization Workshop (Arctic Region Supercomputer Center, University of Alaska Fairbanks, 7 Aug 2003). Available from: http://hdl.handle.net/2022/15219

• Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

2

Page 3: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Outline

• A bit about biomedical data• Computation and visualization• The revolution in biology & IU’s response –the

Indiana Genomics Initiative• Hardware• Some thoughts about dealing with biological and

biomedical researchers in general

Page 4: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

The revolution in biology

• Automated, high-throughput sequencing has revolutionized biology.

• Computing has been a part of this revolution in three ways so far:– Computing has been essential to

the assembly of genomes– There is now so much biological

data available that it is impossible to utilize it effectively without aid of computers

– Networking and the Web have made biological data generally and publicly available

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Page 5: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003
Page 6: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

FASTA format

>gi|532319|pir|TVFV2E|TVFV2E envelope proteinELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF

Page 7: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Some of the issues about this exponential growth in data stores

• WO/RN• Comparability/replicability problems with certain

types of data• HIPPA – how do you de-identify patient data?

Page 8: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Indiana Genomics Initiative (INGEN)

• Created by a $105M grant from the Lilly Endowment, Inc. and launched December, 2000

• Build on traditional strengths and add new areas of research for IU

• Perform the research that will generate new treatments for human disease in the post-genomic era

• Improve human health generally and in the State of Indiana particularly

• Enhance economic growth in Indiana

Page 9: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Challenges for UITS and the INGEN IT Core

• Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing)

• Assist bioinformatics researchers in use of advanced computing facilities

• Questions we are asked:– Why wouldn't it be better just to buy me a newer PC?

• Questions we asked:– What do you do now with computers that you would like

to do faster?– What would you do if computer resources were not a

constraint?

Page 10: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

So, why is this better than just buying me a new PC?

• Unique facilities provided by IT Core– Redundant data storage– HPC – better uniprocessor performance; trivially

parallel programming, parallel programming– Visualization in the research laboratories

• Hardcopy document – INGEN's advanced IT facilities: The least you need to know

• Outreach efforts• Demonstration projects

Page 11: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Example projects

• Data integration• fastDNAml – maximum likelihood phylogenies

(http://www.indiana.edu/~rac/hpc/fastDNAml/index.html)

• PiVN - Software to visualize human family trees • 3-DIVE (3D Interactive Volume Explorer).

http://www.avl.iu.edu/projects/3DIVE/• Protein Family Annotator – collaborative

development with IBM, Inc.

Page 12: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Data Integration• Goal set by IU School of Medicine: Any

research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges

• IU has more than 1 TB of biomedical data stored in massive data storage system

• There are many public data sources• Different labs were independently downloading,

subsetting, and formatting data• Solution: IBM DiscoveryLink, DB/2 Information

Integrator

Page 13: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

A life sciences data example - Centralized Life Science Database

• Based on use of IBM DiscoveryLink(TM) and DB/2 Information Integrator(TM)

• Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized.

• Lab data and programs like BLAST are included via DL’s wrappers.

• Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences

• IU contributed writing of data parsers

Page 14: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003
Page 15: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003
Page 16: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Dot Plots

• Simple way to get a feel for how sequences compare to each other.

• Used both with DNA and Protein sequences• http://www.cgr.ki.se/cgr/groups/sonnhammer/Dot

ter.html/• "A dot-matrix program with dynamic threshold

control suited for genomic DNA and protein sequence analysis" Erik L.L. Sonnhammer and Richard Durbin Gene 167(2):GC1-10 (1995)

Page 17: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

http://www.dkfz-heidelberg.de/tbi/bioinfo/Pairwise/DotPlots/index.html

Page 18: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Protein Family Annotator

• New project• Designed to allow federation and searching of

protein family data• ‘Visualizing’ the effect of variation in proteins a

real challenge for the biologists

Page 19: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Phylogenetic Inference

• Determine likely evolutionary relationships among different taxa

• NP hard• Very large search space• Heuristic search required• Problems:

– searches that are clearly going nowhere– Comparison of different trees

Page 20: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003
Page 21: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

PViN

Page 22: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Gamma Knife• Used to treat

inoperable tumors• Treatment methods

currently use a standardized head model

• UITS is working with IU School of Medicine to adapt Penelope code to work with detailed model of an individual patient’s head

Page 23: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Tomography

• Key issue is processing of images• Weeks to days• Days to minutes• Visualization techniques applied do not need to

be fancy to be useful• Starting with some simple visualizations and

then moving to some very sophisticated visualizations would be tremendous

Page 24: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Some information about the Indiana University high performance

computing environment

Page 25: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Networking: I-light• Network jointly owned by

Indiana University and Purdue University

• 36 fibers between Bloomington and Indianapolis (IU’s main campuses)

• 24 fibers between Indianapolis and West Lafayette (Purdue’s main campus)

• Co-location with Abilene GigaPOP

• Expansion to other universities recently funded

Page 26: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003
Page 27: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Massive Data Storage System• Based on HPSS (High Performance

Software System)• First HPSS installation with

distributed movers; STK 9310 Silos in Bloomington and Indianapolis

• Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable.

• 180 TB capacity with existing tapes; total capacity of 480 TB. 100 TB currently in use; 1 TB for biomedical data.

• Common File System (CFS) – disk storage ‘for the masses’

Photo: Tyagan Miller. May be reused by IU for noncommercialpurposes. To license for commercial use, contact the photographer

Page 28: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and

Visualization of Instrument-Driven Data• Hardware components:

– Distributed Linux cluster• Three locations: IU Northwest, Indiana University Purdue University

Indianapolis, IU Bloomington• 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk• Tuned, configured, and optimized for handling real-time data

streams– A suite of distributed visualization environments– Massive data storage

• Usage components:– Research by application scientists– Research by computer scientists– Education

Page 29: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Goals for AVIDD

• Create a massive, distributed facility ideally suited to managing the complete data/experimental lifecycle (acquisition to insight to archiving)

• Focused on modern instruments that produce data in digital format at high rates. Example instruments:– Advanced Photon Source, Advanced Light Source– Atmospheric science instruments in forest– Gene sequencers, expression chip readers

Page 30: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Goals for AVIDD, Con’t• Performance goals:

– Two researchers should be able simultaneously to analyze 1 TB data sets (along with other smaller jobs running)

– The system should be able to give (nearly) immediate attention to real-time computing tasks, while still running at high rates of overall utilization

– It should be possible to move 1 TB of data from HPSS disk cache into the cluster in ~2 hours

• Science goals:– The distribution of 3D visualization environments in scientists’ labs

should enhance the ability of scientists to spontaneously interact with their data.

– Ability to manage large data sets should no longer be an obstacle to scientific research

– AVIDD should be an effective research platform for cluster engineering R&D as well as computer science research

Page 31: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

John-E-BoxInvented by John N. Huffman, John C. Huffman, and Eric

Wernert

Page 32: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Thoughts about visualization and collaboration in bioinformatics

• Do you want a one-off, or sustained improvement in productivity of scientists?

• Collaboration tools can be highly sophisticated, or pretty darn ugly• Sometimes they must be sophisticated• The key for collaborative technology is that the collaboration has to

solve a problem (other than ‘what are we going to do in the booth this year’) and has to feel natural to the application scientist

• Many problems are as much about the theory and practice of interacting with the information

• Placing facilities in the lab is tremendously beneficial• We should encourage researchers not to be too cost-sensitive• Grand challenge problems are great, but there have to be facilities

that facilitate a learning curve and increases in sophistication over time for the application scientist. This creates a feeder system for the high end systems!

Page 33: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

HPC Challenge

• “Arthropods evolving all over the world”• (sort of) computational steering• Big problem: how do you summarize the views

of LOTS of different trees?

Page 34: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

What are some really important challenges in visualization today?

• Expression chip data• Trees• Multi-scale problems

Page 35: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Thoughts about working with biologists

Page 36: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Bioinformatics and Biomedical Research

• Bioinformatics, Genomics, Proteomics, ____ics will radically change understanding of biological function and the way biomedical research is done.

• Traditional biomedical researchers must take advantage of new possibilities

• Computer-oriented researchers must take advantage of the knowledge held by traditional biomedical researchers

• So why do you want to interrupt the work of my paper mill?

Anopheles gambiae

From www.sciencemag.org/feature/ data/mosquito/mtm/index.htmlSource Library: Centers for Disease Control Photo Credit: Jim Gathany

Page 37: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

So how do you find biologists with whom to collaborate?

• Chicken and egg problem?

• Or more like fishing?• Or bank robbery?

Page 38: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Bank robbery• Willie Sutton, a famous American bank robber, was

asked why he robbed banks, and reportedly said “because that's where the money is.”*

• Cultivating collaborations with biologists in the short run will require:– Active outreach– Different expectations than we might have when working with

an aerospace design firm– Patience

• There are lots of opportunities open for HPC centers willing to take the effort to cultivate relationships with biologists and biomedical researchers. To do this, we’ll all have to spend a bit of time “going where the biologists are.”

*Unfortunately this is an urban legend; Sutton never said this

Page 39: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Acknowledgments

• This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc.

• This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.

• This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Page 40: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

Acknowledgements con’t• UITS Research and Academic Computing Division

managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar

• Indiana Genomics Initiative Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock

• Assistance with this presentation: John Herrin, Malinda Lingwall

• Thanks to Dr. M. Resch, Director, HLRS, for inviting me to visit HLRS

• Thanks to Dr. H. Bungartz for his hospitality, help, and for including Einführung in die Bioinformatik as an elective

• Thanks to Dr. S. Zimmer for help throughout the semester

Page 41: Computational Biology: Data, computation, and visualization Dr. Craig A. Stewart stewart@iu.edu & Dr. Eric Wernert ewernert@indiana.edu 7 August 2003

• Further information is available at– ingen.iu.edu– http://www.indiana.edu/~uits/rac/– http://www.ncsc.org/casc/paper.html– http://www.indiana.edu/~rac/staff_papers.html