56
visit us at: www.ischool.drexel.edu Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information A Thesis Submitted to the Faculty of Drexel University by Donald Anthony Pellegrino Jr. in partial fulfillment of the requirements for the degree of Doctor of Philosophy May 2011

Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Embed Size (px)

DESCRIPTION

Slides from Don Pellegrino's Dissertation Defense.

Citation preview

Page 1: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

v i s i t us a t : www. i s choo l . d rexe l . e du

Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific

InformationA Thesis

Submitted to the Facultyof

Drexel Universityby

Donald Anthony Pellegrino Jr.in partial fulfillment of the

requirements for the degreeof

Doctor of PhilosophyMay 2011

Page 2: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Committee

Chaomei Chen (Chair)

Robert Allen (IST)

Xia Lin (IST)

Jean-Claude Bradley (Chemistry)

Longjian Liu (Epidemiology and Biostatistics)

Page 3: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Problem• Technological developments enable sharing and reuse of scientific

information.• Current indexing methods support query-based search and filtering, however

they do not support overviews and exploration.• Due to these limitations of existing indexing methods, it is challenging to

discover records and connections that relate information in new and potentially insightful ways.

Solution• New Indexing Methods

• Instantiation of graph structures from real-world real-scale scientific collections.

• Interactive visual exploration of structure.• Quantitative and semantic guidance for exploration of the graph.

• Demonstrate feasibility of new methods for finding novel and significant connections and records in the collections.

Page 4: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

“Another key is addressing the volume of information – a veritable tsunami – and the need for tools. In short, the totality of information far exceeds the ability of any organization to effectively and completely analyze and render judgments. And there are several aspects to this issue. One is that textual information must be captured and must be retrievable. Another is that the textual information or structured data quickly outstrips the working capability of the mind to retain and this analyze. Yet another is the necessity to integrate that unstructured text information with structured data. These issues present a critical requirement: analytical software (tools) to work on the problems of entity and relationship extraction from texts as well as the analysis of the resulting data (e.g., the discovery of trends or links that are quite simply not obvious to the human analyst)(Strickland, 2005, p.164, emphasis added).”

Strickland, L. S. (2005). Knowledge Transfer: Information Science Shapes Intelligence in the Cold War Era. In R. V. Williams & B.-A. Lipetz (Eds.), Covert and Overt: Recollecting and Connecting Intelligence Service and Information Science (pp. 147-166). Medford, NJ: Information Today Inc.

Page 5: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Søndergaard, T. F., Andersen, J., & Hjørland, B. (2003). Documents and the communication of scientific and scholarly intformation: Revising and updating the UNISIST model. Journal of Documentation, 59(3), 278-320.

1971 Model 2003 Model

Page 6: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Theme 1: Advancements in technology can lead to increases in the volume and/or type of artifacts that need to be discoverable.

“Technology has a profound effect on how scientists can communicate with each other. This affects how quickly science can progress and what kinds of collaboration are possible (Bradley, Lang, Koch, & Neylon, 2011, p.426).”

Recent Technological Advancements: Cloud Computing, Cyberinfrastructure, Big Data, eScience, Data Driven Science, Open Notebook Science – i.e., More Data

Bradley, J.-C., Lang, A. S. I. D., Koch, S., & Neylon, C. (2011). Collaboration Using Open Notebook Science in Academia. In S. Ekins, M. A. Z. Hupcey & A. J. Williams (Eds.), Collaborative Computational Technologies for Biomedical Research (pp. 425-452): John Wiley & Sonds, Inc.Gantz, J. F., Chute, C., Manfrediz, A., Minton, S., Reinsel, D., Schlichting, W., & Toncheva, A. (2008). The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth Through 2011: IDC.

IDC predicted, “… in 2011, the amount of digital information produced in the year should equal nearly 1,800 exabytes, or 10 times that produced in 2006. The compound annual growth rate between now [2008] and 2011 is expected to be almost 60% (Gantz et al., 2008).”

Page 7: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Theme 2: The introduction of new kinds of artifacts and increases in volume lead to advancements in the methods used for indexing.

“One of the most serious problems confronting science at the present time is the difficulty in keeping abreast of all the research that is being done and in bringing the published results into some workable order. If the results of research are buried or lost for some reason or other, the research, and the money spent on it, is entirely wasted. To prevent such a loss we need adequate guides to the vast amount of scientific literature and must make intelligent and effective use of them. … It is becoming increasingly difficult for our indexes and abstract journals to keep up with the growing number of medical publications and with articles of medical importance in other scientific journals. … The aspect of the problem which is our immediate concern today and which is particularly important to the Army Medical Library is that of the role of indexes in meeting the needs of the present and of the future (Larkey, 1949).”

Larkey, S. V. (1949). The Army Medical Library Research Project at the Welch Medical Library. Bulletin of the Medical Library Association, 37(2), 121-124.

Page 8: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Preliminary Study

VAST Challenge 2008

Page 9: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Custom Improvise visualization developed by Chris Weaver and the NEVAC team for analysis of the wiki collection.

Page 10: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Custom Improvise visualization developed by Chris Weaver and the NEVAC team for analysis of the coast guard intercept collection.

Page 11: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Custom Improvise visualization developed by Chris Weaver and the NEVAC team for analysis of the cell phone call collection.

Page 12: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Custom Improvise visualization developed by Chris Weaver and the NEVAC team for analysis of the RFID movement collection.

Page 13: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

All of the mini-challenge data collections were loaded into a single Maple worksheet. (Pellegrino, Chen, et al., 2008, Figure 1)

Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.

Page 14: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

"Modeling the evacuation mini-challenge hypotheses in an associative network (Pellegrino, Chen, et al., 2008, Figure 7).”

Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.

Page 15: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Graph representation of data and hypotheses (Pellegrino, Chen, et al., 2008, Figure 8).

Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.

Page 16: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

“Path from RFID 21 to RFID 62 (Pellegrino, Chen, et al., 2008, Figure 10).”

Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.

Page 17: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

“k-Neighbors within 4 of RFID 56 (Pellegrino, Chen, et al., 2008, Figure 11).”

Pellegrino, D., Chen, C., MacEachren, A., Mitra, P., Pan, C.-C., Robinson, A., . . . Weaver, C. (2008). North-East Visualization and Analytics Center (NEVAC) Team Entry VAST Challenge Portal: National Institute of Standards and Technology.

Page 18: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Limitations• Synthetic Data• Only tested in one domain.• Significant manual effort required.

Page 19: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Scale-Up and Scale-Out

Influenza Protein Sequence Mapping Study

Page 20: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Study Objectives• Real-world data.• New domain.• Reduce manual effort – create a tool.

Page 21: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
Page 22: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
Page 23: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
Page 24: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
Page 25: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
Page 26: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
Page 27: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

MOVIETemporal Analysis

Page 28: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
Page 29: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Lessons Learned• Real-world data.• Suitable domain.• Prototype tool developed.• Method provides an overview which can

not be achieved using other tools.• Method provides insight into macroscopic

temporal characteristics of the collection.• Method provides means for exploring

specific records.

Page 30: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Scale-Out and Evaluate.

Open Notebook Science Study

Page 31: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Study Objectives• Real-world data.• New domain.• Find a hidden ground truth – variation of

the VAST evaluation model.

Page 32: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

UsefulChem Experiment 262 Notebook Entry by Evan Curtin.

Curtin, E., “Exp262,” [Online]. Available: http://usefulchem.wikispaces.com/Exp262, Retrieved 20 April 2011.

Page 33: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Inventory and model some of the core UsefulChem and Open Notebook Science data.

Page 34: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Objective

To synthesize the precursor diamide to be used subsequently in the pictet spengler reaction affording praziquantel.

Conclusion

After two days of reaction time, it is not clear if a Ugi product is formed. Owing to the small scale on which this reaction was carried out (total volume <175uL), and the minuscule amount of precipitate obtained, further work-up seems impractical.

Experiment aborted.

Page 35: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Overview Graph.

Page 36: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

A disconnected cluster Khalid Mirza - Marshal Moritz cluster.

Page 37: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

A disconnected Dustin Sprouse cluster.

Page 38: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

A Sebastian Petrik cluster.

Page 39: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

David Bulger cluster.

Page 40: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Khalid Mirza - Aneh cluster.

Page 41: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Marshall Moritz cluster.

Page 42: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

James Giammarco - Jessica Colditz and David Bulger - Khalid Mirza connections group.

Page 43: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Michael Wolfle cluster.

Page 44: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information
Page 45: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

“We just tried this exact reaction 2 weeks ago :) http://usefulchem.wikispaces.com/Exp258 [JCB]”

Page 46: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Lessons Learned• Real-world data.• New domain.• Find a hidden ground truth – variation of

the VAST evaluation model.• Extensive opportunity for future work.• Social component is key.

Page 47: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Systematize and Evaluate

Pfizer Drug Discovery Study

Page 48: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Study Objectives• Real-world data.• New domain.• Explore use of quantitative measures to

guide exploration.

Page 49: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Timeline view.

Page 50: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Coordinated views of clusters and the timeline.

Page 51: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Screenshot of in-degree view.

Page 52: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Screenshot of out-degree view.

Page 53: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Screenshot of betweenness view.

Page 54: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

Lessons Learned• Real-world data.• New domain.• Explore use of quantitative measures to

guide exploration.• Indegree and outdegree can be useful for

design meetings.• Betweenness did not appear to add value.• May be particularly useful for researchers who

are not yet familiar with a collection.

Page 55: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information

These issues present a critical requirement: analytical software (tools) to work on the problems of entity and relationship extraction from texts as well as the analysis of the resulting data (e.g., the discovery of trends or links that are quite simply not obvious to the human analyst)(Strickland, 2005, p.164, emphasis added).”

Conclusions• Influenza Study yielded the identification of both macroscopic trends and specific

records that were not readily identifiable using a search and filter modality.• Open Notebook Science Study yielded a structure which may have improved the

likelihood that a critical link (Ugi reaction for Praziquantel intermediate) would be discovered.

• Pfizer Study demonstrated the potential utility of indegree for systematic identification of key compounds.

Page 56: Interactive Visualization Systems and Data Integration Methods for Supporting Discovery in Collections of Scientific Information