15
DATA SCIENCE IN EDUCATION AND FOR DISCOVERY Kirk D. Borne School of Physics, Astronomy, & Computational Sciences George Mason University [email protected] http://classweb.gmu.edu/kborne/

DATA SCIENCE IN EDUCATION AND FOR DISCOVERY Kirk D. Borne School of Physics, Astronomy, & Computational Sciences George Mason University [email protected]

Embed Size (px)

Citation preview

DATA SCIENCE IN EDUCATION AND FOR DISCOVERY

Kirk D. BorneSchool of Physics, Astronomy, & Computational Sciences

George Mason University

[email protected]

http://classweb.gmu.edu/kborne/

Abstract

I will discuss the rise of data science as a new academic and research discipline. Data-intensive opportunities are growing significantly across the spectrum of academic, government, and business enterprises. In order to respond to this data-driven digital transformation, it is imperative to train the next-generation workforce in the data-science skill areas. Among these skills are knowledge discovery and information extraction from massive data collections. I will describe some of the techniques that we are applying both in research (for scientific discovery) and in the classroom (to engage students in inquiry-driven evidence-based learning). Specific examples of surprise detection in big data will be presented.

Ever since humans began to explore the world…

… … humans have asked questions and …

… have collected evidence (data) to help answer those questions.

Astronomy: the world’s second oldest profession !

Now, the Data Flood is everywhere

• Huge quantities of data are being generated, collected, and stored within all business, government, research, and personal domains.

• Two significant challenges of this Data Flood will be addressed:• Training the next-generation workforce to manage and expertly use these data

• “The Rise of the Data Scientist”

• Discovering the hidden knowledge and surprises that are hidden within the data• Transforming our repositories from a data representation to a knowledge representation

• So how do we address these challenges?

• First, we must face it – i.e., the students that we train as well as knowledge workers (those who extract knowledge from data and information) must recognize the need and face the challenge …

Visualize This: A sea of Data (sea of CDs)

This is the CD Sea in Kilmington, England (600,000 CDs ~ 300 TB)

More Data is Different

• The message should be clear: “more data is not simply more data, but more data is different.”

• Numerous federal agencies (and others, of course) have addressed this, including the August 9, 2010 announcement from the OMB and White House OSTP:

• Big Data is a national challenge and a national priority, along with healthcare and national security.

• See http://www.aip.org/fyi (#87)

• International initiative by the CODATA organization to address this challenge: ADMIRE = Advanced Data Methods and Information technologies for Research and Education

• Many U.S. national study groups in the sciences have issued reports on the urgency of establishing both research and educational programs to face the Big Data challenges.

• Each of these reports have issued a call to action …

Data Science: A National Imperative1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) downloaded from

http://www.nap.edu/catalog.php?record_id=5504 2. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from

http://www.sis.pitt.edu/~dlwkshop/report.pdf 3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from

http://www.ncar.ucar.edu/cyber/cyberreport.pdf 4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, (2005)

downloaded from http://www.nsf.gov/nsb/documents/2005/LLDDC_report.pdf 5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research

Agenda, (2005) downloaded from http://www.cra.org/reports/cyberinfrastructure.pdf 6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on

Cyberinfrastructure, (2005) downloaded from http://www.nsf.gov/od/oci/reports/atkins.pdf 7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http://www.arl.org/bm~doc/digdatarpt.pdf 8. National Research Council, National Academies Press report: Learning to Think Spatially, (2006) downloaded from

http://www.nap.edu/catalog.php?record_id=11019 9. NSF report: Cyberinfrastructure Vision for 21st Century Discovery, (2007) downloaded from http://www.nsf.gov/od/oci/ci_v5.pdf 10. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) http://www.sis.pitt.edu/~repwkshop/NSF-JISC-report.pdf 11. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at Extreme

Scale, (2007) downloaded from http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/DOE-Visualization-Report-2007.pdf 12. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from

http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/PetascaleDataWorkshopReport.pdf 13. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded from

http://www.nitrd.gov/about/Harnessing_Power_Web.pdf 14. National Academies report: Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, (2009) downloaded from

http://www.nap.edu/catalog.php?record_id=1261515. NSF report: Data-Enabled Science in the Mathematical and Physical Sciences, (2010) http://www.cra.org/ccc/docs/reports/DES-report_final.pdf

Data Science Education: Two Perspectives

• Informatics in Education – working with data in all learning settings• Informatics (Data Science) enables transparent reuse and analysis of data in

inquiry-based classroom learning.• Learning is enhanced when students work with real data and information

(especially online data) that are related to the topic (any topic) being studied.• http://serc.carleton.edu/usingdata/ (“Using Data in the Classroom”)• Example: CSI The Cosmos

• An Education in Informatics – students are specifically trained:• … to access large distributed data repositories• … to conduct meaningful inquiries into the data• … to mine, visualize, and analyze the data• … to make objective data-driven inferences, discoveries, and decisions

• Numerous Data Science programs now exist at several universities (GMU, Caltech, RPI, Michigan, Cornell, U. Illinois, and more)

• http://cds.gmu.edu/ (Computational & Data Sciences @ GMU)

Data Science Education Goal

• Primary Goal: to increase student’s understanding of the role that data and information play across all disciplines, and to increase the student’s ability to use the technologies and methodologies associated with data acquisition, management, search, mining, analysis, and visualization.

• Secondary goals:• To increase student’s abilities to use databases for inquiry• To increase student’s abilities to acquire, process, and explore data with the use

of a computer• To increase student’s confidence and comfort in using data to address real-world

problems (in their chosen scientific discipline, or in any endeavor)• To increase student’s awareness of ethical issues pertaining to data and

information, including privacy, ownership, proper attribution, misuse and abuse of statistics and graphs, data falsification, and objective reasoning from data

• To demonstrate and to share the joy of discovery from data

Knowledge Discovery from Data: Many names• Data Mining• Machine Learning (ML)• Exploratory Data Analysis (EDA)• Intelligent Data Analysis (IDA)• Data Analytics• Predictive Analytics• Discovery Informatics• On-Line Analytical Processing (OLAP)• Business Intelligence (BI)• Business Analytics• Customer Relationship Management (CRM) • Target Marketing• Cross-Selling• Market Basket Analysis• Credit Scoring• Case-Based Reasoning (CBR)• Connecting the Dots• Intrusion Detection Systems (IDS)• Recommendation / Personalization Systems!

Data-driven Discovery (Unsupervised Learning)• Class Discovery – Clustering

• Distinguish different classes of behavior or different types of objects• Find new classes of behavior or new types of objects• Describe a large data collection by a small number of condensed representations

• Principal Component Analysis – Dimension Reduction• Find the dominant features among all of the data attributes• Enables low-dimensional descriptions of events and behaviors, while revealing

correlations and dependencies among parameters• Addresses the Curse of Dimensionality

• Outlier Detection – Surprise / Anomaly / Novelty Discovery• Find objects and events that are outside the bounds of our expectations• These could be garbage (erroneous measurements) or true discoveries• Used for data quality assurance and/or for discovery of new / rare / interesting

data items

• Link Analysis – Association Analysis – Network Analysis• Identify connections between different events (or objects) • Find unusual (improbable) co-occurring combinations of data attribute values• Find data items that have much fewer than “6 degrees of separation”

Addressing the D2K (Data-to-Knowledge) Challenge

• Complete end-to-end application of Informatics: • Data management, metadata management, data search, information extraction,

data mining, knowledge discovery• All steps are necessary – skilled workforce needed to take data to knowledge• Applies to any discipline (not just science)

Characterize First, then Classify

• The Scientific Method does not begin with “hypothesis formulation.”• Neither should any reasoning process jump to conclusions.• We should teach by example: follow an evidence-based “forensics”

approach.• “Big Data” provide an excellent framework and environment for this.• By including Data Science in our education programs as well as in

our own business practice, this should lead to informed, objective, data-driven decision-making.

• Isn’t this what we expect from all of our citizens?

• Example from scientific method:• Step 1: Data Collection – observe, describe, characterize• Step 2: Hypothesis Formulation – classify, diagnose, predict

Summary

• All enterprises are being inundated with data.

• The knowledge discovery potential from these data is enormous.

• Now is the time to implement data-oriented methodologies (Informatics / Data Science) into the enterprise.

• This is especially important in training and degree programs – training the next-generation workforce to use data for knowledge discovery and decision support.

• We have before us a grand opportunity to establish dialogue and information-sharing across diverse data-intensive research and application communities.

• DATA SUMMIT 2011 has been a fantastic realization of that opportunity.