Big data and open access: a collision course for science

Big data and open access: on track for collision of cosmic proportions?

Beth Plale, PhD, MBA Director, Data To Insight Center School of Informatics and Computing Indiana University

Keynote talk at 2nd Int’l LSDMA Symposium – The Challenge of Big Data in Science, Karlsruhe, Germany, Sept 2013

Open access, open cleaning, open data

yields greatest degree of science advancement on grand societal

ques�ons we face

Open Access

“Data is the New Gold” Title of Opening Remarks, Neelie Kroes, VP of EU Commission responsible for Digital Agenda, Press Conference on Open Data Strategy, Dec 2011

Applied Forces Open access ini�a�ves by federal governments

Big Data

Applied Force Distorts Object Open access ini�a�ves by federal governments

Big Data

Enables societal grand challenges addressed in: à  Climate change à  Food security à New economies

à Grows concerns about privacy of personal data

Negative form of tension (tension I)

Social pressure to privacy overwhelm and spill over to non-‐personal data

Chilling effect on data sharing where social phenomena involved

Exponential Growth in Data Production

Similar growth in societal expectations that large societal problems will be solved by more data

Tension II: Rapid growth in data and expectations yields impossible-to-reach success

DRIVING APPLICATIONS: LIBRARY TEXTS; URBAN SCIENCE; WIND AND WATER

Technical barriers to easing tensions but first …

Hathi Trust Research Center

Text mining at scale

#HTRC #HathiTrust #HTRC #HathiTrust

à  HathiTrust is large corpus providing opportunity for new forms of computation investigation. à  The bigger the data, the less able we are to move it to a researcher’s desktop machine à  Future research on large collections will require computation moves to the data, not vice versa

HTRC Partners

  Indiana University School of Informatics and Computing   Indiana Universities Libraries   University of Illinois Graduate School of Library and

Information Science   University of Illinios Libraries   Brandies University Library   University of Michigan http://www.hathitrust.org/htrc

#HTRC #HathiTrust

HTRC Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.

#HTRC #HathiTrust

Topic modeling on author

Two topics with identical centralities but separate themes

Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899.

Underwood et al. Research

  Computation moves to data   REST based Web services architecture and

protocols   Registry of services and algorithms   Solr full text index   noSQL store as volume store   openID authentication   Portal front-end, programmatic access   SEASR text mining algos

2/4/14 17

Agent framework

Page/volume tree (file system)

Volume store (Cassandra)

SEASR analy�cs service

Task deployment

WSO2 registry services, collec�ons, data

capsule images

Solr index

HathiTrust corpus rsync

HTRC

Data AP

I v0.1

NCSA local resources

Programma�c access e.g.,

WS02 Iden�ty Server

University of Michigan

Meandre Orchestra�on

Agent instance Agent

instance

Agent instance Agent

instance

Non-consumptive Data capsules

Big Red II/IU Quarry

18

Blacklight

Volume store (Cassandra) Volume store (Cassandra)

NSF XSEDE

Portal

HTRC: Open Data, Open Access, Open Cleaning?

  HathiTrust collection (69%) is not open data   Constrained by authors who hold copyright to the books   Computational analysis is by all accounts “fair use” under US copyright

HTRC: Open Data, Open Access, Open Cleaning?

  “Open cleaning” – enhancing OCR and MARC metadata   HTRC is opening data and “cleaning” as fully as we can to make the collection useful to scholarly and scientific investigation

Wind and Water: the hydrologist’s (atmospheric) observational data dilemma

Thanks to Jerry Brotzge, PhD meteorology, Oklahoma University

* Credit/blame for �tle goes to Beth Plale

Atmospheric Observing Systems

Recent addition of plethora of new observing systems to national US atmosphere observing infrastructure

  Improves ability to analyze current state of atmosphere, thus allowing new applications in hydrology and biology

Challenges in:   Data access; unique sensing requirements   Data quality, calibrations, and errors   Complex and non-uniform metadata

Use Case

Use observational data from 3 different radars: FAA TDWR, WSR-88D, and local X-band (CASA) Feed data through OU-custom QA/calibration workflow. Feed into Vflow hydrological model. Note that Vflow is able to operate on (ingest) the “raw” reflectivity data directly. That is, it does not require the data to be turned into gridded precipitation data. Vflow is unique among hydrology models because of this ability. Done in real time, that is, continuously ingesting data over fixed interval.

List of Issues for Flood Forecasting using Radar data Problem Cause Poten�al Solu�on

Hail contamina�on Assumes high rainfall rate Use of dual-‐pol, QC

Bright band Ice at mid-‐levels biases dBZ Real-‐�me QC, 2 radar beams

Ground clu�er Wind farms, blockage Use of Neural Net, velocity

Radar a�enua�on High-‐frequency radars Real-‐�me QC model, fix

Anomalous propaga�on High stable environment Use of Level 1, velocity

Velocity de-‐aliasing High velocity returns Real-‐�me QC

Radar calibra�on Poor maintenance Post QC

Over/under es�ma�on below beam

Radar too far from area of interest; undersampled

Improved radar sampling; addi�onal sfc input

Poor �me sampling Radar 5-‐min volume sampling Improved temporal sampling

ET under beam Lack of surface informa�on Addi�onal surface data

Spa�al interpola�on Polar to Cartesian coordinates Interpola�on algorithm

Use of Reflec�vity Does not measure rain directly Calibra�on against sfc data

WSR-‐88D data

Radar calibra�on

Clu�er removal

Anomalous propaga�on

(AP) removed

Interpola�on from polar to a common

Cartesian grid

Quality Control

Velocity de-‐aliasing

Clear-‐air echoes removed

Hail contamina�on

removal

Mel�ng layer contamina�on

removal

Other radar systems

(TDWR, CASA)

Example Workflow

Integrate radar data

with satellite, surface

observa�ons on grid

Convert radar reflec�vity

dBZ to rainfall rate

Radar merger (across same network and mul�ple networks)

Undersampling Representa�ve

ness

Examine hail contamination in more detail

  Level II radar data that is widely available (through LDM tool of UCAR in US) has not been “cleaned” of effects of clean air echoes, hail, undersampling, and melting layer contamination

  Hail has effect of high reflectivity readings and these high readings can be misinterpreted as high rainfall

  Meteorologists can detect hail easily by eyeballing a visual plot of reflectivity intensities so can go back to Level II data and process by removing hail contamination

  Meteorologists solve problem through trained eye, and good in-house scripts. What does poor hydrologist do?

Meterology/Hydrology: Open Data, Open Access, Open Cleaning?

Data is open, but how to handle cleaning? A: force all level II data through workflow. Hydrologist uses only processed data (i.e., gridded precipitation data).

  Advantage: hides details from hydrologist   Disadvantage: black box approach reduces trust

A: Make “raw” level II data and Q&A workflow tasks available to hydrologist.

  Advantage: hydrologist can develop high level of trust in data

  Disadvantage: current metadata not sufficiently described to capture the kinds of Q&A that have been applied

Urban Science


Tag cloud of related tweet topics #smartcityjam thanks to Jennifer Belissent, PhD

Urban Science

  Harness data from disparate sources with goal of improving city life.

  Fuses physical, biological, and informational sensing of the city

  in-situ sensors for environment: light, temperature, pollution   Video: pedestrian and vehicular traffic   Personal sensors: Fitbit and Up wristbands   Internet sources: Twitter feeds, blogs, news articles, crowd-

sourced sensing   Two examples in US

  Center of Urban Science and Progress, New York University   Urban Center for Computation and Data, University of

Chicago

Urban Science

Thanks to Physics Today, Sept 2013


Graphic courtesy NYU Center for Urban Science and Progress

Urban science: open data, open access, open cleaning?

CUSP is cleaning its own data for integration. Is this being done in way that Chicago can use? Likely not. Temporal streams are relatively simple to understand with even bad metadata. They are observational-physical and observational-social data sources so come with relatively known trust and attribution. What happens when CUSP wants to integrate predictive weather forecasting model results? Weak metadata and attribution can significantly compromise accuracy of results.

Data Provenance

Work of Data To Insight Center at IU, its affiliated faculty and students

Provenance Core (W3C PROV)

Provenance for situational

analysis of agent based

model used in social

ecological systems research

Village labor sharing for agriculture production in Africa

Provenance capture AMSR-E data processing pipeline

Aug 2013 36

Advanced Microwave Scanning Radiometer (AMSR-‐E) : sensor aboard Aqua satellite; passive

microwave radiometer. Observes precipita�on, sea surface temperatures, ice

concentra�ons, snow water equivalent, surface wetness,

wind speed, atmospheric cloud water, and water vapor.

NASA AMSR-‐E imagery ingest processing pipeline: provenance capture for anomaly detec�on

Dataset: D2I-AMSR-E-Provenance Dataset

Owner and Creator: Data to Insight Center Size: 15MB The University of Alabama in Huntsville processes data from the NASA AMSR-E instrument. The Karma project at Indiana University instrumented the ingest processing system and captured provenance for 3,890 runs for the period of September 2 - October 4 2011. The details of the runs are in Figure III-16 below; the largest provenance graph is the monthly rain graph that, when represented as a XML is approximately 13MB. Luo, Yuan, Plale, Beth, Jensen, Scott, Cheah, You-Wei, Conover, Helen. 2012. Provenance of AMSR-E Data from the National Snow and Ice Data Center (NSIDC). OPM XML Ver. 1.1., Sep 2 - Oct 4, 2011. Bloomington, Indiana: Data to Insight Center. http://dx.doi.org/10.5967/M0F47M2D

Aug 2013 39

Provenance History Layout Algorithm

Provenance of 1 month processing of NASA satellite ingest processing pipeline. Can help tracing error back to its cause. Shows relationship between daily products (each clover flower in clover leaf chain) and final monthly products at left-end.

Provenance of a seaIce daily workflow

Provenance graph compare: failed runs

40

Le�: complete provenance of successful execu�on. Right: failed run, because final data product (green on le�) cannot be matched.

Graph compare: dropped provenance

41

Le�: successful execu�on. Right: although successful execu�on, shows dropped no�fica�ons in provenance capture, because all nodes except some edges in le� graph cannot be matched.

Role of provenance in Open Data, Open Access, Open Cleaning

Key contribution of provenance is to data quality. We posit that quality of data provenance has 3 dimensions:

  Correctness   Completeness   Relevancy

Assumption: provenance collection process is automated Assessment is focused on correctness and completeness of captured provenance Steps:

1)  Detect ambiguities and conflicts in real and synthetic provenance traces

2)  Complete portions of missing provenance traces 3)  Validate provenance traces when possible 4)  Score the quality of provenance traces

42

Provenance Quality Analysis Overview

G : Graph level M-G : Multi-Graph (Multiple graphs) Level N / E : Node/Edge Level

43

Wrapping Up: Open Data, Open Cleaning, Open Access

Open data

Open cleaning

Open interfaces

Personal privacy respected

How? e.g, Crea�ve Commons license

Who’s working on: Research Data Alliance

S�mula�ng new business opportunity on stable interfaces to open data

Applied Forces Come Together to Distort Object into New Space

Open access ini�a�ves

Big Data

Fundamental advances in à Climate change, à  Food security àà New economies

Research Data Alliance

Maturity in provenance and metadata

Personal data privacy, social isues of sharing

[email protected]

Our hosts RDA Plenary 1 Chalmers Univ, Gothenburg, Sweden

Photo courtesy Leif Laaksonen

Technology

Big data and open access: a collision course for science