Upload
beth-plale
View
271
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Beth Plale, Keynote talk at 2nd Int’l LSDMA Symposium – The Challenge of Big Data in Science, Karlsruhe, Germany, Sept 2013
Citation preview
Big data and open access: on track for collision of cosmic proportions?
Beth Plale, PhD, MBA Director, Data To Insight Center School of Informatics and Computing Indiana University
Keynote talk at 2nd Int’l LSDMA Symposium – The Challenge of Big Data in Science, Karlsruhe, Germany, Sept 2013
Open access, open cleaning, open data
yields greatest degree of science advancement on grand societal
ques�ons we face
Open Access
“Data is the New Gold” Title of Opening Remarks, Neelie Kroes, VP of EU Commission responsible for Digital Agenda, Press Conference on Open Data Strategy, Dec 2011
Applied Forces Open access ini�a�ves by federal governments
Big Data
Applied Force Distorts Object Open access ini�a�ves by federal governments
Big Data
Enables societal grand challenges addressed in: à Climate change à Food security à New economies
à Grows concerns about privacy of personal data
Negative form of tension (tension I)
Social pressure to privacy overwhelm and spill over to non-‐personal data
Chilling effect on data sharing where social phenomena involved
Exponential Growth in Data Production
Similar growth in societal expectations that large societal problems will be solved by more data
Tension II: Rapid growth in data and expectations yields impossible-to-reach success
DRIVING APPLICATIONS: LIBRARY TEXTS; URBAN SCIENCE; WIND AND WATER
Technical barriers to easing tensions but first …
Hathi Trust Research Center
Text mining at scale
#HTRC #HathiTrust #HTRC #HathiTrust
à HathiTrust is large corpus providing opportunity for new forms of computation investigation. à The bigger the data, the less able we are to move it to a researcher’s desktop machine à Future research on large collections will require computation moves to the data, not vice versa
HTRC Partners
Indiana University School of Informatics and Computing Indiana Universities Libraries University of Illinois Graduate School of Library and
Information Science University of Illinios Libraries Brandies University Library University of Michigan http://www.hathitrust.org/htrc
#HTRC #HathiTrust
HTRC Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
#HTRC #HathiTrust
Topic modeling on author
Two topics with identical centralities but separate themes
Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899.
Underwood et al. Research
Computation moves to data REST based Web services architecture and
protocols Registry of services and algorithms Solr full text index noSQL store as volume store openID authentication Portal front-end, programmatic access SEASR text mining algos
2/4/14 17
Agent framework
Page/volume tree (file system)
Volume store (Cassandra)
SEASR analy�cs service
Task deployment
WSO2 registry services, collec�ons, data
capsule images
Solr index
HathiTrust corpus rsync
HTRC
Data AP
I v0.1
NCSA local resources
Programma�c access e.g.,
WS02 Iden�ty Server
University of Michigan
Meandre Orchestra�on
Agent instance Agent
instance
Agent instance Agent
instance
Non-consumptive Data capsules
Big Red II/IU Quarry
18
Blacklight
Volume store (Cassandra) Volume store (Cassandra)
NSF XSEDE
Portal
HTRC: Open Data, Open Access, Open Cleaning?
HathiTrust collection (69%) is not open data Constrained by authors who hold copyright to the books Computational analysis is by all accounts “fair use” under US copyright
HTRC: Open Data, Open Access, Open Cleaning?
“Open cleaning” – enhancing OCR and MARC metadata HTRC is opening data and “cleaning” as fully as we can to make the collection useful to scholarly and scientific investigation
Wind and Water: the hydrologist’s (atmospheric) observational data dilemma
Thanks to Jerry Brotzge, PhD meteorology, Oklahoma University
* Credit/blame for �tle goes to Beth Plale
Atmospheric Observing Systems
Recent addition of plethora of new observing systems to national US atmosphere observing infrastructure
Improves ability to analyze current state of atmosphere, thus allowing new applications in hydrology and biology
Challenges in: Data access; unique sensing requirements Data quality, calibrations, and errors Complex and non-uniform metadata
Use Case
Use observational data from 3 different radars: FAA TDWR, WSR-88D, and local X-band (CASA) Feed data through OU-custom QA/calibration workflow. Feed into Vflow hydrological model. Note that Vflow is able to operate on (ingest) the “raw” reflectivity data directly. That is, it does not require the data to be turned into gridded precipitation data. Vflow is unique among hydrology models because of this ability. Done in real time, that is, continuously ingesting data over fixed interval.
List of Issues for Flood Forecasting using Radar data Problem Cause Poten�al Solu�on
Hail contamina�on Assumes high rainfall rate Use of dual-‐pol, QC
Bright band Ice at mid-‐levels biases dBZ Real-‐�me QC, 2 radar beams
Ground clu�er Wind farms, blockage Use of Neural Net, velocity
Radar a�enua�on High-‐frequency radars Real-‐�me QC model, fix
Anomalous propaga�on High stable environment Use of Level 1, velocity
Velocity de-‐aliasing High velocity returns Real-‐�me QC
Radar calibra�on Poor maintenance Post QC
Over/under es�ma�on below beam
Radar too far from area of interest; undersampled
Improved radar sampling; addi�onal sfc input
Poor �me sampling Radar 5-‐min volume sampling Improved temporal sampling
ET under beam Lack of surface informa�on Addi�onal surface data
Spa�al interpola�on Polar to Cartesian coordinates Interpola�on algorithm
Use of Reflec�vity Does not measure rain directly Calibra�on against sfc data
WSR-‐88D data
Radar calibra�on
Clu�er removal
Anomalous propaga�on
(AP) removed
Interpola�on from polar to a common
Cartesian grid
Quality Control
Velocity de-‐aliasing
Clear-‐air echoes removed
Hail contamina�on
removal
Mel�ng layer contamina�on
removal
Other radar systems
(TDWR, CASA)
Example Workflow
Integrate radar data
with satellite, surface
observa�ons on grid
Convert radar reflec�vity
dBZ to rainfall rate
Radar merger (across same network and mul�ple networks)
Undersampling Representa�ve
ness
Examine hail contamination in more detail
Level II radar data that is widely available (through LDM tool of UCAR in US) has not been “cleaned” of effects of clean air echoes, hail, undersampling, and melting layer contamination
Hail has effect of high reflectivity readings and these high readings can be misinterpreted as high rainfall
Meteorologists can detect hail easily by eyeballing a visual plot of reflectivity intensities so can go back to Level II data and process by removing hail contamination
Meteorologists solve problem through trained eye, and good in-house scripts. What does poor hydrologist do?
Meterology/Hydrology: Open Data, Open Access, Open Cleaning?
Data is open, but how to handle cleaning? A: force all level II data through workflow. Hydrologist uses only processed data (i.e., gridded precipitation data).
Advantage: hides details from hydrologist Disadvantage: black box approach reduces trust
A: Make “raw” level II data and Q&A workflow tasks available to hydrologist.
Advantage: hydrologist can develop high level of trust in data
Disadvantage: current metadata not sufficiently described to capture the kinds of Q&A that have been applied
Urban Science
* Credit/blame for �tle goes to Beth Plale
Tag cloud of related tweet topics #smartcityjam thanks to Jennifer Belissent, PhD
Urban Science
Harness data from disparate sources with goal of improving city life.
Fuses physical, biological, and informational sensing of the city
in-situ sensors for environment: light, temperature, pollution Video: pedestrian and vehicular traffic Personal sensors: Fitbit and Up wristbands Internet sources: Twitter feeds, blogs, news articles, crowd-
sourced sensing Two examples in US
Center of Urban Science and Progress, New York University Urban Center for Computation and Data, University of
Chicago
Urban Science
Thanks to Physics Today, Sept 2013
* Credit/blame for �tle goes to Beth Plale
Graphic courtesy NYU Center for Urban Science and Progress
Urban science: open data, open access, open cleaning?
CUSP is cleaning its own data for integration. Is this being done in way that Chicago can use? Likely not. Temporal streams are relatively simple to understand with even bad metadata. They are observational-physical and observational-social data sources so come with relatively known trust and attribution. What happens when CUSP wants to integrate predictive weather forecasting model results? Weak metadata and attribution can significantly compromise accuracy of results.
Data Provenance
Work of Data To Insight Center at IU, its affiliated faculty and students
Provenance Core (W3C PROV)
Provenance for situational
analysis of agent based
model used in social
ecological systems research
Village labor sharing for agriculture production in Africa
Provenance capture AMSR-E data processing pipeline
Aug 2013 36
Advanced Microwave Scanning Radiometer (AMSR-‐E) : sensor aboard Aqua satellite; passive
microwave radiometer. Observes precipita�on, sea surface temperatures, ice
concentra�ons, snow water equivalent, surface wetness,
wind speed, atmospheric cloud water, and water vapor.
NASA AMSR-‐E imagery ingest processing pipeline: provenance capture for anomaly detec�on
Dataset: D2I-AMSR-E-Provenance Dataset
Owner and Creator: Data to Insight Center Size: 15MB The University of Alabama in Huntsville processes data from the NASA AMSR-E instrument. The Karma project at Indiana University instrumented the ingest processing system and captured provenance for 3,890 runs for the period of September 2 - October 4 2011. The details of the runs are in Figure III-16 below; the largest provenance graph is the monthly rain graph that, when represented as a XML is approximately 13MB. Luo, Yuan, Plale, Beth, Jensen, Scott, Cheah, You-Wei, Conover, Helen. 2012. Provenance of AMSR-E Data from the National Snow and Ice Data Center (NSIDC). OPM XML Ver. 1.1., Sep 2 - Oct 4, 2011. Bloomington, Indiana: Data to Insight Center. http://dx.doi.org/10.5967/M0F47M2D
Aug 2013 39
Provenance History Layout Algorithm
Provenance of 1 month processing of NASA satellite ingest processing pipeline. Can help tracing error back to its cause. Shows relationship between daily products (each clover flower in clover leaf chain) and final monthly products at left-end.
Provenance of a seaIce daily workflow
Provenance graph compare: failed runs
40
Le�: complete provenance of successful execu�on. Right: failed run, because final data product (green on le�) cannot be matched.
Graph compare: dropped provenance
41
Le�: successful execu�on. Right: although successful execu�on, shows dropped no�fica�ons in provenance capture, because all nodes except some edges in le� graph cannot be matched.
Role of provenance in Open Data, Open Access, Open Cleaning
Key contribution of provenance is to data quality. We posit that quality of data provenance has 3 dimensions:
Correctness Completeness Relevancy
Assumption: provenance collection process is automated Assessment is focused on correctness and completeness of captured provenance Steps:
1) Detect ambiguities and conflicts in real and synthetic provenance traces
2) Complete portions of missing provenance traces 3) Validate provenance traces when possible 4) Score the quality of provenance traces
42
Provenance Quality Analysis Overview
G : Graph level M-G : Multi-Graph (Multiple graphs) Level N / E : Node/Edge Level
43
Wrapping Up: Open Data, Open Cleaning, Open Access
Open data
Open cleaning
Open interfaces
Personal privacy respected
How? e.g, Crea�ve Commons license
Who’s working on: Research Data Alliance
S�mula�ng new business opportunity on stable interfaces to open data
Applied Forces Come Together to Distort Object into New Space
Open access ini�a�ves
Big Data
Fundamental advances in à Climate change, à Food security àà New economies
Research Data Alliance
Maturity in provenance and metadata
Personal data privacy, social isues of sharing
Our hosts RDA Plenary 1 Chalmers Univ, Gothenburg, Sweden
Photo courtesy Leif Laaksonen