46
Big data and open access: on track for collision of cosmic proportions? Beth Plale, PhD, MBA Director, Data To Insight Center School of Informatics and Computing Indiana University Keynote talk at 2 nd Int’l LSDMA Symposium – The Challenge of Big Data in Science, Karlsruhe, Germany, Sept 2013

Big data and open access: a collision course for science

Embed Size (px)

DESCRIPTION

Beth Plale, Keynote talk at 2nd Int’l LSDMA Symposium – The Challenge of Big Data in Science, Karlsruhe, Germany, Sept 2013

Citation preview

Page 1: Big data and open access: a collision course for science

Big data and open access: on track for collision of cosmic proportions?

Beth Plale, PhD, MBA Director, Data To Insight Center School of Informatics and Computing Indiana University

Keynote  talk  at  2nd  Int’l  LSDMA  Symposium  –  The  Challenge  of  Big  Data  in  Science,  Karlsruhe,  Germany,  Sept  2013  

Page 2: Big data and open access: a collision course for science

Open  access,    open  cleaning,    open  data  

yields  greatest  degree  of  science  advancement  on  grand  societal  

ques�ons  we  face  

Page 3: Big data and open access: a collision course for science

Open Access

“Data  is  the  New  Gold”    Title  of  Opening  Remarks,  Neelie  Kroes,  VP  of  EU  Commission  responsible  for  Digital  Agenda,  Press  Conference  on  Open  Data  Strategy,  Dec  2011  

Page 4: Big data and open access: a collision course for science

Applied Forces Open  access  ini�a�ves  by  federal  governments  

Big  Data  

Page 5: Big data and open access: a collision course for science

Applied Force Distorts Object Open  access  ini�a�ves  by  federal  governments  

Big  Data  

Enables  societal  grand  challenges  addressed  in:            à   Climate  change  à   Food  security  à  New  economies  

à Grows  concerns  about  privacy  of  personal  data  

Page 6: Big data and open access: a collision course for science

Negative form of tension (tension I)

Social  pressure  to  privacy  overwhelm  and  spill  over  to  non-­‐personal  data    

Chilling  effect  on  data  sharing  where  social  phenomena  involved  

Page 7: Big data and open access: a collision course for science

Exponential Growth in Data Production

Page 8: Big data and open access: a collision course for science

Similar growth in societal expectations that large societal problems will be solved by more data

Page 9: Big data and open access: a collision course for science

Tension II: Rapid growth in data and expectations yields impossible-to-reach success

Page 10: Big data and open access: a collision course for science

DRIVING APPLICATIONS: LIBRARY TEXTS; URBAN SCIENCE; WIND AND WATER

Technical barriers to easing tensions but first …

Page 11: Big data and open access: a collision course for science

Hathi Trust Research Center

Text mining at scale

 #HTRC  #HathiTrust    #HTRC  #HathiTrust  

Page 12: Big data and open access: a collision course for science

à  HathiTrust is large corpus providing opportunity for new forms of computation investigation. à  The bigger the data, the less able we are to move it to a researcher’s desktop machine à  Future research on large collections will require computation moves to the data, not vice versa

Page 13: Big data and open access: a collision course for science

HTRC Partners

  Indiana University School of Informatics and Computing   Indiana Universities Libraries   University of Illinois Graduate School of Library and

Information Science   University of Illinios Libraries   Brandies University Library   University of Michigan http://www.hathitrust.org/htrc

 #HTRC  #HathiTrust  

Page 14: Big data and open access: a collision course for science

HTRC Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.

 #HTRC  #HathiTrust  

Page 15: Big data and open access: a collision course for science

Topic modeling on author

Two topics with identical centralities but separate themes

Page 16: Big data and open access: a collision course for science

Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899.

Underwood et al. Research

Page 17: Big data and open access: a collision course for science

  Computation moves to data   REST based Web services architecture and

protocols   Registry of services and algorithms   Solr full text index   noSQL store as volume store   openID authentication   Portal front-end, programmatic access   SEASR text mining algos

2/4/14  17  

Page 18: Big data and open access: a collision course for science

Agent  framework  

Page/volume  tree  (file  system)  

Volume  store    (Cassandra)  

SEASR  analy�cs  service  

Task    deployment  

WSO2  registry  services,  collec�ons,  data  

capsule  images  

Solr    index  

HathiTrust  corpus  rsync

HTRC

 Data  AP

I  v0.1  

NCSA  local  resources  

Programma�c  access    e.g.,  

WS02  Iden�ty  Server  

   

University of Michigan

Meandre  Orchestra�on  

Agent  instance  Agent  

instance  

Agent  instance  Agent  

instance  

Non-consumptive Data capsules

Big  Red  II/IU  Quarry  

18  

Blacklight

Volume  store    (Cassandra)  Volume  store    (Cassandra)  

NSF  XSEDE  

Portal

Page 19: Big data and open access: a collision course for science

HTRC: Open Data, Open Access, Open Cleaning?

  HathiTrust collection (69%) is not open data   Constrained by authors who hold copyright to the books   Computational analysis is by all accounts “fair use” under US copyright

Page 20: Big data and open access: a collision course for science

HTRC: Open Data, Open Access, Open Cleaning?

  “Open cleaning” – enhancing OCR and MARC metadata   HTRC is opening data and “cleaning” as fully as we can to make the collection useful to scholarly and scientific investigation

Page 21: Big data and open access: a collision course for science

Wind and Water: the hydrologist’s (atmospheric) observational data dilemma

Thanks to Jerry Brotzge, PhD meteorology, Oklahoma University

*  Credit/blame  for  �tle  goes  to  Beth  Plale  

Page 22: Big data and open access: a collision course for science

Atmospheric Observing Systems

Recent addition of plethora of new observing systems to national US atmosphere observing infrastructure

  Improves ability to analyze current state of atmosphere, thus allowing new applications in hydrology and biology

Challenges in:   Data access; unique sensing requirements   Data quality, calibrations, and errors   Complex and non-uniform metadata

Page 23: Big data and open access: a collision course for science

Use Case

Use observational data from 3 different radars: FAA TDWR, WSR-88D, and local X-band (CASA) Feed data through OU-custom QA/calibration workflow. Feed into Vflow hydrological model. Note that Vflow is able to operate on (ingest) the “raw” reflectivity data directly. That is, it does not require the data to be turned into gridded precipitation data. Vflow is unique among hydrology models because of this ability. Done in real time, that is, continuously ingesting data over fixed interval.

Page 24: Big data and open access: a collision course for science

List of Issues for Flood Forecasting using Radar data Problem   Cause   Poten�al  Solu�on  

Hail  contamina�on   Assumes  high  rainfall  rate     Use  of  dual-­‐pol,  QC  

Bright  band   Ice  at  mid-­‐levels  biases  dBZ   Real-­‐�me  QC,  2  radar  beams  

Ground  clu�er   Wind  farms,  blockage   Use  of  Neural  Net,  velocity    

Radar  a�enua�on   High-­‐frequency  radars   Real-­‐�me  QC  model,  fix  

Anomalous  propaga�on   High  stable  environment   Use  of  Level  1,  velocity  

Velocity  de-­‐aliasing   High  velocity  returns   Real-­‐�me  QC  

Radar  calibra�on   Poor  maintenance   Post  QC  

Over/under  es�ma�on  below  beam  

Radar  too  far  from  area  of  interest;  undersampled    

Improved  radar  sampling;  addi�onal  sfc  input  

Poor  �me  sampling   Radar  5-­‐min  volume  sampling   Improved  temporal  sampling  

ET  under  beam   Lack  of  surface  informa�on   Addi�onal  surface  data  

Spa�al  interpola�on   Polar  to  Cartesian  coordinates   Interpola�on  algorithm  

Use  of  Reflec�vity   Does  not  measure  rain  directly   Calibra�on  against  sfc  data  

Page 25: Big data and open access: a collision course for science

WSR-­‐88D  data  

Radar  calibra�on  

Clu�er  removal  

Anomalous  propaga�on  

(AP)  removed  

Interpola�on  from  polar  to  a  common  

Cartesian  grid  

Quality  Control  

Velocity  de-­‐aliasing  

Clear-­‐air  echoes  removed  

Hail  contamina�on  

removal  

Mel�ng  layer  contamina�on  

removal  

Other  radar  systems  

(TDWR,  CASA)  

Example  Workflow  

Integrate  radar  data  

with  satellite,  surface  

observa�ons  on  grid  

Convert  radar  reflec�vity  

dBZ  to  rainfall  rate  

Radar  merger  (across  same  network  and  mul�ple  networks)  

Undersampling  Representa�ve

ness  

Page 26: Big data and open access: a collision course for science

Examine hail contamination in more detail

  Level II radar data that is widely available (through LDM tool of UCAR in US) has not been “cleaned” of effects of clean air echoes, hail, undersampling, and melting layer contamination

  Hail has effect of high reflectivity readings and these high readings can be misinterpreted as high rainfall

  Meteorologists can detect hail easily by eyeballing a visual plot of reflectivity intensities so can go back to Level II data and process by removing hail contamination

  Meteorologists solve problem through trained eye, and good in-house scripts. What does poor hydrologist do?

Page 27: Big data and open access: a collision course for science

Meterology/Hydrology: Open Data, Open Access, Open Cleaning?

Data is open, but how to handle cleaning? A: force all level II data through workflow. Hydrologist uses only processed data (i.e., gridded precipitation data).

  Advantage: hides details from hydrologist   Disadvantage: black box approach reduces trust

A: Make “raw” level II data and Q&A workflow tasks available to hydrologist.

  Advantage: hydrologist can develop high level of trust in data

  Disadvantage: current metadata not sufficiently described to capture the kinds of Q&A that have been applied

Page 28: Big data and open access: a collision course for science

Urban Science

*  Credit/blame  for  �tle  goes  to  Beth  Plale  

Tag  cloud  of  related  tweet  topics  #smartcityjam  thanks  to  Jennifer  Belissent,  PhD  

Page 29: Big data and open access: a collision course for science

Urban Science

  Harness data from disparate sources with goal of improving city life.

  Fuses physical, biological, and informational sensing of the city

  in-situ sensors for environment: light, temperature, pollution   Video: pedestrian and vehicular traffic   Personal sensors: Fitbit and Up wristbands   Internet sources: Twitter feeds, blogs, news articles, crowd-

sourced sensing   Two examples in US

  Center of Urban Science and Progress, New York University   Urban Center for Computation and Data, University of

Chicago

Page 30: Big data and open access: a collision course for science

Urban Science

Thanks to Physics Today, Sept 2013

*  Credit/blame  for  �tle  goes  to  Beth  Plale  

Graphic  courtesy  NYU  Center  for  Urban  Science  and  Progress    

Page 31: Big data and open access: a collision course for science

Urban science: open data, open access, open cleaning?

CUSP is cleaning its own data for integration. Is this being done in way that Chicago can use? Likely not. Temporal streams are relatively simple to understand with even bad metadata. They are observational-physical and observational-social data sources so come with relatively known trust and attribution. What happens when CUSP wants to integrate predictive weather forecasting model results? Weak metadata and attribution can significantly compromise accuracy of results.

Page 32: Big data and open access: a collision course for science

Data Provenance

Work of Data To Insight Center at IU, its affiliated faculty and students

Page 33: Big data and open access: a collision course for science

Provenance Core (W3C PROV)

Page 34: Big data and open access: a collision course for science
Page 35: Big data and open access: a collision course for science

Provenance for situational

analysis of agent based

model used in social

ecological systems research

Village labor sharing for agriculture production in Africa

Page 36: Big data and open access: a collision course for science

Provenance capture AMSR-E data processing pipeline

Aug  2013  36  

Advanced  Microwave  Scanning  Radiometer  (AMSR-­‐E)  :  sensor  aboard  Aqua  satellite;  passive  

microwave  radiometer.    Observes  precipita�on,  sea  surface  temperatures,  ice  

concentra�ons,  snow  water  equivalent,  surface  wetness,  

wind  speed,  atmospheric  cloud  water,  and  water  vapor.  

Page 37: Big data and open access: a collision course for science

NASA  AMSR-­‐E  imagery  ingest  processing  pipeline:  provenance  capture  for  anomaly  detec�on  

Page 38: Big data and open access: a collision course for science

Dataset: D2I-AMSR-E-Provenance Dataset

Owner and Creator: Data to Insight Center Size: 15MB The University of Alabama in Huntsville processes data from the NASA AMSR-E instrument. The Karma project at Indiana University instrumented the ingest processing system and captured provenance for 3,890 runs for the period of September 2 - October 4 2011. The details of the runs are in Figure III-16 below; the largest provenance graph is the monthly rain graph that, when represented as a XML is approximately 13MB. Luo, Yuan, Plale, Beth, Jensen, Scott, Cheah, You-Wei, Conover, Helen. 2012. Provenance of AMSR-E Data from the National Snow and Ice Data Center (NSIDC). OPM XML Ver. 1.1., Sep 2 - Oct 4, 2011. Bloomington, Indiana: Data to Insight Center. http://dx.doi.org/10.5967/M0F47M2D

Page 39: Big data and open access: a collision course for science

Aug  2013  39  

Provenance History Layout Algorithm

Provenance of 1 month processing of NASA satellite ingest processing pipeline. Can help tracing error back to its cause. Shows relationship between daily products (each clover flower in clover leaf chain) and final monthly products at left-end.

Provenance  of  a  seaIce  daily  workflow  

Page 40: Big data and open access: a collision course for science

Provenance graph compare: failed runs

40  

Le�:  complete  provenance  of  successful  execu�on.  Right:  failed  run,  because  final  data  product  (green  on  le�)  cannot  be  matched.  

Page 41: Big data and open access: a collision course for science

Graph compare: dropped provenance

41  

Le�:  successful  execu�on.  Right:  although  successful  execu�on,  shows  dropped  no�fica�ons  in  provenance  capture,  because  all  nodes  except  some  edges  in  le�  graph  cannot  be  matched.  

Page 42: Big data and open access: a collision course for science

Role of provenance in Open Data, Open Access, Open Cleaning

Key contribution of provenance is to data quality. We posit that quality of data provenance has 3 dimensions:

  Correctness   Completeness   Relevancy

Assumption: provenance collection process is automated Assessment is focused on correctness and completeness of captured provenance Steps:

1)  Detect ambiguities and conflicts in real and synthetic provenance traces

2)  Complete portions of missing provenance traces 3)  Validate provenance traces when possible 4)  Score the quality of provenance traces

42  

Page 43: Big data and open access: a collision course for science

Provenance Quality Analysis Overview

G : Graph level M-G : Multi-Graph (Multiple graphs) Level N / E : Node/Edge Level

43  

Page 44: Big data and open access: a collision course for science

Wrapping Up: Open Data, Open Cleaning, Open Access

Open  data  

Open  cleaning  

Open  interfaces  

Personal  privacy  respected  

How?  e.g,  Crea�ve  Commons  license  

Who’s  working  on:  Research  Data  Alliance  

S�mula�ng  new  business  opportunity  on  stable  interfaces  to  open  data  

Page 45: Big data and open access: a collision course for science

Applied Forces Come Together to Distort Object into New Space

Open  access  ini�a�ves  

Big  Data  

Fundamental  advances  in  à Climate  change,  à   Food  security  àà  New  economies  

Research  Data  Alliance  

Maturity  in  provenance  and  metadata  

Personal  data  privacy,  social  isues  of  sharing  

Page 46: Big data and open access: a collision course for science

[email protected]

Our  hosts  RDA  Plenary  1  Chalmers  Univ,  Gothenburg,  Sweden  

Photo  courtesy  Leif  Laaksonen