29
Computational Training for Domain Scientists & Data Literacy Joshua Bloom Moore Founda1on DataDriven Inves1gator UC Berkeley, Astronomy @pro%sb AAAS, San Jose Sunday Feb 15, 2015

Computational Training for Domain Scientists & Data Literacy

Embed Size (px)

Citation preview

Page 1: Computational Training for Domain Scientists & Data Literacy

Computational Training for Domain Scientists & Data Literacy

Joshua  BloomMoore  Founda1on  Data-­‐Driven  Inves1gator    

UC  Berkeley,  Astronomy

@pro%sb

AAAS,  San  Jose    Sunday  Feb  15,  2015

Page 2: Computational Training for Domain Scientists & Data Literacy

“knowledge and understanding of scientific concepts and processes required for personal decision making, participation in civic and cultural affairs, and economic productivity"

Scientific Literacy

National Science Education Standards Report, National Academies of Science 1996

Data Literacy, as a broad new educational imperative

Data Literacy“knowledge and understanding of analytical concepts and processes required to draw fair conclusions from data that impact personal, professional, and civic lives”

now @ AAAS

Page 3: Computational Training for Domain Scientists & Data Literacy

http://boingboing.net/2013/01/01/correlation-between-autism-dia.html

via reddit/r/skeptic

Page 4: Computational Training for Domain Scientists & Data Literacy

Our  ML  framework  found  the  Nearest  Supernova  in  3  

Decades  ..‣ Built  &  Deployed  robust,  real-­‐?me  machine  learning  framework,  discovering  >10,000  events  in  >  10  TB  of  imaging                  →  50+  journal  ar?cles

‣ Built  Probabilis?c  Event  classifica?on  catalogs  with  innova?ve  ac?ve  learning  

hPp://?medomain.org hPps://www.nsf.gov/news/news_summ.jsp?cntn_id=122537

Page 5: Computational Training for Domain Scientists & Data Literacy

What is the toolbox of the modern (data-driven) scientist?

domain training

statistics

advanced computing

database

GUI

parallel

visualization

Bayesian

machine learning

Physics

laboratory techniques

MCMC

MapReduce

Page 6: Computational Training for Domain Scientists & Data Literacy

And...How do we teach this with what little time the students have?

What is the toolbox of the modern (data-driven) scientist?

Page 7: Computational Training for Domain Scientists & Data Literacy

Data-Centric Coursework, Bootcamps, Seminars, & Lecture Series

BDAS: Berkeley Data Analytics Stack [Spark, Shark, ...]

parallel programming bootcamp

...and entire degree programs

Taught by CS/Stats

Aimed at Engineers & Programmers Heading Toward Industry

Page 8: Computational Training for Domain Scientists & Data Literacy

2010: 85 campers 2012a: 135 campers

Python Bootcamps at Berkeley

Page 9: Computational Training for Domain Scientists & Data Literacy

2012b: 210 campers

Python Bootcamps at Berkeley

2013a: 253 campers

Page 10: Computational Training for Domain Scientists & Data Literacy

‣ 3 days of live/archive streamed lectures ‣ all open material in GitHub ‣ widely disseminated (e.g., @ NASA)

http://pythonbootcamp.info

Page 11: Computational Training for Domain Scientists & Data Literacy

Part of the Designated Emphasis in Computational Science & Engineering at Berkeley

visualization

machine learning

database interaction

user interface & web frameworks

timeseries & numerical computing

interfacing to other languagesBayesian inference & MCMC

hardware control

parallelism

Page 12: Computational Training for Domain Scientists & Data Literacy

Python for Data Science @ Berkeley [Sept 2013]

Page 13: Computational Training for Domain Scientists & Data Literacy

64%

36%female male

8%4%

8%

12%

4%12% 8%

16%

16%

12%PsychologyAstronomyNeuroscienceBiostatisticsPhysicsChemical EngineeringISchoolEarth and Planetary SciencesIndustrial EngineeringMechanical Engineering

“Parallel Image Reconstruction from Radio Interferometry Data”

“Graph Theory Analysis of Growing Graphs”

http://mb3152.github.io/Graph-Growth/

“Realtime Prediction of Activity Behavior from Smartphone”

“Bus Arrival Time Prediction in Spain”

Page 14: Computational Training for Domain Scientists & Data Literacy

Time domain preprocessing

- Start with raw photometry!

- Gaussian process detrending!

- Calibration!

- Petigura & Marcy 2012!

!Transit search

- Matched filter!

- Similar to BLS algorithm (Kovcas+ 2002)!

- Leverages Fast-Folding Algorithm O(N^2) → O(N log N) (Staelin+ 1968)!

!Data validation

- Significant peaks in periodogram, but inconsistent with exoplanet transit

TERRA – optimized for small planets

Detrended/calibrated photometry

TERRA

Raw

Flu

x (p

pt)

Cal

ibra

ted

Flux

Erik Petigura Berkeley Astro Grad Student

Petigura, Howard, & Marcy (2013)

Prevalence of Earth-size planets orbiting Sun-like starsErik A. Petiguraa,b,1, Andrew W. Howardb, and Geoffrey W. Marcya

aAstronomy Department, University of California, Berkeley, CA 94720; and bInstitute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822

Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013)

Determining whether Earth-like planets are common or rare loomsas a touchstone in the question of life in the universe. We searchedfor Earth-size planets that cross in front of their host stars byexamining the brightness measurements of 42,000 stars fromNational Aeronautics and Space Administration’s Kepler mission.We found 603 planets, including 10 that are Earth size (1−2 R⊕)and receive comparable levels of stellar energy to that of Earth(0:25− 4 F⊕). We account for Kepler’s imperfect detectability ofsuch planets by injecting synthetic planet–caused dimmings intothe Kepler brightness measurements and recording the fractiondetected. We find that 11 ± 4% of Sun-like stars harbor an Earth-size planet receiving between one and four times the stellar inten-sity as Earth. We also find that the occurrence of Earth-size planets isconstant with increasing orbital period (P), within equal intervals oflogP up to∼200 d. Extrapolating, one finds 5:7+1:7

−2:2% of Sun-like starsharbor an Earth-size planet with orbital periods of 200–400 d.

extrasolar planets | astrobiology

The National Aeronautics and Space Administration’s (NASA’s)Kepler mission was launched in 2009 to search for planets

that transit (cross in front of) their host stars (1–4). The resultingdimming of the host stars is detectable by measuring their bright-ness, and Kepler monitored the brightness of 150,000 stars every30 min for 4 y. To date, this exoplanet survey has detected morethan 3,000 planet candidates (4).The most easily detectable planets in the Kepler survey are

those that are relatively large and orbit close to their host stars,especially those stars having lower intrinsic brightness fluctua-tions (noise). These large, close-in worlds dominate the list ofknown exoplanets. However, the Kepler brightness measurementscan be analyzed and debiased to reveal the diversity of planets,including smaller ones, in our Milky Way Galaxy (5–7). Theseprevious studies showed that small planets approaching Earthsize are the most common, but only for planets orbiting close totheir host stars. Here, we extend the planet survey to Kepler’smost important domain: Earth-size planets orbiting far enoughfrom Sun-like stars to receive a similar intensity of light energyas Earth.

Planet SurveyWe performed an independent search of Kepler photometry fortransiting planets with the goal of measuring the underlying oc-currence distribution of planets as a function of orbital period,P, and planet radius, RP. We restricted our survey to a set of Sun-like stars (GK type) that are the most amenable to the detectionof Earth-size planets. We define GK-type stars as those with sur-face temperatures Teff = 4,100–6,100 K and gravities logg = 4.0–4.9(logg is the base 10 logarithm of a star’s surface gravity measured incm s−2) (8). Our search for planets was further restricted to thebrightest Sun-like stars observed by Kepler (Kp = 10–15 mag). These42,557 stars (Best42k) have the lowest photometric noise, makingthem amenable to the detection of Earth-size planets. Whena planet crosses in front of its star, it causes a fractional dimmingthat is proportional to the fraction of the stellar disk blocked,δF = ðRP=RpÞ2, where Rp is the radius of the star. As viewed bya distant observer, the Earth dims the Sun by ∼100 parts permillion (ppm) lasting 12 h every 365 d.

We searched for transiting planets in Kepler brightness mea-surements using our custom-built TERRA software packagedescribed in previous works (6, 9) and in SI Appendix. In brief,TERRA conditions Kepler photometry in the time domain, re-moving outliers, long timescale variability (>10 d), and systematicerrors common to a large number of stars. TERRA then searchesfor transit signals by evaluating the signal-to-noise ratio (SNR) ofprospective transits over a finely spaced 3D grid of orbital period,P, time of transit, t0, and transit duration, ΔT. This grid-basedsearch extends over the orbital period range of 0.5–400 d.TERRA produced a list of “threshold crossing events” (TCEs)

that meet the key criterion of a photometric dimming SNR ratioSNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri-terion, many of which are inconsistent with the periodic dimmingprofile from a true transiting planet. Further vetting was performedby automatically assessing which light curves were consistent withtheoretical models of transiting planets (10). We also visuallyinspected each TCE light curve, retaining only those exhibiting aconsistent, periodic, box-shaped dimming, and rejecting thosecaused by single epoch outliers, correlated noise, and other dataanomalies. The vetting process was applied homogeneously to allTCEs and is described in further detail in SI Appendix.To assess our vetting accuracy, we evaluated the 235 Kepler

objects of interest (KOIs) among Best42k stars having P > 50 d,which had been found by the Kepler Project and identified as planetcandidates in the official Exoplanet Archive (exoplanetarchive.ipac.caltech.edu; accessed 19 September 2013). Among them, wefound four whose light curves are not consistent with beingplanets. These four KOIs (364.01, 2,224.02, 2,311.01, and 2,474.01)have long periods and small radii (SI Appendix). This exercisesuggests that our vetting process is robust and that careful scrutinyof the light curves of small planets in long period orbits is useful toidentify false positives.

Significance

A major question is whether planets suitable for biochemistryare common or rare in the universe. Small rocky planets withliquid water enjoy key ingredients for biology. We used theNational Aeronautics and Space Administration Kepler tele-scope to survey 42,000 Sun-like stars for periodic dimmingsthat occur when a planet crosses in front of its host star. Wefound 603 planets, 10 of which are Earth size and orbit in thehabitable zone, where conditions permit surface liquid water.We measured the detectability of these planets by injectingsynthetic planet-caused dimmings into Kepler brightness mea-surements. We find that 22% of Sun-like stars harbor Earth-sizeplanets orbiting in their habitable zones. The nearest such planetmay be within 12 light-years.

Author contributions: E.A.P., A.W.H., and G.W.M. designed research, performed research,analyzed data, and wrote the paper.

The authors declare no conflict of interest.

Freely available online through the PNAS open access option.

Data deposition: The Kepler photometry is available at the Milkulski Archive for SpaceTelescopes (archive.stsci.edu). All spectra are available to the public on the CommunityFollow-up Program website (cfop.ipac.caltech.edu).1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1319909110/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1319909110 PNAS | November 26, 2013 | vol. 110 | no. 48 | 19273–19278

AST

RONOMY

Bootcamp/Seminar Alum

Python

DOE/NERSC computation

PNAS [2014]

Page 15: Computational Training for Domain Scientists & Data Literacy

Dangers  of  a  Cursory  Training/Teaching  Curriculum

Page 16: Computational Training for Domain Scientists & Data Literacy

34 Fitting a straight line to data

0 50 100 150 200 250 300x

0

100

200

300

400

500

600

700

y

y = 1.33 x + 164

0 50 100 150 200 250 300x

0

100

200

300

400

500

600

700

y

0.0

1.0

1.0

1.0

0.0

0.0

0.00.00.0

0.00.0

0.0

0.00.0

0.0

0.0

0.0

0.0

0.0

0.0

Figure 10.— Partial solution to Exercise 14: On the left, the same as Figure 9but including the outlier points. On the right, the same as in Figure 4but applying the outlier (mixture) model to the case of two-dimensionaluncertainties.

0 50 100 150 200 250 300x

0

100

200

300

400

500

600

700

y

forward � � � y = (2.24 ± 0.11) x + (34 ± 18)reverse � · � y = (2.64 ± 0.12) x + (�50 ± 21)

Figure 11.— Partial solution to Exercise 15: Results of “forward and reverse”fitting. Don’t ever do this.

Data analysis recipes:

Fitting a model to data

David W. HoggCenter for Cosmology and Particle Physics, Department of Physics, New York University

Max-Planck-Institut fur Astronomie, Heidelberg

Jo BovyCenter for Cosmology and Particle Physics, Department of Physics, New York University

Dustin LangDepartment of Computer Science, University of Toronto

Princeton University Observatory

Abstract

We go through the many considerations involved in fitting a modelto data, using as an example the fit of a straight line to a set of pointsin a two-dimensional plane. Standard weighted least-squares fittingis only appropriate when there is a dimension along which the datapoints have negligible uncertainties, and another along which all theuncertainties can be described by Gaussians of known variance; theseconditions are rarely met in practice. We consider cases of general,heterogeneous, and arbitrarily covariant two-dimensional uncertain-ties, and situations in which there are bad data (large outliers), un-known uncertainties, and unknown but expected intrinsic scatter inthe linear relationship being fit. Above all we emphasize the impor-tance of having a “generative model” for the data, even an approx-imate one. Once there is a generative model, the subsequent fittingis non-arbitrary because the model permits direct computation of thelikelihood of the parameters or the posterior probability distribution.Construction of a posterior probability distribution is indispensible ifthere are “nuisance parameters” to marginalize away.

It is conventional to begin any scientific document with an introductionthat explains why the subject matter is important. Let us break with tra-dition and observe that in almost all cases in which scientists fit a straightline to their data, they are doing something that is simultaneously wrong

and unnecessary. It is wrong because circumstances in which a set of two

⇤The notes begin on page 39, including the license1 and the acknowledgements2.

1

arX

iv:1

008.

4686

v1 [

astro

-ph.

IM]

27 A

ug 2

010

Dataanalysisrecipes:

Fittingamodeltodata

Dav

idW

.H

oggCenter

forCosm

ologyan

dParticle

Physics,

Dep

artmentof

Physics,

New

York

University

Max

-Plan

ck-In

stitutfurAstron

omie,

Heid

elberg

JoB

ovy

Center

forCosm

ologyan

dParticle

Physics,

Dep

artmentof

Physics,

New

York

University

Dustin

Lan

gDep

artmentof

Com

puter

Scien

ce,University

ofToron

to

Prin

cetonUniversity

Observatory

Abstra

ct

We

go

thro

ugh

the

many

consid

eratio

ns

involv

edin

fittin

ga

model

todata

,usin

gas

an

exam

ple

the

fit

ofa

straig

ht

line

toa

setofpoin

tsin

atw

o-d

imen

sional

pla

ne.

Sta

ndard

weig

hted

least-sq

uares

fittin

gis

only

appro

pria

tew

hen

there

isa

dim

ensio

nalo

ng

which

the

data

poin

tshav

eneg

ligib

leuncerta

inties,

and

anoth

eralo

ng

which

all

the

uncerta

inties

can

be

describ

edby

Gaussia

ns

ofknow

nva

riance;

these

conditio

ns

are

rarely

met

inpra

ctice.W

eco

nsid

erca

sesof

gen

eral,

hetero

gen

eous,

and

arb

itrarily

covaria

nt

two-d

imen

sional

uncerta

in-

ties,and

situatio

ns

inw

hich

there

are

bad

data

(larg

eoutliers),

un-

know

nuncerta

inties,

and

unknow

nbut

expected

intrin

sicsca

tterin

the

linea

rrela

tionsh

ipbein

gfit.

Abov

eall

we

emphasize

the

impor-

tance

of

hav

ing

a“gen

erativ

em

odel”

for

the

data

,ev

enan

approx

-im

ate

one.

Once

there

isa

gen

erativ

em

odel,

the

subseq

uen

tfittin

gis

non-a

rbitra

rybeca

use

the

model

perm

itsdirect

com

puta

tion

ofth

elik

elihood

ofth

epara

meters

or

the

posterio

rpro

bability

distrib

utio

n.

Constru

ction

ofa

posterio

rpro

bability

distrib

utio

nis

indisp

ensib

leif

there

are

“nuisa

nce

para

meters”

tom

arg

inalize

away.

Itis

conven

tional

tobegin

any

scientifi

cdocu

men

tw

ithan

intro

duction

that

explain

sw

hy

the

subject

matter

isim

portan

t.Let

us

break

with

tra-dition

and

observe

that

inalm

ostall

casesin

which

scientists

fit

astraigh

tlin

eto

their

data,

they

aredoin

gsom

ethin

gth

atis

simultan

eously

wrong

andunnecessary.

Itis

wron

gbecau

secircu

mstan

cesin

which

aset

oftw

o

⇤The

notes

beg

inon

page

39,in

cludin

gth

elicen

se1

and

the

ack

now

ledgem

ents

2.

1

arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010

Sta7s7cal  Inference

Data  Literacy  in  the  Undergraduate  &  Graduate  Training  Mission  

Page 17: Computational Training for Domain Scientists & Data Literacy

Versioning  &  Reproducibility

“Recently, the scientific community was shaken by reports that a troubling proportion of peer-reviewed preclinical studies are not reproducible.” McNutt, 2014

http://www.sciencemag.org/content/343/6168/229.summary

-­‐  Git  has  emerged  as  the  de  facto  versioning  tool  -­‐  Berkeley  Common  Environment  (BCE)  So]ware  Stack  -­‐  “Reproducible  and  Collabora?ve  Sta?s?cal  Data  Science”  (Sta?s?cs  157:  P.  Stark)

Data  Literacy  in  the  Undergraduate  &  Graduate  Training  Mission  

Page 18: Computational Training for Domain Scientists & Data Literacy

Undergraduate Curriculum

• Data  Science  Task  Force  report  generated  at  Chancellor/Provost  behest  • Recommenda?ons:  

– A  founda?onal  lower-­‐division  course  accessible  to  everyone  (sta?s?cal  &  computa?onal  thinking),  with  broad  applicability  

– A  suite  of  courses  that  advance  the  use  of  data  sciences  within  the  broad  swath  of  disciplines  available  to  Berkeley  undergraduates    

– A  high-­‐quality  minor  for  students  who  wish  to  couple  data  sciences  with  their  major  discipline    

– A  data  sciences  major  for  those  students  who  want  this  to  be  their  primary  area  of  study

Page 19: Computational Training for Domain Scientists & Data Literacy

pg. 12; Announced Feb 11

Chancellor’s Report for 2015

Page 20: Computational Training for Domain Scientists & Data Literacy

Grand  challenges  at  the  intersec7on  of:  • Open  science  and  reproducible  data  analysis  • Responses  of  coupled  human-­‐natural  systems  to  rapid  environmental  change  (water  resources,  land  use  planning,  agriculture,  etc.)  

• Development  of  evidence-­‐based  solu?ons  in  public  policy,  resource  management  and  environmental  design

• Five  schools  and  colleges:  LeFers  &  Science,  Natural  Resources,  Engineering,  Public  Policy  and  Environmental  Design  

• Two  year  graduate  training  program;  Master’s  and  PhD  students  • Four  cohorts  of  up  to  20  students  each,  first  cohort  in  fall  2015

Data Sciences for the 21st Century (DS421): Environment and Society

via Prof. David Ackerly (Integrative Biology)

Page 21: Computational Training for Domain Scientists & Data Literacy

Data Sciences for the 21st Century (DS421): Environment and Society

via Prof. David Ackerly (Integrative Biology)

Page 22: Computational Training for Domain Scientists & Data Literacy

http://bitly.com/bundles/fperezorg/1

“Bold new partnership launches to harness potential of data scientists and big data”

Founded  in  December  2013  as  a  result  of  a  year+  long  na?onal  selec?on  process  $37.8M  over  5  years,  along  with  University  of  Washington  &  NYU  

‣ An  accelerator  for  data-­‐driven  discovery  ‣ An  agent  of  change  in  the  modern  university  as  Data  Science  takes  hold  ‣ An  incubator  for  the  next  genera?on  of  Data  Science  technology  and  prac?ce

Page 23: Computational Training for Domain Scientists & Data Literacy

Leadership  from  across  the  spectrum

Joshua  Bloom,  Professor,  Astronomy;  Director,  Center  for  Time  Domain  Informa?cs          Henry  Brady,  Dean,  Goldman  School  of  Public  Policy                

Cathryn  Carson,  Associate  Dean,  Social  Sciences;  Ac?ng  Director  of  Social  Sciences  Data  Laboratory  "D-­‐Lab”              David  Culler,  Chair,  EECS                

Michael  Franklin,  Professor;  EECS,  Co-­‐Director,  AMP  Lab                    

Erik  Mitchell,  Associate  University  Librarian                  

Faculty  Lead/PI:  Saul  PerlmuFer,  Physics,  Berkeley  Center  for  Cosmological  Physics

Fernando  Perez,  Researcher,  Henry  H.  Wheeler  Jr.  Brain  Imaging  Center

Jasjeet  Sekhon,  Professor,  Poli?cal  Science  and  Sta?s?cs;  Center  for  Causal  Inference  and  Program  Evalua?on              Jamie  Sethian,  Professor,  Mathema?cs              

Kimmen  Sjölander,  Professor,  Bioengineering,  Plant  and  Microbial  Biology              

Philip  Stark,  Chair,  Sta?s?cs            

Ion  Stoica,  Professor,  EECS;  Co-­‐Director,  AMP  Lab

Page 24: Computational Training for Domain Scientists & Data Literacy

“I  love  working  with  astronomers,  since  their  data  is  worthless.”

-­‐  Jim  Gray,  Microso'

Page 25: Computational Training for Domain Scientists & Data Literacy

Created by Natalia Bilenko.Data source: PubMed Central. http://sciencereview.berkeley.edu/bsr_design/Issue26/datascience/cover/

Page 26: Computational Training for Domain Scientists & Data Literacy

Established  CS/Stats/Math  in  Service  of  novelty  in  domain  science  

vs.  

Novelty  in  domain  science  driving  &  informing  novelty  in  CS/Stats/Math

“novelty2  problem”  an  extra  Burden  for  Forefront  Scien?sts

hPps://medium.com/tech-­‐talk/dd88857f662

Page 27: Computational Training for Domain Scientists & Data Literacy

ỉπ vs.

21st Century Educational Mission: Produce Gamma-shaped people

deep domain skill/knowledge/training deep methodological knowledge/skill

deep domain or methodological skill/knowledge/training strong methodological or domain knowledge/skill

Goal: empower teams of gamma’s to excel

Page 28: Computational Training for Domain Scientists & Data Literacy

SummaryData  science  at  Berkeley  is  thriving  and  BIDS  is  its  intellectual  home        -­‐  incuba?ng  novel  science  &  methodologies        -­‐  teaching  &  training        -­‐  innovate  environments,  interac?ons,  &  networks

Teaching  impera7ves  emerging  around  data  literacy        -­‐  new  campus-­‐wide  undergraduate  curriculum  (star?ng  with  a  founda?on  for  all  1st  years)        -­‐  new  mul?-­‐year,  mul?-­‐disciplinary  graduate  program

Page 29: Computational Training for Domain Scientists & Data Literacy

@pro%sb

Thank  you.

PyData,  May  4.  2014