47
Data Science @ UW

Data Science and Urban Science @ UW

Embed Size (px)

Citation preview

Page 1: Data Science and Urban Science @ UW

Data Science @ UW

Page 2: Data Science and Urban Science @ UW

2

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying to figure out how to make people click on ads”

-- Jeff Hammerbacher, co-founder, Cloudera

Page 3: Data Science and Urban Science @ UW

The Fourth Paradigm1. Empirical + experimental2. Theoretical3. Computational4. Data-Intensive

Jim Gray

05/03/2023 Bill Howe, UW 3

Page 4: Data Science and Urban Science @ UW

“All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data… In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields.”

2005-2008

In other words: • Data-driven discovery will be ubiquitous • UW must be a leader in inventing the

capabilities • UW must be a leader in translational

activities – in putting these capabilities to work

• It’s about intellectual infrastructure (human capital) and software infrastructure (shared tools and services – digital capital)

Page 5: Data Science and Urban Science @ UW

A 5-year, US$37.8 million cross-institutional collaboration to create a data science environment

5

2014

Page 6: Data Science and Urban Science @ UW

$9.3 million from Washington Research Foundation to Amplify the Moore/Sloan effort

• 6 X 5-year Faculty lines in Data Science• 6 X startup packages• 15 X 3 yr postdoctoral fellows• Funds to remodel and furnish a WRF Data Science Studio• Also $7.1 million to closely-related Institute for

Neuroengineering, $8.0 million to Institute for Protein Design, $6.7 million to Clean Energy Institute

6

Page 7: Data Science and Urban Science @ UW

05/03/2023 7Bill Howe, UW

Data Science Kickoff Session:137 posters from 30+ departments and units

Page 8: Data Science and Urban Science @ UW

8

PIs on Moore/Sloan effort

+ eScience Institute Steering Committee

+ UW participants in February 7 Data Science poster session

Broad collaborations

Page 9: Data Science and Urban Science @ UW

Establish a virtuous cycle

• 6 working groups, each with • 3-6 faculty from each institution

Page 10: Data Science and Urban Science @ UW

Key Activity: Promote interdisciplinary careers

• Interdisciplinary graduate students– New, interdisciplinary “Data Science” Ph.D. tracks and program

• Interdisciplinary postdocs (“Data Science Fellows”)– Dual-mentored postdocs with interests in both methods and a domain

science• Interdisciplinary research scientists (“Data Scientists”)

• Work across disciplines to solve people’s data science challenges• Interdisciplinary faculty

– Supported with special hiring and funding initiatives• “Senior Research Fellows”

– Short-term and long-term visitors• A diverse faculty steering committee

Page 11: Data Science and Urban Science @ UW

05/03/2023 11

UW Data Science Education Efforts

Bill Howe, UW

Students Non-StudentsCS/Informatics Non-Major professionals researchersundergrads grads undergrads grads

UWEO Data Science Certificate MOOC Intro to Data ScienceIGERT: Big Data PhD Track New CS Courses Bootcamps and workshops Intro to Data Programming Data Science Masters (planned) Incubator: hands-on training

Page 12: Data Science and Urban Science @ UW

12

Educational transformation

Big Data access and management

Big Data modeling

Big Data analytics

Collaborative Big Data scienceData

Key Activity: Foster Interdisciplinary Education• Ultimate goal: A new PhD program

– Initial goal: A new certificate based on Big Data tracks in all departments– Education highlights: data science courses, co-advising, and internships

• End-to-End Research Agenda– Big Data mgmt, analytics, modeling, & collaboration

• Cyberinfrastructure Development– Big Data analysis service

Page 13: Data Science and Urban Science @ UW

• Additional data science educational activities– Coursera MOOCs

• Introduction to Data Science (Bill Howe)• Computational Methods of Data Analysis (Nathan Kutz)• High Performance Scientific Computing (Randy LeVeque)

– Traditional courses• Many! Example: Biochemistry for Computer Scientists (Joe Hellerstein)• We try to list relevant courses on the eScience Institute website

– UW Educational Outreach• 3-course Certificate in Data Science• 3-course Certificate in Cloud Data Management & Analytics• 3-course Certificate in Cloud Application Development on Amazon Web Services• 3-course Certificate in Data Visualization

– Workshops and bootcamps• Software Carpentry (Winter & Spring 2013; Winter, Spring, & Summer 2014)• Cosmology and Machine Learning (Autumn 2014)

Page 14: Data Science and Urban Science @ UW

• An open shared R&D space where researchers fromacross the campus will come to collaborate

• A resident data science team– Permanent staff of ~5 Data Scientists – applied research and development– ~15-20 Data Science Fellows (research scientists, visitors, postdocs, students)– Entrepreneurial mentorship

• Modes of engagement– Drop-in open workspace– Studio “Office Hours”– Incubation Program– Plus seminars, sponsored

lunches, workshops,bootcamps, joint proposals …

Key Activity: “Re-establish the watercooler”

Page 15: Data Science and Urban Science @ UW

Key Activity: Create scalable impact through aData Science Incubation Program

• Scale and concentrate our efforts– Move from “accidental” encounters to engineered partnerships– Identify emerging opportunities around campus– Provide a shared environment where researchers can learn from an in-house

team, external mentors, and each other• A startup environment!

– “Seed grant” program• Lightweight – 1-page proposals

– Significant potential for technology spinout – new markets for existing technology and new technology for existingmarkets

Page 16: Data Science and Urban Science @ UW

Key Activity: Democratize Access to Big Data and Big Data Infrastructure

• SQLShare: Database-as-a-Service for scientists and engineers

• Myria: Easy, Scalable Analytics-as-a-Service

Page 17: Data Science and Urban Science @ UW

Open Data sharing platforms

• Database-as-a-service for open data analytics• Interoperable with external tools and languages• Local or cloud deployments• Interoperable with existing database platforms• Built-in data integration, profiling, analytics

Google Fusion Tables

17

Entrepreneurship

1) “Data once guarded for assumed but untested reasons is now open, and we're seeing benefits.”

-- Nigel Shadbolt, Open Data Institute

2) Need to help “non-specialists within an organization use data that had been the realm of programmers and DB admins”

-- Benjamin Romano, Xconomy

“Businesses are now using data the way scientists always have” -- Jeff Hammerbacher, Cloudera

Page 18: Data Science and Urban Science @ UW

Halperin, Howe, et al. SSDBM 2013

Page 19: Data Science and Urban Science @ UW

19

Scalable Analytics as a Service

Page 20: Data Science and Urban Science @ UW

20

Page 21: Data Science and Urban Science @ UW

Kenya Health Information System Data

Grégoire LurtonJune 12, 2014

Abie FlaxmanDan Halperin

Gregoire Lurton

Page 22: Data Science and Urban Science @ UW

In the beginning

Page 23: Data Science and Urban Science @ UW

In the beginning

Page 24: Data Science and Urban Science @ UW

“Much of the material remains unprocessed, or, if processed, unanalyzed, or, if analyzed, not read, or, if read, not used or acted upon”

Objectives

Design generalizable method to process HIS-like data

Make important dataset available for analysis

Explore actionable data analysis of HIS data

Why do we care?

Page 25: Data Science and Urban Science @ UW

Metadata Trace - savingReports of year n saved in January of year n+1

Years were not recorded for the first year of use…

Page 26: Data Science and Urban Science @ UW
Page 27: Data Science and Urban Science @ UW
Page 28: Data Science and Urban Science @ UW

REDPyRepeating Earthquake Detector (Python)

An eScience Incubator Project

Project Lead: Alicia Hotovec-EllisData Scientist: Jake Vanderplas

John Vidale

Alicia Hotovec-Ellis

Jake Vanderplas

Page 29: Data Science and Urban Science @ UW

What is a“repeating” earthquake?

EVEN

T #

1234567

Page 30: Data Science and Urban Science @ UW

Why do we studyrepeating earthquakes?

Page 31: Data Science and Urban Science @ UW

The problem(s)…

Time (minutes)

Tim

e (H

H:M

M:S

S)

Page 32: Data Science and Urban Science @ UW

Clustering for Ordered in time

Even

t #

Event #

Ordered with OPTICS

Even

t #

Event #

Page 33: Data Science and Urban Science @ UW

I talked with Alicia a bit yesterday, and she showed me that her earthquake-repeater-searching implementation is more general, and more powerful than I had thought, and closer to trial by others (and I have a particular use in mind in the ongoing iMUSH experiment on Mount St Helens)<snip>

So I'm encouraging her to continue to work on it a day per week or so for the forseeable future, assuming you have the facilities to continue the incubation.

The project outlives the incubator……

Publications in the works on both the software and the science – from three months of half-time work

Page 34: Data Science and Urban Science @ UW

Using Twitter data to identify geographic clustering of anti-vaccination sentiments

Ben BrooksJune 12, 2014

Benjamin Brooks

Andrew Whitaker

Abie Flaxman

Page 35: Data Science and Urban Science @ UW

Initial approach

• Sentiment regarding vaccination can be discerned from Twitter.

• Can we find city- or county-level pockets of anti-vaccination sentiment?

• Do these locales correlate with outbreak and vaccination rate data (beyond H1N1)?

Page 36: Data Science and Urban Science @ UW

Training data issues

• Training data from PSU study labeled tweets as positive, negative, neutral, or irrelevant.

• Many tweet categorizations seemed suspect.

• Produced new training dataset; switched approach to negative tweets vs. all others.

• Of tweets we labeled as negative, PSU training data agreed with 36%.

• Sample non-negative tweets in training dataset from PSU study:

• “RT @Lyn_Sue Lyn_Sue18 Reasons Why u Should NOT Vaccinate Your Children Against The Flu This Season”

• “1882 -3 O RT @alexHroz Citizens From All Walks Intend To Refuse Swine Flu "Vaccine,”

• “Eighteen Reasons Why You Should NOT Vaccinate Your Children Against The Flu This Season by Bill Sard”

• “Swine Flu Vaccine not necessary and not healthy:”

Page 37: Data Science and Urban Science @ UW

Background: Previous work

• “For our sentiment classification, we used an ensemble method combining the Naive Bayes and the Maximum Entropy classifiers…The accuracy of this ensemble classifier was 84.29%.”

Page 38: Data Science and Urban Science @ UW

Other sentiment approaches

• Precision Of all tweets labeled negative by the algorithm, what percentage are “true negatives”?

• Recall Of all “true negative” tweets, what percentage are labeled negative by the algorithm?

Precision Recall

Vaccine-specific keywords 19% 59%

Modified general sentiment 25% 41%

Naïve Bayes 79% 19%

Logistic regression 70% 28%

Labeled data from PSU study 41% 36%

Page 39: Data Science and Urban Science @ UW

Other sentiment approaches

• Data labeled by human beings does not perform dramatically better than other classifiers!

Precision Recall

Vaccine-specific keywords 19% 59%

Modified general sentiment 25% 41%

Naïve Bayes 79% 19%

Logistic regression 70% 28%

Labeled data from PSU study 41% 36%

Page 40: Data Science and Urban Science @ UW

Scalable Analytics over Call Record Data in Developing Nations

Project LeadIan Kelley

Information SchoolUniversity of WashingtonE-mail: [email protected]

eScience Data Incubator - 12 June 2014Andrew WhitakerIan Kelley Josh Blumenstock

Page 41: Data Science and Urban Science @ UW

Map migration patterns of workers during labor market shortages (Rwanda)

Measure and categorize mobility patternsDetermine peoples’ geographic center of gravity

Discover the effects of violent events on internal population mobility (Afghanistan)

Track activity patterns over time; identify changesMap connected areas of country

eScience Data Incubator - 12 June 2014

Research

Page 42: Data Science and Urban Science @ UW

Center of Gravity (COG)

eScience Data Incubator - 12 June 2014

Average position during a time period (e.g., day, week)

Page 43: Data Science and Urban Science @ UW

Comprehensive Bake-Off

eScience Data Incubator - 12 June 2014

Page 44: Data Science and Urban Science @ UW

Towards An Urban Science Incubation Cohort

44

OneBusAway:Transit Traveler Information Systems

Foreclosure Rates andchanges in poverty concentration

PNW Seismic NetworkEarly Warning SystemOcean Observatories Initiative

Education CRPE

Page 45: Data Science and Urban Science @ UW

Seattle the tech and innovation hub• “most innovative state” (Bloomberg 12/13)• “smartest city” (Fast Company, 11/13)• only US city on “ten best Internet cities” (UBM’s Future

Cities blog, 8/13)• ranked 2nd for women entrepreneurs (geekwire, 2/13)• ranked 4th as global startup hub, > NYC (geekwire, 11/12)• “the top tech city” (geekwire, 6/12)• …and so on

45

Page 46: Data Science and Urban Science @ UW

eScience Institute + Urban Science• Better public engagement than in physical and earth sciences• Leverages our core interest in open data and open science• Acute need relative to traditionally data-intensive fields

– relative newcomers in DS techniques and technologies– We prefer collaborations with smaller labs and individuals as opposed to

“Big Science” projects• Seattle offers a unique testbed as an urbanizing region

– Brookings “metro”: Interconnected urban, suburban, rural, environment – Engaged, active communities– Strong local interest in open data, open government– Global hub for technology and innovation (next slide)

• Connections with King County Executive’s office, State CIO’s office, Seattle CTO’s office, local gov data companies (Socrata)

46

Page 47: Data Science and Urban Science @ UW

Data Science @ UWWe are at the dawn of

a revolutionary new era of discovery and learning