28
Data science for everyone Taivo Pungas 29.03.2016

Data science for everyone

Embed Size (px)

Citation preview

Page 1: Data science for everyone

Data sciencefor everyone

Taivo Pungas

29.03.2016

Page 2: Data science for everyone
Page 3: Data science for everyone

For everyone?

● Manual/one-off analyses○ Not production-level code

● Personal interest● No deep learning

Survey:

● PhD [student]?● >1 stats/ML class?● Can code?

Page 4: Data science for everyone

Finding interesting data#1

Page 5: Data science for everyone

select * from users

Job Experience

Page 6: Data science for everyone

1. Existing datasets

int’l organisations

Kaggle

opendata.riik.ee→ R library (WIP)

Page 7: Data science for everyone

1. Existing datasets2. Scrape

kv.ee/2724720 kv.ee/2724721

Page 8: Data science for everyone

1. Existing datasets2. Scrape

check ToSdon’t DoSKV, Postimees, Auto24, Osta, …

1M Estonian real estate ads

Page 10: Data science for everyone

QS: how did I feel today?

very goodvery bad

cou

nt

OK

Page 11: Data science for everyone

Process & tools for analysis#2

Page 12: Data science for everyone

Get data

Clean

Explore / visualise

Publish

My process

Sleep on it

feedback

Page 13: Data science for everyone

Python

general purpose data processing

Hadleyverse

Rvs

pandasmatplotlib

scikit-learn

Page 14: Data science for everyone

R libraries: Hadleyverse

Step Libraries

Get data rvest, xml2, readxl

Clean dplyr, tidyr, stringr

Explore / visualise ggplot2

Publish

... and many others from Hadley Wickham.

Page 15: Data science for everyone

Alternatives

Excel / Google Sheets

● External data sources● Google Apps Script

○ Google Translate API

○ Sending e-mail

○ …

● Not easy to reproduce analyses

Tons of other software

R & Python worth the learning curve

Page 16: Data science for everyone

R is easy: reading data# Read data

apartments <- read.csv2("data/apartment_rent_tartu.csv",

sep=";", header=TRUE)

Page 17: Data science for everyone

R is easy: dplyrlibrary(dplyr)

# Find average price by part of city

apartments %>%

group_by(Linnaosa) %>%

summarise(KeskmineHind=mean(HindKohandatud)) %>%

arrange(desc(KeskmineHind))

Page 18: Data science for everyone

R is easy: lin. regression

# Build linear model

fit <- lm(HindKohandatud ~ Tube, data=apartments)

summary(fit)

Page 19: Data science for everyone

Presenting your results#3

Page 21: Data science for everyone

D3.js: powerful web visualisations

Page 22: Data science for everyone

Easy to useHard to use

Limited

Powerful

ggplot2

D3.js

D3 derivates

Excel

GSheets

AI

Page 23: Data science for everyone

How to reach an audience

● Social media● Start a blog

○ stat24.ee

○ pungas.ee

● Offer free content○ Newspapers (tip lines)

○ Guest posts on blogs

● Push to Estonian data science community○ TODO: FB group? Community blog?

Page 24: Data science for everyone

Putting it together##

Page 25: Data science for everyone

Examples

Apartment prices: R + D3.js 18k hits

Salaries of public servants: R + D3.js 38k hits

Study data: R + D3.js 3k hits

Election promise calculator: D3.js 42k hits

Bondora: R

Alcohol deaths: Illustrator

Page 26: Data science for everyone

News & inspiration

Mailing lists:

Information is Beautiful, Data Science Weekly, Data Elixir

Blogs:

FiveThirtyEight, R-bloggers, Stat24, Mike Bostock

Page 28: Data science for everyone

pungas.eetaivo@

tpungas