Upload
andre-karpistsenko
View
326
Download
0
Embed Size (px)
Citation preview
Data sciencefor everyone
Taivo Pungas
29.03.2016
For everyone?
● Manual/one-off analyses○ Not production-level code
● Personal interest● No deep learning
Survey:
● PhD [student]?● >1 stats/ML class?● Can code?
Finding interesting data#1
select * from users
Job Experience
1. Existing datasets
int’l organisations
Kaggle
opendata.riik.ee→ R library (WIP)
1. Existing datasets2. Scrape
kv.ee/2724720 kv.ee/2724721
1. Existing datasets2. Scrape
check ToSdon’t DoSKV, Postimees, Auto24, Osta, …
1M Estonian real estate ads
1. Existing datasets2. Scrape3. Quantified Self
QS: how did I feel today?
very goodvery bad
cou
nt
OK
Process & tools for analysis#2
Get data
Clean
Explore / visualise
Publish
My process
Sleep on it
feedback
Python
general purpose data processing
Hadleyverse
Rvs
pandasmatplotlib
scikit-learn
R libraries: Hadleyverse
Step Libraries
Get data rvest, xml2, readxl
Clean dplyr, tidyr, stringr
Explore / visualise ggplot2
Publish
... and many others from Hadley Wickham.
Alternatives
Excel / Google Sheets
● External data sources● Google Apps Script
○ Google Translate API
○ Sending e-mail
○ …
● Not easy to reproduce analyses
Tons of other software
R & Python worth the learning curve
R is easy: reading data# Read data
apartments <- read.csv2("data/apartment_rent_tartu.csv",
sep=";", header=TRUE)
R is easy: dplyrlibrary(dplyr)
# Find average price by part of city
apartments %>%
group_by(Linnaosa) %>%
summarise(KeskmineHind=mean(HindKohandatud)) %>%
arrange(desc(KeskmineHind))
R is easy: lin. regression
# Build linear model
fit <- lm(HindKohandatud ~ Tube, data=apartments)
summary(fit)
Presenting your results#3
interactive > static
D3.js: powerful web visualisations
Easy to useHard to use
Limited
Powerful
ggplot2
D3.js
D3 derivates
Excel
GSheets
AI
How to reach an audience
● Social media● Start a blog
○ stat24.ee
○ pungas.ee
● Offer free content○ Newspapers (tip lines)
○ Guest posts on blogs
● Push to Estonian data science community○ TODO: FB group? Community blog?
Putting it together##
Examples
Apartment prices: R + D3.js 18k hits
Salaries of public servants: R + D3.js 38k hits
Study data: R + D3.js 3k hits
Election promise calculator: D3.js 42k hits
Bondora: R
Alcohol deaths: Illustrator
News & inspiration
Mailing lists:
Information is Beautiful, Data Science Weekly, Data Elixir
Blogs:
FiveThirtyEight, R-bloggers, Stat24, Mike Bostock
Long-term motivation
Flickr: ucirvine / CC BY-NC-ND
pungas.eetaivo@
tpungas