Fairy tale from the land of data

Preview:

DESCRIPTION

A fairy tale about falling into a trap of wrong interpretation of results. Shows the importance of building models and understanding them.

Citation preview

Fairy tales in the land of dataOr - do I know what I’m doing?

By @przemur from

http://about.me/przemek.maciolek

A story

http://yamao.deviantart.com/art/Cleric-comm-343786321 https://www.flickr.com/photos/jsjgeology/8359854092/

Suspense

<?

“The hammers from the new

provider are no good, sayr.”

What would you do?

New hammers since this month

install.packages('ggplot2') require('ggplot2') setwd("/Users/pmm/Desktop/hammer") all <- read.csv(file="all.csv") !qplot(all$month_sequence, all$dwarfs) + geom_smooth() qplot(all$month_sequence, all$production) + geom_smooth() !all$prod_per_dwarf <- all$production / all$dwarfs qplot(all$month_sequence, all$prod_per_dwarf) + geom_smooth()

Number of dwarfs working in the mine

The hammers from the new provider started being

distributed to the new miners.

Total production of gold

Per-dwarf average production

Who sees any problem?

Lets look at the production of each dwarf, relative to the time one applied…

Dwarfs which are using the OLD hammer design

Dwarfs which are using the NEW hammer design

new <- read.csv(file="new_relative.csv") old <- read.csv(file="old_relative.csv") !qplot(new$relative_month, new$production) ggplot(new, aes(x=relative_month, y=production)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)

# This will look much better!old$type='old' new$type='new' old_and_new = rbind(old,new) ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)

Scatterplot showing relative production done using old and new hammers

What now?

ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.1) + geom_smooth(method=lm)

The new hammers wear much faster!

How much did the dwarfs lost?

old_m = lm(production ~ relative_month, old) new$possible_production <- predict(old_m, new) sum(new$possible_production) - sum(new$production) (sum(new$possible_production) - sum(new$production))/sum(new$production)

0.5%

Now, taking into account the price of hammer, one can select the optimal strategy… but that’s another story…

Lessons learned …?

• Don’t trust the data blindly, ask questions

• Try to understand underlying rules of the system

• Don’t be shy with trying various models

• If using R, go for ggplot2