Upload
jacqui
View
32
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Welcome (back) to IST 380 !. Today: the old and the new. modeling trends from Twitter data. the most traditional approach to modeling data. This picture may soon become part of the OLD, if trends continue…. Assignments…. Homework #1 is complete! (2/5). - PowerPoint PPT Presentation
Citation preview
Welcome (back) to IST 380 !
Today: the old and the new
the most traditional approach to modeling data
modeling trends from Twitter data
This picture may soon become part of
the OLD, if trends continue…
Assignments…
Homework #1 is complete! (2/5)
Getting started with R (tutorial + "quiz" + text)
Pr #1: text, Chapters 6-9
Pr #2: Monty Hall challenge
Pr #3: writing a predictive model by hand…
Homework #3 is due next Tuesday (2/20)
Things are heating up here!
Make sure you can submit to our submission site!
Homework #2 is due tomorrow (2/12)
Pr #1: text, Chapter 10
Pr #2: the envelope, please!
Pr #3: linear models for prediction
Zac & Suleng
The age of data?
I prefer my data well-aged!
R path!
Progra
mm
ing
Skills
Subject Expertise
2
… R's toolset and its capabilities…
data collection
descriptive vs. generative vs. predictive statistics
predictions using linear regression
I predict we'll get here, but not necessarily in a straight line!…
3
1
Tweet "diffs" for a certain hashtag…
Chapter 10 introduces access to Twitter data and statistical descriptions using these data
Descriptive statistics: Twitter data
packageslibrarylapplyorderdiff
Some R: library
Once you have installed these packages
packages:bitopsRcurl
RJSONIOtwitteR
later:UsingR
You can ensure they're present with
library(bitops)
Chapter 10 will have you write a function to automate this process…
and so on…
Caution! Some of these may have to be installed by hand…
What if I don't have hands?!
Some R: style…I have NO COMMENT about this function!
Some R: style…
better, but not ideal
Some R: style…
use variables to hold intermediate values!
Some R: lapply and vapplyClock in Bristol, UK
lapply(X, FUN, ...)
Allow you to apply a function to every element of a list or a vector:
vapply(X, FUN, FUN.VALUE ...)
> L <- list(8,9,10)> lapply( L, add1 )[[1]][1] 9
[[2]][1] 10
[[3]][1] 11
> V <- 8:10> vapply( V, add1, FUN.VALUE=42 )[1] 9 10 11
UTC?
since before the railroads…red minute hand: Bristol
black minute hand: London (Greenwich)
Clock in Bristol, UKcoordinated universal time
Looking at the data…
UTC?
can be plotted as-is
take differences via as.numeric
- so that "2013-02-11 20:55:03 UTC"
becomes 1360616103
Some R: order and diff
order returns a permutation of its input…
> V <- c(3,4,2,1)
> V[1] 3 4 2 1
> order(V)[1] 4 3 1 2
>
order(..., na.last = TRUE, decreasing = FALSE)
What do these numbers mean?
Some R: order and diff
order returns a permutation of its input…
> V <- c(3,4,2,1)
> V[1] 3 4 2 1
> order(V)[1] 4 3 1 2
> V[order(V)][1] 1 2 3 4
order(..., na.last = TRUE, decreasing = FALSE)
What do these numbers mean?
Why not just use sort?
You can, but this let's you order
anything in the same way!
diff ?
Comparing tags?
#losangeles#sanfransisco
Which is which?
Comparing tags?
#losangeles#sanfrancisco
Which is which?
Comparing tags...
#losangeles#sanfrancisco
Which is which?
Next week: we will
quantify these differences
more carefully…
Generative statistics rgeomrunifrnorm … samplereplicate
Chapter 7 reviews repeated sampling and the resulting distribution of means
distribution of samples of state populations
Generative statistics rgeomrunifrnorm … samplereplicate
Chapter 7 reviews repeated sampling and the resulting distribution of means
distribution of samples of state populations
Monte Carlo method: run
a process many times to
gain insights into it…
Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.
Should you switch or stay?
Hw3 pr2: A second Monte Carlo example :
Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.
Should you switch or stay?
Hw3 pr2: A second Monte Carlo example :
Switch!but, then, should you switch back?
Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.
Should you switch or stay?
Hw3 pr2: A second Monte Carlo example :
This week ~ write a
function to model this
process…
Hw3 pr2
Write a Mystery Envelope function:
… that runs one envelope trial
Another to run it N times:
ME_once <- function( amount_found=1.0, sors="switch", verbose=TRUE)
ME_ntimes <- function( n=100 )
sample_ME <- function( run_me=100 )
… and returns the amount of $ "earned"
And another to run it N times:
Assignments…
Homework #1 is complete! (2/5)
Getting started with R (tutorial + "quiz" + text)
Pr #1: text, Chapters 6-9
Pr #2: Monty Hall challenge
Pr #3: writing a predictive model by hand…
Homework #3 is due next Tuesday (2/20)
Things are heating up here!
Make sure you can submit to our submission site!
Homework #2 is due tomorrow (2/12)
Pr #1: text, Chapter 10
Pr #2: the envelope, please!
Pr #3: linear models for prediction
Big Ideas:
Predictive modeling
Linear regression
The human role… !
So, what is Machine Learning?
The goal of machine learning also known as
predictive statistics/analytics,
is to find a function
that yields outputs for previously-unseen inputs…
function
passenger details
prediction: did the passenger
survive?
So, what is Machine Learning?
The goal of machine learning also known as
predictive statistics/analytics,
is to find a function
that yields outputs for previously-unseen inputs…
function
passenger details
prediction: did the passenger
survive?For Hw2, you are building
this function by hand.
R is for Regression!
The oldest and (still) most popular technique for
automatically generating a model from data.
problem 3 this week…
RegressionWhat is it?
Regression ~ predictive modeling
this week: making an assumption of linear dependence on the
inputs
But why is it called regression?
1877: "reversion" (peas)
1885: "regression" (people)
make this sum of squared errors (residuals) as
small as possible
Let's look at lm1
pr3 this week: temperatures…
Temperature anomalies
The data…
deviations from the 1950-1980 global average of 14°C ~ 57.2°F
averaged (worldwide) and presented in units of 0.01°C
Your task…
• follow an analysis plan similar to the Galton data in the previous slides
• fit a linear model to the yearly average data and to each month's average data
• use your model to predict what the average temperature will be for 2012 and 2013
• is the linear model a reasonable one?
• we'll check (or you can…) the prediction for 2012 (but not 2013, yet)
Try it!
Help is available either with hw#2 (Monty Hall and Titanic using R's functions)
or hw#3 (Twitter, envelopes, and temperatures)
this evening during lab time…
Good luck with everything this week!
Lab !
The Titanic
April 15, 1912
1502 out of the 2224 passengers
died in the sinking
What characteristics did the survivors share?
The Data
There are 742 rows and 11 columns in the training data.
here are the 11 columns
Our goal
… is to write a function that takes in a row of new data and outputs whether that passenger would survive (1) or not (0).
A first predictor
A second predictor
Does the data match the famous emergency cry?
Testing our functions…
CS vs. IS and IT ?
www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
greater integration system-wide issues
smaller details machine specifics
CS vs. IS and IT ?
Where will IS go?
CS vs. IS and IT ?
IT ?
Where will IT go?
IT ?
The bigger picture
Weeks 10-12
Objects
Week 10
Week 11
Week 12
Weeks 13-15
Final Projects
classes vs. objects
methods and data
inheritance
Week 13
Week 14
Week 15
final projects
final projects
final exam
Data?!• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Where?
state reminders…
Data! • Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Zachary Dodds
Pittsburgh, PA
Harvey MuddWhere?
44
mostly CS for me…
M&Ms
Data! • Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Zachary Dodds
Pittsburgh, PA
Harvey MuddWhere?
44
mostly CS for me…
M&Ms
be sure to set up your login + profile for the submission site…
This class is truly seminar-style:
we're devloping expertise in this field together.