Upload
gyles-lamb
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
A Few Handful
Many Time Stamps
One Time Snapshot
Many
Time Series
Number of Variables
Mobile PhoneGalton Height
Census
Titanic Survivors
Stock Market
Boston Housing
Old Faithful
Gene Array
Climate
MA Schools
Project Management
Our datasets have been relatively small, with no time variable.
Comparison of Datasets
Heights Housing Titanic# of Variables 8 15 11
# Variables Used 4 10 11
# Numerical 3 8 6
# Categorical 1 2 5
# of Observations 934 506 891
Univariate Yes Yes Yes
Bivariate – Correlations Yes Yes Yes
Missing Data No No Yes - Age
Variable Transformations Yes - Galton Potentially Yes - Age
Data Relationships Linear Linear Non-Linear,Partitioned
Regression? Yes Yes No
Factor Analysis Relevant? No Yes No
Decision Trees Relevant? No Potentially Yes
Cluster Analysis Relevant? For 1 Variable Yes Potentially
Merge Analytic Models? No Potentially Yes
We can apply similar basic
stats to each of these datasets.
Depending on the type of data relationships ,
the newer techniques may (or may not) be
applicable
This slidedeck walks through the Heights DataSet
• Genetics• Nutrition• Immigration / Origins• Disease
Hypothesis Brainstorming
• Sons will be similar to their Dad’s height
• Daughters will be similar to their Mom’s height
Hypotheses:
Correlations Matrices
library(car)scatterplotMatrix(heights)
library(PerformanceAnalytics)chart.Correlation(heights.num)
With Categorical Variable (Gender) Only Numerical Variables
Comparing Regression Models
Variable 1 2 3 4 5 6 7Father 0.39 0.36 0.39 0.39Mom 0.31 0.29 0.32 0.32
Gender 5.13 5.21 5.20ChildNum -0.16 -0.04Intercept 40.1 46.6 64.1 67.7 22.6 16.5 17.4
R-squares 0.07 0.04 0.51 0.02 0.10 0.635 0.636
With 4 Variables there are 24 different combinations.Fortunately there is a R Library, LEAPS, that can help.
LEAPS Package
The three variable model appears best trade-off between
explanation and simplicity
Goes through different combination of variables to find best ones
R-Square
Variable
If not highlighted,then not in model
If highlighted,then in model
Finds best combination of variables.Starts with 1 variable, then 2, and so on.
Height Dataset Summary
Question
Brainstorm &Lit Review
Get & Prepare Data
Statistics & DataVIz
Answer
What determines height?
Not able to get data for all our variables!
Gender has the biggest effectParent’s Height also influence a Child’s Height
Genetics Nutrition GenderHeels
Child’s HeightParent’s Height
Child Height = 16. 5 + Father’s Height * 0.39 + Mother’s Height * 0.29 andIf a Male then add 5.21 inches.
Number of VariablesAnalyzed
Pivot Tables
6+
5
4
3
2
1
Predictive Modeling Class
CorrelationMatrices
Regression
FactorAnalysis
Histograms
Applied Stats Class
ClusterAnalysis
DecisionTrees
Types of Analysis AdditionalTechniques
Factor Analysis on Height Dataset?
In this case, FA would not be of any help
There is little correlation among predictors
Linear Models vs. Decision TreesHeight variable relationships
appear linear
Decision Trees Would Not Appear to Help
Decision Tree Output
• This model appears less accurate than the regression model AS THE OUTPUT IS IN DISCRETE VALUES