24
A Few Handful Many Time Stamps One Time Snapshot Many Time Serie s Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market Boston Housing Old Faithful Gene Array Climate MA Schools Project Management Our datasets have been relatively small, with no time variable.

A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market

Embed Size (px)

Citation preview

A Few Handful

Many Time Stamps

One Time Snapshot

Many

Time Series

Number of Variables

Mobile PhoneGalton Height

Census

Titanic Survivors

Stock Market

Boston Housing

Old Faithful

Gene Array

Climate

MA Schools

Project Management

Our datasets have been relatively small, with no time variable.

Comparison of Datasets

Heights Housing Titanic# of Variables 8 15 11

# Variables Used 4 10 11

# Numerical 3 8 6

# Categorical 1 2 5

# of Observations 934 506 891

Univariate Yes Yes Yes

Bivariate – Correlations Yes Yes Yes

Missing Data No No Yes - Age

Variable Transformations Yes - Galton Potentially Yes - Age

Data Relationships Linear Linear Non-Linear,Partitioned

Regression? Yes Yes No

Factor Analysis Relevant? No Yes No

Decision Trees Relevant? No Potentially Yes

Cluster Analysis Relevant? For 1 Variable Yes Potentially

Merge Analytic Models? No Potentially Yes

We can apply similar basic

stats to each of these datasets.

Depending on the type of data relationships ,

the newer techniques may (or may not) be

applicable

This slidedeck walks through the Heights DataSet

Research Question

What determines a person’s height?

• Genetics• Nutrition• Immigration / Origins• Disease

Hypothesis Brainstorming

• Sons will be similar to their Dad’s height

• Daughters will be similar to their Mom’s height

Hypotheses:

Height Dataset Variables

heights <- read.csv("GaltonFamilies.csv")

Observations: 934

Variables: 8

We only need a subset of the data

Dataset Variables and Selection

HistogramsHeights of Father, Mother, and Child Appear Normal

ScatterplotsChild Height somewhat correlated to Father, Mother Heights

Correlations Matrices

library(car)scatterplotMatrix(heights)

library(PerformanceAnalytics)chart.Correlation(heights.num)

With Categorical Variable (Gender) Only Numerical Variables

Children Height by GenderNoticeably difference between Gender for Heights

Categorical: Box Plot

Linear Regression Modeling

X’sIndependent Variables

DependentVariableY

X4X3X2X1

delta

delta

Comparing Regression Models

Variable 1 2 3 4 5 6 7Father 0.39 0.36 0.39 0.39Mom 0.31 0.29 0.32 0.32

Gender 5.13 5.21 5.20ChildNum -0.16 -0.04Intercept 40.1 46.6 64.1 67.7 22.6 16.5 17.4

R-squares 0.07 0.04 0.51 0.02 0.10 0.635 0.636

With 4 Variables there are 24 different combinations.Fortunately there is a R Library, LEAPS, that can help.

LEAPS Package

The three variable model appears best trade-off between

explanation and simplicity

Goes through different combination of variables to find best ones

R-Square

Variable

If not highlighted,then not in model

If highlighted,then in model

Finds best combination of variables.Starts with 1 variable, then 2, and so on.

Height Dataset Summary

Question

Brainstorm &Lit Review

Get & Prepare Data

Statistics & DataVIz

Answer

What determines height?

Not able to get data for all our variables!

Gender has the biggest effectParent’s Height also influence a Child’s Height

Genetics Nutrition GenderHeels

Child’s HeightParent’s Height

Child Height = 16. 5 + Father’s Height * 0.39 + Mother’s Height * 0.29 andIf a Male then add 5.21 inches.

Number of VariablesAnalyzed

Pivot Tables

6+

5

4

3

2

1

Predictive Modeling Class

CorrelationMatrices

Regression

FactorAnalysis

Histograms

Applied Stats Class

ClusterAnalysis

DecisionTrees

Types of Analysis AdditionalTechniques

Factor Analysis on Height Dataset?

In this case, FA would not be of any help

There is little correlation among predictors

• For Decision Tree Analysis what would be the First Variable?

Linear or Partitioned Data?

Linear Models vs. Decision TreesHeight variable relationships

appear linear

Decision Trees Would Not Appear to Help

Decision Tree Output

• This model appears less accurate than the regression model AS THE OUTPUT IS IN DISCRETE VALUES

Would Cluster Analysis Be Helpful?

Cluster Analysis on Three Variables?

Results not surprising; though unclear as to how to leverage

Continuous

Convert Continuous to Categorical

Height

Categorical

S M L XL

Cluster Analysis of Child Heights 5 Cluster - S, M, L, XL, XXL

Children Heights for each of the Cluster Centers

Cluster Centers