29

Data Preparation and Preliminary Analysis

  • Upload
    dasha

  • View
    77

  • Download
    4

Embed Size (px)

DESCRIPTION

Data Preparation and Preliminary Analysis. Data. Once the data starts to flow, our attention turns to data analysis Data preparation – includes editing, coding and data entry Exploring, displaying and examining data – the search for meaningful patterns - PowerPoint PPT Presentation

Citation preview

Page 1: Data Preparation and  Preliminary Analysis
Page 2: Data Preparation and  Preliminary Analysis

Data

• Once the data starts to flow, our attention turns to data analysis– Data preparation – includes editing, coding and

data entry– Exploring, displaying and examining data – the

search for meaningful patterns– Data mining – used to extract patterns and

predictive trends from databases

Page 3: Data Preparation and  Preliminary Analysis

Data Editing

• Checking entries for correctness, consistency

• Coding – assigning numbers

• Data entry – spreadsheet, data editor of a statistical program or database

Page 4: Data Preparation and  Preliminary Analysis

Exploring Data

• You could move directly into the statistical analysis …

• When the study’s purpose is not the production of causal inferences, confirmatory data analysis is not required

• When it is, you should discover as much as possible about the data before selecting the appropriate means of confirmation

Page 5: Data Preparation and  Preliminary Analysis

Exploratory Data Analysis

• Set of techniques• The flexibility to respond to the patterns revealed

by successive iterations in the discovery process is an important attribute

• EDA can be compared to the role of the police detectives and other investigators

• Confirmatory analysis can be compared to the role of the judge

• The former are involved in the search for clues the latter are preoccupied with evaluating the strength

Page 6: Data Preparation and  Preliminary Analysis

EDA• Free to take many paths in revealing mysteries in the

data• Emphasizes visual representations and graphical

techniques over summary statistics• Summary statistics , may obscure, conceal the

underlying structure of the data• When numerical summaries are used exclusively and

accepted without visual inspection, the selection of confirmatory modes may be based on flawed assumptions and may produce erroneous conclusions

Page 7: Data Preparation and  Preliminary Analysis

Techniques for Displaying Data

• Frequency Tables

• Bar Charts

• Pie Charts

Page 8: Data Preparation and  Preliminary Analysis

Frequency Tables

• Information – Displays the data from the lowest value to the

highest– Columns for percent– Percent adjusted for missing values– Cumulative percent

Page 9: Data Preparation and  Preliminary Analysis

A Frequency Table for Market Sector

Value Label Value Frequency % Valid % Cum. %Chemicals 1 10 10.0 10.0 10.0Consumer Products 2 8 8.0 8.0 18.0Durables 3 7 7.0 7.0 25.0Energy 4 13 13.0 13.0 38.0Financial 5 24 24.0 24.0 62.0Health 6 4 4.0 4.0 66.0High-Tech 7 11 11.0 11.0 77.0Insurance 8 6 6.0 6.0 83.0Retailing 9 7 7.0 7.0 90.0Other 10 10 10.0 10.0 100.0 Total 100 100.0 100.0Valid Cases 100 Missing Cases 0

Page 10: Data Preparation and  Preliminary Analysis

Sector Bar Chart Display

0

5

10

15

20

25

30

Chemicals Consumer Durables Energy Financial Health High-tech Insurance Retailing Other

Series1

Page 11: Data Preparation and  Preliminary Analysis

Sector Pie Chart Display

Chemicals

Consumer

Durables

Energy

Financial

Health

High-tech

Insurance

Retailing

Other

Page 12: Data Preparation and  Preliminary Analysis

Analysis

• The values and percentages are more readily understood in graphic format.

• The relative sizes of the sectors can be visualized with the bar and pie

Page 13: Data Preparation and  Preliminary Analysis

Another Frequency Table (Ratio-Interval Data)

Row Value Freq. % Cum.%

1 54.9 1 2 2

2 55.4 1 2 4 3 55.6 1 2 6 4 56.4 1 2 8 5 56.8 1 2 10 6 56.9 1 2 12 7 57.8 1 2 14 8 58.1 1 2 16 9 58.2 1 2 18

10 58.3 1 2 20

11 58.5 1 2 2212 59.2 2 4 26

Row Value Freq. % Cum.%

13 61.5 1 2 28 14 62.6 1 2 3015 64.8 1 2 3216 66.0 2 4 3617 66.3 1 2 3818 67.6 1 2 40 19 69.1 1 2 42 20 69.2 1 2 44 21 70.5 1 2 46 22 72.7 1 2 48 23 72.9 1 2 5024 73.5 1 2 52

Row Value Freq. % Cum.%

Page 14: Data Preparation and  Preliminary Analysis

Interval-Ratio Data

• The last chart was not informative• Primary contribution was an ordered list of values• If converted to a bar chart, it would have 48 bars

of equal length and two bars with two occurrences• A pie chart would also be pointless• Notice that when the variable of interest is

measured on an interval-ration scale and is one of many potential values, these techniques are not particularly informative

Page 15: Data Preparation and  Preliminary Analysis

Histogram

• Conventional solution for display of interval-ratio data

• Group the variable’s values into intervals

• Useful– Displaying all intervals in a distribution even

those without observed values– Examining the shape of the distribution for

skewness, kurtosis and the modal pattern

Page 16: Data Preparation and  Preliminary Analysis

Histogram

• Questions to ask– Is there a single hump?– Are subgroups identifiable when multiple

modes are present?– Are straggling data values detached from the

central concentration?

Page 17: Data Preparation and  Preliminary Analysis

Histogram when grouping in increments of 20

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9

Series1

Page 18: Data Preparation and  Preliminary Analysis

Observations

• Intervals with 0 counts show gaps in the data and alert the analyst to look for problems with spread

• There are two extreme values• Along with the peaked midpoint and

reduced number of observations in the upper tail, this histogram warns us of irregularities in the data.

Page 19: Data Preparation and  Preliminary Analysis

Stem and Leaf Displays

• Closely related to the histogram• Shares features but offers unique advantages• Easy to construct by hand for small samples• In contrast to histograms which lose information

by grouping values into intervals, actual data can be inspected directly

• Range of data is apparent at a glance• Also shape and spread impressions immediate

Page 20: Data Preparation and  Preliminary Analysis

Stem and Leaf Displays

• To develop, the first digit of each data item are arranged to the left of a vertical line.

• Each row is referred to as a stem and each piece of information leaf

Page 21: Data Preparation and  Preliminary Analysis

Example of a Stem and Leaf Display

5 0 2 2 3 5 6 7 86 4 5 5 6 6 6 7 8 8 8 8 9 97 0 2 2 6 88 9 2 410 0 1 811 312 113 114 0 615 316 3 617 18 319 20 621 8

Page 22: Data Preparation and  Preliminary Analysis

Boxplots

• Another technique for exploratory data analysis• Boxplot reduces the detail of the stem-and-leaf

display and provides a different visual image of the distribution’s location, spread, shape, tail length, and outliers

• Summary consists of the median, upper and lower quartiles, and the largest and smallest observations.

• The median and quartiles are used because they are particularly resistant statistics.

Page 23: Data Preparation and  Preliminary Analysis

Resistant Statistics• Example: data set = [5,6,6,7,7,7,8,8,9]• The mean is 7 and the standard deviation 1.23• Replace the 9 with 90 and the mean becomes 16 and the

standard deviation 27.78.• Changing only one of the nine values has disturbed the

location and spread summaries to the point where they no longer represent the other eight values. Both mean and standard deviation are considered nonresistant statistics

• The median remained at 7 and the lower and upper quartiles stayed at 6 and 8, respectively.

Page 24: Data Preparation and  Preliminary Analysis

Boxplots

• Rectangular plot that encompasses 50 percent of the data values

• A center line ( or other notation) marking the median and going through the width of the box

• The edges of the box are called hinges• The whiskers that extend from the right and

left hinges to the largest and smallest values

Page 25: Data Preparation and  Preliminary Analysis

Boxplot Components

OutsideValueOroutlier

Smallest observed value within 1.5 IQR of lower hinge

Largest observed value within 1.5 IQR of upper hinge

WhiskersMedian

IQR 1.5IQR1.5IQR

50% of observedValues are within

the box

OutsideValueOroutlier

ExtremeOr farOutsidevalue

Inner fence1.5(IQR) plusUpper hinge

Outer fence3(IQR) plusUpper hinge

Inner fenceLower hingeMinus 1.5(IQR)

Outer fenceLower hingeMinus 3(IQR)

Page 26: Data Preparation and  Preliminary Analysis

ExampleMinimum = 54.9Lower hinge = 60.3Median = 74.55Upper hinge = 111.52Maximum = 218.2IQR = 111.52 – 60.3 = 51.22.5 (IQR) = 25.61Inner fence lower hinge = 60.3 – (51.22+25.61) = -16.53Inner fence upper hinge = 111.52 + (51.22+25.61) = 188.35The smallest and largest values from the distribution within the

fences are used to determine the whisker length

Page 27: Data Preparation and  Preliminary Analysis

Observations

• In preliminary analysis, it is important to separate legitimate outliers from errors in measurement, editing, coding and data entry

• Outliers that are mistakes should be corrected or removed

Page 28: Data Preparation and  Preliminary Analysis

Other Observations

Symmetric

Right Skewed

Left Skewed

Small Spread

Page 29: Data Preparation and  Preliminary Analysis

Visual Techniques of EDA

• Gain insight into the data• More common ways of summarizing

location, spread, and shape• Used resistant statistics• From these we could make decisions on test

selection and whether the data should be transformed or reexpressed before further analysis