Data Analysis

DATA ANALYSIS

Dr. Steward doss

Data Analysis
Frequency Tables one way, multiple tables, cross-tabulations.Various types of charts, diagrams, modelsGraphical visualization of the dataRelationship AnalysisCausal AnalysisHypothesis testingMultivariate Analysis

Prerequisites before the analysis:
Adequate sample size including subgroups,Knowledge about data and its characteristics.Skepticism and OpennessLook at the visual representation of the data Shape and SpreadAvoid extreme values or outliersMissing value analysis. Validate with the study objectives

Types of Data
Data: Measurement of one or more variables of a sample.Nominal DataOrdinal DataInterval DataRatio Data

NOMINAL DATA :
Classify the units of the sample into categories,

Labels or Names that identify the categories.

It do not have any mathematical properties

Mutually exclusive & labeled categories

Demographical or Qualitative Data

Ex: Names of Policies, Depts, Sex, Occupation, Religion, Race, etc.

ORDINAL DATA:
Measurements that enable the units to be ordered with respect to the variable of interest
These data can be quantifiable but the values do not have mathematical propertiesValues can be ordered from low to high, small to big (A > B)Ex: Economic Status (Low/Medium/High), Ranks, Evaluative data, Comparative data, etc.

INTERVAL DATA:
It possess the property of magnitude and equal intervals among the categories.
It has the property of nominal + ordinal + extra. But it does not have all the properties of ratio data (0 value)All arithmetic & statistical functions can be performed.Ex: Rating Scale, Agreement Scale, Attitude Scale etc

RATIO DATA:
Absolute Numbers, Quantitative Variables.
Highest level of measurementIt has the properties of nominal + ordinal + interval + extra (O Value)Ex: Age, Income, Values, Numbers, etc.

Presentation of Data
Frequency Distribution Tabular presentation of data Steps: Calculate the rangeDetermine the no. of class intervalsFrequency of data True limits of class intervals - .5 less than the lower limit and .5 above the class intervalRelative and Cumulative frequenciesFrequency Curves histogram, frequency polygons, etc.

Review
Histogram summarizes the whole data pictorially Frequency distributionRelative frequency distributionRangeWidth of a class interval = range/n of cCumulative percentage

Measures of Central Tendency
Measures of average / representative score / Center score around which other scores tend to cluster.Mean, Median, Mode.Selection of appropriate measures depends on i) types of data, and ii) shapes of distribution.

MEAN
Arithmetic Mean = Simple Average

Advantages:
Best description of dataEasy & Simple to calculateBetter indicator for comparisonUseful for further statistical calculationsDisadvantages:It is affected by extreme valuesIt can not be calculated for Nominal & Ordinal Data

Median:
The middle observation in a set of data when arranged in order.It is a middle number if N is Odd,Average of two middle terms if N is EvenThe Formula for Grouped Data:

Advantages:Better measure for Skewed DistributionIt is not affected by extreme valuesIt is a useful measure for Ordinal & Qualitative data
Disadvantages:
It is complex and time consumingLess StableMake use of less number of data

MODE
It describes the values that occurs most frequentlyEx: 6, 11, 24, 43, 56, 60, 60, 60, 64, 75, 80, 80, 80, 80, 90, 95.It is a useful measure for Nominal DataIt is not affected by extreme values

Dispersion
Range = Max Min valuesQuartiles divides the distribution into 4 equal groupsPercentiles divides the distribution into 10 equal groupsVariance average squared deviations from the meanStandard Deviation square root of the average squared deviations from the mean

Review
If the median of a distribution is 21 and the last value in the distribution is increased by 3, what would be the median value?If Mean is 24, Variance is 0 what does it mean? If the mean is less than the mode, what distribution does it describe for?What does the SD indicate?If a person earns a score higher than 35 students in a total of 50 students, what percentile value does he get?

Estimation Confidence Interval
Confidence Interval Sample size and Confidence IntervalSampling errorStandard errorEstimation of confidence interval mean, proportion, variance, etc.

Correlation
It studies the relationship of any two variablesIt is used to measure the strength of relationshipBetter indicator of the nature of relationshipIt is a useful tool for decision making

Various Types of Correlations
Positive Correlation
Ex: Education & Income, MediClaim & Age, Cattle Insurance & Bank Loan, etc
Negative Correlation
Ex: Supply & Price, Size of article vs. Theft, Tonnage of Ship vs. Cargo risk
Curvilinear Relationship
Ex: Motivation & Performance, Accident Claim & Age, etc

Scatter Plot
It indicates Linearity of the relationshipsExistence of the relationshipStrength of the relationshipNature of the relationshipIndicates the outliers in the distribution

Pearsons Correlation
It is used to quantify the relationship It is a Co-efficientIt is SymmetricOne of the most popular & reliable measureMost appropriate measure for Interval / Ratio data

Review
What does these following values indicates-.80 Relationship is negative but strong-.70 < .60Highest values indicates stronger relationship irrespective of the signs If r = +1 or 1- indicates perfect relationshipR2 explains the variations in the relationship

Chi Square Test
Good measure for nominal variablesIt measures only the relationship not strengthAssumptions:
Sample should be random and independent

Mutually exclusive and exhaustive categories

Expected frequencies must be larger than 5

Sample size should be large (n > 30)

Steps for Chi-square test:
Setting up of Null HypothesisCalculation of Expected frequencies
Determining the cell frequencies expected if null hypothesis is true.

FormulaComparing the calculated values and the table valuesAcceptance or rejection of null hypothesis

REGRESSION
It is used to measure the linearity of the relationshipUseful for PredictionUseful for knowing the cause and effectIt gives the degree of impact of independent variable on the dependant variableIt is robust against the violations of certain assumptions

Assumptions:
No specification errorThe relationship should be linearNo relevant variables are excludedNo irrelevant variables are includedMeasurement errorThe appropriate level of measurementReliable dataError TermZero Mean E (i) = 0The variance of the error term is constantNo auto correlation of the error termNo correlation of Independent variables with the error The error term is normally distributed

There must be some relationship between the variablesDefine Dependant and Independent variablesCalculate the Y intercept and b coefficient using the regression equation Y = a+bx+eTest for statistical significance
Steps for calculation:

Regression Equation:
Y = a + bx + e
Where a = Average value of Y intercept when x = 0

b = Slope or the expected change in Y with one unit change in x

e = error term: error made in predicting the value of Y from the given value of x

Multiple Regression
Used to measure the impact of more than one independent variables on the dependant variableHelps to measure the effect of many variables Used to develop a modelIt can be used with both interval and qualitative variables

Analysis of Variance (ANOVA)
Useful measure to study the differences among the groups (>2); (Age, Sex, Polices, Offices, etc)Measures the degree of differences (Mean)Measures the influence of independent variables on the dependant variableIt identifies the most influential group among the subgroups.Used to test the significance of a model

Assumptions:
Observations (xs) in each J populations are:(i) Independent(ii) Normally distributed(iii) Have equal variances

Formula:
F =

Where MSB means Mean Square Between Groups

MSW relates to Mean Square Within Groups (Sample Error)

F is the ratio of MSB to MSW

If F > 1, Null Hypothesis is rejected

P value indicates significance of F ratio

Steps for ANOVA:

State the hypothesis to be tested

Ex: H0: 1 = 2 = .. = j

Specify the significance level ( = .05)

Compute the F ratio and find its p-value

Conclude that H0 is false if p ; and

if p > , accept H0

Continue..
Within group variance squares of deviations of each score from its sample meanBetween group variance squares of deviations of k sample means from overall meanTotal variance squares of deviations of each score from the overall meand/f for treatment (MST) = K-1, d/f for MSE = N-KF = 0 if there are no diff among the group means

Time Series
Data gathered at regular intervals over a period of time Time SeriesSecular trendSeasonal variationCyclical variationIrregular variation

N

X

X

X

=

=

=

ns

Observatio

of

No

Total

Values

Observed

the

of

Sum

Mean

Arithmetic

W

+

=

Fmd

)

-

2

N

(

Lmd

Median

F

1

-

W

B

MS

MS

Simple Random Sampling

Stratified Random Sampling

Cluster Sampling

Systematic Area Sampling

PROBABILITY

Quota Sampling

Judgemental Sampling

Convenience

NON PROBABILITY

Sampling Methods

N

C

R

t

t

=

s

frequencie

Expected

E

E

O

2

2

)

(

-

=

c

Documents

Data Analysis