Upload
shubham-agrawal
View
214
Download
1
Embed Size (px)
DESCRIPTION
Data Analysis
Citation preview
DATA ANALYSIS
Dr. Steward doss
Data Analysis
Frequency Tables one way, multiple tables, cross-tabulations.Various types of charts, diagrams, modelsGraphical visualization of the dataRelationship AnalysisCausal AnalysisHypothesis testingMultivariate AnalysisPrerequisites before the analysis:
Adequate sample size including subgroups,Knowledge about data and its characteristics.Skepticism and OpennessLook at the visual representation of the data Shape and SpreadAvoid extreme values or outliersMissing value analysis. Validate with the study objectivesTypes of Data
Data: Measurement of one or more variables of a sample.Nominal DataOrdinal DataInterval DataRatio DataClassify the units of the sample into categories,
Labels or Names that identify the categories.
It do not have any mathematical properties
Mutually exclusive & labeled categories
Demographical or Qualitative Data
Ex: Names of Policies, Depts, Sex, Occupation, Religion, Race, etc.
Measurements that enable the units to be ordered with respect to the variable of interest
These data can be quantifiable but the values do not have mathematical propertiesValues can be ordered from low to high, small to big (A > B)Ex: Economic Status (Low/Medium/High), Ranks, Evaluative data, Comparative data, etc.It possess the property of magnitude and equal intervals among the categories.
It has the property of nominal + ordinal + extra. But it does not have all the properties of ratio data (0 value)All arithmetic & statistical functions can be performed.Ex: Rating Scale, Agreement Scale, Attitude Scale etcAbsolute Numbers, Quantitative Variables.
Highest level of measurementIt has the properties of nominal + ordinal + interval + extra (O Value)Ex: Age, Income, Values, Numbers, etc.Presentation of Data
Frequency Distribution Tabular presentation of data Steps: Calculate the rangeDetermine the no. of class intervalsFrequency of data True limits of class intervals - .5 less than the lower limit and .5 above the class intervalRelative and Cumulative frequenciesFrequency Curves histogram, frequency polygons, etc.Review
Histogram summarizes the whole data pictorially Frequency distributionRelative frequency distributionRangeWidth of a class interval = range/n of cCumulative percentageMeasures of Central Tendency
Measures of average / representative score / Center score around which other scores tend to cluster.Mean, Median, Mode.Selection of appropriate measures depends on i) types of data, and ii) shapes of distribution.MEAN
Arithmetic Mean = Simple AverageAdvantages:
Best description of dataEasy & Simple to calculateBetter indicator for comparisonUseful for further statistical calculationsDisadvantages:It is affected by extreme valuesIt can not be calculated for Nominal & Ordinal DataMedian:
The middle observation in a set of data when arranged in order.It is a middle number if N is Odd,Average of two middle terms if N is EvenThe Formula for Grouped Data:Disadvantages:
It is complex and time consumingLess StableMake use of less number of dataMODE
It describes the values that occurs most frequentlyEx: 6, 11, 24, 43, 56, 60, 60, 60, 64, 75, 80, 80, 80, 80, 90, 95.It is a useful measure for Nominal DataIt is not affected by extreme valuesDispersion
Range = Max Min valuesQuartiles divides the distribution into 4 equal groupsPercentiles divides the distribution into 10 equal groupsVariance average squared deviations from the meanStandard Deviation square root of the average squared deviations from the meanReview
If the median of a distribution is 21 and the last value in the distribution is increased by 3, what would be the median value?If Mean is 24, Variance is 0 what does it mean? If the mean is less than the mode, what distribution does it describe for?What does the SD indicate?If a person earns a score higher than 35 students in a total of 50 students, what percentile value does he get?Estimation Confidence Interval
Confidence Interval Sample size and Confidence IntervalSampling errorStandard errorEstimation of confidence interval mean, proportion, variance, etc.Correlation
It studies the relationship of any two variablesIt is used to measure the strength of relationshipBetter indicator of the nature of relationshipIt is a useful tool for decision makingVarious Types of Correlations
Positive CorrelationEx: Education & Income, MediClaim & Age, Cattle Insurance & Bank Loan, etc
Negative CorrelationEx: Supply & Price, Size of article vs. Theft, Tonnage of Ship vs. Cargo risk
Curvilinear RelationshipEx: Motivation & Performance, Accident Claim & Age, etc
Scatter Plot
It indicates Linearity of the relationshipsExistence of the relationshipStrength of the relationshipNature of the relationshipIndicates the outliers in the distributionPearsons Correlation
It is used to quantify the relationship It is a Co-efficientIt is SymmetricOne of the most popular & reliable measureMost appropriate measure for Interval / Ratio dataReview
What does these following values indicates-.80 Relationship is negative but strong-.70 < .60Highest values indicates stronger relationship irrespective of the signs If r = +1 or 1- indicates perfect relationshipR2 explains the variations in the relationshipChi Square Test
Good measure for nominal variablesIt measures only the relationship not strengthAssumptions:Sample should be random and independent
Mutually exclusive and exhaustive categories
Expected frequencies must be larger than 5
Sample size should be large (n > 30)
Steps for Chi-square test:
Setting up of Null HypothesisCalculation of Expected frequenciesDetermining the cell frequencies expected if null hypothesis is true.
REGRESSION
It is used to measure the linearity of the relationshipUseful for PredictionUseful for knowing the cause and effectIt gives the degree of impact of independent variable on the dependant variableIt is robust against the violations of certain assumptionsAssumptions:
No specification errorThe relationship should be linearNo relevant variables are excludedNo irrelevant variables are includedMeasurement errorThe appropriate level of measurementReliable dataError TermZero Mean E (i) = 0The variance of the error term is constantNo auto correlation of the error termNo correlation of Independent variables with the error The error term is normally distributedSteps for calculation:
Regression Equation:
Y = a + bx + eWhere a = Average value of Y intercept when x = 0
b = Slope or the expected change in Y with one unit change in x
e = error term: error made in predicting the value of Y from the given value of x
Multiple Regression
Used to measure the impact of more than one independent variables on the dependant variableHelps to measure the effect of many variables Used to develop a modelIt can be used with both interval and qualitative variablesAnalysis of Variance (ANOVA)
Useful measure to study the differences among the groups (>2); (Age, Sex, Polices, Offices, etc)Measures the degree of differences (Mean)Measures the influence of independent variables on the dependant variableIt identifies the most influential group among the subgroups.Used to test the significance of a modelAssumptions:
Observations (xs) in each J populations are:(i) Independent(ii) Normally distributed(iii) Have equal variancesF =
Where MSB means Mean Square Between Groups
MSW relates to Mean Square Within Groups (Sample Error)
F is the ratio of MSB to MSW
If F > 1, Null Hypothesis is rejected
P value indicates significance of F ratio
Steps for ANOVA:
State the hypothesis to be tested
Ex: H0: 1 = 2 = .. = j
Specify the significance level ( = .05)
Compute the F ratio and find its p-value
Conclude that H0 is false if p ; and
if p > , accept H0
Continue..
Within group variance squares of deviations of each score from its sample meanBetween group variance squares of deviations of k sample means from overall meanTotal variance squares of deviations of each score from the overall meand/f for treatment (MST) = K-1, d/f for MSE = N-KF = 0 if there are no diff among the group meansTime Series
Data gathered at regular intervals over a period of time Time SeriesSecular trendSeasonal variationCyclical variationIrregular variationN
X
X
X
=
=
=
ns
Observatio
of
No
Total
Values
Observed
the
of
Sum
Mean
Arithmetic
W
+
=
Fmd
)
-
2
N
(
Lmd
Median
F
1
-
W
B
MS
MS
Simple Random Sampling
Stratified Random Sampling
Cluster Sampling
Systematic Area Sampling
PROBABILITY
Quota Sampling
Judgemental Sampling
Convenience
NON PROBABILITY
Sampling Methods
N
C
R
t
t
=
s
frequencie
Expected
E
E
O
2
2
)
(
-
=
c