39
DATA ANALYSIS Dr. Steward doss

Data Analysis

Embed Size (px)

DESCRIPTION

Data Analysis

Citation preview

  • DATA ANALYSIS

    Dr. Steward doss

  • Data Analysis

    Frequency Tables one way, multiple tables, cross-tabulations.Various types of charts, diagrams, modelsGraphical visualization of the dataRelationship AnalysisCausal AnalysisHypothesis testingMultivariate Analysis
  • Prerequisites before the analysis:

    Adequate sample size including subgroups,Knowledge about data and its characteristics.Skepticism and OpennessLook at the visual representation of the data Shape and SpreadAvoid extreme values or outliersMissing value analysis. Validate with the study objectives
  • Types of Data

    Data: Measurement of one or more variables of a sample.Nominal DataOrdinal DataInterval DataRatio Data
  • NOMINAL DATA :

    Classify the units of the sample into categories,

    Labels or Names that identify the categories.

    It do not have any mathematical properties

    Mutually exclusive & labeled categories

    Demographical or Qualitative Data

    Ex: Names of Policies, Depts, Sex, Occupation, Religion, Race, etc.

  • ORDINAL DATA:

    Measurements that enable the units to be ordered with respect to the variable of interest

    These data can be quantifiable but the values do not have mathematical propertiesValues can be ordered from low to high, small to big (A > B)Ex: Economic Status (Low/Medium/High), Ranks, Evaluative data, Comparative data, etc.
  • INTERVAL DATA:

    It possess the property of magnitude and equal intervals among the categories.

    It has the property of nominal + ordinal + extra. But it does not have all the properties of ratio data (0 value)All arithmetic & statistical functions can be performed.Ex: Rating Scale, Agreement Scale, Attitude Scale etc
  • RATIO DATA:

    Absolute Numbers, Quantitative Variables.

    Highest level of measurementIt has the properties of nominal + ordinal + interval + extra (O Value)Ex: Age, Income, Values, Numbers, etc.
  • Presentation of Data

    Frequency Distribution Tabular presentation of data Steps: Calculate the rangeDetermine the no. of class intervalsFrequency of data True limits of class intervals - .5 less than the lower limit and .5 above the class intervalRelative and Cumulative frequenciesFrequency Curves histogram, frequency polygons, etc.
  • Review

    Histogram summarizes the whole data pictorially Frequency distributionRelative frequency distributionRangeWidth of a class interval = range/n of cCumulative percentage
  • Measures of Central Tendency

    Measures of average / representative score / Center score around which other scores tend to cluster.Mean, Median, Mode.Selection of appropriate measures depends on i) types of data, and ii) shapes of distribution.
  • MEAN

    Arithmetic Mean = Simple Average
  • Advantages:

    Best description of dataEasy & Simple to calculateBetter indicator for comparisonUseful for further statistical calculationsDisadvantages:It is affected by extreme valuesIt can not be calculated for Nominal & Ordinal Data
  • Median:

    The middle observation in a set of data when arranged in order.It is a middle number if N is Odd,Average of two middle terms if N is EvenThe Formula for Grouped Data:
  • Advantages:Better measure for Skewed DistributionIt is not affected by extreme valuesIt is a useful measure for Ordinal & Qualitative data

    Disadvantages:

    It is complex and time consumingLess StableMake use of less number of data
  • MODE

    It describes the values that occurs most frequentlyEx: 6, 11, 24, 43, 56, 60, 60, 60, 64, 75, 80, 80, 80, 80, 90, 95.It is a useful measure for Nominal DataIt is not affected by extreme values
  • Dispersion

    Range = Max Min valuesQuartiles divides the distribution into 4 equal groupsPercentiles divides the distribution into 10 equal groupsVariance average squared deviations from the meanStandard Deviation square root of the average squared deviations from the mean
  • Review

    If the median of a distribution is 21 and the last value in the distribution is increased by 3, what would be the median value?If Mean is 24, Variance is 0 what does it mean? If the mean is less than the mode, what distribution does it describe for?What does the SD indicate?If a person earns a score higher than 35 students in a total of 50 students, what percentile value does he get?
  • Estimation Confidence Interval

    Confidence Interval Sample size and Confidence IntervalSampling errorStandard errorEstimation of confidence interval mean, proportion, variance, etc.
  • Correlation

    It studies the relationship of any two variablesIt is used to measure the strength of relationshipBetter indicator of the nature of relationshipIt is a useful tool for decision making
  • Various Types of Correlations

    Positive Correlation

    Ex: Education & Income, MediClaim & Age, Cattle Insurance & Bank Loan, etc

    Negative Correlation

    Ex: Supply & Price, Size of article vs. Theft, Tonnage of Ship vs. Cargo risk

    Curvilinear Relationship

    Ex: Motivation & Performance, Accident Claim & Age, etc

  • Scatter Plot

    It indicates Linearity of the relationshipsExistence of the relationshipStrength of the relationshipNature of the relationshipIndicates the outliers in the distribution
  • Pearsons Correlation

    It is used to quantify the relationship It is a Co-efficientIt is SymmetricOne of the most popular & reliable measureMost appropriate measure for Interval / Ratio data
  • Review

    What does these following values indicates-.80 Relationship is negative but strong-.70 < .60Highest values indicates stronger relationship irrespective of the signs If r = +1 or 1- indicates perfect relationshipR2 explains the variations in the relationship
  • Chi Square Test

    Good measure for nominal variablesIt measures only the relationship not strengthAssumptions:

    Sample should be random and independent

    Mutually exclusive and exhaustive categories

    Expected frequencies must be larger than 5

    Sample size should be large (n > 30)

  • Steps for Chi-square test:

    Setting up of Null HypothesisCalculation of Expected frequencies

    Determining the cell frequencies expected if null hypothesis is true.

  • FormulaComparing the calculated values and the table valuesAcceptance or rejection of null hypothesis
  • REGRESSION

    It is used to measure the linearity of the relationshipUseful for PredictionUseful for knowing the cause and effectIt gives the degree of impact of independent variable on the dependant variableIt is robust against the violations of certain assumptions
  • Assumptions:

    No specification errorThe relationship should be linearNo relevant variables are excludedNo irrelevant variables are includedMeasurement errorThe appropriate level of measurementReliable dataError TermZero Mean E (i) = 0The variance of the error term is constantNo auto correlation of the error termNo correlation of Independent variables with the error The error term is normally distributed
  • There must be some relationship between the variablesDefine Dependant and Independent variablesCalculate the Y intercept and b coefficient using the regression equation Y = a+bx+eTest for statistical significance

    Steps for calculation:

  • Regression Equation:

    Y = a + bx + e

    Where a = Average value of Y intercept when x = 0

    b = Slope or the expected change in Y with one unit change in x

    e = error term: error made in predicting the value of Y from the given value of x

  • Multiple Regression

    Used to measure the impact of more than one independent variables on the dependant variableHelps to measure the effect of many variables Used to develop a modelIt can be used with both interval and qualitative variables
  • Analysis of Variance (ANOVA)

    Useful measure to study the differences among the groups (>2); (Age, Sex, Polices, Offices, etc)Measures the degree of differences (Mean)Measures the influence of independent variables on the dependant variableIt identifies the most influential group among the subgroups.Used to test the significance of a model
  • Assumptions:

    Observations (xs) in each J populations are:(i) Independent(ii) Normally distributed(iii) Have equal variances
  • Formula:

    F =

    Where MSB means Mean Square Between Groups

    MSW relates to Mean Square Within Groups (Sample Error)

    F is the ratio of MSB to MSW

    If F > 1, Null Hypothesis is rejected

    P value indicates significance of F ratio

  • Steps for ANOVA:

    State the hypothesis to be tested

    Ex: H0: 1 = 2 = .. = j

    Specify the significance level ( = .05)

    Compute the F ratio and find its p-value

    Conclude that H0 is false if p ; and

    if p > , accept H0

  • Continue..

    Within group variance squares of deviations of each score from its sample meanBetween group variance squares of deviations of k sample means from overall meanTotal variance squares of deviations of each score from the overall meand/f for treatment (MST) = K-1, d/f for MSE = N-KF = 0 if there are no diff among the group means
  • Time Series

    Data gathered at regular intervals over a period of time Time SeriesSecular trendSeasonal variationCyclical variationIrregular variation
  • N

    X

    X

    X

    =

    =

    =

    ns

    Observatio

    of

    No

    Total

    Values

    Observed

    the

    of

    Sum

    Mean

    Arithmetic

    W

    +

    =

    Fmd

    )

    -

    2

    N

    (

    Lmd

    Median

    F

    1

    -

    W

    B

    MS

    MS

    Simple Random Sampling

    Stratified Random Sampling

    Cluster Sampling

    Systematic Area Sampling

    PROBABILITY

    Quota Sampling

    Judgemental Sampling

    Convenience

    NON PROBABILITY

    Sampling Methods

    N

    C

    R

    t

    t

    =

    s

    frequencie

    Expected

    E

    E

    O

    2

    2

    )

    (

    -

    =

    c