17
Sociology Name of Paper: Methodology of Research in Sociology Name of Module: Statistical procedures- Parametric and Non- parametric 1 Module Detail and its Structure Subject Name Sociology Paper Name Methodology of Research in Sociology Module Name/Title Statistical Procedure: Parametric and Non-parametric Module Id RMS 22 Pre-requisites Some knowledge of descriptive and inferential statistics. Knowledge of statistical monographs and numerical symbols. Social scientific application of quantitative presentation Objectives To introduce learner to the use of the dual statistical method in sociological research. This would include introduction to the basic concepts and strategies of the non-parametric and parametric methods, and their scope and limitations. Keywords Statistics; Parametric and Non-parametric Processes Role in Content Development Name Affiliation Principal Investigator Prof. Sujata Patel Dept. of Sociology, University of Hyderabad Paper Co-ordinator Prof. Biswajit Ghosh Professor, Department of Sociology, The University of Burdwan, Burdwan 713104 Email: [email protected] Ph. M +91 9002769014 Content Writer Subhasis Bandyopadhyay and Biplab Nayak Assistant Professor, IIEST-S Email: [email protected] M: 9836945013 Statistical Consultant and Freelance Analyst Email: [email protected] M: +91 9836364035 Content Reviewer (CR) & Language Editor (LE) Prof. Biswajit Ghosh Professor, Department of Sociology, The University of Burdwan, Burdwan

Module Detail and its Structure Subject Name Sociology

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

1

Module Detail and its Structure

Subject Name Sociology

Paper Name Methodology of Research in Sociology

Module Name/Title Statistical Procedure: Parametric and Non-parametric

Module Id RMS 22

Pre-requisites Some knowledge of descriptive and inferential statistics. Knowledge of

statistical monographs and numerical symbols. Social scientific application of

quantitative presentation

Objectives To introduce learner to the use of the dual statistical method in sociological

research. This would include introduction to the basic concepts and strategies

of the non-parametric and parametric methods, and their scope and

limitations.

Keywords Statistics; Parametric and Non-parametric Processes

Role in Content

Development

Name Affiliation

Principal Investigator Prof. Sujata Patel Dept. of Sociology,

University of Hyderabad

Paper Co-ordinator Prof. Biswajit Ghosh Professor, Department of Sociology, The

University of Burdwan, Burdwan 713104

Email: [email protected]

Ph. M +91 9002769014

Content Writer Subhasis Bandyopadhyay

and

Biplab Nayak

Assistant Professor, IIEST-S

Email: [email protected]

M: 9836945013

Statistical Consultant and Freelance

Analyst

Email: [email protected]

M: +91 9836364035

Content Reviewer (CR) &

Language Editor (LE)

Prof. Biswajit Ghosh

Professor, Department of Sociology, The

University of Burdwan, Burdwan

Page 2: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

2

Contents

1. Introduction………………………………………………………………………………..3

2. Learning Outcome…………………………………………………………………………4

3. Utility of Statistics in Social Research………………………………………………….....4

4. Parametric tests and analogous Non-parametric procedure……………………………….5

5. Fundamentals of parametric tests………………………………………………………….5

6. Standard Deviation - t-test of significance – ANOVA……………………………………6

Self-Check Exercise 1……………………………………………………………………..8

7. Fundamentals of Non-parametric Tests…………………………………………………...9

i) Goodness-of-Fit test based on Empirical Process- Two-sample Kolmogorov

-Smirnov Tests (Non-parametric tests)……………………………………………………9

ii) Chi-Square Test ………………………………………………………………………10

iii) The Two Sample Rank-Sum Test………………...………………………………….12

iv) The Kruskal-Wallis Test ……………………………..………………………………12

v) Spearman’s Rank Correlation………………………………………………………....13

8. Why don’t we always use Non-parametric tests?..............................................................15

Self-Check Exercise 2……………………………………………………………………15

9. Summary…………………………………………………………………………………16

Notes……………………………………………………………………………………..16

10. References………………………………………………………………………………..17

Page 3: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

3

1. Introduction

Parametric and non-parametric are two broad classifications of statistical procedures. The

disciplinary domain of statistics takes important role in social sciences because it is usually

impossible to collect data from all individuals of interest (population) in a given space and time.

The only solution available to social researcher is to collect data from a subset (sample) of the

individuals of interest. Meanwhile, the real objective of the research is to know the “truth” about

the population.

Quantities such as means, standard deviations and proportions are all meaningful values and are

generally called “parameters” when social scientists are talking about a population. Since

researchers usually cannot get data from the whole population, they cannot show the values of the

parameters for that population. Researchers can, however, calculate estimates of these quantities

for their sample. When calculated from sample data, these quantities are called “statistics.”

When applying a statistical method, it is common to differentiate between quantitative and

qualitative features and variables. Nominal and ordinal variables and data are usually considered

as qualitative (attributive), while interval variables and ratio variables are considered as

quantitative (Ferguson 1966, Krneta 1987). Also, it is common to apply nonparametric statistical

methods on nominal and ordinal data, while parametric methods are used for the interval and

ratio data (Ferguson 1966).

A statistic estimates a parameter. Parametric statistical procedures rely on assumptions about the

shape of the distribution (i.e., assume a normal distribution) in the underlying population and

about the form or parameters (i.e., means and standard deviations) of the assumed distribution.

These are tests that assume a certain distribution of the data generally and fall under normal

distribution and, in an interval level of measurement with an equality of variances. Descriptive

Statistics in the forms of t-test, z-test and ANOVA are used in parametric procedures. Parametric

distributions are used as arguments to higher-level functions that compute probabilities,

expectations, random variates, or parameter estimates from data. Distributions with undetermined

parameters can be used throughout, and later the parameters can be solved for or optimized over1.

Parametric statistics is a branch of statistics which assumes that the data have come from a type

of probability distribution and makes inferences about the parameters of the distribution (Geisser

and Johnson 2006).

There is hardly any qualified answer to the question: whether a given association between two

variables is significant or not. Again in this instance, the question is whether the association

between those two variables is strong, important and report-worthy or not. In this context,

parametric statistics assumes significance. As the name suggests, parametric statistics are those

that make certain assumptions about the parameters describing the population from which the

sample is selected. Here, the term “significance” does not imply “importance” as in the general

sense of the expression. It refers only to the likelihood that relationship observed in a sample

could be attributed to sampling error alone (Babbie 2013: 470)

Non-parametric statistical procedures rely on no or few assumptions about the shape or

parameters of the population distribution from which the sample was drawn. Non-parametric

Page 4: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

4

statistics assumes no normality in data, no function form in the distribution, and assumes absence

of any parent population to which a sample belongs. It is significant to note, given the internal

fluidity of non-parametric procedure, more number of social conditions are amenable to non-

parametric statistical treatment. Hence, in the numerical section, we will find various types of

non-parametric treatment of behavioural and social situations.

2. Learning Outcome

This module will be helpful to understand the use of the dual statistical methods in sociological

research. This would include introduction to the basic concepts and strategies of the non-

parametric and parametric methods, and their scope and limitations.

3. Utility of Statistics in Social Research

Sociologists seek the help of statistical tools to study cultural change in the society, family

pattern, industrial systems to name a few. They also study statistically the relation between

income and education, occupational mobility and migration and things like these. Thus, statistics

is of immense use in various sociological studies. In fact, research in most social science

disciplines requires a large amount of quantitative data. The process of handling, managing and

interpreting quantitative data collected in the process of research very often necessitate statistical

exercise. Hence, there is a strong logic to argue that sociology needs statistics.

Statistics and statistical methods have highly significant application in sociology. Functions of

statistics are numerous: the methods of descriptive statistics have an important application for

describing natural phenomena; inferential statistics is used for inductive reasoning about

unknown properties of a larger group using the known indicators of the causes; hypothesis testing

most frequently refers to the results of one, two or more causes, on the basis of which it is

possible to draw conclusions on the problem of the research, by accepting or refuting an initial

hypothesis; regression and correlation analysis, in the most simple case, examines the influence

and dependence between two or more variables. If the relationship of a greater number of

variables is examined, it is multiple regression and correlation.

Yet, there are many limitations of the use of statistics in social science. Thus, to begin with,

statistical laws are true on average. Statistics are aggregates of facts. So single observation is not

a statistics, it deals with groups and aggregates only. Second, statistical methods are best

applicable on quantitative data. Third, statistics cannot be applied to heterogeneous data. Fourth,

if sufficient care is not exercised in collecting, analyzing and interpretation the data, statistical

results might be misleading. Finally, some errors are possible in statistical decisions. Particularly

the inferential statistics involves certain errors. We do not know whether an error has been

committed or not.

Page 5: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

5

4. Parametric tests and analogous Non-parametric procedures

Table 1: Difference between Parametric and Non-parametric procedures

Analysis Type Example Parametric

Procedure

Non-parametric

Procedure

Compare means

between two

distinct/independent

groups

Is the mean value of the variable in

a control group indifferent from the

mean for the same variable

assigned to an experimental group?

Two-sample t-test

Wilcoxon rank-

sum test

Compare two

quantitative

measurements taken

from the same

individual

Is there a significant change in one

parameter between baseline and

the six-month follow-up

measurement

Paired t-test

Wilcoxon signed-

rank test

Compare means

between three or more

distinct independent

groups

If the experiment has three groups,

whether the mean value of one

important variable at baseline

differed among the three groups.

Analysis of

Variance (ANOVA)

Kruskal-Wallis

test

Estimate the degree of

association between

two quantitative

variables

Is the significant variable measure

associated with some other

variable within the group?

Pearson coefficient

of correlation

Spearman’s rank

correlation

5. Fundamentals of Parametric Test:

Generally we assume that in a parametric condition, the population data have a normal

distribution. Here is an example of a normal distribution curve, often called a bell-curve.

A normal distribution is a distribution that occurs naturally in many situations. For example, such

a distribution is seen in tests like the CBSE and CAT. The bulk of students will score the average

(C), while smaller numbers of students will score a B or D. An even smaller percentage of

students score an F or an A. Half of the data will fall to the left of the mean; half of the later will

fall to the right.

Many groups follow a Normal Distribution pattern. That’s why it’s widely used in business,

statistics and in government bodies like the Planning Commission. The following may be some of

the issues:

Page 6: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

6

Heights of people.

Measurement errors.

Blood pressure.

Points on a test.

IQ scores.

Salaries.

The empirical rule (three sigma) tells you what percentage of your data falls within a certain

number of standard deviation from the mean:

• 68% of the date falls within one standard deviation of the mean.

• 95% of the data falls within two standard deviations of the mean.

• 99.7% of the date falls within three standard deviations of the mean.

6. Standard Deviation, t-text and ANOVA

i) The standard deviation (SD)

SD controls the spread of the normal distribution. Standard deviation is a measure of dispersion

in statistics. It shows you how much your data is spread out around the mean or average. A

smaller standard deviation means that the data is tightly clustered around the mean; the normal

distribution will be taller. A larger standard deviation means that the data is spread out around

the mean; the normal distribution will be flatter and wider. The expression of SD is

N

i

xN1

(/1 - )2

A normal distribution has the following characteristics:

The mean, mode and median are all equal.

The curve is symmetric at the centre.

Exactly half of the values are to the left of centre and exactly half the values are to the right.

Page 7: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

7

ii) t-Test of significance:

Most t-test statistics have the form t = Z/s, where Z and s are functions of the data. Typically, Z is

designed to be sensitive to the alternative hypothesis (i.e., its magnitude tends to be larger when

the alternative hypothesis is true), whereas s is a scaling parameter that allows the distribution

of t to be determined. As an example, in the one-sample t-test,

t =)/(

)/()(

)/( nS

nX

nS

Z where is the sample mean from a

sample , of size , is the ratio of sample standard deviation over

population standard deviation, is the population standard deviation of the data, and is

the population mean. The assumptions underlying a t-test are that

follows a normal distribution with mean and variance

s2 follows a χ2 distribution with p degrees of freedom under the null

hypothesis, where p is a positive constant

Z and s are independent.

iii) Analysis of Variance (ANOVA):

ANOVA is a statistical test which analyzes variance. It is helpful in making comparison of two or

more means which enables a researcher to draw various results and predictions about two or more

sets of data. The test includes one-way ANOVA, two-way ANOVA or multiple ANOVA

depending upon the type and arrangement of the data.

Like any other test, the ANOVA test has its own test statistic

The statistic for ANOVA is called the F statistic, which we get from the

F Test

The F statistic takes into consideration:

number of samples taken (I)

sample size of each sample (n1, n2, …, nI)

means of the samples ( 1, 2, …, I)

standard deviations of each sample (s1, s2, …, sI)

Example2: Following data is given about cricket teams of three countries:

Countries Number of Players Average Runs Standard Deviations

India 11 60 15

New Zealand 11 50 10

South Africa 11 70 12

Page 8: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

8

Find ANOVA coefficient.

Solution: Construct the following table:

Cricket Teams n x S S2

India 11 60 15 225

New Zealand 11 50 10 100

South Africa 11 70 12 144

n = 11

p = 3

N = 33

x¯ = 60+50+70 = 60

SST=∑n (x−x¯) 2

SST=11(60−60)2+11(50−60)2+11(70−60)2= 2200

MST = SST/p−1

MST = 2200/3−1= 1100

SSE=∑ (n−1) S2

SSE = 10*225 + 10*100 + 10*144= 4690

MSE = SSE/N−p

MSE = 4690/33−3

MSE = 156.33

F = MST/MSE

F = 1100/156.33

= 7.036

Self-Check Exercise 1

Q 1. What is standard deviation?

The amount of dispersion or variation in a set of data value, or spread of the number in a data-

set from its mean value is called standard deviation. The symbol is s for a sample and ơ for a

population.

Q 2. What is ANOVA?

The full form is Analysis of Variance which is a statistical test to obtain difference in means

among at least two groups on some variables.

Q 3. Define inferential statistics

Inferential statistics is used for inductive reasoning about unknown properties of a larger

group using the known indicators of the causes.

Page 9: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

9

7. Fundamentals of Non-parametric Tests

Some examples of non-parametric tests and their application in sociological case are given

below:

i) Goodness-of-Fit test based on Empirical Process: Two-sample Kolmogorov-Smirnov

Tests (Non-parametric tests)

The Kolmogorov-Smirnov Test (K-S Test) is used to decide if a sample comes from a

population with a specific distribution (Chakravarty et al. 1967). K-S test is based on the

empirical distribution function (ECDF). Given N ordered data points Y1, Y2, ..., YN, the

ECDF is defined as:

EN=n(i)/N

where n(i) is the number of points less than Yi and the Yi are ordered from smallest to largest

value. This is a step function that increases by 1/N at the value of each ordered data point. The

graph below is a plot of the empirical distribution function with a normal cumulative

distribution function for 100 normal random numbers. The K-S test is based on the maximum

distance between these two curves.

An attractive feature of this test is that the distribution of the K-S test statistic itself does not

depend on the underlying cumulative distribution function being tested. Another advantage is

that it is an exact test (the chi-square goodness-of-fit test depends on an adequate sample size

for the approximations to be valid). Despite these advantages, the K-S test has several

important limitations:

Page 10: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

10

1. It only applies to continuous distributions.

2. It tends to be more sensitive near the centre of the distribution than at the tails.

3. Perhaps the most serious limitation is that the distribution must be fully specified. That

is, if location, scale, and shape parameters are estimated from the data, the critical region

of the K-S test is no longer valid. It typically must be determined by simulation.

ii) Chi-Square Test:

An important non-parametric test often used in sociological analysis is Chi-square test.

Application of chi-square distribution and chi-square test is important in cases with multiple

qualitative variables for which it is known or assumed that the variables are interrelated. Chi-

square test is quite a common test based on determining the sum of the quotient of the square

of the difference between the observed and expected frequencies and expected frequencies.

The primary use of the chi-square test is to examine whether two variables are independent or

not. What does it mean to be independent, in this sense? It means that the two factors are not

related. Typically in social science research, we are interested in finding factors that are

dependent upon each other—education and income, occupation and prestige, age and voting

behaviour. By ruling out independence of the two variables, the chi-square can be used to

assess whether two variables are, in fact, dependent or not. More generally, we say that one

variable is "not correlated with" or "independent of" the other if an increase in one variable is

not associated with an increase in another. If two variables are correlated, their values tend to

move together, either in the same or in the opposite direction. Chi-square examines a special

kind of correlation: that between two nominal variables.

In the following example, we'll use a chi-square test to determine whether there is a

relationship between gender and getting in trouble at school (both nominal variables). Below

is the table documenting the raw scores of boys and girls and their respective behaviour issues

(or lack thereof):

Gender and Getting in Trouble at School

Got in Trouble Did Not Get in Trouble Total

Boys 46 71 117

Girls 37 83 120

Total 83 154 237

To examine statistically whether boys got in trouble in school more often, we need to frame

the question in terms of hypotheses. The null hypothesis is that the two variables are

independent (i.e. no relationship or correlation) and the research hypothesis is that the two

variables are related. In this case, the specific hypotheses are:

Page 11: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

11

H0: There is no relationship between gender and getting in trouble at school

H1: There is a relationship between gender and getting in trouble at school

As is customary in the social sciences, we'll set our alpha level at 0.05

Next we need to calculate the expected frequency for each cell. These values represent what

we would expect to see if there really were no relationships between the two variables. We

calculate the expected frequency for each cell by multiplying the row total by the column total

and dividing by the total number of observations. To get the expected count for the upper right

cell, we would multiply the row total (117) by the column total (83) and divide by the total

number of observations (237): (83 x 117)/237 = 40.97. If the two variables were independent,

we would expect 40.97 boys to get in trouble. Or, to put it another way, if there were no

relationship between the two variables, we would expect to see the number of students who

got in trouble be evenly distributed across both genders.

We do the same thing for the other three cells and end up with the following expected counts

(in parentheses next to each raw score):

Gender and Getting in Trouble at School

Got in Trouble Did Not Get in Trouble Total

Boys 46 (40.97) 71 (76.02) 117

Girls 37 (42.03) 83 (77.97) 120

Total 83 154 237

With these sets of figures, we calculate the chi-square statistic as follows:

For each cell, we square the difference between the observed frequency and the expected

frequency (observed frequency – expected frequency) and divide that number by the expected

frequency. Then we add all of the terms (there will be four, one for each cell) together, like so:

After we have crunched all those numbers, we end up with an obtained statistic of 1.87. Please

note here that a chi-square statistic cannot be negative because nominal variables do not have

directionality. If your obtained statistic turns out to be negative, you might want to check your

Page 12: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

12

math. But before we can come to a conclusion, we need to find our critical statistic, which

entails finding our degrees of freedom. In this case, the number of degrees of freedom is equal

to the number of columns in the table minus one multiplied by the number of rows in the table

minus one, or (r-1)(c-1).

In our case, we have (2-1)(2-1), or one degree of freedom.

We also need to reference our alpha, which we set at .05. As you can see, the critical statistic

for an alpha level of 0.05 and one degree of freedom is 3.841, which is larger than our

obtained statistic of 1.87. Because the critical statistic is greater than our obtained statistic, we

cannot reject our null hypothesis.

iii) The Two Sample Rank-Sum Test

The two sample rank-sum test is a nonparametric alternative to the two sample t-test which is

based solely on the order in which the observations from the two samples fall. We will use the

following as a running example.

The logic underlying the two-sample rank-sum test is straightforward. The data consist of two

independent samples drawn from identically distributed populations. Let x1, x2, . . . , xn denote

the first random sample of size n and let y1, y2, . . . , ym denote the second random sample of

size m. Assign the ranks 1 to n + m to the combined observations from smallest to largest

without regard to sample membership and let Rk denote the rank assigned to the n + m

observations for k = 1, . . . , n + m. Let Tx and Ty denote the sums of the ranks from the first

and second samples, respectively, and let T = Tx. Finally, note that Tx + Ty = (n + m)(n + m +

1)/2 . The null hypothesis simply states that each of the possible arrangements of the n + m

observations to the two samples with n values in the first sample and m values in the second

sample occurs with equal probability. The exact lower (upper) one-sided probability value of

an observed value of T, T0, is the proportion of all possible T values less (greater) than or

equal to T0.

iv) The Kruskal-Wallis Test

This test was developed by Kruskal and Wallis (1952) jointly and is named after them. The

Kruskal-Wallis test is a nonparametric (distribution free) test, and is used when the

assumptions of ANOVA are not met. They both assess for significant differences on a

continuous dependent variable by a grouping independent variable (with three or more

groups). In the ANOVA, we assume that distribution of each group is normally distributed

and there is approximately equal variance on the scores for each group. However, in the

Kruskal-Wallis Test, we do not have any of these assumptions. Like all non-parametric tests,

the Kruskal-Wallis Test is not as powerful as the ANOVA. The following account has been

taken from www.statisticssolutions.com.

Null hypothesis: Null hypothesis assumes that the samples are from identical populations.

Page 13: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

13

Alternative hypothesis: Alternative hypothesis assumes that the samples come from different

populations.

Questions like the following are answered:

How do test scores differ between the different grade levels in elementary school?

Does marketing scores differ between the different grade levels in elementary school?

Procedure:

1. Arrange the data of both samples in a single series in ascending order.

2. Assign rank to them in ascending order. In the case of a repeated value, or a tie, assign

ranks to them by averaging their rank position.

3. Then sum up the different ranks, e.g. R1 R2 R3…., for each of the different groups..

4. To calculate the value, apply the following formula:

Where,

H = Kruskal-Wallis Test statistic

N = total number of observations in all samples

Ti = Sum of the ranks assigned

The Kruskal-Wallis test statistic is approximately a chi-square distribution, with k-1 degrees

of freedom where nishould be greater than 5. If the calculated value of the Kruskal-Wallis test

is less than the critical chi-square value, then the null hypothesis cannot be reject. If the

calculated value of Kruskal-Wallis test is greater than the critical chi-square value, then we

can reject the null hypothesis and say that the sample comes from a different population.

Assumptions:

1. We assume that the samples drawn from the population are random.

2. We also assume that the cases of each group are independent.

3. The measurement scale should be at least ordinal.

v) Spearman’s Rank Correlation.

The following description has been eminently taken from the source

www.statisticssolutions.com: Spearman Correlation Coefficient is also referred to as

Spearman Rank Correlation or Spearman's rho. It is typically denoted either with the Greek

Page 14: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

14

letter rho (ρ), or rs. It is one of the few cases where a Greek letter denotes a value of a sample

and not the characteristic of the general population. Like all correlation coefficients,

Spearman's rho measures the strength of association of two variables. As such, the Spearman

Correlation Coefficient is a close sibling to Pearson's Bivariate Correlation Coefficient, Point-

Biserial Correlation, and the Canonical Correlation.

All correlation analyses express the strength of linkage or co-occurrence between to variables

in a single value between -1 and +1. This value is called the correlation coefficient. A positive

correlation coefficient indicates a positive relationship between the two variables (the larger

A, the larger B) while a negative correlation coefficients expresses a negative relationship (the

larger A, the smaller B). A correlation coefficient of 0 indicates that no relationship between

the variables exists at all. However correlations are limited to linear relationships between

variables. Even if the correlation coefficient is zero a non-linear relationship might exist.

Compared to Pearson's bivariate correlation coefficient the Spearman Correlation does not

require continuous-level data (interval or ratio), because it uses ranks instead of assumptions

about the distributions of the two variables. This allows us to analyze the association between

variables of ordinal measurement levels. Moreover the Spearman Correlation is a non-

paracontinuous-level test, which does not assume that the variables approximate multivariate

normal distribution. Spearman Correlation Analysis can therefore be used in many cases

where the assumptions of Pearson's Bivariate Correlation (continuous-level variables,

linearity, and multivariate normal distribution of the variables to test for significance) are not

met.

Typical questions the Spearman Correlation Analysis answers are as follows:

Sociology: Do people with a higher level of education have a stronger opinion of whether

or not tax reforms are needed?

Medicine: Does the number of symptoms a patient has indicate a higher severity of

illness?

Biology: Is mating choice influenced by body size in bird species A?

Business: Are consumers more satisfied with products that are higher ranked in quality?

Theoretically, the Spearman correlation calculates the Pearson correlation for variables that

are converted to ranks. Similar to Pearson's bivariate correlation, the Spearman correlation

also tests the null hypothesis of independence between two variables. However this can lead

to difficult interpretations. Kendall's Tau-b rank correlation improves this by reflecting the

strength of the dependence between the variables in comparison.

Since both variables need to be of ordinal scale or ranked data, Spearman's correlation

requires converting interval or ratio scales into ranks before it can be calculated.

Mathematically, Spearman correlation and Pearson correlation are very similar in the way that

they use difference measurements to calculate the strength of association. Pearson correlation

uses standard deviations while Spearman correlation difference in ranks. However, this leads

Page 15: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

15

to an issue with the Spearman correlation when tied ranks exist in the sample. An example of

this is when a sample of marathon results awards two silver medals but no bronze medal. A

statistician is even crueller to these runners because a rank is defined as average position in

the ascending order of values. For a statistician, the marathon result would have one first

place, two places with a rank of 2.5, and the next runner ranks 4. If tied ranks occur, a more

complicated formula has to be used to calculate rho, but SPSS automatically and correctly

calculates tied ranks.3

8. Why don’t we always use Non-parametric tests?

Although non-parametric tests have the very desirable property of making fewer assumptions

about the distribution of measurements in the population from which we drew our sample, they

have two main drawbacks. The first is that they generally are less statistically powerful than the

analogous parametric procedure when the data truly are approximately normal. “Less powerful”

means that there is a smaller probability that the procedure will tell us that two variables are

associated with each other when in fact they are truly associated. If you are planning a study and

trying to determine how many patients to include, a non-parametric test will require a slightly

larger sample size to have the same power as the corresponding parametric test. The second

drawback associated with non-parametric tests is that their results are often less easy to interpret

than the results of parametric tests. Many non-parametric tests use rankings of the values in the

data rather than using the actual data. Knowing that the difference in mean ranks between two

groups is five does not really help our intuitive understanding of the data. On the other hand,

knowing that the mean systolic blood pressure of patients taking the new drug was five mmHg

lower than the mean systolic blood pressure of patients on the standard treatment is both intuitive

and useful. In short, non-parametric procedures are useful in many cases and necessary in some,

but they are not a perfect solution.

Self Check Exercise 2

Q: What is the major advantage of non-parametric statistics?

It makes fewer assumptions about the distribution of measurement in the population.

Q: What type of measurement level is used in Spearman’s correlation?

It is ordinal level of measurement in most cases

Q: Which test is carried out when precondition for ANOVA is absent?

Kruskal-Wallis test does not need any assumption like ANOVA.

Page 16: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

16

9. SUMMARY

Here is a summary of the major points and how they might affect statistical analyses you

perform:

Parametric and non-parametric are two broad classifications of statistical procedures.

Parametric tests are based on assumptions about the distribution of the underlying

population from which the sample was taken.

The most common parametric assumption is that data are approximately normally

distributed.

Non-parametric tests do not rely on assumptions about the shape or parameters of the

underlying population distribution: If the data deviate strongly from the assumptions of a

parametric procedure, using the parametric procedure could lead to incorrect conclusions.

Researchers should be aware of the assumptions associated with a parametric procedure

and should learn methods to evaluate the validity of those assumptions: If you determine

that the assumptions of the parametric procedure are not valid, use an analogous non-

parametric procedure instead.

The parametric assumption of normality is particularly worrisome for small sample sizes

(n < 30): Non-parametric tests are often a good option for these data.

It can be difficult to decide whether to use a parametric or non-parametric procedure in

some cases: Non-parametric procedures generally have less power for the same sample

size than the corresponding parametric procedure if the data truly are normal.

Interpretation of non-parametric procedures can also be more difficult than for parametric

procedures. Visit a statistician if you are in doubt about whether parametric or non-

parametric procedures are more appropriate for your data. The book Practical Non-

parametric Statistics 2 is an excellent resource for anyone.

Notes

1. Parametric Statistical Distribution in Wolfram Language and System Documentation

Center. Available at:

https://reference.wolfram.com/.../ParametricStatisticalDistributions.html

2. This example is taken from the following website: formulas.tutorvista.com/math/anova-

formula.html

3. Sub-sections ii – v on Non-parametric Tests are taken essentially from

www.statisticssolutions.com

Page 17: Module Detail and its Structure Subject Name Sociology

Sociology

Name of Paper: Methodology of Research in Sociology

Name of Module: Statistical procedures- Parametric and Non-parametric

17

10. References

Bagdonavicius, Vilyandas, Krupois, Julius and Nikulin, Mikhail, S.. Nonparametric Tests for

Complete Data. New York: ISTE & John Wiley & Sons, 2011.

Chakravarti, I.M. Laha, R.G. and Roy, J. Handbook of Methods of Applied Statistics, Volume I

Wiley and Sons, 1967.

Connover, W.J. Practical Non-parametric Statistics. New York: John Wiley, 1971.

Fisher, R.A. Contributions to Mathematical Statistics. New York: John Wiley, 1950.

Gibbons, J. D. and Chakraborti, S. Non-parametric Statistical Inference. Boca Raton, FL: CRC

Press, 2009.

Govidarajulu, Z. Non-parametric Inference. Singapore: World Scientific, 2007.

Maritz, J.S. Distribution – Free Statistical Methods. New York: Chapman & Hall / CRC, 1995.

Rao. C.R. Linear Statistical Inference and its Applications. New York: John Wiley & Sons, 2002.

Vaart, Van Der. Asymptotic Statistics. Cambridge: Cambridge University Press, 2000.

Website: www.statisticssolutions.com