Advanced statistics

Prof. JOY V. LORIN-PICARDAVAO DEL NORTE STATE COLLEGE

NEW VISAYAS, PANABO CITY

TOPIC OUTLINEPART 1Role of Statistics in ResearchDescriptive StatisticsHands –On Statistical Software Sample and PopulationSampling ProceduresSample SizeHands –On Statistical Software Inferential StatisticsHypothesis TestingHands –On Statistical Software

TOPIC OUTLINEPART 2Choice of Statistical TestsDefining Independent and Dependent

VariablesHands –On Statistical Software Scales of MeasurementsHow many Samples / Groups are in the DesignPART 3Parametric TestsHands –On Statistical Software PART 4Non-Parametric TestsHands –On Statistical Software

TOPIC OUTLINEPART 5Goodness of Fit Hands –On Statistical Software PART 6Choosing the Correct Statistical

TestsHands –On Statistical Software Introduction to Multiple and Non-

Linear RegressionHands –On Statistical Software

Role of Statistics in ResearchNormally use to analyze data To organize and make sense out of large

amount of dataThis is basic to intelligent reading research

articleHas significant contributions in social

sciences, applied sciences and even business and economics

Statistical researches make inferences about population characteristics on the basis of one or more samples that have been studied.

How is Statistics look into ?1. Descriptive – this gives us

information , or simple describe the sample we are studying.

2. Correlational - this enables us to relate variables and establish relationship between and among variables which are useful in making predictions.

3. Inferential – this is going beyond the sample and make inference on the population.

Descriptive Statistics

N - total population/sample size from any given population

ExampleMinutes Spent on the Phone 102 124 108 86 103 82 71 104 112 118 87 95

103 116 85 122 87 100105 97 107 67 78 125109 99 105 99 101 92

Example 2425 430 430 435 435 435 435 435 440 440440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615

425 430 430 435 435 435 435 435 440 440440 440 440 445 445 445 445 445 450 450450 450 450 450 450 460 460 460 465 465465 470 470 472 475 475 475 480 480 480480 485 490 490 490 500 500 500 500 510510 515 525 525 525 535 549 550 570 570575 575 580 590 600 600 600 600 615 615

Range, Mean, Median and Mode The terms mean, median, mode, and range describe

properties of statistical distributions. In statistics, a distribution is the set of all possible values for terms that represent defined events. The value of a term, when expressed as a variable, is called a random variable. There are two major types of statistical distributions. The first type has a discrete random variable. This means that every term has a precise, isolated numerical value. An example of a distribution with a discrete random variable is the set of results for a test taken by a class in school. The second major type of distribution has a continuous random variable. In this situation, a term can acquire any value within an unbroken interval or span. Such a distribution is called a probability density function. This is the sort of function that might, for example, be used by a computer in an attempt to forecast the path of a weather system.

Mean

The most common expression for the mean of a statistical distribution with a discrete random variable is the mathematical average of all the terms. To calculate it, add up the values of all the terms and then divide by the number of terms. This expression is also called the arithmetic mean. There are other expressions for the mean of a finite set of terms but these forms are rarely used in statistics. The mean of a statistical distribution with a continuous random variable, also called the expected value, is obtained by integrating the product of the variable with its probability as defined by the distribution. The expected value is denoted by the lowercase Greek letter mu (µ).

Median The median of a distribution with a discrete random

variable depends on whether the number of terms in the distribution is even or odd. If the number of terms is odd, then the median is the value of the term in the middle. This is the value such that the number of terms having values greater than or equal to it is the same as the number of terms having values less than or equal to it. If the number of terms is even, then the median is the average of the two terms in the middle, such that the number of terms having values greater than or equal to it is the same as the number of terms having values less than or equal to it. The median of a distribution with a continuous random variable is the value m such that the probability is at least 1/2 (50%) that a randomly chosen point on the function will be less than or equal to m, and the probability is at least 1/2 that a randomly chosen point on the function will be greater than or equal to m.

Mode The mode of a distribution with a discrete

random variable is the value of the term that occurs the most often. It is not uncommon for a distribution with a discrete random variable to have more than one mode, especially if there are not many terms. This happens when two or more terms occur with equal frequency, and more often than any of the others. A distribution with two modes is called bimodal. A distribution with three modes is called trimodal. The mode of a distribution with a continuous random variable is the maximum value of the function. As with discrete distributions, there may be more than one mode.

Range The range of a distribution with a discrete

random variable is the difference between the maximum value and the minimum value. For a distribution with a continuous random variable, the range is the difference between the two extreme points on the distribution curve, where the value of the function falls to zero. For any value outside the range of a distribution, the value of the function is equal to 0.

The least reliable of the measure and is use only when one is in a hurry to get a measure of variability

Variance

Variance

Standard DeviationThe standard deviation formula is very simple:

it is the square root of the variance. It is the most commonly used measure of spread.

An important attribute of the standard deviation as a measure of spread is that if the mean and standard deviation of a normal distribution are known, it is possible to compute the percentile rank associated with any given score.

Standard DeviationIn a normal distribution, about 68% of

the scores are within one standard deviation of the mean and about 95% of the scores are within two standard deviations of the mean.

The standard deviation has proven to be an extremely useful measure of spread in part because it is mathematically tractable. Many formulas in inferential statistics use the standard deviation.

Coefficient of Variation

Kurtosis

KURTOSIS - refers to how sharply peaked a distribution is. A value for kurtosis is included with the graphical summary:

· Values close to 0 indicate normally peaked data.

· Negative values indicate a distribution that is flatter than normal.

· Positive values indicate a distribution with a sharper than normal peak.

Skewness

Samples and PopulationPopulation – as used in research, refers

to all the members of a particular group.

It is the group of interest to the researcher

This is the group of whom the researcher would like to generalize the results of a study

A target population is the actual population to whom the researcher would like to generalize

Accessible population is the population to whom the researcher is entitled to generalize

SAMPLINGThis is the process of selecting the

individuals who will participate in a research study.

Any part of the population of individuals of whom information is obtained.

A representative sample is a sample that is similar to the population to whom the researcher is entitled to generalize

PROBABILITY AND NON-PROBABILITY SAMPLINGA sampling procedure that gives every

element of the population a (known) nonzero chance of being selected in the sample is called probability sampling. Otherwise, the sampling procedure is called non-probability sampling.

Whenever possible, probability sampling is used because there is no objective way of assessing the reliability of inferences under non-zero probability sampling.

METHODS OF PROBABILITY SAMPLING

1. simple random sampling2.systematic sampling3.stratified sampling4. cluster sampling5. two-stage random sampling

Simple Random SamplingThis is a sample selected from a population in such a manner that all members of the population have an equal chance of being selected

Stratified Random SamplingSample selected so that certain characteristics are represented in the sample in the same proportion as they occur in the population

Cluster Random SampleThis is obtained by using groups as the sampling unit rather than individuals.

Two-Stage Random SampleSelects groups randomly and then chooses individuals randomly from these groups.

Non-Probability Sampling

1. accidental or convenience sampling2. purposive sampling3. quota sampling4. snowball or referral sampling 5. systematic sampling

Systematic SampleThis is obtained by selecting every nth name in a population

Convenience Sampling Any group of individuals that is conveniently available to be studied

Purposive SamplingConsist of individuals who have special qualifications of some sort or are deemed representative on the basis of prior evidence

Quota SamplingIn quota sampling, the population is

first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. This means that individuals can put a demand on who they want to sample (targeting)

Snow ball Sampling snowball sampling is a technique for

developing a research sample where existing study subjects recruit future subjects from among their acquaintances. Thus the sample group appears to grow like a rolling snowball. As the sample builds up, enough data is gathered to be useful for research. This sampling technique is often used in hidden populations which are difficult for researchers to access; example populations would be drug users or prostitutes. As sample members are not selected from a sampling frame, snowball samples are subject to numerous biases

General Classification of Collecting Data

1. Census or complete enumeration-is the process of gathering information from every unit in the population.

- not always possible to get timely, accurate and economical data

- costly, if the number of units in the population is too large

2. Survey sampling- is the process of obtaining information from the units in the selected sample.

Advantages: reduced cost, greater speed, greater scope, and greater accuracy

Sample sizeSamples should be as large as a researcher

can obtain with a reasonable expenditure of time and energy.

As suggested, a minimum number of subjects is 100 for a descriptive study , 50 for a correlational study, and 30 in each group for experimental and causal-comparative design

According to Padua , for n parameters, minimum n could be computed as n >= (p +3) p/2 where p = parameters , say if p = 4, thus minimum n = 14.

Inferential Statistics

This is a formalized techniques used to make conclusions about populations based on samples taken from the populations.

HypothesisHypothesis is defined as the tentative theory or

supposition provisionally adopted to explain certain facts and to guide in the investigation of others.

A statistical hypothesis is an assertion or statement that may or may not be true concerning one or more population.

Example:1. A leading drug in the treatment of hypertension

has an advertised therapeutic success rate of 83%. A medical researcher believes he has found a new drug for treating hypertensive patients that has higher therapeutic success rate than the leading than the leading drug with fewer side effect.

The Statistical Hypothesis :HO: The new drug is no better than the old one (p

=0.83)H1: The new drug is better than the old one ( p>

0.83)

Example 2. A social researcher is conducting a study to determine if the level of women’s participation in community extension programs of the barangay can be affected by their educational attainment , occupation, income, civil status, and age.

HO: The level of women’s participation in community extension programs is not affected by their attainment, occupation, income , civil status and age.

H1: The level of women’s participation in community

extension programs is affected by their attainment, occupation, income , civil status and age.

Example 3: A community organizer wants to compare the three community organizing strategies applied to cultural minorities in terms of effectiveness.

A. Hypothesis TestingSteps in Hypothesis Testing1. Formulate the null hypothesis and the alternative hypothesis

- this is the statistical hypothesis which are assumptions or guesses about the population involved. In short, these are statements about the probability distributions of the populations

Null HypothesisThis is a hypothesis of “ no effect “.It is usually formulated for the

express purpose of being rejected, that is, it is the negation of the point one is trying to make.

This is the hypothesis that two or more variables are not related or that two or more statistics are not significantly different.

Alternative HypothesisThis is the operational statement of the researcher’s hypothesis

The hypothesis derived from the theory of the investigator and generally state a specified relationship between two or more variables or that two or more statistics significantly differ.

Two Ways of Stating the Alternative Hypothesis1. Predictive - specifies the type of

relationship existing between two or more variables (direct or indirect) or specifies the direction of the difference between two or more statistics

2. Non- Predictive - does not specify the type of relationship or the direction of the difference

C. LEVEL OF SIGNIFICANCE (α)α is the maximum probability with which

we would be willing to risk Type I Error (The hypothesis can be inappropriately rejected ). The error of rejecting a null hypothesis when it is actually true. Plainly speaking, it occurs when we are observing a difference when in truth there is none, thus indicating a test of poor specificity. An example of this would be if a test shows that a woman is pregnant when in reality she is not.

In other words, the level of significance determines the risk a researcher would be willing to take in his test.

The choice of alpha is primarily dependent on the practical application of the result of the study.

Examples of α .05 (95 % confident of the claim).01 (99 % confident of the claim) But take note, α is not always .05 or .01. This

could mathematically be computed based from the formula :

where the variance , no of samples and its difference are predetermined – Chebychev’s sample size formula.

D. Defining a Region of RejectionThe region of rejection is a region

of the null sampling distribution. It consists of a set of possible values which are so extreme that when the null hypothesis is true the probability is small (i.e. equal to alpha) that the sample we observe will yield a value which is among them.

E. Collect the data and compute the value of the test- statistic

F . Collect the data and compute the

value of the test –statistic. G. State your decision.

H. State your conclusion.

B. Choose an Appropriate Statistical Test for testing the Null HypothesisThe choice of a statistical test for the

analysis of your data requires careful and deliberate judgment.

PRIMARY CONSIDERATIONS:The choice of a statistical test is

dictated by the questions for which the research is designed

The level, the distribution , and dispersion of data also suggest the type of statistical test to be used

SECONDARY CONSIDERATIONS

The extent of your knowledge in statistics

Availability of resources in connection with the computation and interpretation of data

Choice of Statistical TestsThis is designed to help you develop a framework for choosing the correct statistic to test your hypothesis.

It begins with a set of questions you should ask when selecting your test.

It is followed by demonstrations of the factors that are important to consider when choosing your statistic.

Choice of Statistical TestsPresented below are four questions you should ask and answer when trying to determine which statistical procedure is most appropriate to test your hypothesis.

Choice of Statistical Tests

What are the independent and dependent variables?

What is the scale of measurement of the study variables?

How many samples/groups are in the design?

Have I met the assumptions of the statistical test selected?

Choice of Statistical Tests

To determine which test should be used in any given circumstance, we need to consider the hypothesis that is being tested, the independent and dependent variables and their scale of measurement, the study design, and the assumptions of the test.

Defining Independent and Dependent Variables

Before we can begin to choose our statistical test, we must determine which is the independent and which is the dependent variable in our hypothesis.

Our dependent variable is always the phenomenon or behavior that we want to explain or predict.

Defining Independent and Dependent Variables

The independent variable represents a predictor or causal variable in the study.

In any antecedent-consequent relationship, the antecedent is the independent variable and the consequent is the dependent variable.

Defining Independent and Dependent VariablesWith single samples and one dependent

variable, the one-sample Z test, the one-sample t test, and the chi-square goodness-of-fit test are the only statistics that can be used.

Students sometimes ask, "but don't you have population data too, so you have two sets of data?" Yes and no.

Data have to exist or else the population parameters are defined. But, the researcher does not collect these data, they already exist.

Defining Independent and Dependent VariablesSo, if you are collecting data on one

sample and comparing those data to information that has already been gathered and is published, then you are conducting a one-sample test using the one sample/set of data collected in this study.

For the chi-square goodness-of-fit test, you can also compare the sample against chance probabilities

Defining Independent and Dependent VariablesWhen we have a single sample and

independent and dependent variables measured on all subjects, we typically are testing a hypothesis about the association between two variables. The statistics that we have learned to test hypotheses about association include:

chi-square test of independence Spearman's rs Pearson's rbivariate regression and multiple

regression

Multiple Sample Tests

Studies that refer to repeated measurements or pairs of subjects typically collect at least two sets of scores. Studies that refer to specific subgroups in the population also collect two or more samples of data. Once you have determined that the design uses two or more samples or "groups", then you must determine how many samples or groups are in the design. Studies that are limited to two groups use either the chi-square statistic, Mann-Whitney U, Wilcoxon test, independent means t test, or the dependent means t test.

If you have three or more groups in the design, the chi-square statistic, Kruskal-Wallis H Test, Friedman ANOVA for ranks, One-way Between-Groups ANOVA, and Factorial ANOVA depending on the nature of the relationship between groups. Some of these tests are designed for dependent or correlated samples/groups and some are designed for samples/groups that are completely independent.

Multiple Sample TestsDependent Means

Dependent groups refer to some type of association or link in the research design between sets of scores. This usually occurs in one of three conditions -- repeated measures, linked selection, or matching. Repeated measures designs collect data on subjects using the same measure on at least two occasions. This often occurs before and after a treatment or when the same research subjects are exposed to two different experimental conditions.

Multiple Sample Tests

When subjects are selected into the study because of natural "links or associations", we want to analyze the data together. This would occur in studies of parent-infant interaction, romantic partners, siblings, or best friends. In a study of parents and their children, a parent’s data should be associated with his son's, not some other child's. Subject matching also produces dependent data. Suppose that an investigator wanted to control for socioeconomic differences in research subjects. She might measure socioeconomic status and then match on that variable. The scores on the dependent variable would then be treated as a pair in the statistical test.

All statistical procedures for dependent or correlated groups treat the data as linked, therefore it is very important that you correctly identify dependent groups designs. The statistics that can be used for correlated groups are the McNemar Test (two samples or times of measurement), Wilcoxon t Test (two samples), Dependent Means t Test (two samples), Friedman ANOVA for Ranks (three or more samples), Simple Repeated Measures ANOVA (three or more samples).

Independent MeansWhen there is no subject overlap across groups,

we define the groups as independent. Tests of gender differences are a good example of independent groups. We cannot be both male and female at the same time; the groups are completely independent. If you want to determine whether samples are independent or not, ask yourself, "Can a person be in one group at the same time he or she is in another?" If the answer is no (can't be in a remedial education program and a regular classroom at the same time; can't be a freshman in high school and a sophomore in high school at the same time), then the groups are independent.

The statistics that can be used for independent groups include the chi-square test of independence (two or more groups), Mann-Whitney U Test (two groups), Independent Means t test (two groups), One-Way Between-Groups ANOVA (three or more groups), and Factorial ANOVA (two or more independent variables).

Scales of MeasurementsOnce we have identified the

independent and dependent variables, our next step in choosing a statistical test is to identify the scale of measurement of the variables.

All of the parametric tests that we have learned to date require an interval or ratio scale of measurement for the dependent variable.

Scales of Measurements

If you are working with a dependent variable that has a nominal or ordinal scale of measurement, then you must choose a nonparametric statistic to test your hypothesis

How many Samples / Groups are in the Design

Once you have identified the scale of measurement of the dependent variable, you want to determine how many samples or "groups" are in the study design.

Designs for which one-sample tests (e.g., Z test; t test; Pearson and Spearman correlations; chi-square goodness-of-fit) are appropriate to collect only one set or "sample" of data.

How many Samples / Groups are in the DesignThere must be at least two sets of scores or two "samples" for any statistic that examines differences between groups (e.g. , t test for dependent means; t test for independent means; one-way ANOVA; Friedman ANOVA; chi-square test of independence) .

Parametric TestsParametric statistics are used when

our data are measured on interval or ratio scales of measurement

Tend to need larger samplesData should fit a particular

distribution; transformed the data into that particular distribution

Samples are normally drawn randomly from the population

Follows the assumption of normality –meaning the data is normally distributed.

Parametric AssumptionsListed below are the most frequently

encountered assumptions for parametric tests.

Statistical procedures are available for testing these assumptions.

The Kolmogorov-Smirnov Test is used to determine how likely it is that a sample came from a population that is normally distributed.

Parametric AssumptionsThe Levene test is used to test the

assumption of equal variances. If we violate test assumptions, the statistic

chosen cannot be applied. In this circumstance we have two options:

We can use a data transformation We can choose a nonparametric statistic

If data transformations are selected, the transformation must correct the violated assumption. If successful, the transformation is applied and the parametric statistic is used for data analysis.

Types of Parametric TestsZ testOne-way ANOVAOne-Sample t testFactorial ANOVAt test for dependent means Pearson’s r t test for independent means Bivariate/Multiple regression

Non-Parametric TestsInference procedures which are likely distribution free.

Nonparametric statistics are used when our data are measured on a nominal or ordinal scale of measurement.

All other nonparametric statistics are appropriate when data are measured on an ordinal scale of measurement.

Example to this is the sign tests. These are tests designed to draw inferences about medians.

Types of Non-parametric TestsSigned TestsChi-square statistics and their

modifications (e.g., McNemar Test) are used for nominal data.

Wilcoxon Test – alternative to t – test in the parametric test

Kruskal- Wallis Test - alternative to ANOVA

Freidman Test – alternative to ANOVA

Goodness of Fit Test

Choosing the Correct Statistical TestsSummary

Five issues must be considered when choosing statistical tests.Scale of measurement Number of samples/groups Nature of the relationship between groups

Number of variables Assumptions of statistical tests

Introduction to Multiple and Non-Linear Regression

Hands –On Statistical Software

Thank you very much!Hope you are now ready to conduct

your study

Technology

Advanced statistics