10
Introduction to Statistics STATISTICAL ANALYSIS YHJ@SOM, USM Page - 1 Introduction to Statistics Introduction to Statistics Page 1 Objectives To define statistics To discuss the wide range of applications of statistics To discuss key statistical concepts To understand the branches of statistics To describe the levels of measurement of data Page 2 What is Statistics? Science of collecting, organizing, presenting, analyzing, and interpreting data for the purpose of assisting in making more effective decision Branch of mathematics Facts and figures A subject or discipline Collections of data Page 3

1 Intro to Statistics

Embed Size (px)

Citation preview

Page 1: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 1

Introduction

to

Statistics

Introduction

to

Statistics

Page 1

Objectives

� To define statistics

� To discuss the wide range of applications of statistics

� To discuss key statistical concepts

� To understand the branches of statistics

� To describe the levels of measurement of data

Page 2

What is Statistics?

� Science of collecting, organizing, presenting, analyzing, and interpreting data for the purpose of assisting in making more effective decision

� Branch of mathematics

� Facts and figures

� A subject or discipline

� Collections of data

Page 3

Page 2: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 2

Applications of Statistics

Statistical techniques are used in a wide range

of types of scientific and social research,

including: Biostatistics, Computational

biology, Computational sociology, Network

biology, Social science, Sociology and Social

research.

Some fields of inquiry use applied statistics so

extensively that they have specialized terminology.

These disciplines include:Actuarial science

Applied information economics

Biostatistics

Business statistics

Chemometrics (for analysis of data from chemistry)

Data mining

Demography

Econometrics

Energy statistics

Engineering statistics

Epidemiology

Geography and Geographic Information Systems, specifically in Spatial analysis

Image processing

Psychological statistics

Reliability engineering

Social statistics

Page

6

Engineering statistics is a branch of statistics that has

several subtopics which are particular to engineering:

� Design of Experiments (DOE) uses statistical techniques to test and construct models of engineering components and systems.

� Quality control and process control use statistics as a tool to

manage conformance to specifications of manufacturing

processes and their products.

� Time and methods engineering use statistics to study repetitive

operations in manufacturing in order to set standards and find optimum (in some sense) manufacturing procedures.

� Reliability engineering which measures the ability of a system to

perform for its intended function (and time) and has tools for

improving performance.

� Probabilistic design involving the use of probability in product and

system design

Page 3: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 3

Applications of Statistics in Business

� Accounting – auditing and cost estimation

� Finance – investments and portfolio management

� Human resource – compensation, job satisfaction, performance measure

� Operation – quality management, forecasting, MIS, capacity planning, materials control

� Marketing - market analysis, consumer research, pricing

� Economics – regional, national, and international economic performance

� International Business- market and demographic analysis.

Page 7

Key Statistical Concepts…

� Population

— a population is

� Sample

— A sample is

Page 8

Key Statistical Concepts…

� Parameter

� Statistic

Page 9

Page 4: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 4

Key Statistical Concepts…

�� Populations have Parameters,Populations have Parameters,

�� Samples have Statistics.Samples have Statistics.

Parameter

Population Sample

Statistic

Subset

Page 10

Branches of Statistics

Statistics

Descriptive Statistics Inferential Statistics

Non-Parametric StatisticsParametric Statistics

Page 11

Descriptive Statistics…

� …are methods of organizing, summarizing, and presenting data in a convenient and informative way. These methods include:

� Graphical Techniques

� Numerical Techniques

� The actual method used depends on what information we would like to extract. Are we interested in…

� measure(s) of central location? and/or

� measure(s) of variability (dispersion)?

Page 12

Page 5: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 5

Inferential Statistics…

� Inferential statistics is also a set of methods, but it is used to draw conclusions or inferences about characteristics of populations based on data from a sample.

Page 13

Statistical Inference…

Statistical inference is the process of making an estimate, prediction, or decision about a population based on a sample.

Parameter

Population

Sample

Statistic

Inference

What can we infer about a Population’s Parameters

based on a Sample’s Statistics?

Page 14

Statistical Inference…

We use statistics to make inferences about parameters.

Therefore, we can make an estimate, prediction, or decision about a population based on sample data.

Thus, we can apply what we know about a sample to the larger population from which it was drawn!

Page 15

Page 6: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 6

Statistical Inference…Inference…

� Rationale:

•Large populations make investigating each member impractical and expensive.

•Easier and cheaper to take a sample and make estimates about the population from the sample.

� However:

Such conclusions and estimates are not always going to be correct.

For this reason, we build into the statistical inference “measures of reliability”, namely confidence level and significance level.

Page 16

Confidence & Significance Levels…

The confidence level is the proportion of times that an estimating procedure will be correct.

E.g. a confidence level of 95% means that, estimates based on this form of statistical inference will be correct 95% of the time.

When the purpose of the statistical inference is to draw a conclusion about a population, the significance level measures how frequently the conclusion will be wrong in the long run.

E.g. a 5% significance level means that, in the long run, this type of conclusion will be wrong 5% of the time.

Page 17

Process of Inferential StatisticsProcess of Inferential StatisticsProcess of Inferential StatisticsProcess of Inferential Statistics

Population

(parameter)

µ

Sample

x

(statistic )

Calculate x

to estimate µ

Select a

random sample

Page 18

Page 7: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 7

Page

19

Branches of Statistics

Statistics

Descriptive Statistics Inferential Statistics

Non-Parametric StatisticsParametric Statistics

Page

20

Parametric Statistics

Parametric statistics is a branch of statistics that assumes data come

from a type of probability distribution and makes inferences

about the parameters of the distribution. Most well-known

elementary statistical methods are parametric.

Page

21

Non-parametric statistics

� distribution free methods which do not rely on assumptions that the data are drawn from a given probability distribution.

� non-parametric statistic can refer to a statistic (a function on a sample) whose interpretation does not depend on the population fitting any parametrized distributions.

� Non-parametric models differ from parametricmodels in that the model structure is not specified a priori but is instead determined from data.

Page 8: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 8

Page

22

The most frequently used tests include:� Anderson–Darling test Kuiper's test

� Cliff's delta Mann–Whitney U or

Wilcoxon rank sum test

� Cochran's Q median test

� Cohen's kappa Pitman's permutation test

� Efron–Petrosian test Rank products

� Friedman two-way analysis of variance by ranks

� Kendall's tau Siegel–Tukey test

� Kendall's W Wald–Wolfowitz runs test

� Kolmogorov–Smirnov test

� Kruskal-Wallis one-way analysis of variance by ranks

� Spearman's rank correlation coefficient

� Van Elteren stratified Wilcoxon rank sum test

� Wilcoxon signed-rank test.

Types of Data & Information

Data (at least for purposes of Statistics) fall into three main groups:

� Interval Data

� Nominal Data

� Ordinal Data

Page 23

Interval Data…

Interval data

• Real numbers, i.e. heights, weights, prices, etc.

• Also referred to as quantitative or numerical.

Arithmetic operations can be performed on Interval Data, thus its meaningful to talk about 2*Height, or Price + $1, and so on.

Page 24

Page 9: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 9

Nominal Data…Nominal Data…

Nominal Data

• The values of nominal data are categories.

E.g. responses to questions about marital status, coded as:

Single = 1, Married = 2, Divorced = 3, Widowed = 4

Because the numbers are arbitrary, arithmetic operations don’t make any sense (e.g. does Widowed ÷ 2 = Married?!)

Nominal data are also called qualitative or categorical.

Page 25

Ordinal Data…Ordinal Data…

Ordinal Data appear to be categorical in nature, Ordinal Data appear to be categorical in nature, but their values have an but their values have an orderorder; a ranking to them:; a ranking to them:

E.g. College course rating system:E.g. College course rating system:

poor = 1, fair = 2, good = 3, very good = 4, poor = 1, fair = 2, good = 3, very good = 4, excellent = 5excellent = 5

While its still not meaningful to do arithmetic on While its still not meaningful to do arithmetic on this data (e.g. does 2*fair = very good?!), we can this data (e.g. does 2*fair = very good?!), we can say things like:say things like:

excellent > poorexcellent > poor oror fair < very goodfair < very good

That is, order is maintained no matter what That is, order is maintained no matter what numeric values are assigned to each category.numeric values are assigned to each category.

Page 26

Calculations for Types of Data

As mentioned above,

• All calculations are permitted on interval data.

• Only calculations involving a ranking process are allowed for ordinal data.

• No calculations are allowed for nominal data, only counting the number of observations in each category is possible.

This lends itself to the following “hierarchy

of data”…

Page 27

Page 10: 1 Intro to Statistics

Introduction to Statistics STATISTICAL ANALYSIS

YHJ@SOM, USM Page - 10

Hierarchy of Data…IntervalInterval

Values are real numbers.Values are real numbers.

All calculations are valid.All calculations are valid.

Data may be treated as ordinal or nominal.Data may be treated as ordinal or nominal.

OrdinalOrdinal

Values must represent the ranked order of the data.Values must represent the ranked order of the data.

Calculations based on an ordering process are valid.Calculations based on an ordering process are valid.

Data may be treated as nominal but not as interval.Data may be treated as nominal but not as interval.

Nominal Nominal

Values are the arbitrary numbers that represent Values are the arbitrary numbers that represent categories.categories.

Only calculations based on the frequencies of Only calculations based on the frequencies of occurrence are valid.occurrence are valid.

Data may not be treated as ordinal or interval.Data may not be treated as ordinal or interval.Page 28

Page

29

End of discussion