30
Fundamentals of Data Analysis Lecture 3 Basics of statistics

Fundamentals of Data Analysis Lecture 3 Basics of statistics

  • Upload
    said

  • View
    39

  • Download
    4

Embed Size (px)

DESCRIPTION

Fundamentals of Data Analysis Lecture 3 Basics of statistics. Program for today. Basic terms and definitions Discrete distributions Continuous distributions Normal distribution. Topics for discussion. What are the application s of statistics in modern physics ? - PowerPoint PPT Presentation

Citation preview

Page 1: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Fundamentals of Data Analysis

Lecture 3

Basics of statistics

Page 2: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Program for todayBasic terms and definitions Discrete distributionsContinuous distributionsNormal distribution

Page 3: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Topics for discussion What are the applications of

statistics in modern physics? How important is the drawing of

conclusions based on statistical analysis ?

Page 4: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

What is the statistics ?Definition of Statistics:

1. A collection of quantitative data pertaining to a subject or group. Examples are blood pressure statistics etc.

2. The science that deals with the collection, tabulation, analysis, interpretation, and presentation of quantitative data

Page 5: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

What is the statistics ?Two phases of statistics:

Descriptive Statistics:o Describes the characteristics of a product

or process using information collected on it.

Inferential Statistics (Inductive):o Draws conclusions on unknown process

parameters based on information contained in a sample.

o Uses probability

Page 6: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Probability When we cannot rely on the assumption that all

sample points are equally likely, we have to determine the probability of an event experimentally. We perform a large number of experiments N and count how often each of the sample points is obtained. The ratio of the number of occurrences of a certain sample point to the total number of experiments is called the relative frequency.

Page 7: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Probability The probability is then assigned the relative

frequency of the occurrence of a sample point in this long series of repetitions of the experiment. This is based on the axiom, called the "law of large numbers", which says that the relative frequency approaches the true (theoretical) probability of the outcome if the experiment is repeated over and over again. How important is the drawing of conclusions based on statistical analysis.

Page 8: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Probability

where n(E) is the number of times, the event E took place out of a total of N experiments. From this definition we can see that the probability is a number between 0 and 1. When the probability is 1, then we know that a particular outcome is certain.

Page 9: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Probability

For a discrete random variable definition of probability is intuitive:

where n(x) is the number of occurences of the desired value of the random variable x (successes) in N samples (N ).

N

xnP

Page 10: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Probability

N

xxxxnxxxxP

00

00

x

xxxxPxf

000

For a continuous random variable, this definition requires the identification of a small range of variation Δx (Δx 0), for which the probability is determined :

For a continuous random variable it is preferable to use the probability density function:

Page 11: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

HistogramThe histogram is the most important graphical tool for exploring the shape of data distributions. And a good way to visualize trends in population data. The more a particular value occurs, the larger the corresponding bar

on the histogram.

Page 12: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

HistogramConstructing a histogram

Step 1: Find range of distribution, largest - smallest values

Step 2: Choose number of classes, 5 to 20

Step 3: Determine width of classes, one

decimal place more than the data, class width = range/number of classes

Step 4: Determine class boundaries

Step 5: Draw frequency histogram

Page 13: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

HistogramNumber of groups or cells

If number of observations < 100 – 5 to 9 cells

Between 100-500 – 8 to 17 cells

Greater than 500 – 15 to 20 cells

Page 14: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Analysis of histogram

Page 15: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Analysis of histogramCalculating the average for ungrouped data

and for grouped data:1

ni

i

XX

n

1

1 1 2 2

1 2

... .

...

hi i

i

h h

h

f XX

n

f X f X f X

f f f

Page 16: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Analysis of histogramBoundaries Midpoint Frequency Computation

23.6-26.5 25.0 4 100

26.6-29.5 28.0 36 1008

29.6-32.5 31.0 51 1581

32.6-35.5 34.0 63 2142

35.6-38.5 37.0 58 2146

38.6-41.5 40.0 52 2080

41.6-44.5 43.0 34 1462

44.6-47.5 46.0 16 736

47.6-50.5 49.0 6 294

Total 320 11549

Page 17: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Measures of dispersion Range Standard deviation Variance

Page 18: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Measures of dispersionThe range is the simplest and easiest to calculate of the measures of dispersion.

R = Xmax - Xmin

Page 19: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Measures of dispersionStandard deviation inside the probe:

2

1( )

1

n

iXi X

Sn

Page 20: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Measures of dispersionFor a discrete random variable definition of variation is as follows:

when for continous is:

ii xPxExxV2

dxxfxExxVb

a

2

Page 21: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Parameters of a distribution

Parameter is a characteristic of a population, i.o.w. it describes a population

Statistic is a characteristic of a sample, used to make inferences on the population parameters that are typically unknown, called an estimator

Page 22: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Parameters of a distribution

Population - Set of all items that possess a characteristic of interest

Sample - Subset of a population

Page 23: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Parameters of a distribution

Expected value (EV) discrete random variable:

and for continuous random variable:

ii

i

xPxZ

kxE

1

dxxfxxEb

a

Page 24: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Random numbers

1 2 3 4 5 6 7 8 9 10

1534 7106 2836 7873 5574 7545 7590 5574 1202 7712

6128 8993 4102 2551 0330 2358 6427 7067 9325 2454

6047 8566 8644 9343 9297 6751 3500 8754 2913 1258

0806 5201 5705 7355 1448 9562 7514 9205 0402 2427

9915 8274 4525 5695 5752 9630 7172 6988 0227 4264

2882 7158 4341 3463 1178 5789 1173 0670 0820 5067

9213 1223 4388 9760 6691 6861 8214 8813 0611 3131

8410 9836 3899 3883 1253 1683 6988 9978 8026 6751

9974 2362 2103 4326 3825 9079 6187 2721 1489 4216

3402 8162 8226 0782 3364 7871 4500 5598 9424 3816

8188 6569 1492 2139 8823 6878 0613 7161 0241 3834

3825 7020 1124 7483 9155 4919 3209 5959 2364 2555

9801 8788 6338 5899 3309 0807 0968 0539 4205 8257

Page 25: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Normal distribution

Characteristics of the normal curve:

It is symmetrical -- Half the cases are to one side of the center; the other half is on the other side.

The distribution is single peaked, not bimodal or multi-modal

Also known as the Gaussian distribution

Page 26: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Normal distribution

Characteristics of the normal curve:

It is symmetrical -- Half the cases are to one side of the center; the other half is on the other side.

The distribution is single peaked, not bimodal or multi-modal

Also known as the Gaussian distribution

Page 27: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Normal distribution

Probability density function:

N(μ,σ)

N(0,1) - standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1

Page 28: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Normal distribution

Page 29: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Exponential distribution

Probability density function

Cumulative distribution function

Cumulative distribution function is given by: F(x) = P(-oo, x)

for

Page 30: Fundamentals of Data Analysis  Lecture 3  Basics of statistics

Thanks for attention !