Introduction to Statistics for Bio Medical Engineers - Kristina M. Ropella

8/8/2019 Introduction to Statistics for Bio Medical Engineers - Kristina M. Ropella

1/102

Intrctin t Statistics

r Bimeical Enineers


2/102

Copyrght 2007 by Morgan & Claypool

All rghts reserved. No part o ths publcaton may be reproduced, stored n a retreval system, or transmtted n

any orm or by any meanselectronc, mechancal, photocopy, recordng, or any other except or bre quotatons n

prnted revews, wthout the pror permsson o the publsher.

Introducton to Statstcs or Bomedcal Engneers

Krstna M. Ropella

www.morganclaypool.com

ISBN: 1598291963 paperback

ISBN: 9781598291964 paperback

ISBN: 1598291971 ebook

ISBN: 9781598291971 ebook

DOI: 10.2200/S00095ED1V01Y200708BME014

A Publcaton n the Morgan & Claypool Publshers seres

SYNTHESIS LECTURES ON BIOMEDICAL ENGINEERING #14

Lecture #14

Seres Edtor: John D. Enderle, Unversty o Connectcut

Series ISSN

ISSN 1930-0328 prnt

ISSN 1930-0336 electronc


3/102

Intrctin t Statisticsr Bimeical EnineersKristina M. RpellaDepartment o Bomedcal Engneerng

Marquette Unversty

SYNTHESIS LECTURES ON BIOMEDICAL ENGINEERING #14


4/102

This text is dedicated to all the students who have completed my BIEN 084statistics course or biomedical engineers and have taught me how to be

more eective in communicating the subject matter and making statisticscome alive or them. I also thank J. Claypool or his patience andor encouraging me to fnally put this text together.

Finally, I thank my amily or tolerating my time at home on the laptop.


5/102

ABSTRACTThere are many books wrtten about statstcs, some bre, some detaled, some humorous, some

colorul, and some qute dry. Each o these texts s desgned or a specc audence. Too oten, texts

about statstcs have been rather theoretcal and ntmdatng or those not practcng statstcal

analyss on a routne bass. Thus, many engneers and scentsts, who need to use statstcs much

more requently than calculus or derental equatons, lack sucent knowledge o the use o

statstcs. The audence that s addressed n ths text s the unversty-level bomedcal engneerng

student who needs a bare-bones coverage o the most basc statstcal analyss requently used n

bomedcal engneerng practce. The text ntroduces students to the essental vocabulary and basc

concepts o probablty and statstcs that are requred to perorm the numercal summary and sta-

tstcal analyss used n the bomedcal eld. Ths text s consdered a startng pont or mportant

ssues to consder when desgnng experments, summarzng data, assumng a probablty model or

the data, testng hypotheses, and drawng conclusons rom sampled data.

A student who has completed ths text should have sucent vocabulary to read more ad-

vanced texts on statstcs and urther ther knowledge about addtonal numercal analyses that are

used n the bomedcal engneerng eld but are beyond the scope o ths text. Ths book s desgned

to supplement an undergraduate-level course n appled statstcs, speccally n bomedcal eng-

neerng. Practcng engneers who have not had ormal nstructon n statstcs may also use ths text

as a smple, bre ntroducton to statstcs used n bomedcal engneerng. The emphass s on the

applcaton o statstcs, the assumptons made n applyng the statstcal tests, the lmtatons o

these elementary statstcal methods, and the errors oten commtted n usng statstcal analyss.

A number o examples rom bomedcal engneerng research and ndustry practce are provded to

assst the reader n understandng concepts and applcaton. It s benecal or the reader to have

some background n the le scences and physology and to be amlar wth basc bomedcal n-

strumentaton used n the clncal envronment.

KEywoRdSprobablty model, hypothess testng, physology, ANOVA, normal dstrbuton,

condence nterval, power test


6/102


7/102

Cntents

1. Intrctin .......................................................................................................1

2. Cllectin data an Eperimental desin ...........................................................5

3. data Smmar an descriptie Statistics ............................................................9

3.1 Why Do We Collect Data? ................................................................................ 9

3.2 Why Do We Need Statstcs? ............................................................................. 9

3.3 What Questons Do We Hope to Address Wth Our Statstcal Analyss? ..... 10

3.4 How Do We Graphcally Summarze Data? .................................................... 11

3.4.1 Scatterplots ........................................................................................... 113.4.2 Tme Seres ........................................................................................... 11

3.4.3 Box-and-Whsker Plots ........................................................................ 12

3.4.4 Hstogram ............................................................................................. 13

3.5 General Approach to Statstcal Analyss ......................................................... 17

3.6 Descrptve Statstcs ........................................................................................ 20

3.6.1 Measures o Central Tendency ............................................................. 21

3.6.2 Measures o Varablty ......................................................................... 22

4. Assmin a Prbabilit Mel Frm the Sample data ........................................ 25

4.1 The Standard Normal Dstrbuton .................................................................. 29

4.2 The Normal Dstrbuton and Sample Mean .................................................... 32

4.3 Condence Interval or the Sample Mean ....................................................... 33

4.4 The tDstrbuton ............................................................................................ 36

4.5 Condence Interval Usng tDstrbuton .......................................................... 38

5. Statistical Inerence .......................................................................................... 41

5.1 Comparson o Populaton Means .................................................................... 41

5.1.1 The tTest ............................................................................................. 42

5.1.1.1 Hypothess Testng ................................................................ 425.1.1.2 Applyng the tTest ................................................................ 43

5.1.1.3 Unpared tTest ...................................................................... 44

5.1.1.4 Pared tTest ........................................................................... 49

5.1.1.5 Example o a Bomedcal Engneerng Challenge ................. 50

ii


8/102

5.2 Comparson o Two Varances .......................................................................... 54

5.3 Comparson o Three or More Populaton Means ........................................... 59

5.3.1 One-Factor Experments ...................................................................... 60

5.3.1.1 Example o Bomedcal Engneerng Challenge .................... 60

5.3.2 Two-Factor Experments ...................................................................... 69

5.3.3 Tukeys Multple Comparson Procedure ............................................. 73

6. Linear Reressin an Crrelatin Analsis ....................................................... 75

7. Per Analsis an Sample Size ........................................................................ 81

7.1 Power o a Test ................................................................................................. 82

7.2 Power Tests to Determne Sample Sze ............................................................ 83

8. Jst the Beinnin ............................................................................................ 87

Bibliraph ............................................................................................................. 91

Athr Biraph ...................................................................................................... 93

iii INTRoduCTIoN To STATISTICS FoR BIoMEdICAL ENgINEERS


9/102

1

C H A P T E R 1

Bomedcal engneers typcally collect all sorts o data, rom patents, anmals, cell counters, mcro-

assays, magng systems, pressure transducers, bedsde montors, manuacturng processes, materal

testng systems, and other measurement systems that support a broad spectrum o research, desgn,

and manuacturng envronments. Ultmately, the reason or collectng data s to make a decson.

That decson may concern derentatng bologcal characterstcs among derent populatons

o people, determnng whether a pharmacologcal treatment s eectve, determnng whether t s

cost-eectve to nvest n multmllon-dollar medcal magng technology, determnng whether a

manuacturng process s under control, or selectng the best rehabltatve therapy or an ndvdual

patent.

The challenge n makng such decsons oten les n the act that all real-world data contans

some element o uncertanty because o random processes that underle most physcal phenomenon.

These random elements prevent us rom predctng the exact value o any physcal quantty at any

moment o tme. In other words, when we collect a sample or data pont, we usually cannot predct

the exact value o that sample or expermental outcome. For example, although the average restng

heart rate o normal adults s about 70 beats per mnute, we cannot predct the exact arrval tme

o our next heartbeat. However, we can approxmate the lkelhood that the arrval tme o the next

heartbeat wll all n a specc tme nterval we have a good probablty model to descrbe the

random phenomenon contrbutng to the tme nterval between heartbeats. The tmng o heart-

beats s nfuenced by a number o physologcal varables [1], ncludng the reractory perod o

the ndvdual cells that make up the heart muscle, the leakness o the cell membranes n the snus

node (the hearts natural pacemaker), and the actvty o the autonomc nervous system, whch may

speed up or slow down the heart rate n response to the bodys need or ncreased blood fow, oxygen,

and nutrents. The sum o these bologcal processes produces a pattern o heartbeats that we may

measure by countng the pulse rate rom our wrst or carotd artery or by searchng or specc QRS

waveorms n the ECG [2]. Although ths sum o events makes t dcult or us to predct exactly

when the new heartbeat wll arrve, we can guess, wth a certanty amount o condence when the

next beat wll arrve. In other words, we can assgn a probablty to the lkelhood that the next

heartbeat wll arrve n a speced tme nterval. I we were to consder all possble arrval tmes and

Intrctin


10/102

2 INTRoduCTIoN To STATISTICS FoR BIoMEdICAL ENgINEERS

assgned a probablty to those arrval tmes, we would have a probablty model or the heartbeat

ntervals. I we can nd a probablty model to descrbe the lkelhood o occurrence o a certan

event or expermental outcome, we can use statstcal methods to make decsons. The probablty

models descrbe characterstcs o the populaton or phenomenon beng studed. Statstcal analys

then makes use o these models to help us make decsons about the populaton(s) or processes.

The conclusons that one may draw rom usng statstcal analyss are only as good as thunderlyng model that s used to descrbe the real-world phenomenon, such as the tme nterva

between heartbeats. For example, a normally unctonng heart exhbts consderable varablty n

beat-to-beat ntervals (Fgure 1.1). Ths varablty refects the bodys contnual eort to mantan

homeostass so that the body may contnue to perorm ts most essental unctons and supply the

body wth the oxygen and nutrents requred to uncton normally. It has been demonstrated through

bomedcal research that there s a loss o heart rate varablty assocated wth some dseases, such

as dabetes and schemc heart dsease. Researchers seek to determne ths derence n varablty

between normal subjects and subjects wth heart dsease s sgncant (meanng, t s due to some

underlyng change n bology and not smply a result o chance) and whether t mght be used topredct the progresson o the dsease [1]. One wll note that the probablty model changes as a

consequence o changes n the underlyng bologcal uncton or process. In the case o manuactur

ng, the probablty model used to descrbe the output o the manuacturng process may change a

FIguRE 1.1: Example o an ECG recordng, where R-R nterval s dened as the tme nterval be

tween successve R waves o the QRS complex, the most promnent waveorm o the ECG.


11/102

INTRoduCTIoN 3

a uncton o machne operaton or changes n the surroundng manuacturng envronment, such as

temperature, humdty, or human operator.

Besdes helpng us to descrbe the probablty model assocated wth real-world phenomenon,

statstcs help us to make decsons by gvng us quanttatve tools or testng hypotheses. We call

ths inerential statistics, whereby the outcome o a statstcal test allows us to draw conclusons or

make nerences about one or more populatons rom whch samples are drawn. Most oten, scen-tsts and engneers are nterested n comparng data rom two or more derent populatons or rom

two or more derent processes. Typcally, the deault hypothess s that there s no derence n the

dstrbutons o two or more populatons or processes, and we use statstcal analyss to determne

whether there are true derences n the dstrbutons o the underlyng populatons to warrant d-

erent probablty models be assgned to the ndvdual processes.

In summary, bomedcal engneers typcally collect data or samples rom varous phenomena,

whch contan some element o randomness or unpredctable varablty, or the purposes o makng

decsons. To make sound decsons n the context o the uncertanty wth some level o condence,

we need to assume some probablty model or the populatons rom whch the samples have beencollected. Once we have assumed an underlyng model, we can select the approprate statstcal

tests or comparng two or more populatons and then use these tests to draw conclusons about

FIguRE 1.2:Steps n statstcal analyss.


12/102

our hypotheses or whch we collected the data n the rst place. Fgure 1.2 outlnes the steps o

perormng statstcal analyss o data.

In the ollowng chapters, we wll descrbe methods or graphcally and numercally sum

marzng collected data. We wll then talk about ttng a probablty model to the collected data by

brefy descrbng a number o well-known probablty models that are used to descrbe bologca

phenomenon. Fnally, once we have assumed a model or the populatons rom whch we have collected our sample data, we wll dscuss the types o statstcal tests that may be used to compare data

rom multple populatons and allow us to test hypotheses about the underlyng populatons.



13/102

5

C H A P T E R 2

Beore we dscuss any type o data summary and statstcal analyss, t s mportant to recognze that

the value o any statstcal analyss s only as good as the data collected. Because we are usng data

or samples to draw conclusons about entre populatons or processes, t s crtcal that the data col-

lected (or samples collected) are representatve o the larger, underlyng populaton. In other words,

we are tryng to determne whether men between the ages o 20 and 50 years respond postvely

to a drug that reduces cholesterol level, we need to careully select the populaton o subjects or

whom we admnster the drug and take measurements. In other words, we have to have enough

samples to represent the varablty o the underlyng populaton. There s a great deal o varety n

the weght, heght, genetc makeup, det, exercse habts, and drug use n all men ages 20 to 50 years

who may also have hgh cholesterol. I we are to test the eectveness o a new drug n lowerng

cholesterol, we must collect enough data or samples to capture the varablty o bologcal makeup

and envronment o the populaton that we are nterested n treatng wth the new drug. Capturng

ths varablty s oten the greatest challenge that bomedcal engneers ace n collectng data and

usng statstcs to draw meanngul conclusons. The expermentalst must ask questons such as the

ollowng:

What type o person, object, or phenomenon do I sample?

What varables that mpact the measure or data can I control?

How many samples do I requre to capture the populaton varablty to apply the appro-

prate statstcs and draw meanngul conclusons?

How do I avod basng the data wth the expermental desgn?

Expermental desgn, although not the prmary ocus o ths book, s the most crtcal step to sup-

port the statstcal analyss that wll lead to meanngul conclusons and hence sound decsons.

One o the most undamental questons asked by bomedcal researchers s, What sze sam-

ple do I need? or How many subjects wll I need to make decsons wth any level o condence?

Cllectin data anEperimental desin


14/102

We wll address these mportant questons at the end o ths book when concepts such as varablty

probablty models, and hypothess testng have already been covered. For example, power tests wl

be descrbed as a means or predctng the sample sze requred to detect sgncant derences n

two populaton means usng a ttest.

Two elements o expermental desgn that are crtcal to prevent basng the data or selectng

samples that do not arly represent the underlyng populaton are randomzaton and blockng.Randomzaton reers to the process by whch we randomly select samples or expermenta

unts rom the larger underlyng populaton such that we maxmze our chance o capturng the

varablty n the underlyng populaton. In other words, we do not lmt our samples such tha

only a racton o the characterstcs or behavors o the underlyng populaton are captured n the

samples. More mportantly, we do not bas the results by artcally lmtng the varablty n the

samples such that we alter the probablty model o the sample populaton wth respect to the prob

ablty model o the underlyng populaton.

In addton to randomzng our selecton o expermental unts rom whch to take samples, w

mght also randomze our assgnment o treatments to our expermental unts. Or, we may random-ze the order n whch we take data rom the expermental unts. For example, we are testng the

eectveness o two derent medcal magng methods n detectng bran tumor, we wll randomly

assgn all subjects suspect o havng bran tumor to one o the two magng methods. Thus, we hav

a mx o sex, age, and type o bran tumor partcpatng n the study, we reduce the chance o havng

all one sex or one age group assgned to one magng method and a very derent type o populaton

assgned to the second magng method. I a derence s noted n the outcome o the two magng

methods, we wll not artcally ntroduce sex or age as a actor nfuencng the magng results.

As another example, one are testng the strength o three derent materals or use n

hp mplants usng several strength measures rom a materals testng machne, one mght random

ze the order n whch samples o the three derent test materals are submtted to the machne

Machne perormance can vary wth tme because o wear, temperature, humdty, deormaton

stress, and user characterstcs. I the bomedcal engneer were asked to nd the strongest matera

or an artcal hp usng specc strength crtera, he or she may conduct an experment. Let us

assume that the engneer s gven three boxes, wth each box contanng ve artcal hp mplant

made rom one o three materals: ttanum, steel, and plastc. For any one box, all ve mplan

samples are made rom the same materal. To test the 15 derent mplants or materal strength

the engneer mght randomze the order n whch each o the 15 mplants s tested n the mater-

als testng machne so that tme-dependent changes n machne perormance or machne-matera

nteractons or tme-varyng envronmental condton do not bas the results or one or more o th

materals. Thus, to ully randomze the mplant testng, an engneer may lterally place the number

115 n a hat and also assgn the numbers 115 to each o the mplants to be tested. The engnee

wll then blndly draw one o the 15 numbers rom a hat and test the mplant that corresponds to



15/102

CoLLECTINg dATA ANd ExPERIMENTAL dESIgN 7

that number. Ths way the engneer s not testng all o one materal n any partcular order, and we

avod ntroducng order eects nto the data.

The second aspect o expermental desgn s blockng. In many experments, we are nterested

n one or two specc actors or varables that may mpact our measure or sample. However, there

may be other actors that also nfuence our measure and conound our statstcs. In good exper-

mental desgn, we try to collect samples such that derent treatments wthn the actor o nterestare not based by the derng values o the conoundng actors. In other words, we should be cer-

tan that every treatment wthn our actor o nterest s tested wthn each value o the conoundng

actor. We reer to ths desgn as blockng by the conoundng actor. For example, we may want to

study weght loss as a uncton o three derent det plls. One conoundng actor may be a persons

startng weght. Thus, n testng the eectveness o the three plls n reducng weght, we may want

to block the subjects by startng weght. Thus, we may rst group the subjects by ther startng

weght and then test each o the det plls wthn each group o startng weghts.

In bomedcal research, we oten block by expermental unt. When ths type o blockng s

part o the expermental desgn, the expermentalst collects multple samples o data, wth eachsample representng derent expermental condtons, rom each o the expermental unts. Fg-

ure 2.1 provdes a dagram o an experment n whch data are collected beore and ater patents

receves therapy, and the expermental desgn uses blockng (let) or no blockng (rght) by exper-

mental unt. In the case o blockng, data are collected beore and ater therapy rom the same set o

human subjects. Thus, wthn an ndvdual, the same bologcal actors that nfuence the bologcal

response to the therapy are present beore and ater therapy. Each subject serves as hs or her own

control or actors that may randomly vary rom subject to subject both beore and ater therapy.

In essence, wth blockng, we are elmnatng bases n the derences between the two populatons

Block (Repeated Measures) No Block (No repeated measures)

Subject Measure

beforetreatment

Measure

aftertreatment

Subject Measure

beforetreatment

Subject Measure

aftertreatment

1 M11 12 1 M1 K+1 M(K+1)

2 M21 M22 2 M2 K+2 M(K+2)

3 M31 M32 3 M3 K+3 M(K+3)

. . .

. . .K MK1 MK2 K MK K+K M(K+K)

FIguRE 2.1:Samples are drawn rom two populatons (beore and ater treatment), and the exper-

mental desgn uses block (let) or no block (rght). In ths case, the block s the expermental unt (sub-

ject) rom whch the measures are made.


16/102


(beore and ater) that may result because we are usng two derent sets o expermental unts. Fo

example, we used one set o subjects beore therapy and then an entrely derent set o subject

ater therapy (Fgure 2.1, rght), there s a chance that the two sets o subjects may vary enough n

sex, age, weght, race, or genetc makeup, whch would lead to a derence n response to the therapy

that has lttle to do wth the underlyng therapy. In other words, there may be conoundng actor

that contrbute to the derence n the expermental outcome beore and ater therapy that are noonly a actor o the therapy but really an artact o derences n the dstrbutons o the two der

ent groups o subjects rom whch the two samples sets were chosen. Blockng wll help to elmnat

the eect o ntersubject varablty.

However, blockng s not always possble, gven the nature o some bomedcal research stud

es. For example, one wanted to study the eectveness o two derent chemotherapy drugs n

reducng tumor sze, t s mpractcal to test both drugs on the same tumor mass. Thus, the two

drugs are tested on derent groups o ndvduals. The same type o desgn would be necessary o

testng the eectveness o weght-loss regmens.

Thus, some mportant concepts and dentons to keep n mnd when desgnng expermentnclude the ollowng:

experimental unit: the tem, object, or subject to whch we apply the treatment and rom

whch we take sample measurements;

randomization: allocate the treatments randomly to the expermental unts;

blocking: assgnng all treatments wthn a actor to every level o the blockng actor.

Oten, the blockng actor s the expermental unt. Note that n usng blockng, we stll

randomze the order n whch treatments are appled to each expermental unt to avod

orderng bas.Fnally, the expermentalst must always thnk about how representatve the sample populaton

wth respect to the greater underlyng populaton. Because t s vrtually mpossble to test every

member o a populaton or every product rollng down an assembly lne, especally when destruc

tve testng methods are used, the bomedcal engneer must oten collect data rom a much smalle

sample drawn rom the larger populaton. It s mportant, the statstcs are gong to lead to useu

conclusons, that the sample populaton captures the varablty o the underlyng populaton. Wha

s even more challengng s that we oten do not have a good grasp o the varablty o the underly

ng populaton, and because o expense and respect or le, we are typcally lmted n the number o

samples we may collect n bomedcal research and manuacturng. These lmtatons are not easy toaddress and requre that the engneer always consder how ar the sample and data analyss s and

how well t represents the underlyng populaton(s) rom whch the samples are drawn.


17/102

9

C H A P T E R 3

We assume now that we have collected our data through the use o good expermental desgn. We

now have a collecton o numbers, observatons, or descrptons to descrbe our data, and we would

lke to summarze the data to make decsons, test a hypothess, or draw a concluson.

3.1 wHy do wE CoLLECT dATA?The world s ull o uncertanty, n the sense that there are random or unpredctable actors that

nfuence every expermental measure we make. The unpredctable aspects o the expermental out-

comes also arse rom the varablty n bologcal systems (due to genetc and envronmental ac-

tors) and manuacturng processes, human error n makng measurements, and other underlyng

processes that nfuence the measures beng made.

Despte the uncertanty regardng the exact outcome o an experment or occurrence o a u-

ture event, we collect data to try to better understand the processes or populatons that nfuence an

expermental outcome so that we can make some predctons. Data provde normaton to reduce

uncertanty and allow or decson makng. When properly collected and analyzed, data help us

solve problems. It cannot be stressed enough that the data must be properly collected and analyzed

the data analyss and subsequent conclusons are to have any value.

3.2 wHy do wE NEEd STATISTICS?We have three major reasons or usng statstcal data summary and analyss:

The real world s ull o random events that cannot be descrbed by exact mathematcal

expressons.

Varablty s a natural and normal characterstc o the natural world.

We lke to make decsons wth some condence. Ths means that we need to nd trends

wthn the varablty.

1.

2.

3.

data Smmar andescriptie Statistics


18/102


3.3 wHAT QuESTIoNS do wE HoPE To AddRESS wITHouR STATISTICAL ANALySIS?

There are several basc questons we hope to address when usng numercal and graphcal summary

o data:

Can we derentate between groups or populatons?Are there correlatons between varables or populatons?

Are processes under control?

Fndng physologcal derences between populatons s probably the most requent am

o bomedcal research. For example, researchers may want to know there s a derence n le

expectancy between overweght and underweght people. Or, a pharmaceutcal company may wan

to determne one type o antbotc s more eectve n combatng bactera than another. Or, a

physcan wonders dastolc blood pressure s reduced n a group o hypertensve subjects ate

the consumpton o a pressure-reducng drug. Most oten, bomedcal researchers are comparng

populatons o people or anmals that have been exposed to two or more derent treatments or d

agnostc tests, and they want to know there s derence between the responses o the populaton

that have receved derent treatments or tests. Sometmes, we are drawng multple samples rom

the same group o subjects or expermental unts. A common example s when the physologcal dat

are taken beore and ater some treatment, such as drug ntake or electronc therapy, rom one group

o patents. We call ths type o data collecton blockingn the expermental desgn. Ths concept o

blockng s dscussed more ully n Chapter 2.

Another queston that s requently the target o bomedcal research s whether there s a cor

relaton between two physologcal varables. For example, s there a correlaton between body buld

and mortalty? Or, s there a correlaton between at ntake and the occurrence o cancerous tumors

Or, s there a correlaton between the sze o the ventrcular muscle o the heart and the requency o

abnormal heart rhythms? These type o questons nvolve collectng two set o data and perormng

a correlaton analyss to determne how well one set o data may be predcted rom another. When

we speak o correlaton analyss, we are reerrng to the lnear relaton between two varables and the

ablty to predct one set o data by modelng the data as a lnear uncton o the second set o data

Because correlaton analyss only quantes the lnear relaton between two processes or data sets

nonlnear relatons between the two processes may not be evdent. A more detaled descrpton o

correlaton analyss may be ound n Chapter 7.

Fnally, a bomedcal engneer, partcularly the engneer nvolved n manuacturng, may be

nterested n knowng whether a manuacturng process s under control. Such a queston may ars

there are tght controls on the manuacturng speccatons or a medcal devce. For example

1.2.

3.


19/102

dATA SuMMARy ANd dESCRIPTIvE STATISTICS 11

the engneer s tryng to ensure qualty n producng ntravascular catheters that must have d-

ameters between 1 and 2 cm, the engneer may randomly collect samples o catheters rom the

assembly lne at random ntervals durng the day, measure ther dameters, determne how many o

the catheters meet speccatons, and determne whether there s a sudden change n the number

o catheters that al to meet speccatons. I there s such a change, the engneers may look or

elements o the manuacturng process that change over tme, changes n envronmental actors, oruser errors. The engneer can use control charts to assess whether the processes are under control.

These methods o statstcal analyss are not covered n ths text, but may be ound n a number o

reerences, ncludng [3].

3.4 How do wE gRAPHICALLy SuMMARIZE dATA?We can summarze data n graphcal or numercal orm. The numercal orm s what we reer to as

statstcs. Beore blndly applyng the statstcal analyss, t s always good to look at the raw data,

usually n a graphcal orm, and then use graphcal methods to summarze the data n an easy to

nterpret ormat.

The types o graphcal dsplays that are most requently used by bomedcal engneers nclude

the ollowng: scatterplots, tme seres, box-and-whsker plots, and hstograms.

Detals or creatng these graphcal summares are descrbed n [36], but we wll brefy

descrbe them here.

3.4.1 ScatterpltsThe scatterplot smply graphs the occurrence o one varable wth respect to another. In most cases,

one o the varables may be consdered the ndependent varable (such as tme or subject number),and the second varable s consdered the dependent varable. Fgure 3.1 llustrates an example o a

scatterplot or two sets o data. In general, we are nterested n whether there s a predctable rela-

tonshp that maps our ndependent varable (such as respratory rate) nto our dependent varable

(such a heart rate). I there s a lnear relatonshp between the two varables, the data ponts should

all close to a straght lne.

3.4.2 Time SeriesA tme seres s used to plot the changes n a varable as a uncton o tme. The varable s usually

a physologcal measure, such as electrcal actvaton n the bran or hormone concentraton n the

blood stream, that changes wth tme. Fgure 3.2 llustrates an example o a tme seres plot. In ths

gure, we are lookng at a smple snusod uncton as t changes wth tme.


20/102


3.4.3 B-an-whisker Plts

These plots llustrate the rst, second, and thrd quartles as well as the mnmum and maxmumvalues o the data collected. The second quartle (Q2) s also known as the medan o the data. Th

quantty, as dened later n ths text, s the mddle data pont or sample value when the samples

are lsted n descendng order. The rst quartle (Q1) can be thought o as the medan value o the

samples that all below the second quartle. Smlarly, the thrd quartle (Q3) can be thought o as

the medan value o the samples that all above the second quartle. Box-and-whsker plots are use

ul n that they hghlght whether there s skew to the data or any unusual outlers n the sample

(Fgure 3.3).

-2

-1

0

1

2

5 10 15 20

Amplitude

Time (msec)

FIguRE 3.2: Example o a tme seres plot. The ampltude o the samples s plotted as a uncton o

tme.

20100

10

9

8

7

6

5

4

3

2

1

0

Independent Variable

Depe

ndentVariable

FIguRE 3.1: Example o a scatterplot.


21/102


1

10

9

8

7

6

5

4

3

2

1

0

Category

DependentVariable

Q1

Q2

Q3

Box and Whisker Plot

FIguRE 3.3: Illustraton o a box-and-whsker plot or the data set lsted. The rst (Q1), second (Q2),

and thrd (Q3) quartles are shown. In addton, the whskers extend to the mnmum and maxmum

values o the sample set.

3.4.4 HistramThe hstogram s dened as a requency dstrbuton. GvenNsamples or measurements, x

i, whch

range rom Xmn

to Xmax

, the samples are grouped nto nonoverlappng ntervals (bns), usually o

equal wdth (Fgure 3.4). Typcally, the number o bns s on the order o 714, dependng on the

nature o the data. In addton, we typcally expect to have at least three samples per bn [7]. Stur-

gess rule [6] may also be used to estmate the number o bns and s gven by

k = 1 + 3.3 log(n).

where k s the number o bns and n s the number o samples.

Each bn o the hstogram has a lower boundary, upper boundary, and mdpont. The hsto-

gram s constructed by plottng the number o samples n each bn. Fgure 3.5 llustrates a hstogram

or 1000 samples drawn rom a normal dstrbuton wth mean () = 0 and standard devaton () =

1.0. On the horzontal axs, we have the sample value, and on the vertcal axs, we have the number

o occurrences o samples that all wthn a bn.

Two measures that we nd useul n descrbng a hstogram are the absolute requency and

relatve requency n one or more bns. These quanttes are dened as

i= absolute requency n ith bn;

i/n = relatve requency n th bn, where n s the total number o samples beng summarzed

n the hstogram.

a)

b)


22/102


A number o algorthms used by bomedcal nstruments or dagnosng or detectng ab-

normaltes n bologcal uncton make use o the hstogram o collected data and the assocated

relatve requences o selected bns [8]. Oten tmes, normal and abnormal physologcal uncton

(breath sounds, heart rate varablty, requency content o electrophysologcal sgnals) may be d

erentated by comparng the relatve requences n targeted bns o the hstograms o data repre

sentng these bologcal processes.

Lower Bound Upper BoundMidpoint

FIguRE 3.4: One bn o a hstogram plot. The bn s dened by a lower bound, a mdpont, and an

upper bound.

-2 -1 0 1 2 3

0

10

20

Normalized Value

Frequency

FIguRE 3.5:Example o a hstogram plot. The value o the measure or sample s plotted on the hor-

zontal axs, whereas the requency o occurrence o that measure or sample s plotted along the vertca

axs.


23/102


The hstogram can exhbt several shapes. The shapes, llustrated n Fgure 3.6, are reerred

to as symmetrc, skewed, or bmodal.

A skewed hstogram may be attrbuted to the ollowng [9]:

mechansms o nterest that generate the data (e.g., the physologcal mechansms that

determne the beat-to-beat ntervals n the heart);an artact o the measurement process or a sht n the underlyng mechansm over tme

(e.g., there may be tme-varyng changes n a manuacturng process that lead to a change

n the statstcs o the manuacturng process over tme);

a mxng o populatons rom whch samples are drawn (ths s typcally the source o a

bmodal hstogram).

The hstogram s mportant because t serves as a rough estmate o the true probablty den-

sty uncton or probablty dstrbuton o the underlyng random process rom whch the samples

are beng collected.

The probablty densty uncton or probablty dstrbuton s a uncton that quantes theprobablty o a random event, x, occurrng. When the underlyng random event s dscrete n nature,

we reer to the probablty densty uncton as the probablty mass uncton [10]. In ether case, the

uncton descrbes the probablstc nature o the underlyng random varable or event and allows us

to predct the probablty o observng a specc outcome, x (represented by the random varable),

o an experment. The cumulatve dstrbuton uncton s smply the sum o the probabltes or a

group o outcomes, where the outcome s less than or equal to some value, x.

Let us consder a random varable or whch the probablty densty uncton s well dened

(or most real-world phenomenon, such a probablty model s not known.) The random varable s

the outcome o a sngle toss o a dce. Gven a sngle ar dce wth sx sdes, the probablty o rollng

a sx on the throw o a dce s 1 o 6. In act, the probablty o throwng a one s also 1 o 6. I we

consder all possble outcomes o the toss o a dce and plot the probablty o observng any one o

those sx outcomes n a sngle toss, we would have a plot such as that shown n Fgure 3.7.

Ths plot shows the probablty densty or probablty mass uncton or the toss o a dce.

Ths type o probablty model s known as a unorm dstrbuton because each outcome has the

exact same probablty o occurrng (1/6 n ths case).

For the toss o a dce, we know the true probablty dstrbuton. However, or most real-

world random processes, especally bologcal processes, we do not know what the true probablty

densty or mass uncton looks lke. As a consequence, we have to use the hstogram, created rom a

small sample, to try to estmate the best probablty dstrbuton or probablty model to descrbe the

real-world phenomenon. I we return to the example o the toss o a dce, we can actually toss the

dce a number o tmes and see how close the hstogram, obtaned rom expermental data, matches

1.

2.

3.


24/102


-4 -3 -2 -1 0 1 2 3

0

100

200

Measure

Frequ

ency

Symmetric

0150

0

100

200

300

400

Measure

Frequency

Skewed

0 10 20

0

100

200

300

400

Measure

Frequency

Bimodal

FIguRE 3.6:Examples o a symmetrc (top), skewed (mddle), and bmodal (bottom) hstogram. In

each case, 2000 sampled were drawn rom the underlyng populatons.


25/102


the true probablty mass uncton or the deal sx-sded dce. Fgure 3.8 llustrates the hstograms

or the outcomes o 50 and 1000 tosses o a sngle dce. Note that even wth 50 tosses or samples, t

s dcult to determne what the true probablty dstrbuton mght look lke. However, as we ap-

proach 1000 samples, the hstogram s approachng the true probablty mass uncton (the unorm

dstrbuton) or the toss o a dce. But, there s stll some varablty rom bn to bn that does not

look as unorm as the deal probablty dstrbuton llustrated n Fgure 3.7. The message to take

away rom ths llustraton s that most bomedcal research reports the outcomes o a small numbero samples. It s clear rom the dce example that the statstcs o the underlyng random process

are very dcult to dscern rom a small sample, yet most bomedcal research reles on data rom

small samples.

3.5 gENERAL APPRoACH To STATISTICAL ANALySISWe have now collected our data and looked at some graphcal summares o the data. Now we wll

use numercal summary, also known as statstcs, to try to descrbe the nature o the underlyng

populaton or process rom whch we have taken our samples. From these descrptve statstcs, we

assume a probablty model or probablty dstrbuton or the underlyng populaton or process andthen select the approprate statstcal tests to test hypotheses or make decsons. It s mportant to

note that the conclusons one may draw rom a statstcal test depends on how well the assumed

probablty model ts the underlyng populaton or process.

1 2 3 4 5 6

0

1/6

Result of Toss of Single Dice

RelativeFrequency

Probability Mass Function

FIguRE 3.7: The probablty densty uncton or a dscrete random varable (probablty mass unc-

ton). In ths case, the random varable s the value o a toss o a sngle dce. Note that each o the sx pos-

sble outcomes has a probablty o occurrence o 1 o 6. Ths probablty densty uncton s also knownas a unorm probablty dstrbuton.


26/102


654321

0.2

0.1

0.0

Value of Dice Toss

RelativeFrequ

ency

Histogram of 50 Dice Tosses

654321

0.2

0.1

0.0

Value of Dice Toss

RelativeFrequency

Histogram of 2000 Dice Tosses

FIguRE 3.8:Hstograms representng the outcomes o experments n whch a sngle dce s tossed

50 (top) and 2000 tmes (lower), respectvely. Note that as the sample sze ncreases, the hstogram ap

proaches the true probablty dstrbuton llustrated n Fgure 3.7.

As stated n the Introducton, bomedcal engneers are tryng to make decsons about popu

latons or processes to whch they have lmted access. Thus, they desgn experments and collecsamples that they thnk wll arly represent the underlyng populaton or process. Regardless o

what type o statstcal analyss wll result rom the nvestgaton or study, all statstcal analys

should ollow the same general approach:


27/102


Measure a lmted number o representatve samples rom a larger populaton.

Estmate the true statstcs o larger populaton rom the sample statstcs.

Some mportant concepts need to be addressed here. The rst concept s somewhat obvous. It s

oten mpossble or mpractcal to take measurements or observatons rom an entre populaton.

Thus, the bomedcal engneer wll typcally select a smaller, more practcal sample that represents

the underlyng populaton and the extent o varablty n the larger populaton. For example, we

cannot possbly measure the restng body temperature o every person on earth to get an estmate o

normal body temperature and normal range. We are nterested n knowng what the normal body

temperature s, on average, o a healthy human beng and the normal range o restng temperatures

as well as the lkelhood or probablty o measurng a specc body temperature under healthy, rest -

ng condtons. In tryng to determne the characterstcs or underlyng probablty model or body

temperature or healthy, restng ndvduals, the researcher wll select, at random, a sample o healthy,

restng ndvduals and measure ther ndvdual restng body temperatures wth a thermometer. The

researchers wll have to consder the composton and sze o the sample populaton to adequatelyrepresent the varablty n the overall populaton. The researcher wll have to dene what character-

zes a normal, healthy ndvdual, such as age, sze, race, sex, and other trats. I a researcher were to

collect body temperature data rom such a sample o 3000 ndvduals, he or she may plot a hsto-

gram o temperatures measured rom the 3000 subjects and end up wth the ollowng hstogram

(Fgure 3.9).The researcher may also calculate some basc descrptve statstcs or the 3000 samples,

such as sample average (mean), medan, and standard devaton.

1.

2.

95 96 97 98 99 100 101 102

0.0

0.1

0.2

0.3

0.4

0.5

Temperature (F)

Density

Body Temperature

FIguRE 3.9:Hstogram or 2000 nternal body temperatures collected rom a normally dstrbuted

populaton.


28/102


Once the researcher has estmated the sample statstcs rom the sample populaton, he or sh

wll try to draw conclusons about the larger (true) populaton. The most mportant queston to ask

when revewng the statstcs and conclusons drawn rom the sample populaton s how well th

sample populaton represents the larger, underlyng populaton.

Once the data have been collected, we use some basc descrptve statstcs to summarze th

data. These basc descrptve statstcs nclude the ollowng general measures: central tendencyvarablty, and correlaton.

3.6 dESCRIPTIvE STATISTICSThere are a number o descrptve statstcs that help us to pcture the dstrbuton o the underlyng

populaton. In other words, our ultmate goal s to assume an underlyng probablty model or th

populaton and then select the statstcal analyses that are approprate or that probablty model.

When we try to draw conclusons about the larger underlyng populaton or process rom ousmaller sample o data, we assume that the underlyng model or any sample, event, or measure

(the outcome o the experment) s as ollows:

X = ndvdual derences stuatonal actors unknown varables,

whereXs our measure or sample value and s nfuenced by, whch s the true populaton mean

ndvdual derences such as genetcs, tranng, motvaton, and physcal condton; stuaton actors

such as envronmental actors; and unknown varables such as undented/nonquanted actor

that behave n an unpredctable ashon rom moment to moment.In other words, when we make a measurement or observaton, the measured value represent

or s nfuenced by not only the statstcs o the underlyng populaton, such as the populaton

mean, but actors such as bologcal varablty rom ndvdual to ndvdual, envronmental actor

(tme, temperature, humdty, lghtng, drugs, etc.), and random actors that cannot be predcted

exactly rom moment to moment. All o these actors wll gve rse to a hstogram or the sample

data, whch may or may not refect the true probablty densty uncton o the underlyng popula

ton. I we have done a good job wth our expermental desgn and collected a sucent number o

samples, the hstogram and descrptve statstcs or the sample populaton should closely refect th

true probablty densty uncton and descrptve statstcs or the true or underlyng populaton. I

ths s the case, then we can make conclusons about the larger populaton rom the smaller sampl

populaton. I the sample populaton does not refect varablty o the true populaton, then the

conclusons we draw rom statstcal analyss o the sample data may be o lttle value.


29/102


There are a number o probablty models that are useul or descrbng bologcal and manu-

acturng processes. These nclude the normal, Posson, exponental, and gamma dstrbutons [10].

In ths book, we wll ocus on populatons that ollow a normal dstrbuton because ths s the most

requently encountered probablty dstrbuton used n descrbng populatons. Moreover, the most

requently used methods o statstcal analyss assume that the data are well modeled by a normal

(bell-curve) dstrbuton. It s mportant to note that many bologcal processes are not well mod-eled by a normal dstrbuton (such as heart rate varablty), and the statstcs assocated wth the

normal dstrbuton are not approprate or such processes. In such cases, nonparametrc statstcs,

whch do not assume a specc type o dstrbuton or the data, may serve the researcher better n

understandng processes and makng decsons. However, usng the normal dstrbuton and ts asso-

cated statstcs are oten adequate gven the central lmt theorem, whch smply states that the sum

o random processes wth arbtrary dstrbutons wll result n a random varable wth a normal ds-

trbuton. One can assume that most bologcal phenomena result rom a sum o random processes.

3.6.1 Measres Central TenencThere are several measures that refect the central tendency or concentraton o a sample populaton:

sample mean (arthmetc average), sample medan, and sample mode.

The sample mean may be estmated rom a group o samples, xi, where is sample number,

usng the ormula below.

Gven n data ponts, x1, x

2,, x

n:

xn

xi

i

n

==

1

1

.

In practce, we typcally do not know the true mean, , o the underlyng populaton, nstead we

try to estmate true mean, , o the larger populaton. As the sample sze becomes large, the sample

mean, x, should approach the true mean,, assumng that the statstcs o the underlyng populaton

or process do not change over tme or space.

One o the problems wth usng the sample mean to represent the central tendency o a

populaton s that the sample mean s susceptble to outlers. Ths can be problematc and oten

decevng when reportng the average o a populaton that s heavly skewed. For example, when

reportng ncome or a group o new college graduates or whch one s an NBA player who has just

sgned a multmllon-dollar contract, the estmated mean ncome wll be much greater than whatmost graduates earns. The same msrepresentaton s oten evdent when reportng mean value or

homes n a specc geographc regon where a ew homes valued on the order o a mllon can hde

the act that several hundred other homes are valued at less than $200,000.


30/102


Another useul measure or summarzng the central tendency o a populaton s the sample

medan. The medan value o a group o observatons or samples, xi, s the mddle observaton when

samples, xi, are lsted n descendng order.

For example, we have the ollowng values or tdal volume o the lung:

2, 1.5, 1.3, 1.8, 2.2, 2.5, 1.4, 1.3,

we can nd the medan value by rst orderng the data n descendng order:

2.5, 2.2, 2.0, 1.8, 1.5, 1.4, 1.3, 1.3,

and then we cross o values on each end untl we reach a mddle value:

2.5, 2.2, 2.0, 1.8, 1.5, 1.4, 1.3, 1.3.

In ths case, there are two mddle values; thus, the medan s the average o those two values, whchs 1.65.

Note that the number o samples, n, s odd, the medan wll be the mddle observaton. I

the sample sze, n, s even, then the medan equals the average o two mddle observatons. Com-

pared wth the sample mean, the sample medan s less susceptble to outlers. It gnores the skew n

a group o samples or n the probablty densty uncton o the underlyng populaton. In general

to arly represent the central tendency o a collecton o samples or the underlyng populaton, we

use the ollowng rule o thumb:

I the sample hstogram or probablty densty uncton o the underlyng populaton s

symmetrc, use mean as a central measure. For such populatons, the mean and medan

are about equal, and the mean estmate makes use o all the data.

I the sample hstogram or probablty densty uncton o the underlyng populaton s

skewed, medan s a more ar measure o center o dstrbuton.

Another measure o central tendency s mode, whch s smply the most requent observaton n

a collecton o samples. In the tdal volume example gven above, 1.3 s the most requently occurrng

sample value. Mode s not used as requently as mean or medan n representng central tendency.

3.6.2 Measres variabilitMeasures o central tendency alone are nsucent or representng the statstcs o a populaton o

process. In act, t s usually the varablty n the populaton that makes thngs nterestng and lead

1.

2.


31/102


to uncertanty n decson makng. The varablty rom subject to subject, especally n physologcal

uncton, s what makes ndng ool-proo dagnoss and treatment oten so dcult. What works

or one person oten als or another, and, t s not the mean or medan that pcks up on those

subject-to-subject derences, but rather the varablty, whch s refected n derences n the prob-

ablty models underlyng those derent populatons.

When summarzng the varablty o a populaton or process, we typcally ask, How ar romthe center (sample mean) do the samples (data) le? To answer ths queston, we typcally use the

ollowng estmates that represent the spread o the sample data: nterquartle ranges, sample var-

ance, and sample standard devaton.

The nterquartle range s the derence between the rst and thrd quartles o the sample

data. For sampled data, the medan s also known as the second quartle, Q2. Gven Q2, we can nd

the rst quartle, Q1, by smply takng the medan value o those samples that le below the second

quartle. We can nd the thrd quartle, Q3, by takng the medan value o those samples that le

above the second quartle. As an llustraton, we have the ollowng samples:

1, 3, 3, 2, 5, 1, 1, 4, 3, 2.

I we lst these samples n descendng order,

5, 4, 3, 3, 3, 2, 2, 1, 1, 1,

the medan value and second quartle or these samples s 2.5. The rst quartle, Q1, can be ound

by takng the medan o the ollowng samples,

2.5, 2, 2, 1, 1, 1,

whch s 1.5. In addton, the thrd quartle, Q3, may be ound by takng the medan value o the

ollowng samples:

5, 4, 3, 3, 3, 2.5,

whch s 3. Thus, the nterquartle range, Q3 Q1 = 3 1.5 = 2.

Sample varance, s2, s dened as the average dstance o data rom the mean and the ormula

or estmatng s2 rom a collecton o samples, xi, s

sn

x xi

i

n

2 2

1

1

1=

( ) .


32/102


Sample standard devaton, s, whch s more commonly reerred to n descrbng the varablty o

the data s

=2

s s (same unts as orgnal samples).

It s mportant to note that or normal dstrbutons (symmetrcal hstograms), sample mean

and sample devaton are the only parameters needed to descrbe the statstcs o the underlyng

phenomenon. Thus, one were to compare two or more normally dstrbuted populatons, one only

need to test the equvalence o the means and varances o those populatons.


33/102

25

Now that we have collected the data, graphed the hstogram, estmated measures o central ten-

dency and varablty, such as mean, medan, and standard devaton, we are ready to assume a

probablty model or the underlyng populaton or process rom whch we have obtaned samples.

At ths pont, we wll make a rough assumpton usng smple measures o mean, medan, standard

devaton and the hstogram. But t s mportant to note that there are more rgorous tests, such as

thec2 test or normalty [7] to determne whether a partcular probablty model s approprate to

assume rom a collecton o sample data.

Once we have assumed an approprate probablty model, we may select the approprate

statstcal tests that wll allow us to test hypotheses and draw conclusons wth some level o con-

dence. The probablty model wll dctate what level o condence we have when acceptng or

rejectng a hypothess.

There are two undamental questons that we are tryng to address when assumng a prob-

ablty model or our underlyng populaton:

How condent are we that the sample statstcs are representatve o the entre

populaton?

Are the derences n the statstcs between two populatons sgncant, resultng rom

actors other than chance alone?

To declare any level o condence n makng statstcal nerence, we need a mathematcal model

that descrbes the probablty that any data value mght occur. These models are called probablty

dstrbutons.

There are a number o probablty models that are requently assumed to descrbe bologcalprocesses. For example, when descrbng heart rate varablty, the probablty o observng a specc

tme nterval between consecutve heartbeats mght be descrbed by an exponental dstrbuton [1, 8].

Fgure 3.6 n Chapter 3 llustrates a hstogram or samples drawn rom an exponental dstrbuton.

1.

2.

Assmin a Prbabilit MelFrm the Sample data

C H A P T E R 4


34/102


Note that ths dstrbuton s hghly skewed to the rght. For R-R ntervals, such a probablty unc-

ton makes sense physologcally because the ndvdual heart cells have a reractory perod that pre-

vents them rom contractng n less that a mnmum tme nterval. Yet, a very prolonged tme nterva

may occur between beats, gvng rse to some long tme ntervals that occur nrequently.

The most requently assumed probablty model or most scentc and engneerng applca

tons s the normal or Gaussan dstrbuton. Ths dstrbuton s llustrated by the sold black lne nFgure 4.1 and oten reerred to as the bell curve because t looks lke a muscal bell.

The equaton that gves the probablty,(x), o observng a specc value ox rom the un

derlyng normalpopulaton s

f x

x

( ) ,=-

-

1

2

1

2

2

e

< x <

where s the true mean o the underlyng populaton or process and s the standard devaton

o the same populaton or process. A graph o ths equaton s gven llustrated by the sold, smoothcurve n Fgure 4.1. The area under the curve equals one.

Note that the normal dstrbuton s

a symmetrc, bell-shaped curve completely descrbed by ts mean, , and standard deva-

ton, .

by changng and , we stretch and slde the dstrbuton.

1.

2.

0

0.05

0.1

Normalized Measure

Relative

Frequency

Histogram of Measure, with Normal Curve

-4 -3 -2 -1 0 1 2 3

FIguRE 4.1:A hstogram o 1000 samples drawn rom a normal dstrbuton s llustrated. Super-

mposed on the hstogram s the deal normal curve representng the normal probablty dstrbuton

uncton.


35/102

ASSuMINg A PRoBABILIT y ModEL FRoM THE SAMPLE dATA 27

Fgure 4.1 also llustrates a hstogram that s obtaned when we randomly select 1000 samples

rom a populaton that s normally dstrbuted and has a mean o 0 and a varance o 1. It s mpor-

tant to recognze that as we ncrease the sample sze n, the hstogram approaches the deal normal

dstrbuton shown wth the sold, smooth lne. But, at small sample szes, the hstogram may look

very derent rom the normal curve. Thus, rom small sample szes, t may be dcult to determne

the assumed model s approprate or the underlyng populaton or process, and any statstcaltests that we perorm may not allow us to test hypotheses and draw conclusons wth any real level

o condence.

We can perorm lnear operatons on our normally dstrbuted random varable, x, to produce

another normally dstrbuted random varable,y. These operatons nclude multplcaton ox by a

constant and addton o a constant (oset) to x. Fgure 4.2 llustrates hstograms or samples drawn

rom each o populatons x andy. We note that the dstrbuton or y s shted (the mean s now

equal to 5) and the varance has ncreased wth respect to x.

One test that we may use to determne how well a normal probablty model ts our data

s to count how many samples all wthn 1 and 2 standard devatons o the mean. I the dataand underlyng populaton or process s well modeled by a normal dstrbuton, 68% o the samples

should le wthn 1 standard devaton rom the mean and 95% o the samples should le wthn

1050

200

100

0

xoryvalue

Frequency

y= 2x+ 5

x

FIguRE 4.2:Hstograms are shown or samples drawn rom populatons x andy, wherey s smply a ln-

ear uncton ox. Note that the mean and varance oy der rom x, yet both are normal dstrbutons.


36/102


2 standard devatons rom the mean. These percentages are llustrated n Fgure 4.3. It s mpor

tant to remember these ew numbers, because we wll requently use ths 95% nterval when drawng conclusons rom our statstcal analyss.

Another means or determnng how well our sampled data, x, represent a normal dstrbu

ton s the estmate Pearsons coecent o skew (PCS) [5]. The coecent o skew s gven by

PCSmedian

=

3 x x

s.

I the PCS > 0.5, we assume that our samples were not drawn rom a normally dstrbuted populaton

When we collect data, the data are typcally collected n many derent types o physcal unt

(volts, celsus, newtons, centmeters, grams, etc.). For us to use tables that have been developed oprobablty models, we need to normalze the data so that the normalzed data wll have a mean o

0 and a standard devaton o 1. Such a normal dstrbuton s called a standard normal dstrbuton

and s llustrated n Fgure4.1.

3210-1-2-3

90

80

70

60

50

40

30

20

10

0

Normalized value (Z score)

Frequency

68 %95%

FIguRE 4.3: Hstogram or samples drawn rom a normally dstrbuted populaton. For a normal ds

trbuton, 68% o the samples should le wthn 1 standard devaton rom the mean (0 n ths case) and

95% o the samples should le wthn 2 standard devatons (1.96 to be precse) o the mean.


37/102


The standard normal dstrbuton has a bell-shaped, symmetrc dstrbuton wth = 0 and

= 1.

To convert normally dstrbuted data to the standard normal value, we use the ollowng

ormulas,

z = (x )/ or z = (x x )/s,

dependng on we know the true mean, , and standard devaton, a, or we only have the sample

estmates, x or s.

For any ndvdual sample or data pont, xi, rom a sample wth mean, x, and standard deva-

ton, s, we can determne ts z score rom the ollowng ormula:

z

x x

sii

=

.

For an ndvdual sample, the z score s a normalzed or standardzed value. We can use ths value

wth our equatons or probablty densty uncton or our standardzed probablty tables [3] to de-

termne the probablty o observng such a sample value rom the underlyng populaton.

The z score can also be thought o as a measure o the dstance o the ndvdual sample, xi,

rom the sample average, x, n unts o standard devaton. For example, a sample pont, xihas a z

score ozi= 2, t means that the data pont, x

i, s 2 standard devatons rom the sample mean.

We use normalzed z scores nstead o the orgnal data when perormng statstcal analyss

because the tables or the normalzed data are already worked out and avalable n most statstcs

texts or statstcal sotware packages. In addton, by usng normalzed values, we need not worry

about the absolute ampltude o the data or the unts used to measure the data.

4.1 THE STANdARd NoRMAL dISTRIBuTIoNThe standard normal dstrbuton s llustrated n Table 4.1.

The z table assocated wth ths gure provdes table entres that gve the probablty that z

a, whch equals the area under the normal curve to the let oz = a. I our data come rom a normal

dstrbuton, the table tells us the probablty or chance o our sample value or expermental out-

comes havng a value less than or equal to a.

Thus, we can take any sample and compute ts z score as descrbed above and then use the

z table to nd the probablty o observng a z value that s less than or equal to some normalzed

value, a. For example, the probablty o observng a z value that s less than or equal to 1.96 s97.5%. Thus, the probablty o observng a z value greater than 1.96 s 2.5%. In addton, because o

symmetry n the dstrbuton, we know that the probablty o observng a z value greater than 1.96

s also 97.5%, and the probablty o observng a z value less than or equal to 1.96 s 2.5%. Fnally,


38/102

3210-1-2-3-4

Measure

Frequency

ZDistribution

Z

Area to let oza

equals the Pr(z < za) = 1 a; thus, the area n the tal to the rght o z

equals a.

TABLE 4.1: Standard z dstrbuton uncton: areas under standardzed normal densty uncton

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.l5279 0.5319 0.535

0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.575

0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.614

1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9631.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.970

1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.976

2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.981

2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.993

2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.995

2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.996

3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.999


39/102


the probablty o observng a z value between 1.96 and 1.96 s 95%. The reader should study the

z table and assocated graph o the z dstrbuton to very that the probabltes (or areas under the

probablty densty uncton) descrbed above are correct.

Oten, we need to determne the probablty that an expermental outcome alls between two

values or that the outcome s greater than some value a or less or greater than some value b. To nd

these areas, we can use the ollowng mportant ormulas, where Pr s the probablty:

Pr(azb) = Pr(zb) Pr(za)

= area between z = a and z = b.

Pr(za) = 1 Pr(z < a)

= area to rght oz = a

= area n the rght tal.

Thus, or any observaton or measurement, x, rom any normal dstrbuton:

Pr( ) Pr ,a x b

az

b =

where s the mean o normal dstrbuton and s the standard devaton o normal dstrbuton.

In other words, we need to normalze or nd the z values or each o our parameters, a and b,

to nd the area under the standard normal curve (z dstrbuton) that represents the expresson onthe let sde o the above equaton.

Eample 4.1 The mean ntake o at or males 6 to 9 years old s 28 g, wth a standard devaton

o 13.2 g. Assume that the ntake s normally dstrbuted. Steves ntake s 42 g and Bens ntake s

25 g.

AREA IN RIgHTTAIL,a Za

0.10 1.282

0.05 1.645

0.025 1.96

0.010 2.326

0.005 2.576

Commonly used z values:


40/102


What s the proporton o area between Steves daly ntake and Bens daly ntake?

I we were to randomly select a male between the ages o 6 and 9 years, what s the prob-

ablty that hs at ntake would be 50 g or more?

Solution:x = at ntake

The problem may be stated as: what s Pr(25 x 42)?Assumng a normal dstrbuton, we convert to z scores:

What s Pr(((25 28)/13.2) < z < ((42 28)/13.2)))?

= Pr (0.227 z 1.06) = Pr (z 1.06) Pr (z 0.227) (usng ormula o

Pr (azb))

` = Pr (z 1.06) [1 Pr(z 0.227)] = 0.8554 [1 0.5910] = 0.4464 or 44.6% o

area under the z curve.

2. The problem may be stated as, What s Pr (x > 50)?

Normalzng to z score, what s Pr (z > (50 28)/13.2)?

= Pr (z > 1.67)= 1 Pr (z 1.67) = 1 0.9525 = 0.0475, or 4.75% o the area

under the z curve.

Eample 4.2 Suppose that the speccatons on the range o dsplacement or a lmb ndentor are

0.5 0.001 mm. I these dsplacements are normally dstrbuted, wth mean = 0.47 and standard

devaton = 0.002, what percentage o ndentors are wthn speccatons?

Solution:x = dsplacement.

The problem may be stated as, What s Pr(0.499 x 0.501)?

Usng z scores, Pr(0.499 x 0.501) = Pr((0.499 0.47)/0.002 z (0.501 0.470.002))

= Pr (14.5 z 15.5) = Pr (z 15.5) Pr (z 14.5) = 1 1 = 0

It s useul to note that the dstrbuton o the underlyng populaton and the assocated sample

data are not normal (.e. skewed), transormatons may oten be used to make the data normal

and the statstcs covered n ths text may then be used to perorm statstcal analyss on the trans-

ormed data. These transormatons on the raw data nclude logs, square root, and recprocal.

4.2 THE NoRMAL dISTRIBuTIoN ANd SAMPLE MEANAll statstcal analyss ollows the same general procedure:

Assume an underlyng dstrbuton or the data and assocated parameters (e.g., the

sample mean).

1.

2.

1.

1.


41/102


Scale the data or parameter to a standard dstrbuton.

Estmate condence ntervals usng a standard table or the assumed dstrbuton. (The

queston we ask s, What s the probablty o observng the expermental outcome by

chance alone?)

Perorm hypothess test (e.g., Students ttest).

We are begnnng wth and ocusng most o ths text on the normal dstrbuton or probablty

model because o ts prevalence n the bomedcal eld and somethng called the central lmt

theorem. One o the most basc statstcal tests we perorm s a comparson o the means rom

two or more populatons. The sample mean s t tsel an estmate made rom a nte number

o samples. Thus, the sample mean, x, s tsel a random varable that s modeled wth a normal

dstrbuton [4].

Is ths model or x legtmate? The answer s yes, or large samples, because o the central

lmt theorem, whch states [4, 10]:

I the individual data points or samples (each sample is a random variable),x, come rom any arbitrary

probability distribution, the sum (and hence, average) o those data points is normally distributed as the

sample size, n, becomes large.

Thus, even each sample, such as the toss o a dce, comes rom a nonnormal dstrbuton

(e.g., a unorm dstrbuton, such as the toss o a dce), the sum o those ndvdual samples (such

as the sum we use to estmate the sample mean, x) wll have a normal dstrbuton. One can eas-

ly assume that many o the bologcal or physologcal processes that we measure are the sum o a

number o random processes wth varous probablty dstrbutons; thus, the assumpton that oursamples come rom a normal dstrbuton s not unreasonable.

4.3 CoNFIdENCE INTERvAL FoR THE SAMPLE MEANEvery sample statstc s n tsel a random varable wth some sort o probablty dstrbuton. Thus,

when we use samples to estmate the true statstcs o a populaton (whch n practce are usually not

known and not obtanable), we want to have some level o condence that our sample estmates are

close to the true populaton statstcs or are representatve o the underlyng populaton or process.

In estmatng a condence nterval or the sample mean, we are askng the queston: How

close s our sample mean (estmated rom a nte number o samples) to the true mean o thepopulaton?

To assgn a level o condence to our statstcal estmates or statstcal conclusons, we need

to rst assume a probablty dstrbuton or model or our samples and underlyng populaton and

then we need to estmate a condence nterval usng the assumed dstrbuton.

2.

3.

4.


42/102


The smplest condence nterval that we can begn wth regardng our descrptve statstcs

a condence nterval or the sample mean, x. Our queston s how close s the sample mean to the

true populaton or process mean, ?

Beore we can answer ths, we need to assume an underlyng probablty model or the sampl

mean, x, and true mean, . As stated earler, t may be shown that or a large samples sze, the sampl

mean, x, s well modeled by a normal dstrbuton or probablty model. Thus, we wll use ths modewhen estmatng a condence nterval or our sample mean, x.

Thus, x s estmated rom the sample. We then ask, how close s x (sample mean) to the true

populaton mean, ?

It may be shown that we took many groups on samples and estmated x or each group,

the average or sample mean ox = , and

the standard devaton ox = s n.

Thus, as our sample sze, n, gets large, the dstrbuton or x approaches a normal dstrbuton.For large n, x ollows a normal dstrbuton, and the z score or xmay be used to estmate th

ollowng:

Pr( ) Pr

/ /.a x b

a

nz

b

n =

Ths expresson assumes a large n and that we know.

Now we look at the case where we mght have a large n, but we do not know. In such caseswe replace wth s to get the ollowng expresson:

Pr( ) Pr

/ /,a x b

a

s nz

b

s n =

where s n s called the sample standard error and represents the standard devaton or x.

Let us assume now or large n, we want to estmate the 95% condence nterval or x. W

rst scale the sample mean, x, to a z value (because the central lmt theorem says that x s normally

dstrbuted)

z

x

s n

=

/.

1.

2.


43/102


We recall that 95% oz values all between 1.96 (approxmately 2) o the mean, and or the z

dstrbuton,

Pr(1.96 z1.96) = 0.95.

Substtutng or z,

z

x

s n

=

/.

we get

0 95 1 96 1 96. Pr .

/. .=

x

s n

I we use the ollowng notaton n terms o the sample standard error:

SE( ) .xs

n

=

Rearrangng terms or the expresson above, we note that the probablty that les between 1.96

(or 2) standard devatons ox s 95%:

0 9 5 1 96 1 96. Pr . SE( ) . SE( ).= +( )x x x x

Note that 1.96 s reerred to as za/2. Ths z value s the value oz or whch the area n the rghttal o the normal dstrbuton s a/2. I we were to estmate the 99% condence nterval, we would

substtute z0.01/2

, whch s 2.576, nto the 1.96 poston above.

Thus, For large n and any condence level, 1 a, the 1 acondence nterval or the true

populaton mean, , s gven by:

= x z SE x/ ( ).2

Ths means that there s a (1 a)percent probablty that the true mean les wthn the above nter-

val centered about x.

Eample 4.3 Estmate o condence ntervals

Problem: Gven a collecton o data wth, x = 505 and s = 100. I the number o samples was 1000,

what s the 95% condence nterval or the populaton mean, ?


44/102


Solution: I we assume a large sample sze, we may use the z dstrbuton to estmate the condenc

nterval or the sample mean usng the ollowng equaton:

= x z x/ SE( ).2

We plug n the ollowng values:

x=505;SE( ) / / .x s n= = 100 1000

For 95% condence nterval, a= 0.05.

Usng a z table to locate z(0.05/2), we nd that the value oz that gves an area o 0.025 n

the rght tal s 1.96

Pluggng n x, SE(x), and z(a/2) nto the estmate or the condence nterval above, we nd

that the 95% condence nterval or = [498.80, 511.20].

Note that we wanted to estmate the 99% condence nterval, we would smply use a d-erent z value, z (0.01/2) n the same equaton. The z value assocated wth an area o 0.005 n th

rght tal s 2.576. I we use ths z value, we estmate a condence nterval or o [496.86, 515.14]

We note that the condence nterval has wdened as we ncreased our condence level.

4.4 THE tdISTRIBuTIoNFor small samples, x s no longer normally dstrbuted. Thereore, we use Students tdstrbuton to

estmate the true statstcs o the populaton. The tdstrbuton, as llustrated n Table 4.2 looks lk

a z dstrbuton but wth slower taper at the tals and fatter central regon.

Measure

Frequency

t Distribution

t

Curve changes with df

3210-1-2-3-4

Table entry = t(a; d), where as the area n the tal to the rght ot(a; d) and ds degree

o reedom.


45/102


TABLE 4.2: Percentage ponts or Students tdstrbuton

d a= area to rght ot(a; d)

0.10 0.05 0.025 0.01 0.005

1 3.078 6.314 12.706 31.821 63.657

2 1.886 2.920 4.303 6.965 9.925

3 1.638 2.353 3.182 4.541 5.841

4 1.533 2.132 2.776 3.747 4.604

5 1.476 2.015 2.571 3.365 4.032

10 1.372 1.812 2.228 2.764 3.169

11 1.363 1.796 2.201 2.718 3.106

12 1.356 1.782 2.179 2.681 3.055

13 1.350 1.771 2.160 2.650 3.012

14 1.345 1.761 2.145 2.624 2.977

15 1.341 1.753 2.131 2.602 2.947

30 1.310 1.697 2.042 2.457 2.750

40 1.303 1.684 2.021 2.423 2.704

60 1.296 1.671 2.000 2.390 2.660

120 1.289 1.658 1.980 2.358 2.617

1.282 1.645 1.960 2.326 2.576

za(large sample) 1.282 1.645 1.960 2.326 2.576


46/102


We use tdstrbuton, wth slower tapered tals, because wth so ew samples, we have les

certanty about our underlyng dstrbuton (probablty model).

Now our normalzed value or x s gven by

s n

/

,x

whch s known to have a tdstrbuton rather than the z dstrbuton that we have dscussed thu

ar. The tdstrbuton was rst nvented by W.S. Gosset [4, 11], a chemst who worked or a brew

ery. Gosset decded to publsh hs tdstrbuton under the alas o Student. Hence, we oten reer to

ths dstrbuton as Students tdstrbuton.

The tdstrbuton s symmetrc lke the z dstrbuton and generally has a bell shape. But th

amount o spread o the dstrbuton to the tals, or the wdth o the bell, depends on the sample

sze, n. Unlke the z dstrbuton, whch assumes an nnte sample sze, the tdstrbuton change

shape wth sample sze. The result s that the condence ntervals estmated wth tvalues are morspread out than or z dstrbuton, especally or small sample szes, because wth such sample

szes, we are penalzed or not havng sucent samples to represent the extent o varablty o

the underlyng populaton or process. Thus, when we are estmatng condence ntervals or th

sample mean, x, we do not have as much condence n our estmate. Thus, the nterval wdens to

refect ths decreased certanty wth smaller sample szes. In the next secton, we estmate the con

dence nterval or the same example gven prevously, but usng the tdstrbuton nstead o the z

dstrbuton.

4.5 CoNFIdENCE INTERvAL uSINg tdISTRIBuTIoNLke the z tables, there are t tables where the values ot that are assocated wth derent area

under the probablty curve are already calculated and may be used or statstcal analyss wthou

the need to recalculate the tvalues. The derence between the z table and ttable s that now th

tvalues are a uncton o the samples sze or degrees o reedom. Table 4.2 gves a ew lnes o th

ttable rom [3].

To use the table, one smply looks or the ntersecton o degrees o reedom, d, (related to

sample sze) and avalue that one desres n the rght-hand tal. The ntersecton provdes the tvalu

or whch the area under the tcurve n the rght tal s a. In other words, the probablty that twlbe less than or equal to a specc entry n the table s 1 a. For a specc sample sze, n, the degree

o reedom, d= n 1.

Now we smply substtute tor z to nd our condence ntervals. So, the condence nterva

or the sample mean, x, usng the tdstrbuton now becomes


47/102


= 1 SE( )

x t 2n;

x

where s the true mean o the underlyng populaton or process rom whch we are drawng sam-

ples, SE(x) s the standard error o the underlyng populaton or process, ts the tvalue or whch

there s an area oa/2 n the rght tal, and n s the sample sze.

Eample 4.4 Condence nterval usng tdstrbuton

Problem: We consder the same example used prevously or estmatng condence ntervals usng

z values. In ths case, the sample sze s small (n = 20), so we now use a tdstrbuton.

Solution: Now, our estmate or condence nterval or = 1 SE( )

x t 2n;

x .

Agan, we plug n the ollowng values:

x=505;SE( ) / / .x s n= = 100 20

For 95% condence nterval, a= 0.05.

Usng a ttable to locate t(0.05/2, 20 1), we nd that or 19 d, the value otthat gves an

area o 0.025 n the rght tal s 2.093.

Pluggng n x, SE(x), and t(a, n 1) nto the estmate or the condence nterval above, we

nd that the 95% condence nterval or = [458.20, 551.80] We note that ths condence nterval

s wdened compared wth that prevously estmated usng the z values. Ths s expected because the

tdstrbuton s wder than the z dstrbuton at small sample szes, refectng the act that we have

less condence n our estmate ox and, hence, when the sample sze s small.

Condence ntervals may be estmated or most descrptve statstcs, such as the samplemean, sample varance, and even the lne o best t determned through lnear regresson [3]. As

noted above, the condence nterval refects the varablty n the parameter estmate and s n-

fuenced by the sample sze, the varablty o the populaton, and the level on condence that we

desre. The greater the desred condence, the wder the nterval. Lkewse the greater the varablty

n the samples, the wder the condence nterval.

Eample 4.5 Revew o probablty concepts

Problem: Assembly tmes were measured or a sample o 15 glucose nuson pumps. The mean tme

to assem

Documents

Introduction to Statistics for Bio Medical Engineers - Kristina M. Ropella