50
Data analysis: 1. Describing data Ana Jerončić, PhD Department for Research in Biomedicine and Health

Data analysis: 1. Describing data

Embed Size (px)

DESCRIPTION

Data analysis: 1. Describing data. Ana Jerončić , PhD Department for Research in Biomedicine and Health. Contact. E-mail: [email protected] Location: main building, 5th floor, room 512 Phone: 557-862. Contents of the 2nd week. Describing data - Central tendency and variability - PowerPoint PPT Presentation

Citation preview

Page 1: Data analysis: 1. Describing data

Data analysis:1. Describing

dataAna Jerončić, PhD

Department for Research in Biomedicine and Health

Page 2: Data analysis: 1. Describing data

Contact

E-mail: [email protected]: main building, 5th floor,

room 512Phone: 557-862

Page 3: Data analysis: 1. Describing data

Contents of the 2nd week

1. Describing data - Central tendency and variability

2. Estimation - Accuracy, precision, standard error, confidence intervals

3. Hypothesis testing - Test statistics, P-value, choice of a statistical test

4. Interpretation of data - Causality and association, odds ratio, risk, correlation, linear regression

5. Sources of error - Type 1 and type 2 errors, power, bias, confounding

Page 4: Data analysis: 1. Describing data

Learning outcome

Critical appraisal of scientific papers

NOT!Implementation of data analysis

Page 5: Data analysis: 1. Describing data

Why?

To identify the best available treatment To prevent “medical zombies” To perform your own research

Page 6: Data analysis: 1. Describing data

Data analysis:Describing data

1. How the data should be organized prior to data analysis

2. Data types3. Graphical & tabular techniques for

description, summary statistics Qualitative Data Quantitative Data

Page 7: Data analysis: 1. Describing data

Height measurements among 1st year medical students

157 204 184 186 197 155 169150 193 205 150 161 169 147167 159 187 173 146 179 201159 147 144 204 184 192 165146 169 198 164 182 165 173147 166 167 180 169 174 201146 151 203 171 186 179 152189 204 189 200 202 147 181145 161 173 155 203 190 164141 163 179 195 155 197 151197 141 146 202 149 197 203172 143 151 200 197 192 192160 173 187 172 177 179 188

Page 8: Data analysis: 1. Describing data

What is the unit of measurement? How many observations per subject ?

Page 9: Data analysis: 1. Describing data

Variables and their values, Units of observation

Entity Height(cm)

Weight(kg)

Age(years)

Sex(category)

Person 1Person 2Person 3**

176171182**

706075**

333862**

MaleFemaleMale**

OBSERVATIONS

VARIABLES

Measurement/Observation

Page 10: Data analysis: 1. Describing data

Variable Features of variables

Example Descriptive statistics

Informativeness level

Categorical,Nominal

Unordered/unarranged

categories

Gender, urbanization

Number, proportion

Low

Ordinal Orded/arranged categories

Grades, scales

Median Medium

Numerical Arranged categories with equal intervals

Height, weight

Mean or median

High

Types of data (variables)

Page 11: Data analysis: 1. Describing data

CategoricalNominal

QualitativeOrdinal

Numerical

Quantitative

Types of data

Page 12: Data analysis: 1. Describing data

Type of data?

Height Grades Age in years Weight Insuline concentration Blood glucose

Page 13: Data analysis: 1. Describing data

Type of data?

How many cigarettes do you smoke a day?

1-5 6-10 11-15 16-20 21 and more

Page 14: Data analysis: 1. Describing data

Type of data?

Have you ever had a heart attack? Yes No

Do you suffer from hypertension? Yes No ?

Page 15: Data analysis: 1. Describing data

Type of data?

Gender: Male Female

Page 16: Data analysis: 1. Describing data

Type of data?

Marital status: married divorced widowed single lives alone ?

Page 17: Data analysis: 1. Describing data

Type of data?

Education: elementary school high school two-year college four-year college ?

Page 18: Data analysis: 1. Describing data

Type of data?

Likert scale 

Claim: Violence among the youth is becoming an increasing problem in Croatia.

I agree completely I agree Undecided I disagree I argue strongly against

1 2 3 4 5

Page 19: Data analysis: 1. Describing data

Type of data?

Visually analogous scale

E.g. pain level that examinee experiences

I don’t feel pain

I feel intolerable pain

Page 20: Data analysis: 1. Describing data

Variables – Transformation of variables

Numerical Distance is meaningfull

Ordinal Atributes can be ordered

Nominal Attributes are only named; weakest

Page 21: Data analysis: 1. Describing data

Watch out for… literature inconsistency in data type

classification

Page 22: Data analysis: 1. Describing data

Person No. Height [cm]Person 1 148Person 2 142Person 3 154Person 4 153Person 5 160Person 6 177Person 7 204Person 8 192Person 9 191Person 10 203Person 11 197Person 12 202Person 13 177

Observe and Describe

Page 23: Data analysis: 1. Describing data

Organized data are input forGraphical & Tabular data representations

Person No. Height [cm]Person 1 148Person 2 142Person 3 154Person 4 153Person 5 160Person 6 177Person 7 204Person 8 192Person 9 191Person 10 203Person 11 197Person 12 202Person 13 177

Observe and Describe

Page 24: Data analysis: 1. Describing data

QUALITATIVE DATA

Page 25: Data analysis: 1. Describing data

Tabular Techniques for Qualitative Variable(s) –YPEL5 example –

Contingency TableIn one study researchers investigated genotype of the

YPEL5 gene in a population sample from Split. They got the following results on 10 examinees :

IndividualYPEL5 Genotype

1 AA2 BB3 BB4 BB5 AB6 AB7 BB8 AA9 AB

10 BB

Genotype Frequency Relative Frequency

Relative Frequency [%]

AA 2 0.2 20%

AB 3 0.3 30%

BB 5 0.5 50%

Total 10 1.00 100%

Table Frequency Distribution of YPEL5 genotypes

proportion percentage

Page 26: Data analysis: 1. Describing data

Graphical Techniques for Qualitative Variable(s) –YPEL5 example – Bar Chart

Bar Charts are often used to display frequencies…

categories’ names

CountsOr Percentages

BB AA AB0

1

2

3

4

5

YPEL5 genotype

Freq

uen

cy

Page 27: Data analysis: 1. Describing data

Is there an association between the medicine taken and the length of cold?

(84%)

(16%)

(100%)

(19%)

(81%)

(100%)

Page 28: Data analysis: 1. Describing data

Graphical & Tabular Techniques

The only allowable calculation => count the frequency of category.

We can summarize the data in a contingency table that presents the categories and their counts called a frequency distribution.

A relative frequency distribution lists the categories and the proportion with which each occurs.

Page 29: Data analysis: 1. Describing data

Graphical Techniques for Qualitative Variable(s) –Bar Chart – pareto chart

Nominal data has no order. However, sometimes it is usefull to arrange the outcomes from the most frequently occurring to the least frequently occurring. We call this bar chart representation a “pareto chart”

categories’ names

counts

Page 30: Data analysis: 1. Describing data

Chart with relative frequency is more informative

categories’ names

percentages

Graphical Techniques for Qualitative Variable(s) –Bar Chart – pareto chart

Page 31: Data analysis: 1. Describing data

Pie Charts show relative frequencies…

Graphical Techniques for Qualitative Variable(s) –YPEL5 example - Pie Chart

50%

20%

30%

Pie Chart

BBAAAB

Page 32: Data analysis: 1. Describing data

Watch out for . . . Authors can use percentages to hide the true size of the data. To say that 50% of a sample has a certain condition

when there are only four people in the sample is clearly not providing the same level of information as 50% of a sample based on 400 people.

So, percentages should be used as an additional help for the reader rather than replacing the actual data

Page 33: Data analysis: 1. Describing data

Chart that changed the medicine

Page 34: Data analysis: 1. Describing data

QUANTITATIVE DATA

Page 35: Data analysis: 1. Describing data

Graphical Technique for Quantitative Data

Height measurements among 1st year medical students

Individual Height (cm)1 1862 1443 1754 1995 1496 1577 1508 1769 179

10 16511 15112 16413 16714 17515 19116 16317 18718 17619 18420 19121 17222 15123 179

Frequency distribution for quantitative data:

Building a Histogram

Page 36: Data analysis: 1. Describing data

Building a Histogram…

Category limits[cm] Freq.

Relative Freq.

Percent Relative Freq.

>140;<=150 3 0,13 13%150-160 3 0,13 13%160-170 4 0,17 17%170-180 7 0,30 30%180-190 5 0,22 22%190-200 1 0,04 4%Total 23 1,00 100%

Frequency distribution of height

145 155 165 175 185 1950%5%

10%15%20%25%30%35%

Height [cm]P

erce

nt

Rel

ativ

e Fr

equ

ency

Page 37: Data analysis: 1. Describing data

Graphical Techniques for Quantitative Data

There are several graphical methods that are used when the data are quantitative ( numeric).

The most important of these graphical methods is the histogram.

The histogram is not only a powerful graphical technique used to summarize interval data, but it is also used to help explain probabilities.

Page 38: Data analysis: 1. Describing data

Histogram simulations http://www.shodor.org/interactivate/activities/Histogram/

Page 39: Data analysis: 1. Describing data
Page 40: Data analysis: 1. Describing data

Summary: Graphs/Tables for Describing Data

Qualitative Frequency Distribution – tabular summary of data Bar Chart Pie Chart

Quantitative Frequency Distribution – tabular summary of data Histogram Line Chart (Time-Series Plot) Stem and Leaf Display

Page 41: Data analysis: 1. Describing data

Relationship between two variables

Page 42: Data analysis: 1. Describing data

Relationship between two variables

To compare two variables we use: Scatter plot/diagram (quantitative) Cross table (qualitative)

Page 43: Data analysis: 1. Describing data

Scatter plot – for two quantitative variables

Scatter plot, showing the strong association between enzyme activity at pH 5.5 and the 5α-reductase 2-specific mRNA expression, as expressed on the basis of β-actin (n = 30; rs = 0.81; 95% confidence interval, 0.64–0.91; P < 0.0001).

Page 44: Data analysis: 1. Describing data

Patterns of Scatter Diagrams…Linearity and Direction are two concepts we

are interested in

Positive Linear Relationship Negative Linear Relationship

Weak or Non-Linear Relationship

Page 45: Data analysis: 1. Describing data

Scatter plot

Analysis of expression level from microarray data

Squamous cell carcinoma tumor and perilesional display distinctly different scatter plots from normal tissue. Expresion levels for gene subset 1 in patient 1

Page 46: Data analysis: 1. Describing data

Cross Table - for two qualitative variables

Used to compare two qualitative variables

If first variable has r categories, second variable c categories, then we have an r×c cross table.

Page 47: Data analysis: 1. Describing data

Association of two qualitative variables

Disease X YES NO TOTAL

YPEL5 Genot

ype

AA 2 0 2

AB 1 3 4BB 0 4 4TOTAL 3 7 10

AA AB BB 0

1

2

3

4

Disease Healthy

Freq

uen

cy

Based on data presented do you think that YPEL5 could be associated with disease X?

Page 48: Data analysis: 1. Describing data

Questions?

Room 512 (5th floor)E-mail: [email protected]

Page 49: Data analysis: 1. Describing data

The projected clinical cost for breast cancer detection program in 2011-12, broken down by service category.

Page 50: Data analysis: 1. Describing data

The results of measuring the height among med. students

Individual Height (cm)1 1862 1443 1754 1995 1496 1577 1508 1769 179

10 16511 15112 16413 16714 17515 19116 16317 18718 17619 18420 19121 17222 15123 179

Histogram – common mistake!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230

50

100

150

200

250

subjects

Height [cm]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230

50

100

150

200

250

Same data –with reshuffled subjects

subjects

Height [cm]