134
Research Methodology PART 8 Statistical Techniques for Processing & Analysis of Data M S Sridhar Head, Library & Documentation ISRO Satellite Centre Bangalore 560017 E-mail: [email protected] & [email protected]

Data Analysis Techq

Embed Size (px)

Citation preview

Page 1: Data Analysis Techq

Research Methodology

PART 8Statistical Techniques for Processing &

Analysis of Data

M S SridharHead, Library & Documentation

ISRO Satellite CentreBangalore 560017

E-mail: [email protected] & [email protected]

Page 2: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 2

Statistical techniques for processing & analysis of dataSynopsis

1. Introduction to Research & Research methodology

2. Selection and formulation of research problem

3. Research design and plan

4. Experimental designs5. Sampling and sampling

strategy or plan6. Measurement and scaling

techniques7. Data collection methods

and techniques8. Testing of hypotheses9. Statistical techniques for

processing & analysis of data

10. Analysis, interpretation and drawing inferences

11. Report writing

1. IntroductionStatistics: what, why and

characteristics2. Statistic Types

Quantitative & Qualitative (Variable & Attribute) data

Descriptive & Inferential statistics3. Processing & Analysis of data

Processing:1. Editing2. Coding 3. Classification 4. Tabulation

Analysis1. Descriptive & inferential2. Correlational, causal &

multivariate …contd.

Page 3: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 3

Statistical Techniques for Processing & Analysis of Data:contd.

4. Some processing techniquesTally sheet / chart Presentation of data

Textual or descriptiveTabularDiagrammatic/ graphical

5. Univariate analysis/ measuresCentral tendencyDispersionAsymmetry (skewness)

6. Bivariate & Multivariate analysis/ measures

Synopsis1. Introduction to Research

& Research methodology

2. Selection and formulation of research problem

3. Research design and plan

4. Experimental designs5. Sampling and sampling

strategy or plan6. Measurement and scaling

techniques7. Data collection methods

and techniques8. Testing of hypotheses9. Statistical techniques for

processing & analysis of data

10. Analysis, interpretation and drawing inferences

11. Report writing

Page 4: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 4

Statistics•Science of statistics cannot be ignored by researcher•Statistics is both singular and plural. As plural it means numerical facts systematically collected and as singular it is the science of collecting, classifying and using statistics

•It is a tool for designing research, processing & analysing data and drawing inferences / conclusions

•It is also a double edged tool easily lending itself for abuse and misuseAbuse⇒ Poor data + Sophisticated techniques = Unreliable ResultMisuse⇒Honest facts (Hard data) + Poor techniques = ImpressionsExamples:

Percentage for very small sample Using wrong averagePlaying with probabilityScale & origin and proportion between ordinate & abscissaFunny correlationOne-dimensional figureUnmentioned base

Page 5: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 5

Characteristics of Statistics

1. Aggregates of facts2. Affected by multiple causes3. Numerically expressed 4. Collected in a systematic manner5. Collected for a predetermined purpose6. Enumerated or estimated according to reasonable

standard of accuracy7. Statistics must be placed in relation to each other

(context)

Page 6: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 6

What statistics does?1. Enables to present facts on a precise definite form

that helps in proper comprehension of what is stated. Exact facts are more convincing than vague statements

2. Helps to condense the mass of data into a few numerical measures, i.e., summarises data and presents meaningful overall information about a mass of data

3. Helps in finding relationship between different factors in testing the validity of assumed relationship

4. Helps in predicting the changes in one factor due to the changes in another

5. Helps in formulation of plans and policies which require the knowledge of further trends and hence statistics plays vital role in decision making

Page 7: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 7

Statistic types

• Deductive statistics describe a complete set of data • Inductive statistics deal with a limited amount of data

like a sample • Descriptive statistics ( & causal analysis) is concerned

with development of certain indices from the raw data and causal analysis. Measures of central tendency and measures of dispersion are typical descriptive statistical measures

• Inferential (sampling / statistical) analysis: Inferential statistics is used for (a) estimation of parameter values (point and interval estimates) (b) testing of hypothesis(using parametric / standard tests and non-parametric / distribution-free tests) and (c) drawing inferences

Page 8: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 8

Descriptive Statistics (Techniques)1. Uni-dimension analysis (Mostly one variable)

(I) Central tendency - Mean, median, mode, GM & HM

(ii) Dispersion - variance, standard deviation , mean deviation & range

(iii) Asymmetry (Skewness) & Kurtosis(iv) Relationship - Pearson’s product moment

correlation, spearman’s rank order correlation, Yule's coefficient of association

(v) Others - One way ANOVA, index numbers, time series analysis, simple correlation & regression analysis

Page 9: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 9

Descriptive Statistics (Techniques) …contd.2. Bivariate analysis

(I) Simple regression & correlation(ii) Association of attributes(iii)Two-way ANOVA

3. Multivariate analysis(i)Multiple regression & correlation/partial correlation(ii)Multiple Discriminate Analysis: Predicting an

entity’s possibility of belonging to a particular group based on several predictors

(iii)Multi-ANOVA: Extension of two-way ANOVA; ratio of among group variance to within group variance

(iv)Canonical analysis : Simultaneously predicting a set of dependent variables (both measurable & non measurable)

(v)Factor analysis, cluster analysis, etc.

Page 10: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 10

Quantitative and Qualitative (Variable and Attribute) Data• Quantitative (or numerical) data

an expression of a property or quality in numerical termsdata measured and expressed in quantityenables (i) precise measurement (ii) knowing trends or changes over time, and (iii) comparison of trends or individual units On the other hand,

• Qualitative (or categorical ) datainvolves quality or kind with subjectivityVariables data are quality characteristics that are measurable values, i.e., they are measurable, normally continuous and may take on any valueAttribute data are quality characteristics that are observed to be either present or absent, conforming or not conforming, i.e., they are countable, normally discrete and integer

Page 11: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 11

Processing and Analysis of Qualitative DataWhen feel & flavour of the situation become important,

researchers resort to qualitative data (some times called attribute data)

Qualitative data describe attributes of a single or a group of persons that is important to record as accurately as possible even though they cannot be measured in quantitative terms.

More time & efforts are needed to collect & process qualitative data. Such data are not amenable for statistical rules & manipulations. However, Scaling techniques help converting qualitative data into quantitative data. Usual data reduction, synthesis and plotting trends are required but differ substantially and extrapolation of finding is difficult. It calls for sensitive interpretation & creative presentation.

Examples: Quotation from interview, open remarks in questionnaire, case histories bringing evidence, content analysis of verbatim material, etc. …contd.

Page 12: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 12

Process and Analysis of Qualitative Data …contd.Note: Identifying & coding recurring answers to open ended questions help categorise key concepts & behaviour. May even count & cross analyse (requires pattern discerning skill); even unstructured depth interviews can be coded to summarise key concepts & present in the form of master chartsQualitative coding involves classifying data which are (i) not originally created for research purpose and (ii) having very little orderSTEPS:1. Initial formalisation with issues arising (build themes &

issues)2. Systematically describing the contents (compiling a list of

key themes)3. Indexing the data (note reflections for patterns, links, etc.) in

descriptions; interpreting in relation to objective; checking the interpretation

4. Charting the data themes5. Refining the charted material6. Describing & discussing the emerging story

Page 13: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 13

Processing and Analysis of Quantitative DataQuantitative data are numbers representing counts, ordering or measurements can be described, summarised (data reduction), aggregated, compared and manipulated arithmetically & statisticallyLevels of measurement (ie., nominal, ordinal, interval & ratio) determine the kind of statistical techniques to be usedUse of computer is necessary in many situations

1. Organisation and classification of data2. Presentation of data3. Analysis of data

Inferential (Sampling / Statistical) Analysis is concerned with process of generalisation through estimation of parameter values and testing of hypotheses

4. Interpretation of data Inference: Data processing, analysis, presentation (presenting in table, chart or graph) & interpretation (interpreting is to expound the meaning) should lead to drawing inference, i.e., (i) Validation of hypotheses and (ii) Realisation of objectives with respect to (a) Relationship between variables (b) Discovering a fact (c) Establishing a general or universal law

Page 14: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 14

Processing and Analysis of Quantitative Qata …contd.

STEPS: 1. Data reduction : Reduce large batches & data sets (a) to numerical summaries, tabular & graphical form (b) to enable to ask questions about observed patterns 2. Data presentation 3. Exploratory data analysis 4. Looking for relationships & trends 5. Graphical presentation

PROCESSING (Aggregation & compression):1. Editing : (i) Field editing (ii) Central editing2. Coding: Assigning to a limited number of mutually

exclusive but exhaustive categories or classes3. Classification: arranging data in groups or classed on the

basis of common characteristics (i) By attributes ( statistics of attributes) (ii) By class intervals (statistics of variables)

Note:class limits, class intervals,magnitude, determination of frequencies & number of classes (normally 5-15; size of class interval, i = R / 1+3.3 log N Where R = Range & N = No. of items to be grouped) are discussed later

Page 15: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 15

Processing and Analysis of Quantitative Data4. Tabulation/ Tabular Presentation :To make voluminous data readily usable and easily comprehensiblethree forms of presentation are possibleA. Textual (descriptive) presentation: When the quantity of data is not too large and no difficulty in comprehending while going through, textual presentation helps to emphasise certain points. E.g. There are 30 students in the class and of which 10 (one-third) are female students.B.Tabular presentation: Summarising and displaying data in a concise / compact and logical order for further analysis is the purpose of tabulation. It is a statistical representation presenting as a simple or complex table for summarising and comparing frequencies, determining bases for and computing percentages, etc. Note: While tabulating responses to questionnaire that problems concerning the responses like ‘Don’t know’ and not answered responses, computation of percentages, etc. have to be handled carefully. C. Diagrammatic presentation

Page 16: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 16

Tabular Presentation of DataTable organises data presenting in rows and columns

with cells containing data for further statistical treatment and decision making. Four kinds of classification used in tabulation are:

i) Qualitative classification based on qualitative characteristics like status, nationality and gender

ii) Quantitative classification based on characteristics measured quantitatively like age, height and income (assigning class limits for the values forms classes)

iii) Temporal classification : Categorised according to time (with time as classifying variable). E.g., hours, days, weeks, months, years

iv) Spatial classification: Place as a classifying variable. E.g. Village, town, block, district, state, country

Page 17: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 17

Parts of TableTable is conceptualised as data presented in rows and

columns along with some explanatory notes. Tabulation can be one-way, two-way, or three-way classification depending upon the number of characteristics involved

i) Table number for identification purpose at the top or at the beginning of the title of the table; Whole numbers are used in ascending order; Subscripted numbers are used if there are many tables

ii) Title, usually placed at the head, narrates about the contents of the table; Clearly, briefly and carefully worded so as to make interpretations from the table clear and free from ambiguity

iii) Captions or column heading: are column designations to explain figures of the column

Page 18: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 18

Parts of Table …contd.

iv) Stab or row leadings (stab column) are designations of the rows

v) Body of the table contains the actual data vi) Unit of measurement: stated along with the title;

does not change throughout the table unless stated when different units are used for rows and columns; if stated figures are large, they are rounded up and indicated

vii)Source Note at the bottom of the table to indicate the source of data presented

viii)Foot Note is the last part of the table; explains the specific feature of the data content, which is not self explanatory and has not been explained earlier

Page 19: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 19

Preparation of Frequency Distribution Table1. Deciding number of classes: The rule of thumb is to have 5

to 15 classes. Know the range and variations in variable’s value. Range is the difference between the largest and the smallest value of the variable (i.e., It is the sum of all classintervals or the number of classes multiplied by class interval) (Class interval is the various intervals of the variable chosen for classifying data)

2. Deciding size of each class : 1 and 2 are inter-linked 3. Determining the class limit : Choose a value less than the

minimum value of the variable as the lower limit of the first class and a value greater than the maximum value of the variable is the upper class limit for the last class.

Note: It is important to choose class limit in such a way that mid-point or class mark of each class coincides, as far as possible, with any value around which the data tend to be concentrated, i.e., Class limits are chosen in such a way that midpoint is close to average

Page 20: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 20

Class Intervals in Frequency Tables

11 12 13 14 16 17 18 19

5 UNITS 5 UNITS

10 15 20LOWER MID-POINT UPPER LIMIT LIMIT

11 12 13 14

2.5 UNITS 2.5 UNITS

10 12.5 20LOWER MID-POINT UPPER LIMIT LIMIT

A

B

15

Even class-interval & its mid point

Odd class-interval & its mid point

Page 21: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 21

Preparation of Frequency Distribution contd.

Two methods for class limits: Exclusive & inclusive type class intervals for determination of frequency of each class (see tally sheet example given later)(i) Exclusive method: Upper class limit of one class equals the lower class limit of the next class. Suitable in case of data of a continuous variable and here the upper class limit is excluded but the lower class limit of a class is included in the interval(ii) Inclusive method: Both class limits are parts of the class interval. An adjustment in class interval is done if we found ‘gap’ or discontinuing between the upper limit of a class and the lower limit of the next class. Divide the difference between the upper limit of first class and lower limit of the second class by 2 and subtract it from all lower limits and add it to all upper class limits.Adjusted class mark = (Adjusted upper limit + Adjusted lower limit) /2This adjustment restores continuity of data in the frequency distribution

Page 22: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 22

Preparation of Frequency Distribution contd.4. Find the frequency of each class (i.e., how many times that

observation occurs in the row data) by tally marking. Frequency of an observation is the number of times a certain observations occurs. Frequency table gives the class intervals and the frequencies associated with themLoss of information: Frequency distribution summarises raw data to make it concise and comprehensible, but does not show the details that are found in raw data.Bivariate Frequency distribution is a frequency distribution of two variables (e.g.:No. of books in stock and budget of 10 libraries)Frequency Distribution with unequal classes: Some classes having either densely populated or sparsely populated observations, theobservations deviate more from their respective class marks than in comparison to those in other classes. In such cases, unequal classes are appropriate. They are formed in such a way that class markscoincide, as far as possible, to a value around which the observations in a class tend to concentrate, then in that case unequal class interval is more appropriate.Frequency array: For a discrete variable, the classification of its data is known as a frequency array (e.g. No. of books in 10 libraries)

Page 23: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 23

Analysis of DataComputation of certain indices or measures, searching for patterns of relationships, estimating values of unknown parameters, & testing of hypothesis for inferences1. Descriptive analysis : Largely the study of distributions of one variable (uni-dimension); Univariate analysis → Two variables Multivariate analysis → More than two variables

2. Inferential or statistical analysis :• Correlation & causal analysis:

Joint variation of two or more variables is correlation analysisHow one or more variables affect another variable is causal

analysisFunctional relation existing between two or more variables is

regression analysis• Multivariate analysis: Simultaneously analysing more than two variables • Multiple regression analysis: Predicting dependent variable based on its covariance with all concerned independent variables

Page 24: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 24

Tally (tabular) sheets /charts for frequency distribution of qualitative, quantitative and grouped/ interval data

I. Single variable (Univariate measures)1. Quantitative

(I) Simple data(ii) Frequency distribution of grouped / interval data

2. Qualitative (Attributes)II. Two or more variables (Bivariate & multivariate

measures)1. Quantitative / Quantitative

(I) Simple (ii) Frequency distribution2. Quantitative / Qualitative (Attributes)

(I) Simple (ii) Frequency distribution3. Qualitative / Qualitative (Attributes)

examples of tabulation and tabular presentation follows

Page 25: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 25

Table 8.1 (Quantitative data)

Frequency distribution of

citations in technical reports

No. of

citations

Tally Frequency

(No. of tech.

reports)

0 ⎟⎟ 2

1 ⎟⎟⎟⎟ 4

2 ⎟⎟⎟⎟ 5

3 ⎟⎟⎟⎟ 4

4 ⎟⎟⎟⎟ ⎟⎟ 7

5 ⎟⎟⎟⎟ ⎟⎟⎟ 8

Total 30

Table 8.2 (Qualitative data) Frequency

distribution of qualification (educational

level) of users

Qualification Tally Frequency

(No. of

users)

Undergraduates ⎟⎟⎟⎟ ⎟ 6

Graduates ⎟⎟⎟⎟ ⎟⎟⎟⎟ 9

Postgraduates ⎟⎟⎟⎟ ⎟⎟ 7

Doctorates ⎟⎟⎟ 3

Total 25

Page 26: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 26

Table 8.3: Frequency distribution of age of 66 users who used a branch public library during an hour (Grouped/ interval data of single variable) (Note that the raw data of age of individual users is already grouped here)Age in years (Groups/Classes)

Tally Frequency (No. of users)

< 11 11

11 – 20 14

21 – 30 16

31 – 40 12

41 – 50 6

51 - 60 3

> 60 4

Total 66

Page 27: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 27

Table 8.4: No. of books acquired by a library over last six yearsYear No. of Books acquired(Qualitative) (Quantitative)2000 772

2001 910

2002 873

2003 747

2004 832

2005 891

Total 5025

Table 8.5: The daily visits of users to a library during a week are recorded and summarisedDay Number of users(Qualitative) (Quantitative)

Monday 391

Tuesday 247

Wednesday 219

Thursday 278

Friday 362

Saturday 96

Total 1593

Page 28: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 28

Table 8.6: The frequency distribution of number of authors

per paper of 224 sample papersNo. of Authors No. of Papers

1 432 513 534 305 196 157 68 49 2

10 1Total 224

Page 29: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 29

Table 8.7: Total books (B), journals (J) and reports ( R) issued out from a library counter in one hour are recorded as below:B B B J B BB B J B B BB B B B B BB B B B B JB R B B B J

A frequency table can be worked out for above data as shown below:Document Tally Frequency Relative Cumulative CumulativeType (Number) frequency frequency relative frequency

Books 20 0.8 20 0.8

Journal 4 0.16 24 0.96

Reports 1 0.04 25 1.0

Total 25 1.0

Page 30: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 30

Table 8.7 contd.Note: If the proportion of each type of document (category) are of interest rather than actual numbers, the same can be expressed in percentages or as proportions as shown below:

Proportions of books, journals and reports issued from a library in one hour is 20:4:1

ORType of document

Proportion of each type of document (%)

Books 80

Journal 16

Reports 4

Total 100

Page 31: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 31

Table 8.8: Given below is a summarized table of the relevant records retrieved from a database in response to six queries

Search Total Relevant % of relevant No. Documents Documents records

Retrieved Retrieved Retrieved1 79 21 26.6

2 18 10 55.6

3 20 11 55.0

4 123 48 39.0

5 6 8 50.0

6 109 48 44.0

Total 375 146

Note:Percentage of relevant records retrieved for each query gives better picture about which query is more efficient than observing just frequencies.

Page 32: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 32

Table 8.9: Frequency distribution of borrowed use of books of a library over four yearsNo. Times borrowed No. of Books Percentage Cumulative borrowed (Quantitative) (Quantitative) Percentage

0 19887 57.12 57.121 4477 12.56 69.682 4047 11.93 81.613 1328 3.81 85.424 897 2.57 87.995 726 2.02 90.016 557 1.58 91.687 447 1.28 92.968 348 1.00 93.969 286 0.92 94.7810 290 0.84 95.62

>10 1524 4.38 100.00

Page 33: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 33

Table 8.10: The raw data of self-citations in a sample of 10 technical reports are given below:

5 0 1 4 03 8 2 3 04 2 1 0 73 1 2 6 02 2 5 7 2

Frequency distribution of self-citations of technical reports:No. of self- Frequency Less than (or equal) More than (or equal)citations (No. of reports) cumulative frequency cumulative frequency

No. % % % 0 5 20 20 1001 3 12 32 802 6 24 56 683 3 12 68 444 2 8 76 325 2 8 84 246 1 4 88 167 2 8 96 128 1 4 100 4

Total 25

Page 34: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 34

Table 8.11: (Qualitative Data) Responses in the form of True (T) or False (F) to a questionnaire (opinionnaire) is tabulated and given along with qualitative raw dataTrue 17 T T T F FFalse 8 F T T T TNo response 5 T T F F TTotal 30 F T F T T

T T T T F

Page 35: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 35

Grouped or Interval DataSo far (except in Table 8.3) only discrete data are presented and

the number of cases/ items are also limited

As against discrete data, continuous data like heights of people, have to be collected in groups or intervals, like height between 5’and 5’5” for a meaningful analysis

Even large quantity of discrete data require compression and reduction for meaningful observation, analysis and inferences

Table 8.12 in the next slide presents 50 observations and if we create a frequency table of these discrete data it will have 22 lines in the table as there are 22 different values 9ranging from Rs.10/-to Rs.100/-). Such large tables are undesirable as they not only take more time but also the resulting frequency table is less appealing. In such situations, we transform discrete data into grouped or interval data by creating manageable number of classes or groups. Such data compression and reduction are inevitable and worth despite some loss of accuracy (or data)

Page 36: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 36

Table 8.12 : (Grouped or interval data) Raw data of prices (in Rs.) of a set of 50 popular science books in Kannada

30 80 100 12 4050 60 40 30 4540 30 70 43 4025 50 10 30 3518 35 60 35 2527 25 25 30 3035 35 14 32 3525 30 40 15 3020 16 13 30 6020 65 60 40 10

Page 37: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 37

Frequency distribution of grouped or interval data of Table 8.12

Price in Rs.No. of books

10 1 Mean = Rs. 35.912 1 Median = Rs. 33.513 1 Mode = Rs. 3014 115 116 118 120 225 527 130 932 135 640 643 145 150 260 465 170 180 1

100 1

Page 38: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 38

Frequency distribution of grouped or interval data of price (in Rs.) of popular science books in Kannada (Table 8.12) :

Price (in Rs.) (class) Frequency (f) (No. of books)1 - 10 2

11 - 20 821 - 30 1531 - 40 1341 - 50 451 - 60 461 - 70 271 - 80 181 - 90 091 -100 1

Total 50

Page 39: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 39

Home work

Work out a frequency table with less than cumulative and more than cumulative frequencies for the raw data of number of words per line in a book given below :

12 10 12 09 11 10 13 13 07 11 10 10 09 10 12 11 01 10 13 10 15 13 11 1208 13 11 10 08 12 13 1109 11 14 12 07 12 11 10

Page 40: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 40

Diagrammatic/ Graphical Presentationquickest understanding of the actual

situation to be explained by data compared to textual or tabular presentation

translates quite effectively the highly abstract ideas contained in numbers into more concrete and easily comprehensive from

may be less accurate but more effective than table

tables and diagrams may be suitable to illustrate discrete data while continuous data is better represented by graphsNote: Sample charts are constructed and presented using data from previously presented tables. Different types of data may require different modes of diagrammatic representation

Three important kinds of diagrams:i) Geometric diagram(a) Bar (column) chart: simple, multiple, and component(b) Pieii) Frequency diagram(a) Histogram(b) Frequency polygon(c) Frequency curve(d) Ogive or cumulative frequency curveiii) Arithmetic line graph

Page 41: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 41

Simple column chart for data in Table 8.2 : Qualification of users

6

9

7

3

0123456789

10

Underg

radua

tes

Gradua

tes

Postgrad

uates

Doctora

tes

No.

of u

sres

Page 42: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 42

Bar chart for data from Table 8.9: Frequency distribution of borrowed use of books of a library over four years

19 8 8 7

4 4 77 4 0 4 7

13 2 8 8 9 7 72 6 557 4 4 7 3 4 8 2 8 6 2 9 0152 4

0

5000

10000

15000

20000

25000

0 1 2 3 4 5 6 7 8 9 10 >10

No. of times borrowed

No.

of b

ooks

Page 43: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 43

Bar chart for data in Table 8.1 : Frequency distribution of citations in technical reports

2

45

4

78

0 2 4 6 8 10

0

1

2

3

4

5

No.

of c

itatio

ns

No. of reports

Page 44: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 44

Component bar chart

Page 45: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 45

100% component column chart

Page 46: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 46

Grouped column chart

Page 47: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 47

Comparative 100% columnar chart

Chart with figures / symbols

Page 48: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 48

Histogram (frequency polygon) for data inTable 8.6: No. of authors per paper

43

51 53

30

1915

6 4 2 10

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10

No. of authors

No.

of p

aper

s

Frequency polygon

Page 49: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 49

Line graphs

Page 50: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 50

Line graph for data in Table 8.6 : No. of authors per paper

43

51 53

30

1915

6 4 2 10

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10No. of authors

No.

of p

aper

s

Page 51: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 51

Frequency Distribution of No. of Words per Line of a Book (Home work)

2.5 7.5 12.520

42.5

62.580

95 97.5 100

0

20

40

60

80

100

120

Less than (or equal) cumulative frequency in %

Page 52: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 52

Cumulative frequency graph of reduction in no. of journals subscribed and no. of reports added over the years

0

200

400

600

800

1000

1200

1980 1985 1990 1995 2000 2002

Reports (annual intake)

Journals (subscribed)

Reports Journals

Year (annual intake) (subscribed)

1980 1063 533 1985 936 519 1990 523 444 1995 288 416 2000 67 3262002 29 300

Page 53: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 53

Line graph of less than or equal cumulative frequency of self-citations in technical reports(Table 8.12)

2032

5668

7684 88

96 100

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9

No. of self-citations

No.

of r

epor

ts

Page 54: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 54

Line graph for more than or equal cumulative frequency of self-citations in reports (Table 8.12)

8068

4432

2416 12

40

20

40

60

80

100

1 2 3 4 5 6 7 8 9

No. of self-citations

No.

of r

epor

ts

Page 55: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 55

Pie Diagram / Chart for Example 8.7: No. of books, journals and reports issued per hour

Repor t s4%

Books80%

Jour nal s16%

Books Journals Reports

Page 56: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 56

Univariate Measures: A. Central TendencyCentral tendency or averages are used to summarise data. It specifies a single most representative value to describe the data set.

1. The sum of the deviations of individual values of x from the mean will always add up to zero

2. The positive deviations must balance the negative deviations.

3. It is very sensitive to extreme values4. The sum of squares of the deviations about the mean is

minimumA good measure of central tendency should meet the

following requisites- easy to calculate and understand- rigidly delivered- representative of data- should have sampling stability- should not be affected by extreme values

Page 57: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 57

Univariate Measures: A. Central Tendency1. MEAN: Arithmetic mean (called statistical / arithmetic average) is the most commonly used measure. By dividing the total of items by total number of items we get mean.

Characteristics of Meanmost representative figure for the entire mass of datatells the point about which items have a tendency to clusterunduly affected by extreme items (very sensitive to extreme values)The positive deviations must balance the negative deviations (The sum of the deviations of individual values of x from the mean will always add up to zero)The sum of squares of the deviations about the mean is minimum

Page 58: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 58

Univariate Measures: A. Central Tendency

1. MEAN

X = xin

=X1 +X2+ ….+Xn

n

EX: 4 6 7 8 9 10 11 11 11 12 13

X 11

102=

f1x1 +f2 x2+ ….+fn xnfi xi

fiX f1+f2+ ….+fn (= n)==

= 9.27

For grouped or interval data

Page 59: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 59

Mean for grouped or interval data

X = ∑ fi xi / n where n = ∑ fi= f1X1 + f2 X2 + ….+ fn Xn / f1 + f2 + ….+ fn

Formula for Weighted Mean:X w = ∑ Wx Xi / ∑ Wi

Formula for Mean Of Combined Sample:X = n X + m Y / n + m

Formula for Moving Average (Shortcut or Assumed Average Method):

X = fi (Xi – A) / n : where n = ∑ fiNOTE: Step deviation method takes common factor out to enable

simple working and uses the formula X = g + [∑ f d / n] (i)

Page 60: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 60

Price (in Rs.)

(class)

Frequency (f) (No. of books)

Cumulative less than or

equal frequency (cf)

Distance of class from the

assumed average class (d) fd d2 fd2

1 -- 10 2 2 -4 -8 16 3211 -- 20 8 10 -3 -24 9 7221 -- 30 15 25 -2 -30 4 6031 -- 40 13 38 -1 -13 1 1341 -- 50 4 42 0 0 0 051 -- 60 4 46 1 4 1 061 -- 70 2 48 2 8 4 871 -- 80 1 49 3 3 9 981 -- 90 0 50 4 4 16 0Total 50 -56 194

Calculation of the mean (¯X ) from a frequency distribution of grouped or interval data of price (in Rs.) of popular science books in Kannada (Table 8.12) using Step deviation method is shown below:

g = 46 ; ∑ƒd = - 56 ; n = 50 ; i = 10¯X = g + [∑ f d / n] (i) = 46 + [ -56 / 50] (10) = 34.6Note: Compare answer with mean calculated as discrete data in Table 8.12

Page 61: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 61

Assumed average (shortcut) method & step deviation method

Table: Calculation of the mean (x ) from a frequency distribution. data represent weights or 265 male freshman students at the university of Washington

Class-Interval (Weight) ƒ d ƒdƒd

90 - 99 .......... 1 -5 -5 X = g + ( i )100 -- 109 …….. 1 -4 -4 N110 -- 119 …….. 9 -3 -27 99120 -- 129 ……... 30 -2 -60 = 145 + ----- ( 10 ) 130 -- 139 …….. 42 -1 -42 265140 -- 149 ……… 66 0 0150 -- 159 ……… 47 1 47 = 145 + ( .3736) (10)160 -- 169 ……… 39 2 78 = 145 + 3.74170 -- 179 ……… 15 3 45 = 148.74180 -- 189 ……… 11 4 44190 -- 199 ……… 1 5 5 fi – ( Ai - A ) 200 -- 209 ……… 3 6 18 X = A + ---------------------

fiN = 265 ƒd = 237 - 138 = 99

Page 62: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 62

Univariate Measures: A. Central Tendency contd..WEIGHTED MEAN

Xw =

EX: MEAN OF COMBINED SAMPLE

NX + MYZ =

N+MMOVING AVERAGE

SHORTCUT OR ASSUMED AVERAGE METHOD

(Xi – A) fi (Xi – A)X = A + X = A+

n fiNOTE: Step deviation method takes common factor out to enable simple working

Wi Xi

Wi

Page 63: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 63

Univariate Measures: A.Central Tendency2. Median: Middle item of series when arranged in ascending or descending order of magnitude

M = VALUE OF N+1 / 2 TH ITEM

EX: 11 7 13 4 11 9 6 11 10 12 84 6 7 8 9 10 11 11 11 12 13

1 2 3 4 5 6 7 8 9 10 11

FOR FREQUENCY DISTRIBUTION N/2 - Cf

M = L + × iF

L = lower limit of the median classCf = cum. freq. of the class preceding the median classf = simple freq. of the median classi = width of the class interval of the median classNote: As a positional average does not involve values of all items anduseful only in qualitative phenomenon

Page 64: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 64

MedianThe median in the layman language is divider like the ‘divider’ on the

road that divides the road into two halvesA positional value of the variable which divides the distribution into

two equal parts, i.e., the median of a set of observations is a value that divides the set of observations into two halves so that one half of observations are less than or equal to the median value and the other half are greater than or equal to the medianvalue

Extreme items do not affect median, i.e., median is a useful measure as it is not unduly affected by extreme values and is specially useful in open ended frequencies

For discrete data, mean and median do not change if all the measurements are multiplied by the same positive number and the result divided later by the same constant

As a positional average, median does not involve values of all items and it is more useful in qualitative phenomenon

The median is always between the arithmetic mean and the mode

Page 65: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 65

Median of grouped or interval data

M = L + W/F (i)Where, W = [n/2] – Cf (No. of observations to be added to the cumulative total in the previous class in order to reach the middle observation in the array) L = Lower limit of the median class (the array in which middle observation lies)Cf = Cumulative frequency of the class preceding the median classi = Width of the class interval of the median class (the class in which the middle observation of the array lies)F = Frequency distribution of the median class

Page 66: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 66

Calculation of the median (M) from a frequency distribution of grouped or interval data of price (in Rs.) of popular science books in Kannada (Table 8.12) is given belowPrice (in Rs.) Frequency (f) Cumulative less than or equal(class) (No. of books) frequency)

1 -- 10 2 211 -- 20 8 1021 -- 30 15 2531 -- 40 13 3841 -- 50 3 4151 -- 60 3 4461 -- 70 2 4671 -- 80 1 4781 -- 90 0 4791 -- 100 1 48

100 2 50Total 50

L = 21 ; Cf = 10 ; I = 10 ; F = 15 ; W = [n/2] – Cf = [50/2] – 10 = 15M = L + W/F (i) = 21 + 15/15 (10) = 31Note: Compare answer with median calculated as discrete data in Table 8.12

Page 67: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 67

Median for grouped or interval dataTABLE : Calculation of the median (x). data represent weights of 265 male freshman studies at the university of Washington

Class – Interval Cumulative [ w = N/2 – Cf ](Weight) ƒ ƒ “Less than”

90 - 99 …….. 1 1 100 - 109 ……… 1 2 X = / + (W/f) ( i ) 110 - 119 ……… 9 11 120 - 129 ……… 30 41 132.5 - 83130 – 139 ………. 42 83 = 140 + -------------------- (10)140 – 149 ………. 66 149 66150 – 159 ……… 47 196 49.5160 – 169 ……… 39 235 = 140 + --------- (10) 170 – 179 ……… 15 250 66 180 – 189 ……… 11 261 = 140 + (.750) (10)190 – 199 ……… 1 262 = 140 + ( .750) (10)200 – 209 ……… 3 265 = 140 + 7.50

N = 265 = 147.5N /2 = 265/2 = 132.5

Page 68: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 68

Univariate Measures: A. Central Tendency 3. ModeMODE is the most commonly or frequently occurring value in a seriesEX. : 4 6 7 8 9 10 11 11 11 12 13

--------------^

For Frequency DistributionΔ1 f2

Z = L + ----------- X i OR L + --------- X i Δ1 Δ2 f2 + f1

L = Looser limit of the modal class.Δ1 = Difference in Freq. Between the modal class and the

preceding class.Δ2 = Difference in Freq. Between the modal class and the

succeeding class.i = Width of the class interval of the modal class.f1 = Freq. of the class preceding the modal class.f2 = Freq. of the class succeeding the modal class.

Page 69: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 69

ModeMode is the most commonly or frequently occurring value/ observed

data or the most typical value of a series or the value around which maximum concentration of items occur. In other words, the mode of a categorical or a discrete numerical variable is that value of the variable which occurs maximum number of times

The mode is not affected by extreme values in the data and can easily be obtained from an ordered set of data

The mode does not necessarily describe the ‘most’ ( for example, more than 50 %) of the cases

Like median, mode is also a positional average and is not affected by values of extreme items. Hence mode is useful in eliminating theeffect of extreme variations and to study popular (highest occurring) case (used in qualitative data)

The mode is usually a good indicator of the centre of the data only if there is one dominating frequency. However, it does not give relative importance and not amenable for algebraic treatment (like median)

Median lies between mean & mode. For normal distribution, mean, median and mode are equal (one and

the same)

Page 70: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 70

Mode for grouped or interval dataFor frequency distribution with grouped (or interval) quantitative data

, the model class is the class interval with the highest frequency. This is more useful when we measure a continuous variable which results in every observed value having different frequency. Modal class in Table 8.3 is age group 21-30. Please note that since the notion of the location or central tendency requires order mode is not meaningful for nominal data.

Δ2 f2Z = L + -------- (i) OR L + --------- (i)

Δ2 + Δ1 f2 + f1 Where, L = Lower limit of the modal classΔ1 = Difference in frequency between the modal class and the

preceding classΔ2 = Difference in frequency between the modal class and the

succeeding classi = Width of the class interval of the modal classf1 = Frequency of the class preceding the modal classf2 = Frequency of the class succeeding the modal class

Page 71: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 71

Price (in Rs.) (class) Frequency (f) (No. of books)

Cumulative less than or equal frequency (cf)

1 -- 10 2 211 -- 20 8 1021 -- 30 15 2531 -- 40 13 3841 -- 50 4 4251 -- 60 4 4661 -- 70 2 4871 -- 80 1 4981 -- 90 0 50Total 50

Calculation of the mode (Z ) from a frequency distribution of grouped or interval data of price (in Rs.) of popular science books in Kannada (Table 8.12) is shown below:

L = 41 ; i = 10 ; f1 = 13 ; f2 = 4

Z = L + [f1 / f1 + f2] (i) OR L + [Δ2 / Δ1 + Δ2] (i)

Z = 41 + [13 / 13 + 4] (10) = 48.65 The value 48.65 lies in the class 41-50 and hence the modal class is 41-50 in the grouped data.Note: Compare answer with mode calculated as discrete data in Table 8.12

Page 72: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 72

Table: Calculation of the mode (X).Data represent weights of 265 freshman students at the university of Washington

Class –Interval (Weight) ƒ90 - 99 . . . . . . . . . . . . 1 ƒ2

100-109 . . . . . . . . . . .. . 1 X = l + ---------- (i)110 -119 . . . . . . . . . . . . 9 ƒ1 + ƒ2

120 -129. . . . . . . . . . . .. .. 30130 -139. . . . . . . . . . . . .. 42 47140 -149. . . . . . . . . . . .. . 66 = 140 + ----------- (10) 150-159 . . . . . . . . . . . . . 47 47 + 42160-169 . . . . . . . . . . . .. . 39170-179 . . . . . . . . . . . .. . 15 = 140 + 47/89 (10) 180-189 . . . . . . . . . . . . . 11 = 140 + 5.3190-199 . . . . . . . . . . . . . 1 = 145.3 200-209. . . . . . . . . . . . . . 3

Z = L + Δ1 /Δ1 Δ2 Χi = 140 + 24/43 Χ 10 = 145 . 5

Mode for grouped or interval data

Page 73: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 73

Univariate Measures: A. Central Tendancy4. GM & 5. HM

Harmonic Mean : 1. Has limited application as it gives largest weight to the smallest item and smallest weight to the largest item 2. Used in cases where time and rate are involved (ex: time and motion study)

Note: 1. Median and mode could also be used in qualitative data 2. Median lies between mean & mode 3. For normal distribution mean= median =mode

4. Geometric Meannth Root of the product of the values of n itemsG.M. = n ∏ xi X n x1 .x2 ….xn

Ex. 4 6 9 GM = 3 4 x 6 x 9 = 6

NOTE : 1. Log is used to simplify2. GM is used in the preparation of indexes (I.e., determining Average Percent of change) and dealing with ratios

5. Harmonic MeanReciprocal of the average ofreciprocals of the values of items

in seriesn Σ fi

H M = -------------------------- = -------1/x1 + 1/x2 + …fi/ xn Σ fi/xi

for frequency distribution Ex. : 4 5 10

3 HM = ----------------- = 60/1 = 5.45

1/4 +1/5 + 1/10

Page 74: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 74

Univariate Measures: B. Dispersion

Central tendency measures do not reveal the variability present in the data. To understand the data better, we need to know the spread of the values and quantify the variability of the data.

Dispersion is the scatter of the values of items in the series around the true value of average. Dispersion is the extent to which values in a distribution differ from the average of the distribution.

1. Range: The difference between the values of the extreme items of a series, I.e., difference between the smallest and largest observationsExample: 4 6 7 8 9 10 11 11 11 12 13

Range = 13 - 4 = 9• Simplest and most crude measure of dispersion• As it is not based on all the values, it is greatly/ unduly affected by the

two extreme values and fluctuations of sampling. The range may increase with the size of the set of observations though it can decrease

• Gives an idea of the variability very quickly

Page 75: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 75

Univariate Measures: B. Dispersion 2. Mean Deviation : The average of difference of the values of items from some average of the series (ignoring negative sign), I.e. the arithmetic mean of the differences of the values from their average Note: 1. MD is based on all values and hence cannot be calculated for open-ended distributions. It uses average but ignores signs and hence appears unmethodical.2. MD is calculated from mean as well as from median for both ungrouped data using direct method and for continuous distribution using assumed mean method and short-cut-method3. The average used is either the arithmetic mean or median

_Σ | xi – x |

δx = -------------n

Example: 4 6 7 8 9 10 11 11 11 12 13 14 – 9.271 + 16-9.271+………+113 – 9.271 24.73

δx = ----------------------------------------------------- = ----------- = 2.2511 11

Coefficient of mean deviation: Mean deviation divided by the average. It is arelative measure of dispersion and is comparable to similar measure of other series, i.e., Coeff. of MD = δx / x (Ex: 2.25/9.27 = 0.24) . M.D. & its coefficient are used to judge the variability and they are better measure than range

_ For grouped data Σ fi | xi – x |

δx = -------------n

Page 76: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 76

Univariate Measures: B. Dispersion3. Standard Deviation: The square root of the average of squares of

deviations (based on mean), I.e., the positive square root of the mean of squared deviation from mean

Σ (xi – x )2 Σ fi (xi – x)2

σ = ------------------ For grouped data σ = --------------√ n √ Σ fi

Example: 4 6 7 8 9 10 11 12 13

(4-9.27)2 + (6-9.27)2 +……+ (13 –9.27)2

σ = --------------------------------------------------------- = 2.64√ 11

Coefficient of S D is S D divided by mean.Example: 2.64 / 9.27 = 0.28

Variance : Square of S D i.e., VAR = Σ (xi – x)2 / n Example: (2.64)2 = 6.97

Coefficient of variation is Coefficient of SD multiplied by 100 Example : 0.28 x 100 = 28

Note: Coefficient of SD is a relative measure and is often used for comparing with similar measure of other series

Page 77: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 77

Univariate Measures: B. Dispersion 3. Standard DeviationSD is very satisfactory and most widely used measure of dispersionamenable for mathematical manipulationit is independent of origin, but not of scaleIf SD is small, there is a high probability for getting a value close to the mean and if it is large, the value is father away from the meandoes not ignore the algebraic signs and it is less affected by

fluctuations of sampling SD is calculated using (i) Actual mean method , (ii) Assumed mean method (iii) Direct method (iv) Step deviation method

For frequency of grouped or interval data σ = √ [∑ fi (x i – ⎯x)2 / ∑ f i ]

Indirect method uses assumed average formula σ = {√ [(∑ƒd2 / n) - (∑ƒd )2) / n2] } Where, d = Distance of class from the assumed average class n = ∑ fi , i.e., σ =√ fi (xi – A)2 /Σ fi - Σ fi (xi – A)2 /Σ fi

For discrete data assumed average formula isσ = √ Σ (xi – A)2 / n - Σ (xi – A) 2 / n

Page 78: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 78

Price (in Rs.) (class)

Frequency (f) (No. of

books)

Cumulative less than or equal frequency (cf)

Distance of class from the assumed average class (d) fd d2 fd2

1 -- 10 2 2 -4 -8 16 3211 -- 20 8 10 -3 -24 9 7221 -- 30 15 25 -2 -30 4 6031 -- 40 13 38 -1 -13 1 1341 -- 50 4 42 0 0 0 051 -- 60 4 46 1 4 1 061 -- 70 2 48 2 8 4 871 -- 80 1 49 3 3 9 981 -- 90 0 50 4 4 16 0Total 50 -56 194

Calculation of the SD (σ) from a frequency distribution of grouped or interval data of price (in Rs.) of popular science books in Kannada (Table 8.12) using assumed average method is shown below:

n = 50 ∑ƒd = - 56 ∑ ƒd2 = 194 i = 10σ = {√ [(∑ƒd2 / n) - (∑ ƒd )2) / n2] } (i) = {√ [(194 / 50) - (-56)2) / 502] } (10)

= {√ [(3.88) - (1.2544)] } (10) = {√ 2.6256 } (10)= {1.6204} (10) = 16.204

Page 79: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 79

TABLE: Calculation of the standard deviation (σ)Data represent weights of 265 male freshman students at the university of WashingtonClass –Interval (Weight) ƒ d ƒd ƒde

90 - 9 . . . . . . . 1 -5 -5 25 Σ ƒd2 Σ ƒd 2

100 -109 . . . . . . . 1 -4 -4 16 σ = ------- - ------- (i) 110 - 119 . . . . . . . 9 -3 -27 81 √ N N 120 – 129 . . . . . . . 30 -2 -60 120130 - 139 . . . . . . . 42 -1 -42 42 931 99 2

140 - 149. . . . . . . 66 0 0 0 = ------ - ….. (10) 150 - 159 . . . . . . . 47 1 47 47 √ 265 265160 - 169 . . . . . . . 39 2 78 156170 - 179 . . . . . . . 15 3 45 135 = ( √ 3.5132 - .1396 )(10)180 - 189 . . . . . . . 11 4 44 176 = (1.8367) (10)190 - 199 . . . . . . . 1 5 5 25 = 18.37 or 18.4200 - 209. . . . . . . 3 6 18 108 N= 265 Σƒd = 99 Σ ƒd2 = 931 D = (Xi – A) N = Σ fi

SD for grouped data (indirect method using assumed average)

Page 80: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 80

TABLE : Means, standard deviation, and coefficients of variation of the age distributions of four groups of mothers whogave birth to one or more children in the city of minneapolis: 1931 to 1935.

CLASSIFICATION X σ C V

Resident married………... 28.2 6.0 21.3Non-resident married…… 29.5 6.0 20.3Resident unmarried……... 23.4 5.8 24.8Non-resident unmarried… 21.7 3.7 17.1

SD for grouped data(indirect method using assumed average) …contd.

Page 81: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 81

Absolute and relative measures of dispersionThe absolute measures give the answers in the units in which

original values are expressed. They may give misleading ideas about the extent of variation especially when the averages differ significantly

The relative measures (usually expressed in percentages) overcome the above drawbacks. Some of them are:

i) Coefficient of range = (L – S) / (L + S) (L is largest value and S is smallest value)

ii) Coefficient quartile deviationiii) Coefficient of MDiv) Co-efficient of variation Note: Relative measures are free from the units in which the values

have been expressed. They can be compared even across different groups having different units of measurement

Lorenz curve is a graphical measure of dispersion. It uses the information expressed in a cumulative manner to indicate the degree of variability. It is specially useful in comparing the variability of two or more distributions

Page 82: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 82

Univariate Measures: B. Dispersion 4. QuartilesThere are some positional measures of non-central location where it is necessary to divide the data into equal parts. They are quartiles, deciles and percentiles (The quartiles & the median divide the array into four equal parts, deciles into ten equal groups, and percentiles into one hundred equal groups)Quartiles : Measures dispersion when median is used as averageLower quartile: Value in the array below which there are one quarter of the observationsUpper quartile: Value in the array below which there are three quarters of the observationsInterquartile range: Difference between the quartiles Interquartile range can be called a positional measure of variability While range is overly sensitive to the number of observations, the interquartile range can either decrease or increase when further observations are added to the sampleUseful as a measure of dispersion to study special collections of data like salaries of employeesExample: 4 6 7 8 9 10 11 11 12 13 Lower quartile is 7; Upper quartile is 11; Interquartile range is 4

Page 83: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 83

Normal DistributionTo understand skewness (asymmetry), testing of hypotheses (Part 9)

and interpretation of data (part 10) it is necessary to know about normal distribution.

The normal frequency distribution is developed from frequency histogram with large sample size and small cell intervals. The normal curve being a perfect symmetrical curve (symmetrical about µ), the mean, median and the mode of the distribution are one andthe same (µ = M = Z). The curve is uni-modal and bell-shaped and the data values concentrate around the mean. The sampling distributions based on a parent normal distributions are manageable analytically.

The normal curve is not just one curve but a family of curves which differ only with regard to the values of μ and σ , but have the same characteristics in all other respects.

Height is maximum at the mean value and declines as we go in either direction from the mean and tails extend indefinitely on both sides. The first and the third quartiles are equidistant from the mean. The height is given by an equation

1Y = --------------- e -1/2(X-µ/σ)2

√σ2π

Page 84: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 84

Normal Distribution …contd.It is a special continuous distribution. Great many techniques used in

applied statistics are based on this. Many populations encountered in the course of research in many fields seems to have a normal distribution to a good degree of approximation (I.o.w., nearly normal distributions are encountered quite frequently). Samplingdistributions based on a parent normal distributions are manageable analytically

Definition: The random variable x is said to be normally distributed if density function is given by

F(x) OR n (x) = 1 e- (x - μ )2 / 2 σ2

√2ΠσWhere ∞⁄ ∞ n (x) dx = 1 and - ∞ < x < ∞

(Since n(x) is given to be a density function, it implied that n(x) dx = 1) When the function is plotted for several values of σ (standard deviation) , a bell shaped curve as shown below can be seen. Changing µ (mean) merely shifts the curves to the right or left without changing their shapes. The function given actually represents a two-parameter family of distributions, the parameters being µ and σ2

(mean and variance)

Page 85: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 85

Normal Distribution …contd.

The experimenter musts know, at least approximately, the generalform of the distribution function which his data follow. If it is normal, he may use the methods directly; if it is not, he may transform his data so that the transformed observations follow a normal distribution. When experimenter does not know the form of his population distribution, then he must use other more general but usually less powerful methods of analysis called non-parametric methods

An important property of normal distribution for researchers is that if x follows normal distribution and the area under the normal curve is taken as 1, then, the probability that x is within

1 Standard deviation of the mean is 68%2 “ 95%3 “ 97.7%

Page 86: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 86

Normal Curves

Page 87: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 87

Z-score or standardised normal deviationThe area under the normal curve bounded by the class interval for any given class represents the relative frequency of that class. The area under the curve lying between any two vertical lines at points A and B along the X-axis represents the probability that the random variable x takes on value in that interval bounded by A and B. By finding the area under the curve between any two points along the X-axis we can find the percentage of data occurring within these two points.

The computed value Z is also known as the Z-score or standardisednormal deviation. Actually, the value of Z follows a normal probability distribution with a mean of zero and standard deviation of one. This probability distribution is known as the standard normal probability distribution. This allows us to use only one table of areas for all types of normal distributions.

The standard table of Z scores gives the areas under the curve between the standardised mean zero and the points to the right of the mean for all points that are at a distance from the mean in multiples of 0.01σ. It should be noted that only the areas are to be subtracted or added. Do not add or subtract the Z scores and then find the area for the resulting value.

Page 88: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 88

Z-score or standardised normal deviation …contd.

TABLE – Normal Distribution Z Prob. Z Prob. Z Prob. 3.0 .999 0.8 .788 -1.4 .081 2.8 .997 0.6 .726 -1.6 .0552.6 .995 0.4 .655 -1.8 .0362.4 .992 0.2 .579 -2.0 .023 2.2 .986 0.0 .500 -2.2 .014 2.0 .977 -.2 .421 -2.4 .008 1.8 .964 -.4 .345 -2.6 .005 1.6 .945 -.6 .274 -2.8 .0031.4 .919 -.8 .212 -3.0 .001 1.2 .885 -1 .1591.0 .841 -1.2 .115

The Standardised normal often used is obtained by assuming mean as zero (µ = 0) and SD as one (σ = 1). Then,x scale µ-3σ µ -2σ µ-σ µ µ+σ µ+2σ µ +3σz scale -3 -2 -1 0 +1 +2 +3

z = (xi - µ) / σ

Page 89: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 89

Univariate Measures: C. Measure of Asymmetry (Skewness)

Example: 4 6 7 8 9 10 11 11 11 12 13Skewness = 9.27-11 = -1.73 or 9.27-10 = -0.73j = -1.73/2.64 = -0.66 or 9-0.73) X 3 / 2.64 = - 0.83 Hence negatively skewed.Check the following for positive skewness 7, 8, 8, 9, 9, 10, 12, 14, 15, 16, 18

Normal Distribution of items in a series is perfectly symmetrical. Curve drawn fromnormal distribution which is bell shaped, shows no asymmetry (skewness), i.e., X = M = Z for a normal curve.

Asymmetrical distribution which has skewness to the right, i.e., curve distorted on the right is positive skewness (Z> M> X ) and the curve distorted to the left is negative skewness (Z > M> X) (see figure)

Skewness: The difference between the mean, median or mode, i.e., Skewness = X – Z OR X – M

Coeff. of skewness (J) = X – Z / σ OR 3 ( X – M ) / σSkewness shows the manner in which the items are clustered around the average; Useful in the study of formation of series and gives idea about the shape of the curveKurtosis is a measure of flat-topped ness of a curve i.e, humped ness Indicates the nature of distribution of items in the middle of a series(Mesokurtic: Kurtic in the centre, i.e. normal curve, Leptokurtic:More peaked than the normal curve, Platykurtic: More flat than the normal curve)

Page 90: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 90

Normal Curve and Skewness

Page 91: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 91

Relationship Between Measures of Variability (M D, S D and Semi-interquartile Range)

Page 92: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 92

Summary of ExamplesSummary of examples:4 6 7 8 9 10 11 11 11 12 13

Univariate Measures:A. Central Tendency

1. Mean 9.272. Median M 103. Mode Z 114. G.M.5. H.M

B. Dispersion1. Range 92. Mean deviation 2.253. Coefficient of MD 0.244. Standard deviation2.645. Coefficient of SD 0.286. Coefficient of variation 28

7. Variance 6.97 8. Lower quartile 79. Upper quartile 1110. Inter quartile range 4

C. Asymmetry1. Skewness

w.r.t. Mode 1.73w.r.t. Median 0.73

2. Coefficient of Skewnessw.r.t. Mode 0.66w.r.t. .Median 0.8

Home work:7, 8, 8, 9, 9, 10, 12, 14, 15, 16, 18

Page 93: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 93

Bivariate & Multivariate Measures

A. RelationshipTo find relation of 2 or more

variablesIf related, directly or inversely &

degree of relationIs it cause and effect relationship ?If so, degree and direction

1. Association (Attributes)(I) Cross tabulation(ii) Yule’s co-efficient of association(iii) Chi- square test(iv) Co-efficient of mean square

contingency2. Correlation (Quantitative)

(I) Spearman’s (Rank) coefficient of correlation (ordinal)

(ii) Pearson’s coefficient of correlation

(iii) Cross tabulation and scatter diagram

3. Cause and Effect (Quantitative)(I) Simple (linear) & regression(ii) Multiple (complex correlation & regression

(iii) Partial correlationB. Other Measures /

Techniques1. Index number2. Time series analysis3. Anova4. Anocova5. Discriminant analysis6. Factor analysis7. Cluster analysis8. Model building

Page 94: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 94

MeasureMeasure1. Pearson product1. Pearson product

momentmoment2. Rank order or2. Rank order or

Kendall’s Kendall’s tautau3. Correlation ratio, 3. Correlation ratio,

((etaeta))4. 4. IntraclassIntraclass

5. 5. BiserialBiserial, , Point Point biserialbiserial

6. Phi coefficient6. Phi coefficient

7. Partial Correlation7. Partial Correlation

Nature of VariablesNature of VariablesTwo continuous variables; interval or ratio scaleTwo continuous variables; ordinal scaleOne variable continuous, other either continuous or discrete

One variable continuous, other discrete; interval or ratio scaleOne variable continuous, other a) Continuous but dichotomised, or b) true dichotomyTwo true dichotomises; nominal or ordinal seriesThree or more continuous variables

CommentCommentRelationship linear

Relationship nonlinearPurpose: to determine within-group similarityIndex of item discrimination (used in item analysis)

Purpose: to determine relationship between two variables, with effect of the held constant

8. Multiple 8. Multiple correlationcorrelation

9.Kendall’s 9.Kendall’s coefficient of coefficient of concordanceconcordance

Three or more continuous variables

Three or more continuous variables.; ordinal series

Purpose: to predict one variable from a linear weighted combination of two or more independent variablesPurpose; to determine the degree of (say, interrater) agreement

Common Measures of Relationship

Page 95: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 95

Measures / Tests of Association1. Cross Tabulation

Useful in finding relationship in nominal data But not a powerful form of measure /

testClassify each variable into two or

more categoriesBegin with a two-way table to see

whether there is interrelationship between variablesThen cross classify the variables in

subcategories to look for interaction between them

(I) Symmetrical relationship: Two variables vary together, but neither is due to the other (assumed)(ii) Reciprocal relationship: Two variables mutually influence or reinforce each other(iii) Asymmetrical relationship: If one (individual) variable is responsible for change in the other (dependent) variable

Attempt can also be made to see / find the conditional relationships by introducing the third factor and cross-classifying the three variables. Ie. To see whether X affects Y only when Z is held constantCross tabulate a dependent variable (of importance) to one or more independent variableShow the percentages in the cells of cross tabulationLook for valid (not spurious) explanationsAsk whether differences are statistically significant?

Page 96: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 96

Example: Given below is the data regarding reference queries received by a library. Is there a significant association between gender of user and type of query ?

L R S R Totalquery query

Male users 17 18 35Female users 3 12 15Total 20 30 50

Expected frequencies are worked out like E11 = 20X35 / 50 = 14Expected frequencies are:

L S TotalM 14 21 35W 6 9 15Total 20 30 50

Cells Oij Eij (Oij - Eij) (Oij - Eij )2 / Eij1,1 17 14 3 9/14 = 0.641,2 18 21 -3 9/21 = 0.432,1 3 6 -3 9/6 = 1.502,2 12 9 3 9/9 = 1.00Total (∑) χ2 = 3.57 df = (C-1) (r-1) = (2-1) (2-1) = 1Table value of χ2 for 1 df at 5 % significance is 3.841. Hence association is not significant.

Page 97: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 97

2. Association : Yule’s Coefficient of association

QAB = (AB) (ab) – (AB) (aB) (AB) (ab) + (AB) (aB)

(AB) = Freq. of class AB in which aA and B are present.(Ab) = Freq. of class Ab in which aA is present but B is absentQAB takes values between + 1 and –1 indicates degree of association.IF (AB) > (A) (B) expected Freq. Then AB are positively associated.

N IF (AB) < (A) (B) expected Freq. Then A& B are independent.

N I.e., QAB = 0IMMUNITY Ex :

PRESENT ABSENTPRESENT

A. INOCULATION ABSENT Total

5 X 4 - 2 X 1 18 QAB = ---------------------- = -------- = 0 . 82

5 X 4 + 2 X 1 22(A) (B) 7 x 6

(AB) = 5 > -------------- = ---------- = 3.5N 12

(AB) (Ab)

(ab) (ab)

5 2 7

1 4 5

6 6 12

The association of A and B in the population may be due to attribute C. In such a case partial association (as against total association) between A and B is determined byQabc = (ABC) (abC) –(ABC) (aBc) / (ABC) (abC) – (ABC) (aBc)

Illusory Association : there is no real association between A & B but both are associated with third attribute. Reasons : (i) A and B are not properly defined (ii) A and B are not properly / correctly recorded

Page 98: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 98

Attribute AA1 A2 A3 A4

B1 (A1B1) (A2B1) (A3B1) (A4B1) (B1)

B2 (A1B2) (A2B2) (A3B2) (A4B2) (B2)

B3 (A1B3) (A2B3) (A3B3) (A4B3) (B3)

B4 (A1B4) (A2B4) (A3B4) (A4B4) (B4)

Total (A1) (A2) (A3) (A4) N

Total

Attribute B

N(a)(A)Total

(B)(b)

(aB)(a b)

(AB)(A b)

Bb

aATotal

Attribute

Attribute

Reduced to 2x2 Table

Note: Larger than 2X2 tables have to be reduced to 2X2 by combining some classes to use this method

4 X 4 Contingency Table

2. Association : Yule’s Coefficient of association contd.

Page 99: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 99

Yule’s Coefficient of Association contd.

Example 1 : The number of books issued on random sample of days in 2005 and 2006 are as follows

2005 200636 37 34 7828 97 89 8932 37 22 3439 33 44 2227 114 49 33

114 35 33 17

Example 2 : Data on the number of books issued from a library during the course of a week (both actual and expected)Day Actual ExpectedMon 39 42.17Tue 14 42.17Wed 21 42.17Thu 47 42.17Fri 36 42.17Sat 96 42.17Total 253 253(.02)

Page 100: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 100

Yule’s Coefficient of Association contd.Example 3 : In 1984 - 5, a library authority spent Rs.550 000 on books and

Rs.140 000 on other items. In 1987- 8, the authority spent Rs.810 000 on staff, Rs.330 000 on books and Rs.210 000 on other items. Did the pattern of expenditure change significantly between 1984-5 and 1987-8 ?The observed data can be compiled into a contingency table as shown :

Contingency table of observed frequenciesExpenditure (‘000s)

Year Staff Books Other Total1984-5 550 230 140 9201987-8 810 330 210 1350Total 1360 560 350 2270

A table of expected frequencies can be deduced as shown :Contingency table of expected frequencies

Expenditure (‘000s)Year Staff Books Other Total1984-5 551.19 226.96 141.85 9201987-8 508.81 333.04 208.15 1350Total 1360 560 350 2270

Page 101: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 101

2. Correlation : i. Cross Tabulation, Correlation Table & Scatter Diagram

Frequency of use of a number of documents of different ages

Doc. Age of FrequencyNo. Doc. of use

(years) (times /year)1 1 402 3 183 2 304 4 215 3 266 5 107 4 138 3 35

Correlation table of age and frequency of use of documentsFreq. of use Age of doc.(yrs) Total(times / year) 1 2 3 4 51-10 1 111-20 1 1 221-30 1 1 1 331-40 1 1 2Total 1 1 3 2 1 8

Monthly totals of books, Journals and Reports issued from a libraryMonth Reps Bks Jls TotalJan 465 3216 713 4394Feb 513 3215 686 4414Mar 425 3126 996 4547

Page 102: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 102

2. Correlation : i. Cross Tabulation, Correlation Table & Scatter Diagram contd.

Page 103: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 103

Correlation Scatter Diagrams

Page 104: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 104

ii. Spearman’s Coefficient of (Rank Order) CorrelationOnly between two variables which are ordinal in nature; helps to decide whether two sets of ranks differ and to the extent they offer

6Σ di2 di = O. b. betn ranks of 6th pair of the two variables

rs = 1 - -------------- n = No.of pairs of observationsn (n2 – 1)

Example:Boys and girls were questioned about their reading interest and asked to put various types of novel into their order of performance, with the following results:

Type of novel Rank orders di di2

Boys GirlsAnimal stories 4 2 2 4Historical novels 3 3 0 0Romances 5 1 4 16War stories 1 5 -4 16Westerns 2 4 -2 4

Σ di2 40

6 x 40rs = l - ------------- = 1 - 2 = -1

5 x 24 (that means perfect negative correlation)rs varies from +1 to -1rs = 0 indicates that two sets of rankings are dissimilar / independent

Page 105: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 105

ii. Spearman’s Coefficient of (Rank Order) Correlation contd.

Homework: Given below is the mean scores on a 5 point scale about the nature & type of information required by a group of physicists and another group of mechanical engineers. Find the correlation of their rankings ? (carryout t-test for 5% significance level)

Physicists Mech. Engrs.A. State of the art 2.60 1.17B. Theoretical background 2.98 2.71C. Experimental results 2.67 2.34D. Methods, processes & procedures 2.62 2.07E. Product, material & equipment information 2.45 2.23F. Computer programs & model building info. 2.00 0.85G. Standard & patent spec. 0.93 2.15H. Physical, technical & design data 3.05 2.65I. S & T news 2.29 2.53J. General information 1.21 0.92

Page 106: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 106

∑ (Xi—X) (Yi—Y)

r = ----------------------------

n --- x . y

∑ Xi Yi - n. X . Y

r = --------------------------------- ASSUMING ZERO AS MEAN

√ ∑ Xi2 - n X2 √ ∑ Yi

2 – n Y2

∑ d xi .dyi - ∑ d xi . ∑ .dyi WITH ASSUMED AVERAGES

n n Ax and A y

r = --------------------------------------------- ∑ d xi = ∑ (Xi – A x)

∑ d xi2 ∑ d xi

2 ∑ d yi2 ∑ d yi

2 ∑ d yi = ∑ (Yi – A y)

√ n n n n ∑ d xi2 = ∑ (Xi – A x)2

∑ d yi2 = ∑ (Yi – A y)2

∑ d xi . Σ d yi2 = ∑ (Xi – A x) (Yi – A y)

iii. Pearson’s (Product Moment) Coefficient of Correlation (Simple Correlation)

Page 107: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 107

Most widely used method and assumes (i) linear relationship (ii) variables are not causally related (iii) a 2 distribution of observations of booth variables

∑ (Xi - X) (Yi -Y)r = --------------------------------------------------

∑ ( Xi - X ) 2 . √ ∑ ( Yi2 –Y ) 2

√EXAMPLE : The following table gives the approximate number of abstracts (in thousand) in a

selection of volumes of abstracts together with the cost of each volume:umber of abstracts (X) Cost (Y)

(thousands) (£)36.7 115

8.5 5212.5 753.9 310.5 91.3 124.1 20

19.4 564.3 24

91. 2 39. 4X = 10.1 Y = 4.4

iii. Pearson’s (Product Moment) Coefficient of Correlation (Simple Correlation)

Page 108: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 108

iii. Pearson’s Coefficient of Correlation (Simple Correlation) Example contd.

GIVEN BELOW ARE THE AVERAGE NO.OF YEARS OF EXPERIENCE (X) AND THE AVERAGE NO.OF BOOKS BORROWED PER MONTH (Y) FIND THE PRODUCT MOMENT CORRELATION CETWEEN THE TWO

X Y XY X2 Y2 105 – 5 X 3. 4 X 5. 24 1 2 2 1 4 r = ----------------------------------------2 4 8 4 16 √ 71 – 5 (3.4)2 √ 158 – 5 (5. 2 )2

4 5 20 16 25 5 7 35 25 49 105 – 88.4 16 . 65 8 40 25 64 = ----------------- = --------------

Total ------- ------- ------- ------- -------- 3.63 X 4.77 17 . 31517 26 105 71 158

= + 0 . 96

X = 17/5 = 3 . 4 Y = 26/5 = 5.2 ∑ X2 = 71 ∑ Y2 = 158∑ XY = 105

NOTE : 1. Most widely used method 2. It assumes (i) linear relationship (ii) normal distribution (iii) variables

are causally related. but does not indicate a cause and effect relationship(iv) it is neutral to chance in scale and origin

3. Value of r varies from +1 (perfect positve correlation) to –1 (perfect negative correlation). zero indicates the absence of association / relationship.

Page 109: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 109

iii. Pearson’s Coefficient of Correlation For Grouped Data contdFOR GROUPED DATA

∑ ∑ f ji ((Xi - X ) (Yf – Y)R = --------------------------------------------

∑ fi ((Xi - X 2) ∑ ff (Yj - Y 2)

SHORT CUT APPROACH WITH ASSUMED MEANS

∑ f j dxi . dyi - ∑ fi dxi . ∑ fi dyin n n

R = -----------------------------------------------------------------------∑ f j dxi

2 ∑ f j dx i2 - ∑ fjdyj

2 - ∑ fjdy j 2

n n √ n nHOMEWORK : following are the average no.of ref. queries asked (x) and the average no.of books

inoensed (y) during a study of library users for three months. find the correlation coefficient between the two. carry out a t-test for 5% significance level.X Y t-TEST FOR SIGNIFICANCE5 46 3 n – 2 n - 21 2 t = r ----------- t = rs -------------4 6 √ 1 – r2 √ 1 – rs

2

2 3 (ANS r = + 0.41)EX:

t = (0.96) (5 – 2)/ 1 – (0 . 96)2 = 5. 939Df = n – 1 = 5 – 1 = 4; Tabulated value of t for 4 d.f. for two tabled test out 0.5% significance level is

4.264 . Hence r is significant of 0.5% significance …contd

Page 110: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 110

Prefrence for violent TV in the 3rd grade

Prefrence for violent TV in the 3rd grade

0.05

0.01

0.21 -0.05

0.38

0.31

Aggression in the 3rd grade

Aggression in the 13th grade

The Correlation Between a Preference for Violent Television andPeer-reted Aggression for 211 Boys Over a 10-Year lag

Page 111: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 111

3. Cause & Effect Relationships• Visual inspection of correlational table and scatter diagram indicates existence

and direction of relation • Correlation coefficient shows the magnitude as well as direction of relationship• Regression analysis shows the cause and effect relationship, ie., independent

variable (X) is the cause and dependent variable (Y) is the effecti. Regression Analysis (Simple / Liner)

Describes in quantitative terms the underlying (cause & effect) relationship or correlation between two sets of data (two variables)Helps predicting value of dependent variable for a given / known value of independent variable regression equation of Y on X (simple / liner)

Y = a + bXY = Estimated value of Y for a give value of X(a and b are constants)a = Parameter which tells at what value the straight line cuts the Y axisb = Slope or grdient of the regression line, i.e., unit change in X produces a

change of b in YNote : 1. The relationship between X & Y may take any form but here it is

assumed to be linear ie., straight line 2. Practical data may fit near or closer to straight line 3. The objective is to fit a regression line with minimum error values (difference

between the observed values and expected values)4. To find the ‘best’ fit the largest square method is used

Page 112: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 112

3. Cause & Effect Relationships : I. Regression analysis contd

TO FIND THE ‘REST’ FIT THE LEAST SQUARE METHOD IS USED

Y Y Y

X

X XX

X X X

X XX

X X XNO RELATIONSHIP ACTUAL VALUES BEST FIT

THE LEAST SQUARE METHOD PROVIDES TWO NORMAL EQNS. TO DETERMINE CONSTANTS a AND b

∑ Y = n a + b ∑ X ∑ XY = a ∑ X + b ∑ X2 TSS = ∑ (Y – Y ) 2

RSS = ∑ (Y – Y ) 2THE BASIC EQN IS Y = a + b x +E ESS = ∑ (Y – ŷ ) 2

TSS = RSS + ESSRSS / 1

F = --------------ESS / n –2

EXAMPLE : Given below are the estimated use of a library (Y) for a corresponding expenditure on promotion and user orientation (x). Fit best regression line. Estimate the use (I.e., predict) for an expenditure of Rs.8000 . If the library would like to reach a level of use of 70,000 what should be the expenditure on promotion and user-orientation. …contd.

Page 113: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 113

X Ŷ (Estimate) X Ŷ X2 Y (EXPECTED) ERROR( In thousands (In ten thousands)

of Rs.5 4 20 25 4 . 02 -0 . 026 3 18 36 4 . 32 -1 . 321 2 2 1 2 . 82 -0 . 824 6 24 16 3 . 72 +2 . 282 3 6 4 3 . 12 -0 . 12

Tot 18 18 70 82 18 0∑ X ∑ Ŷ ∑ X Ŷ ∑ X2

NOTE: IF r IS CALCULATED ∑X , ∑ Ŷ, ∑X Y & ∑ X2 ARE READILY AVAILABLE18 = 15a + 18 b ⇒ a = 2.52 ⇒ y = 2.52 + 0.3 x70 = 18a + 82 b b = 0.3

(i) X = 8 (Rs 80000/-) ⇒ Y = 2.52 + 0.3 x 8 = 4.92 i.e Rs 49200/-(ii) Y = 7 (70,000) ⇒ 7 = 2.52 + 0.3 x = x ⇒ 14.93 i.e Rs 14930/-

REGRESSION COEFFICIENT & COEFFICIENT OF DETERMINATIONCONSTANT b IS CALLED REGRESSION COEFFIENTREGRESSION OF X ON Y IS X= α + βY

RSSb β = r2 COEFFICIENT OF DETERMINATION r2 = --------

VARIES BTEWEEN ZERO AND + 1 TSSAS r2 TEND CLOSER TO + 1 IND. VAR EXPLAINS THE MOVEMENTS IN T HE DEP.VAR

r IS THE CORRELATION COEFFICIENT & VARIES FROM –1 TO +1

Homework : No. of AgeBooksX Y 5 236 358 4112 5815 75

3. Cause & Effect Relationships : i. Regression Analysis contd..

Page 114: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 114

INVOLVES TWO OR MORE INDEPENDANT VARIABLESŶ = a + b1 x1 + b2 x2

NORMAL EQUATIONS ARE∑ yi = n a + b1 ∑ x 1i + b1 ∑ x 2i ∑ x1i Yi = a ∑ x 1i + b1 ∑ x1i

2 + b2 ∑ x1i x2i∑ x2i yi = a ∑ x2i + b1 ∑ x1i x2i + b2 ∑ x2i

2

PROBLEM OF MULTICOLLINEARITYREGRESSION COEFFICIENTS b1 AND b2 BECOME LESS RELIABLE IF THERE IS A HIGH DECREE OF CORRELATION BETWEEN IND. VAR. X1 AND X2 .THE COLLECTIVE EFFECT OF INO. VAR X1 AND X2 IS GIVEN BY THE COEFFICIENT OF MULTIPLE CORRELATION

b1 ∑ xi x1i - n y x1 + b2 ∑ yi x2i - n y x2Ry. X1 x2 = ---------------------------------------------------

√ ∑ Yi – n Y x1i = (x1i – x1)

b1 ∑ x1i yi + b2 ∑ x2i yi x2i = (x2i – x2) OR ------------------------------ y i = ( yi – y)

√ ∑ Yi2

ii. Multiple Correlations and (Non-Linear or Complex) Regressions

Page 115: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 115

iii. Partial Correlation

(iii) PARTIAL CORRELATION measures, separately, the relationship betntwo variables (i.e. dep. and a particular ind. variables) by holding all other variables constant

FIRST SIMPLE COEFFICIENTS OF CORRELATIONS BETN EACH PAIR OF VARIABLES HAVE TO BE CALCULATED

FOR EXAMPLE, FIRST ORDER COEFFICIENT (OF PARTIAL CORRELATION) MEASURING EFFECT OF X ON Y IS GIVEN BY

R2 y. x1x2 – r2 y x2

r yx1. x2 = -------------------------1 - r2 yx1

ryx1 – ryx2 . rx1x2OR ----------------------------

√ 1- r2 yx2 √ 1- r2 x1x2

Page 116: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 116

4. Other Measures : A. Index numbers

A. Index numbersIndex number is a device to measure the magnitude of(I) Change in the price, quantity or value of an item or more, usually a group

of items over time or (ii) Difference between the two similarly measured quantitiesExampleTotal number of issues of volumes of non-fiction by a library in a number of

yearsYear 1960 1961 1962 1963 1964Number of issues 8094 9288 8416 9271 8233i. Fixed Base Index

Simple indexes for issues of volumes of non-fiction by a library in a number of years (1960 = 100)

Year 1960 1961 1962 1963 1964Index 100 114.75 103.98 114.54 101.72

Page 117: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 117

4. Other Measures A. Index numbers contd.ii. Chain Base Index

Chain base index for issues of volumes of non-fiction by a library in a number of years

Year 1961 1962 1963 1964Index 114.75 90.61 110.16 88.80

the change or difference is expressed as a ratio or % of a stated base or starting date, period or quantity which is given a value of 100 points

Value in given yearFixed base index = --------------------------------- X 100

Value in base year

Value in given yearChain base index = ----------------------------------- X 100

Value in previous yearAn item in the index is given its due weight in accordance with its importance

in the whole indexPrice in given year X Qty in base year

Base year weighted index = ---------------------------------------------------------- X 100Price in base year X Qty in base year

Price in given year X Qty in given year Given year weighted index = ------------------------------------------------------X 100

Price in base year x Qty in given year

Page 118: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 118

4. Other Measures A. Index numbers contd.

• Index number is a special type of average used to measure the level of a given phenomenon as compared to the level of the same phenomenon at some standard date I.o.w reducing the figure to a common base (eg: converting the series into a series of index numbers) to study the chances in the effect of such factors which are incapable of being measured directly

• They are approximate indicators & give only a fair idea of changes.• index numbers prepared for a purpose cannot be used for other purposes or same

purpose at other places. Cchances of error also remain in them.Examples: 1. Library use index = 1/100 no. of pages of xerox copies of reading material taken during a

year + 2 times no. of documents borrowed through ILL + 5 times no. of visits to library during 3months sample seat occupancy study + mean no. of documents borrowed during the year (both circulation sample and collection sample)

2. Library interaction index = No. of documnts sugested + no. of documents indented + no. of documents reserved + 2times no. of literature search service availed + no of short range ref. Queries placed

YEAR 1 2 3 4 5Chain base 100 103 107 110 115

Fixed base 100 100x103 103x107 110.2x110 121.2x115100 100 100 100

100 =103 = 110.2 = 121.2 =139.4

Page 119: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 119

4. Other Measures B. Time series analysis contd.B. Time series analysisTime series: Series of successive observations of a phenomenon over a

period of time– When individual variable is time in a cause and effect relationship of

regression analysis type it is time series analysis– It helps to estimate/ predict the future

Components of time series1. Secular or long term trend (T)2. Short term oscillations : (i) Cyclical variations(C) (usually more than a

year) (ii) Seasonal variations(S) (usually within a year)3. Irregular or erratic variations (I) Random fluctuations & completely

unpredictable like riots, natural calamities, etc.

Page 120: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 120

4. Other Measures B. Time series analysis contd.

Methods of isolating and measuring trend1. Free hand method2. Semi-average method3. Method of moving average4. Method of least squares

Method of moving averagesBy smoothening out fluctuations, helps to detect the trend Choosing appropriate period, the method can also help to find out short term

variations (ie cyclical & seasonal) as wellIn addition, use of seasonal index helps to account for seasonal variationsMoving average helps to reduce seasonal variations while finding trend

Example: 1. The following are daily issues of junior non-fiction from a library (public library)Day Week 1 Week 2 Week 3Mon 36 46 66Tue 31 55 76Wed 25 37 40Thu 55 80 74Fri 45 66 90Sat 90 115 150

METHOD OF LEAST SQUARESTAKING ‘t’ AS IND.VAR. THE EQN.FOR SECULAR TREND IS Ŷ = a + b tNORMAL EQNS. ARE ∑ Ŷ = n a + b ∑ t

∑ t Ŷ = a ∑ t + b ∑ t2 n = NO OF YEARSENABLES FORCEASTING FUTURE VALUES OF Y FORM Y = a + b t

Page 121: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 121

4. Other Measures B. Time series analysis contd.

Times series analysis of data of example 1Week Day Number of issues Moving average Cyclical

variation1. M 36

T 31W 25T 55 47.9 +7.1F 45 50.7 -5.7S 90 53.7 +36.3

2. M 46 56.8 -10.8T 55 60.6 -5.6W 37 64.4 -27.4T 80 68.2 +11.8F 66 71.6 -5.6S 115 73.6 +41.4

3. M 66 73.3 -7.3T 76 74.8 +1.2W 40 79.8 -39.8T 74F 90S 150

Page 122: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 122

4. Other Measures B. Time series analysis contd.

Home workDaily visitors to a

public libraryDay Week 1 Week 2Sun 900 800Mon 400 500Tue 500 300Wed 600 300Thu 300 400Fri 700 600Sat 1100 900

Solution:Trend : Upward, i.e., Increasing daily issuesCyclic variation: Difference between the moving average (expected) and corresponding actual figure of issues are markedly high on Saturday and very low on Wednesday

Page 123: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 123

Monthly statistics of no. of searches executed on a CD-ROM database by PG students is as follows

Year 1 Year 2 Year 3Month No. of 12 month No. of 12 month No. of 12month

searches moving Av searches moving Av searches moving AvJAN 50 60 68.3 50 71.7FEB 50 50 68.3 40 71.7MAR 50 60 68.3 70 72.5APR 60 70 69.2 80 71.7MAY 70 80 70.0 90 70.8JUN 80 64.2 90 69.2 100 71.7JUL 90 65.0 90 68.3 100AUG 90 65.0 90 67.5 90SEP 60 65.8 60 68.3 70OCT 60 66.7 70 69.2 60NOV 50 67.5 60 70.0 50DEC 60 68.3 50 70.8 60Secular trend: IncreasingSeasonal variation: Maximum during Mar-May (may be exam ? Seasonal

index is more useful in accounting seasonal variations)

Page 124: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 124

Quarterly statistics of user visit to a special library

Year 1 Year 2 Year 3Quarter No. of % of No. of % of No. of % of Average

User Quarterly User Quarterly User Quarterly % ofVisits Visits Visits Average Index

1 2000 80 3000 100 4000 100 93.32 3000 120 3500 116.7 5000 125 120.63 2000 80 2000 66.7 3000 75 73.94 3000 120 3500 116.7 4000 100 112.2

Total 10000 12000 16000Q Average 2500 3000 4000

If the first quarter of 4th year records 6000 user visits estimate the average quarterly visits for that year

Ans. 6000/93.3 X100 = 6431

Page 125: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 125

Method of least squares

Assumes linear relation and that the past behaviour continues to persist in futureY = na+b∑t

Normal equations∑Y = na + b ∑t∑tY = a ∑t + b ∑t2

Example: ∑t = 0, ∑Y = 177, ∑tY = 171, ∑t 2 = 280, n=15⇒ 177= 15a +b x 0

171 = a x 0 + b x 280 ⇒ a = 11.8b = 0.61

Hence the trend regression line is Y = 11.8 + 0.61tTo find out the t=9 (i.e., say sales for 1986)

Y =11.8 + 0.61 x 9 = 17.29

Note: For variable ‘t’ midpoint of time is taken as origin.

Ex: -2, -1, 0, 1, 2, (odd nos.)

-3, -2, -1, 1, 2, 3 (even nos.)

Page 126: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 126

4. Other Measures B. Time series analysis contd.

Measurement of seasonal variations:1. Ratio to trend method2. Ratio to moving averages method3. Link relative methodMeasurement of cyclic variations:1. Harmonic analysis2. Spectrum analysisNote: Residuals remaining after elimination of seasonal and trend components be recorded & plotted graphically for visual comparison of residual variations which are attributed to cyclic and irregular / erratic components

Page 127: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 127

ANOVA• Testing the difference among different groups of data

for homogeneity. • Useful to investigate(I) Any number of factors which are hypothesised or

said to influence the dependent variable(ii)The differences amongst various categories within

each of these factors which may have a large number of values

Example (one way ANOVA)The number of books stored per shelf in a library may be

of interest. If a random sample of shelves is selected and the number of books on each shelf are counted, the quantitative data collected can be presented in a frequency table as shown in the figure …contd.

Page 128: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 128

Frequency table showing variation of no. of books stored per shelf according to subject category (example of one way ANOVA) contd.

Number of shelvesBooks per shelf Geography X1 Law X2 production X3 total16 1 117 018 019 020 021 3 322 023 1 1 424 025 4 426 3 327 028 ! 1 229 1 130 2 2 431 032 1 133 1 1 334 1 135 036 2 237 038 039 040 041 042 043 1 1

Page 129: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 129

ANOVA (Example of one way ANOVA) contd.Frequency table showing variation of number of books stored per shelf in a random sample of shelves

Books per Number of Books per No. of shelvesshelf shelves shelf

16 1 30 417 0 31 018 0 32 119 0 33 320 0 34 121 3 35 022 0 36 223 4 37 024 0 38 025 4 39 026 3 40 027 0 41 028 2 42 029 1 43 1

One-way ANOVA considers one factor, i.e., No. of books per shelf

Page 130: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 130

Steps (example of one way ANOVA) contd.

1. OBTAIN MEAN OF EACH SAMPLE I.E. X1, X2, X3

X1+ X2+ X32. FIND MEAN OF THE SAMPLE MEANS, I.E. X = ---------------

3 (k)3. FIND SUM OF SQUARES FOR VARIANCE BETN THE SAMPLES,

I.E., SS BETWEEN = n1 (X1 – X)2 + n2 (X2 – X)2 + n3 (X3 – X)2

4. CALCULATE VARIANCE OR MEAN SQUARE BETN. SAMPLES,

SS BETWEEN DF = k-1I.E., MS BETWEEN = ---------------------

2 = 3 - 1 =25. SUM OF SQUARES FOR VARIANCE WITHIN SAMPLES

SS WITHIN = ∑ (X1i – X1)2 + ∑ (X2i – X2)2 + ∑ (Y3i – X3)2

6. VARIANCE OF MEAN SQUARE WITHIN SAMPLESSS WITHIN DF = n - k

MS WITHIN = -------------------- n = n1 + n2 + n3 + …..n - k

Page 131: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 131

Frequency table showing variation of number of booksstored per shelf according to subject category

(example of one way ANOVA) …contd.7. CHECK SS FOR TOTAL VARIATION

= ∑(Xij – X)2 = SS BETWEEN + SS WITHINAND (n-1) = (k – 1) + (n – k)

MS BETWEEN8. F RATIO = --------------------

MS WITHINNote: Compare with table value of F. If it is equal or more than table value difference is significant and hence

1. samples could not have come from the same universe or

2. the independent variable has a significant effect ondependant variable. More the value of F ratio more definite and sure about the conclusions

Page 132: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 132M S Sridhar, ISRO Testing of Hypotheses 132

1. Anderson, Jonathan, et. al. Thesis and assignment writing. New Delhi: Wiley, 1970.

2. Best, Joel. Damned lies and statistics. California: University of California Press, 2001.

3. Best, Joel. More damned lies and statistics; how numbers confuse public issues. Berkeley: University of California Press, 2004

4. Body, Harper W Jr. et.al. Marketing research: text and cases. Delhi: All India Traveler Bookseller, 1985.

5. Booth, Wayne C, et. al. The craft of research. 2 ed. Chicago: The University of Chicago Press, 2003.

6. Chandran, J S. Statistics fdor business and economics. New Delhi: Vikas, 1998.

7. Chicago guide to preparing electronic manuscripts: For authors and publishers. Chicago: The University of Chicago Press, 1987.

8. Cohen, Louis and Manion, Lawrence. Research methods in education. London: Routledge, 1980.

9. Goode, William J and Hatt, Paul K. Methods on social research. London; Mc Graw Hill, 1981.

10. Gopal, M.H. An introduction to research procedures in social sciences.Bombay: Asia Publishing House, 1970.

11. Koosis, Donald J. Business statistics. New York: John Wiley,1972.

ReferencesReferences

Page 133: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 133M S Sridhar, ISRO Testing of Hypotheses 133

12. Kothari, C.R. Research methodology: methods and techniques. 2 ed., New Delhi: Vishwaprakashan, 1990.

13. Miller, Jane E. The Chicago guide to writing about numbers. Chicago: the University of Chicago Press, 2004.

14. Rodger, Leslie W. Statistics for marketing. London: Mc-Graw Hill, 1984.15. Salvatoe, Dominick. Theory and problems of statistics and

econometrics (Schaum’s outline series). New York: McGraw-Hill, 1982.16. Spiegel, Murray R. Schauim’s outline of theory and problems of

statistics in SI units. Singapore: Mc Graw Hill , 1981.17. Simpson, I. S. How to interpret statistical data: a guide for librarians

and information scientists. London: Library Association, 1990.18. Slater, Margaret ed. Research method in library and information

studies. London: Library Association, 1990.19. Turabian, Kate L. A manual for writers of term papers, theses, and

dissertations. 6 ed. Chicago: The University of Chicago, 1996.20. Young, Pauline V. Scientific social surveys and research. New Delhi:

Prentice-Hall of India Ltd., 1984.21. Walizer, Michael H and Wienir, Paul L. Research methods and analysis:

searching for relationships. New York: Harper & Row, 1978.22. Williams, Joseph M. Style: towards clarity and grace. Chicago: The

University of Chicago Press, 1995.

References References ……Contd.Contd.

Page 134: Data Analysis Techq

Research Methodology 8 M S Sridhar, ISRO 134

About the Author

Dr. M. S. Sridhar is a post graduate in Mathematics and Business Management and a Doctorate in Library and Information Science. He is in the profession for last 36 years. Since 1978, he is heading the Library and Documentation Division of ISRO Satellite Centre, Bangalore. Earlier he has worked in the libraries of National Aeronautical Laboratory (Bangalore), Indian Institute of Management (Bangalore) and University of Mysore. Dr. Sridhar has published 4 books, 81 research articles, 22 conferences papers, written 19 course materials for BLIS and MLIS, made over 25 seminar presentations and contributed 5 chapters to books. E-mail:[email protected], [email protected], [email protected] ; Phone: 91-80-25084451; Fax: 91-80-25084476.