Quantitative Methods for Economic Analysis 6nov2014

QUANTITATIVE METHODS FORECONOMIC ANALYSIS – 1

III SEMESTER

B A ECONOMICS

(2013 Admission )

UNIVERSITY OF CALICUT

SCHOOL OF DISTANCE EDUCATION

Calicut university P.O, Malappuram Kerala, India 673 635.

263 A


III SEMESTER

B A ECONOMICS

(2013 Admission )




263 A


III SEMESTER

B A ECONOMICS

(2013 Admission )




263 A

School of Distance Education

Quantitative Methods for Economic Analysis - I Page 2

UNIVERSITY OF CALICUTSCHOOL OF DISTANCE EDUCATION

B.A. ECONOMICS(2013 ADMISSION )

III SEMESTER


Prepared by:

Module Materials Prepared by

Full Module

Chacko Jose P, PhDAssociate Professor of EconomicsSacred Heart CollegeChalakudy, Thrissur, Kerala(Formerly ReaderUGC-Academic Staff CollegeUniversity of Calicut)

Editor

Dr.C.KrishnanAssociate ProfessorPG Department of EconomicsGovt. College KodancheryKozhikode – 673580Email: [email protected]

Layout & Settings: Computer Section, SDE

© Reserved



CONTENTS PAGES

MODULE - I 5- 79

MODULE - II 80-100

MODULE - III 101-150

MODULE - IV 151-169



Quantitative Methods for Economic Analysis – 1Syllabus

Module I. Description of Data and SamplingStatistics-Meaning and limitations-Data: Elements, Variables, Observations-Scale of Measurement-Types of Data: Qualitative and Quantitative; Cross-section, Time series and Pooled Data-Frequency Distributions: Absolute andrelative-Graphs: Bar chart, Histogram etc. Summary Measure of Distributions:Measures of Central Tendency, Variability and Shape-Sampling: Populationand Sample, Methods of Sampling.

Module II. Correlation and Regression AnalysisCorrelation-Meaning, Types and Degrees of Correlation- Methods of Measuring

Correlation- Graphical Methods: Scatter Diagram and Correlation Graph;AlgebraicMethods: Karl Pearson’s Coefficient of Correlation and Rank CorrelationCoefficient -Properties and Interpretation of Correlation Coefficient

Module III. Index Numbers and Time Series AnalysisIndex Numbers: Meaning and Uses- Laspeyre’s, Paasche’s, Fisher’s, Dorbish-Bowley,Marshall-Edgeworth and Kelley’s Methods- Tests of Index Numbers: TimeReversal andFactor Reversal tests -Base Shifting, Splicing and Deflating- Special PurposeIndicesWholesale Price Index, Consumer Price Index and Stock Price Indices:BSE SENSEX and NSE-NIFTY. Time Series Analysis-Components of TimeSeries, Measurement of Trend by Moving Average and the Method of LeastSquares.

Module IV. Nature and Scope of EconometricsEconometrics: Meaning, Scope, and Limitations - Methodology of econometrics-Modern interpretation-Stochastic Disturbance term- Population RegressionFunction and Sample Regression Function-Assumptions of Classical Linearregression model.



Module IDescription of Data and Sampling

1. STATISTICS-MEANING

Statistics is as old as the human race!. Its utility has been increasing as the ages goes by. In the

olden days it was used in the administrative departments of the states and the scope was limited.

Earlier it was used by governments to keep record of birth, death, population etc., for

administrative purpose. John Graunt was the first man to make a systematic study of birth and

death statistics and the calculation of expectation of life at different age in the 17th century

which led to the idea of Life Insurance.

The word ‘Statistics’ seems to have been derived from the Latin word ‘status’ or Italian word‘statista’ or the German word ‘Statistik’ each of which means a political state. Fields likeagriculture, economics, sociology, business management etc., are now using Statistical Methods

for different purposes.

Statistics has been defined differently by different writers. According to Webster "Statistics are

the classified facts representing the conditions of the people in a state. Specially those facts

which can be stated in numbers or any tabular or classified arrangement."

According to Bowley statistics are ‘statistics is the science of counting’, ‘science of averages’‘Numerical statements of facts in any department of enquiry placed in relation to each other.’According to Yule and Kendall, statistics means quantitative data affected to a marked extent by

multiplicity of causes.

More broad definition of statistics was given by Horace Secrist. According to him, statistics

means aggregate of facts affected to marked extent by multiplicity of causes, numerically

expressed, enumerated or estimated according to a reasonable standard of accuracy, collected in

a systematic manner for a predetermined purpose and placed in relation to each other.

This definition points out some essential characteristics that numerical facts must possess so that

they may be called statistics. These characteristics are:

1. They are enumerated or estimated according to a reasonable standard of accuracy

2. They are affected by multiplicity of factors

3. They must be numerically expressed

4. They must be aggregate of facts

W.I. King defines “the science of statistics is the method of judging collection, natural or socialphenomena from the results obtained from the analysis or enumeration or collection of

estimates”.Prof: Boddington has defined statistics as “science of estimate and probabilities”Let us also see some other definitions of statistics.

Statistics as a discipline is the development and application of methods to collect, analyse and

interpret data.



Statistics is the science of learning from data, and of measuring, controlling, and communicating

uncertainty; and it thereby provides the navigation essential for controlling the course of

scientific and societal advances.

Statistics is a collection of mathematical techniques that help to analyse and present data.

Statistics is also used in associated tasks such as designing experiments and surveys and planning

the collection and analysis of data from these.

Statistics is the study of numerical information, called data. Statisticians acquire, organize, and

analyse data. Each part of this process is also scrutinized. The techniques of statistics are applied

to a multitude of other areas of knowledge.

Thus to sum up “statistics are the numerical statement of facts capable of analysis and

interpretation and the science of statistics is the study of the principles and the methods applied

in collecting, presenting, analysis and interpreting the numerical data in any field of inquiry.”

Characteristics of Statistics

1. Statistics are aggregate of facts: A single age of 20 or 30 years is not statistics, a series of ages

are. Similarly, a single figure relating to production, sales, birth, death etc., would not be

statistics although aggregates of such figures would be statistics because of their comparability

and relationship.

2. Statistics are affected to a marked extent by a multiplicity of causes: A number of causes

affect statistics in a particular field of enquiry, e.g., in production statistics are affected by

climate, soil, fertility, availability of raw materials and methods of quick transport.

3. Statistics are numerically expressed, enumrated or estimated: The subject of statistics is

concerned essentially with facts expressed in numerical form -with their quantitative details but

not qualitative descriptions. Therefore, facts indicated by terms such as ‘good’, ‘poor’ are notstatistics unless a numerical equivalent, is assigned to each expression. Also this may either be

enumerated or estimated, where actual enumeration is either not possible or is very difficult.

4. Statistics are numerated or estimated according to reasonable standard of accuracy: Personal

bias and prejudices of the enumeration should not enter into the counting or estimation of

figures, otherwise conclusions from the figures would not be accurate. The figures should be

counted or estimated according to reasonable standards of accuracy. Absolute accuracy is neither

necessary nor sometimes possible in social sciences. But whatever standard of accuracy is once

adopted, should be used throughout the process of collection or estimation.

5. Statistics should be collected in a systematic manner for a predetermined purpose: The

statistical methods to be applied on the purpose of enquiry since figures are always collected

with some purpose. If there is no predetermined purpose, all the efforts in collecting the figures

may prove to be wasteful. The purpose of a series of ages of husbands and wives may be to find

whether young husbands have young wives and the old husbands have old wives.

6. Statistics should be capable of being placed in relation to each other: The collected figure

should be comparable and well-connected in the same department of inquiry. Ages of husbands



are to be compared only with the corresponding ages of wives, and not with, say, heights of

trees.

Functions of Statistics

The functions of statistics may be enumerated as follows :

(i) To present facts in a definite form : Without a statistical study our ideas are likely to be vague,

indefinite and hazy, but figures helps as to represent things in their true perspective. For

example, the statement that some students out of 1,400 who had appeared, for a certain

examination, were declared successful would not give as much information as the one that 300

students out of 400 who took the examination were declared successful.

(ii) To simplify unwieldy and complex data : It is not easy to treat large numbers and hence they

are simplified either by taking a few figures to serve as a representative sample or by taking

average to give a bird’s eye view of the large masses. For example, complex data may besimplified by presenting them in the form of a table, graph or diagram, or representing it through

an average etc.

(iii) To use it as a technique for making comparisons: The significance of certain figures can be

better appreciated when they are compared with others of the same type. The comparison

between two different groups is best represented by certain statistical methods, such as average,

coefficients, rates, ratios, etc.

Uses of Statistics

Statistics is primarily used either to make predictions based on the data available or to make

conclusions about a population of interest when only sample data is available.

In both cases statistics tries to make sense of the uncertainty in the available data.

Statisticians apply statistical thinking and methods to a wide variety of scientific, social, and

business endeavours in such areas as astronomy, biology, education, economics, engineering,

genetics, marketing, medicine, psychology, public health, sports, among many. Many economic,

social, political, and military decisions cannot be made without statistical techniques, such as the

design of experiments to gain federal approval of a newly manufactured drug.

Statistics is of two types (a) Descriptive statistics involves methods of organizing, picturing and

summarizing information from data. (b) Inferential statistics involves methods of using

information from a sample to draw conclusions about the population.

These days statistical methods are applicable everywhere. There is no field of work in which

statistical methods are not applied. According to A L. Bowley, ‘A knowledge of statistics is likea knowledge of foreign languages or of Algebra, it may prove of use at any time under any

circumstances”. The importance of the statistical science is increasing in almost all spheres ofknowledge, e g., astronomy, biology, meteorology, demography, economics and mathematics.

Economic planning without statistics is bound to be baseless. Statistics serve in administration,

and facilitate the work of formulation of new policies. Financial institutions and investors utilise

statistical data to summaries the past experience. Statistics are also helpful to an auditor, when he

uses sampling techniques or test checking to audit the accounts of his client.



(a) Statistics and Economics: In the year 1890 Prof. Alfred Marshall, the renowned economist

observed that “statistics are the straw out of which I, like every other economist, have to make

bricks”. This proves the significance of statistics in economics. Economics is concerned withproduction and distribution of wealth as well as with the complex institutional set-up connected

with the consumption, saving and investment of income. Statistical data and statistical methods

are of immense help in the proper understanding of the economic problems and in the

formulation of economic policies. In fact these are the tools and appliances of an economists

laboratory. In the field of economics it is almost impassible to find a problem which does not

require an extensive uses of statistical data. As economic theory advances use of statistical

methods also increase. The laws of economics like law of demand, law of supply etc can be

considered true and established with the help of statistical methods. Statistics of consumption

tells us about the relative strength of the desire of a section of people. Statistics of production

describe the wealth of a nation. Exchange statistics through light on commercial development of

a nation. Distribution statistics disclose the economic conditions of various classes of people.

There for statistical methods are necessary for economics.

(b) Statistics and business: Statistics is an aid to business and commerce. When a person enters

business, he enters into the profession of fore casting. Modern statistical devices have made

business forecasting more precise and accurate. A business man needs statistics right from the

time he proposes to start business. He should have relevant fact and figures to prepare the

financial plan of the proposed business. Statistical methods are necessary for these purposes. In

industrial concern statistical devices are being used not only to determined and control the

quality of products manufactured by also to reduce wastage to a minimum. The technique of

statistical control is used to maintain quality of products.

(c) Statistics and Research: Statistics is an indispensable tool of research. Most of the

advancement in knowledge has taken place because of experiments conducted with the help of

statistical methods. For example, experiments about crop yield and different types of fertilizers

and different types of soils of the growth of animals under different diets and environments are

frequently designed and analysed according to statistical methods. Statistical methods are also

useful for the research in medicine and public health. In fact there is hardly any research work

today that one can find complete without statistical data and statistical methods.

Other uses of statistics are as follows.

(1) Statistics helps in providing a better understanding and exact description of a phenomenon of

nature.

(2) Statistical helps in proper and efficient planning of a statistical inquiry in any field of study.

(3) Statistical helps in collecting an appropriate quantitative data.

(4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and graphic

form for an easy and clear comprehension of the data.

(5) Statistics helps in understanding the nature and pattern of variability of

a phenomenon through quantitative observations.



(6) Statistics helps in drawing valid inference, along with a measure of their reliability about the

population parameters from the sample data.

Limitations of StatisticsStatistics is indispensable to almost all sciences - social, physical and natural. It is very often

used in most of the spheres of human activity. In spite of the wide scope of the subject it has

certain limitations. Some important limitations of statistics are the following:

1. Statistics does not study qualitative phenomena: Statistics deals with facts and figures. So

the quality aspect of a variable or the subjective phenomenon falls out of the scope of statistics.

For example, qualities like beauty, honesty, intelligence etc. cannot be numerically expressed. So

these characteristics cannot be examined statistically. This limits the scope of the subject.

2. Statistical laws are not exact: Statistical laws are not exact as incase of natural sciences.

These laws are true only on average. They hold good under certain conditions. They cannot be

universally applied. So statistics has less practical utility.

3. Statistics does not study individuals: Statistics deals with aggregate of facts. Single or

isolated figures are not statistics. This is considered to be a major handicap of statistics.

4. Statistics can be misused: Statistics is mostly a tool of analysis. Statistical techniques are

used to analyze and interpret the collected information in an enquiry. As it is, statistics does not

prove or disprove anything. It is just a means to an end. Statements supported by statistics are

more appealing and are commonly believed. For this, statistics is often misused. Statistical

methods rightly used are beneficial but if misused these become harmful. Statistical methods

used by less expert hands will lead to inaccurate results. Here the fault does not lie with the

subject of statistics but with the person who makes wrong use of it.

Other limitations are as follows.

(1) Statistics laws are true on average. Statistics are aggregates of facts. So single observation is

not a statistics, it deals with groups and aggregates only.

(2) Statistical methods are best applicable on quantitative data.

(3) Statistical cannot be applied to heterogeneous data.

(4) It sufficient care is not exercised in collecting, analyzing and interpretation the data,

statistical results might be misleading.

(5) Only a person who has an expert knowledge of statistics can handle statistical data

efficiently.

(6) Some errors are possible in statistical decisions. Particularly the inferential statistics involves

certain errors. We do not know whether an error has been committed or not.

2.DATA: ELEMENTS, VARIABLES, OBSERVATIONS, SCALE OFMEASUREMENT

Data may be defined as facts, observations, and information that come from investigations. Data

can be defined as groups of information that represent the qualitative or quantitative attributes of

a variable or set of variables, which is the same as saying that data can be any set of information

that describes a given entity. Data in statistics can be classified into grouped data and ungrouped

data.



1. Elements: A data element is a unit of data for which the definition, identification,

representation, and permissible values are specified by means of a set of attributes. It is the

smallest named item of data that conveys meaningful information or condenses lengthy

description into a short code called data field in the structure of a database.

2. Variable - property of an object or event that can take on different values. A variable is any

measurable characteristic or attribute that can have different values for different subjects. Height,

age, amount of income, country of birth, grades obtained at school and type of housing are

examples of variables. For example, college major is a variable that takes on values like

mathematics, computer science, English, psychology, etc.

Discrete Variable - a variable with a limited number of values (e.g., gender (male/female),

college class (freshman/sophomore/junior/senior).

Continuous Variable - a variable that can take on many different values, in theory, any value

between the lowest and highest points on the measurement scale.

Independent Variable - a variable that is manipulated, measured, or selected by the researcher as

an antecedent condition to an observed behavior. In a hypothesized cause-and-effect

relationship, the independent variable is the cause and the dependent variable is the outcome or

effect.

Dependent Variable - a variable that is not under the experimenter's control -- the data. It is the

variable that is observed and measured in response to the independent variable.

Qualitative Variable - a variable based on categorical data.

Quantitative Variable - a variable based on quantitative data.

Qualitative vs. Quantitative Variables

Variables can be classified as qualitative (aka, categorical) or quantitative (aka, numeric).

Qualitative. Qualitative variables take on values that are names or labels. The color of a

ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be

examples of qualitative or categorical variables.

Quantitative. Quantitative variables are numeric. They represent a measurable quantity.

For example, when we speak of the population of a city, we are talking about the number

of people in the city - a measurable attribute of the city. Therefore, population would be a

quantitative variable.

In algebraic equations, quantitative variables are represented by symbols (e.g., x, y, or z).

Discrete vs. Continuous Variables

Quantitative variables can be further classified as discrete or continuous. If a variable can take on

any value between its minimum value and its maximum value, it is called a continuous variable;

otherwise, it is called a discrete variable.

Some examples will clarify the difference between discrete and continouous variables.

Suppose the fire department mandates that all fire fighters must weigh between 150 and

250 pounds. The weight of a fire fighter would be an example of a continuous variable;

since a fire fighter's weight could take on any value between 150 and 250 pounds.



Suppose we flip a coin and count the number of heads. The number of heads could be any

integer value between 0 and plus infinity. However, it could not be any number between

0 and plus infinity. We could not, for example, get 2.3 heads. Therefore, the number of

heads must be a discrete variable.

Univariate vs. Bivariate Data

Statistical data are often classified according to the number of variables being studied.

Univariate data. When we conduct a study that looks at only one variable, we say that we

are working with univariate data. Suppose, for example, that we conducted a survey to

estimate the average weight of high school students. Since we are only working with one

variable (weight), we would be working with univariate data.

Bivariate data. When we conduct a study that examines the relationship between two

variables, we are working with bivariate data. Suppose we conducted a study to see if

there were a relationship between the height and weight of high school students. Since we

are working with two variables (height and weight), we would be working with bivariate

data.

3. ObservationsAn observation is the value, at a particular period, of a particular variable, such as the individual

price of an item at a given outlet. An observation is the value, at a particular period, of a

particular variable. It is thus a method of data collection in which the situation of interest is

watched and the relevant facts, actions and behaviors are recorded.

Observation units vary according to the specific survey or data collection: for statistical data

collected on persons the observation unit is usually one individual or a household.

4. Scale of MeasurementNormally, when one hears the term measurement, they may think in terms of measuring the

length of something (e.g., the length of a piece of wood) or measuring a quantity of something

(ie. a cup of flour).This represents a limited use of the term measurement. In statistics, the term

measurement is used more broadly and is more appropriately termed scales of measurement.

Scales of measurement refer to ways in which variables/numbers are defined and categorized.

Each scale of measurement has certain properties which in turn determines the appropriateness

for use of certain statistical analyses. The four scales of measurement are nominal, ordinal,

interval, and ratio.

Properties of Measurement Scales

Each scale of measurement satisfies one or more of the following properties of measurement.

Identity: Each value on the measurement scale has a unique meaning. It is not equal to any other

value on the scale.

Magnitude: All values on the measurement scale have an ordered relationship to one another.

That is, some values are larger and some are smaller.

Equal intervals: Scale units along the scale are equal to one another. This means, for example,

that the difference between 1 and 2 would be equal to the difference between 19 and 20.

A minimum value of zero: The scale has a true zero point that is now values exist below zero.



Measurement scales are of four types, namely, Nominal Scale of Measurement, Ordinal Scale of

Measurement, Interval Scale of Measurement and Ratio Scale of Measurement

(a) Nominal Scale of Measurement

The nominal scale of measurement only satisfies the identity property of measurement. Values

assigned to variables represent a descriptive category, but have no inherent numerical value with

respect to magnitude.

Gender is an example of a variable that is measured on a nominal scale. Individuals may be

classified as "male" or "female", but neither value represents more or less "gender" than the

other. Religion and political affiliation are other examples of variables that are normally

measured on a nominal scale.

(b) Ordinal Scale of Measurement

The ordinal scale has the property of both identity and magnitude. Each value on the ordinal

scale has a unique meaning, and it has an ordered relationship to every other value on the scale.

An example of an ordinal scale in action would be the results of a horse race, reported as "win",

"place", and "show". We know the rank order in which horses finished the race. The horse that

won finished ahead of the horse that placed, and the horse that placed finished ahead of the horse

that showed. However, we cannot tell from this ordinal scale whether it was a close race or

whether the winning horse won by a mile.

(c) Interval Scale of Measurement

The interval scale of measurement has the properties of identity, magnitude, and equal intervals.

A perfect example of an interval scale is the Fahrenheit scale to measure temperature. The scale

is made up of equal temperature units, so that the difference between 40 and 50 degrees

Fahrenheit is equal to the difference between 50 and 60 degrees Fahrenheit.

With an interval scale, you know not only whether different values are bigger or smaller, youalso know how much bigger or smaller they are. For example, suppose it is 60 degreesFahrenheit on Monday and 70 degrees on Tuesday. You know not only that it was hotter onTuesday, you also know that it was 10 degrees hotter.

(d) Ratio Scale of Measurement

The ratio scale of measurement satisfies all four of the properties of measurement: identity,magnitude, equal intervals, and a minimum value of zero.

The weight of an object would be an example of a ratio scale. Each value on the weight scale hasa unique meaning, weights can be rank ordered, units along the weight scale are equal to oneanother, and the scale has a minimum value of zero.

Weight scales have a minimum value of zero because objects at rest can be weightless, but theycannot have negative weight.

The table below will help clarify the fundamental differences between the four scales ofmeasurement:



IndicationsDifference

Indicates Direction ofDifference

Indicates Amount ofDifference

AbsoluteZero

Nominal XOrdinal X XInterval X X XRatio X X X X

You will notice in the above table that only the ratio scale meets the criteria for all four

properties of scales of measurement.

Interval and Ratio data are sometimes referred to as parametric and Nominal and Ordinal data

are referred to as nonparametric. Parametric means that it meets certain requirements with

respect to parameters of the population (for example, the data will be normal--the distribution

parallels the normal or bell curve). In addition, it means that numbers can be added, subtracted,

multiplied, and divided. Parametric data are analyzed using statistical techniques identified as

Parametric Statistics. As a rule, there are more statistical technique options for the analysis of

parametric data and parametric statistics are considered more powerful than nonparametric

statistics. Nonparametric data are lacking those same parameters and cannot be added,

subtracted, multiplied, and divided. For example, it does not make sense to add Social Security

numbers to get a third person. Nonparametric data are analyzed by using Nonparametric

Statistics.

3. TYPES OF DATA: Qualitative and Quantitative; Cross-section, Timeseries and Pooled Data

3.1 Qualitative and QuantitativeData is a collection of facts, such as values or measurements. It can be numbers, words,

measurements, observations or even just descriptions of things.Some methods provide data

which are quantitative and some methods data which are qualitative.

Quantitative data are anything that can be expressed as a number, or quantified. Examples of

quantitative data are scores on achievement tests, number of hours of study, or weight of a

subject. These data may be represented by ordinal, interval or ratio scales and lend themselves to

most statistical manipulation. Thus qualitative data is one that approximates or characterizes but

does not measure the attributes, characteristics, properties, etc., of a thing or phenomenon.

Qualitative data describes whereas quantitative data defines.

Qualitative data cannot be expressed as a number. Data that represent nominal scales such as

gender, socio-economic status, religious preference are usually considered to be qualitative data.

Thus quantitative data is one that can be quantified and verified, and is amenable to statistical

manipulation. Quantitative data defines whereas qualitative data describes.

Both types of data are valid types of measurement. But only quantitative data can be analysed

statistically, and thus more rigorous assessments of the data are possible.



Quantitative and qualitative data provide different outcomes, and are often used together to get a

full picture of a population. For example, if data are collected on annual income (quantitative),

occupation data (qualitative) could also be gathered to get more detail on the average annual

income for each type of occupation.

Quantitative and qualitative data can be gathered from the same data unit depending on whether

the variable of interest is numerical or categorical. For example:

Example 1:

Oil Painting

Qualitative data: blue/green color, gold frame

smells old and musty

texture shows brush strokes of oil

paint

peaceful scene of the country

masterful brush strokes

Oil Painting

Quantitative data: picture is 10" by 14"

with frame 14" by 18"

weighs 8.5 pounds

surface area of painting is 140 sq.

in.

cost Rs5000

Example 2

Dataunit

Numeric variable = Quantitativedata

Categoricalvariable

= Qualitativedata

A person "Howmany children doyou have?"

4 children "In whichcountry were yourchildren born?"

India

"How much do youearn?"

Rs. 50,000 p.a. "What is youroccupation?"

Banker

"How many hoursdo you work?"

45 hours perweek

"Do you work full-time or part-time?"

Full-time

A house "Plinth area ofyour house?"

1000 squaremetres

"In which city ortown is the houselocated?"

Thrissur

Abusiness

"Howmany workers arecurrentlyemployed?"

110 employees "What isthe industry of thebusiness?"

Textile retail



A farm "How many milkcows are locatedon the farm?

36 cows "What is themain activity ofthe farm?"

Dairy

And Quantitative data can also be Discrete data or Continuous data.Discrete data can only take certain values (like whole numbers)Continuous data can take any value (within a range)Put simply: Discrete data is counted, Continuous data is measured.See the following example.Example: What do we know about Arrow the Dog?

Description about Blacky, your pet dog

Qualitative:

He is brown and blackHe has long hairHe has lots of energy

Quantitative:

Discrete:He has 4 legsHe has 2 brothers

Continuous:He weighs 25.5 kgHe is 565 mm tall

3.2 Cross Section and Time Series Data

Time series data is data that is measured using a sequence of certain points at particular times.

The BSE SENSEX is an example of data that is measured using time series data, as the data

collected is listed at a certain time on each day. Line charts are used to plot time series data and

these enable the viewer of the data to analyze the data with ease, and to compare and contrast the

differences between one set of data at a particular time and another set of data at a particular

time.

Other examples of time-series would be staff numbers at a particular institution taken on a

monthly basis in order to assess staff turnover rates, weekly sales figures of ice-cream sold

during a holiday period at a seaside resort and the number of students registered for a particular

course on a yearly basis. All of the above would be used to forecast likely data patterns in the

future.

Cross-section data is data that is collected by analyzing different sets of data from different

sources at a particular time. This type of statistical information is useful when observing habits

within a country, such as eating habits, voting habits, and drinking habits. Applying a certain set

of questions to a certain number of people in different areas, and collating the information to

achieve a realistic picture that is relevant to a nation or an area as a whole makes this data useful.

Another example of cross-section data is business data collected to see the popularity of certain

products at a particular time, and this is known as market research.



Other examples: if one considered the closing prices of a group of 20 different tech stocks of

BSE on September 15, 2014 this would be an example of cross-sectional data. Note that the

underlying population should consist of members with similar characteristics. For example,

suppose you are interested in how much companies spend on research and development

expenses. Firms in some industries such as retail spend little on research and development

(R&D), while firms in industries such as technology spend heavily on R&D. Therefore, it's

inappropriate to summarize R&D data across all companies. Rather, analysts should summarize

R&D data by industry, and then analyze the data in each industry group. Other examples of

cross-sectional data would be: an inventory of all ice creams in stock at a particular supermarket,

a list of grades obtained by a class of students for a specific test.

The major difference between time series data and cross-section data is that the former focuses

on results gained over an extended period of time, often within a small area, whilst the latter

focuses on the information received from surveys and opinions at a particular time, in various

locations, depending on the information sought.

4. FREQUENCY DISTRIBUTIONS: ABSOLUTE AND RELATIVEFrequency distribution is a specification of the way in which the frequencies of members of a

population are distributed according to the values of the variates which they exhibit. For

observed data the distribution is usually specified in tabular form, with some grouping for

continuous variates.

The frequency distribution or frequency table is a tabular organization of statistical data,

assigning to each piece of data its corresponding frequency.

Types of Frequencies

(a) Absolute Frequency

The absolute frequency is the number of times that a certain value appears in a statistical study.

It is denoted by .

The sum of the absolute frequencies is equal to the total number of data, which is denoted by N.+ + + ⋯ + =This sum is commonly denoted by the Greek letter Σ (capital sigma) which represents ‘sum’.

=(b) Relative Frequency

The relative frequency is the quotient between the absolute frequency of a certain value and the

total number of data. It can be expressed as a percentage and is denoted by .=The sum of the relative frequency is equal to 1.



(c) Cumulative Frequency

The cumulative frequency is the sum of the absolute frequencies of all values less than or equal

to the value considered.

It is denoted by F i.

(d) Relative Cumulative Frequency

The relative cumulative frequency is the quotient between the cumulative

frequency of a particular value and the total number of data. It can be expressed as

a percentage.

Example

A city has recorded the following daily maximum temperatures during a month:

32, 31, 28, 29, 33, 32, 31, 30, 31, 31, 27, 28, 29, 30, 32, 31, 31, 30, 30, 29, 29, 30, 30, 31, 30, 31,

34, 33, 33, 29, 29.

Let us form a table based on this information. In the first column of the table are the variables

ordered from lowest to highest, in the second column is the count or the number or times this

variable has occurred and in the third column is the score of the absolute frequency.

xi Count fi Fi ni Ni

27 I 1 1 0.032 0.03228 II 2 3 0.065 0.09729 6 9 0.194 0.29030 7 16 0.226 0.51631 8 24 0.258 0.77432 III 3 27 0.097 0.87133 III 3 30 0.097 0.96834 I 1 31 0.032 1

31 1

Discrete variables are used for this type of frequency table.

5. GRAPHS OF FREQUENCY DISTRIBUTIONA frequency distribution can be represented graphically in any of the following ways.

The most commonly used graphs and curves for representation a frequency distribution areBar Charts

Histogram

Frequency Polygon

Smoothened frequency curve

Ogives or cumulative frequency curves.

(a)Bar ChartsA bar chart is used to present categorical, quantitative or discrete data.



The information is presented on a coordinate axis. The values of the variable are represented on

the horizontal axis and the absolute, relative or cumulative frequencies are represented on the

vertical axis.

The data is represented by bars whose height is proportional to the frequency.

Example

A study has been conducted to determine the blood group of a class of 20 students. The resultsare as follows:

BloodGroup f i

A 6

B 4

AB 1

O 9

20

Based on this we can draw a bar chart as follows.

Step 1: Number the Y-axis with the dependent variable. The dependent variable is the one being

tested in an experiment. In this sample question, the study wanted to know how many students

belonged to each blood group. So the number of students is the dependent variable. So it is

marked on the Y-axis.

Step 2: Label the X-axis with what the bars represent. For this problem, label the x-axis “BloodGroup” and then label the Y-axis with what the Y-axis represents: “number of students.”Step 3: Draw your bars. The height of the bar should be even with the correct number on the Y-

axis. Don’t forget to label each bar under the x-axis.

Finally, give your graph a name. For this problem, call the graph ‘Blood group of students’.

0123456789

10

A B AB O

Num

ber o

f stu

dent

s

Blood Group

Blood group of students



Histogram:A histogram is a set of vertical bars whose one as are proportional to the frequencies

represented. While constructing histogram, the variable is always taken on the X axis and the

frequencies on the Y axis. The width of the bars in the histogram will be proportional to the

class interval. The bars are drawn without leaving space between them. A histogram generally

represents a continuous curve. If the class intervals are uniform for a frequency distribution,

then the width of all the bars will by equal.

Example:

Y50 -

40 -

30 -

20 -

10 -

X0 5 10 15 20 25 30

Marks

Frequency Polygon (or line graphs)

Frequency Polygon is a graph of frequency distribution. Frequency polygons are a

graphical device for understanding the shapes of distributions. They serve the same purpose as

histograms, but are especially helpful for comparing sets of data.

To create a frequency polygon, start just as for histograms, by choosing a class interval. Then

draw an X-axis representing the values of the scores in your data. Mark the middle of each class

interval with a tick mark, and label it with the middle value represented by the class. Draw the Y-

axis to indicate the frequency of each class. Place a point in the middle of each class interval at

the height corresponding to its frequency. Finally, connect the points. You should include one

class interval below the lowest value in your data and one above the highest value. The graph

will then touch the X-axis on both sides.

Another method of constructing frequency polygon is to take the mid points of the various class

intervals and then plot frequency corresponding to each point and to join all these points by a

straight lines. Here need not construct a histogram:-

Marks No. ofstudents

10-15 5

15-20 20

20-25 47

25-30 38

30-35 10

No.

of

stud

ents



Example:

Draw a frequency polygon to the following frequency distribution

Marks: 10-20 20-30 30-40 40-50 50-60 60-70 70-80

No. of

Students:

8 13 19 28 19 11 9

Y

20 -

15 -

10 -

5 -

0 10 20 30 40 50 60 70Marks

Frequency Curves

Frequency curves are derived from frequency polygons. Frequency curve is obtained by

joining the points of frequency polygon by a freehand smoothed curve. Unlike frequency

polygon, where the points we joined by straight lines, we make use of free hand joining of those

points in order to get a smoothed frequency curve. It is used to remove the ruggedness of

polygon and to present it in a good form or shape. We smoothen the angularities of the polygon

only without making any basic change in the shape of the curve. In this case also the curve

begins and ends at base line, as is in case of polygon. Area under the curve must remain almost

the same as in the case of polygon.

Example:

Marks: 10-20 20-30 30-40 40-50 50-60 60-70

No. ofStudents:

5 8 15 20 12 7

No.

of s

tude

nts


10-15 5

15-20 20

20-25 47

25-30 38

30-35 10



Y

20 - x

15 - x

x10 -

xx

5 - x

| | | | | | | | | || | |

0 10 20 30 40 50 60 70 Marks

Difference between frequency polygon and frequency curveFrequency polygon is drawn to frequency distribution of discrete or continuous nature.

Frequency curves are drawn to continuous frequency distribution. Frequency polygon is

obtained by joining the plotted points by straight lines. Frequency curves are smooth. They are

obtained by joining plotted points by smooth curve.

Ogives (Cumulative frequency curve)A frequency distribution when cumulated, we get cumulative frequency distribution. A

series can be cumulated in two ways. One method is frequencies of all the preceding classes one

added to the frequency of the classes. This series is called less than cumulative series. Another

method is frequencies of succeeding classes are added to the frequency of a class. This is called

more than cumulative series. Smoothed frequency curves drawn for these two cumulative series

are called cumulative frequency curve or ogives. Thus corresponding to the two cumulative

series we get two ogive curves, known as less than ogive and more than ogive.

Less than ogive curve is obtained by plotting frequencies (cumulated) against the upper

limits of class intervals. More than ogive curve is obtained by plotting cumulated frequencies

against the lower limits of class intervals. Less than ogive is an increasing curve, slopping –upwards from left to right. More than ogive is a decreasing curve and slopes from left to right.

Example:From less than and more than cumulative frequency distribution for the following frequency

distribution. Cumulative frequency distribution:

Marks No. ofStudents

No.

of

stud

ents


10-15 5

15-20 20

20-25 47

25-30 38

30-35 10



Marksless than

No. ofStudents

MarksMorethan

No. ofStudents

10 0 10 6020 4 20 5630 10 30 5040 20 40 4050 40 50 2060 58 60 270 60 70 0

Pie DiagramsOne of the most common ways to represent data graphically is called a pie chart. It gets its name

by how it looks, just like a circular pie that has been cut into several slices. This kind of graph is

helpful when graphing qualitative data, where the information describes a trait or attribute and is

not numerical. Each trait corresponds to a different slice of the pie. By looking at all of the pie

pieces, you can compare how much of the data fits in each category.

Pie charts are a form of an area chart that are easy to understand with a quick look. They show

the part of the total (percentage) in an easy-to-understand way. Pie charts are useful tools that

-10

0

10

20

30

40

50

60

70

0 20 40 60 80

No. of Students

No. of Students2

10-20 420-30 630-40 1040-50 2050-60 1860-70 2

No.

of

Stu

dent

s

Marks

More than ogive

Less than ogive



help you figure out and understand polls, statistics, complex data, and income or spending. They

are so wonderful because everybody can see what is going on.

Pie diagrams are used when the aggregate and their division are to be shown together.

The aggregate is shown by means of a circle and the division by the sectors of the circle. For

example: to show the total expenditure of a government distributed over different departments

like agriculture, irrigation, industry, transport etc. can be shown in a pie diagram. In constructing

a pie diagram the various components are first expressed as a percentage and then the percentage

is multiplied by 3.6. so we get angle for each component. Then the circle is divided into sectors

such that angles of the components and angles of the sectors are equal. Therefore one sector

represents one component. Usually components are with the angles in descending order are

shown.

Example:

You conducted a survey as part of a project work. You had taken a sample of 20 individuals and

you want to represent their occupation using a pie chart .

First, put your data into a table, then add up all the values to get a total:

Farmer Business Teacher Bank Driver TOTAL

4 5 6 1 4 20

Calculate the angle of each sector, using the formula

Divide each value by the total and multiply by 100 to get a percent:


4 5 6 1 4 20

4/20 =20% 5/20 =25% 6/20 =30% 1/20 = 5% 4/20 =20% 100%

Now you need to figure out how many degrees for each ‘pie slice’ (correctly

called a sector).

A Full Circle has 360 degrees, so we do this calculation:


4 5 6 1 4 20

4/20 =20% 5/20 =25% 6/20 =30% 1/20 = 5% 4/20 =20% 100%

4/20 × 360°

= 72°5/20 × 360°

= 90°6/20 × 360°

= 108°1/20 × 360°

= 18°4/20 × 360°

= 72°360°



Draw a circle using a pair of compasses.

Use a protractor to draw the angle for each sector.

Label the circle graph and all its sectors.

Pie charts are to be used with qualitative data, however there are some limitations in using them.

If there are too many categories, then there will be a multitude of pie pieces. Some of these are

likely to be very skinny, and can be difficult to compare to one another.

If we want to compare different categories that are close in size, a pie chart does not always help

us to do this. If one slice has central angle of 30 degrees, and another has a central angle of 29

degrees, then it would be very hard to tell at a glance which pie piece is larger than the other.

6. SUMMARY MEASURE OF DISTRIBUTIONS

We will discuss three sets of summary measures namely Measures of Central Tendency,

Variability and Shape. These are called summary measures because they summarise the data. For

example, one of summary measure very familiar to you is mean. (Mean comes under measure of

central tendency.) If we take mean mark of students in a class for a subject, it gives you a rough

idea of what the marks are like. Thus based on just one summary value, we get idea of the entire

data.

6.1 Measures of Central TendencyA measure of central tendency is a measure that tells us where the middle of a bunch of data lies.

A measure of central tendency is a single value that attempts to describe a set of data by

identifying the central position within that set of data. As such, measures of central tendency are

sometimes called measures of central location. They are also classed as summary statistics. The

mean (often called the average) is most likely the measure of central tendency that you are most

familiar with, but there are others, such as the median and the mode.

30%

5%

20%

SAMPLE POPULATION BY OCCUPATIONFarmer Business


















data.







20%

25%

30%

20%

SAMPLE POPULATION BY OCCUPATIONBusiness Teacher Bank Driver


















data.









Mean: Mean is the most common measure of central tendency. It is simply the sum of the

numbers divided by the number of numbers in a set of data. This is also known as average.

Median: Median is the number present in the middle when the numbers in a set of data are

arranged in ascending or descending order. If the number of numbers in a data set is even, then

the median is the mean of the two middle numbers.

Mode: Mode is the value that occurs most frequently in a set of data.

The mean, median and mode are all valid measures of central tendency, but under different

conditions, some measures of central tendency become more appropriate to use than others. In

the following sections, we will look at the mean, mode and median, and learn how to calculate

them.

We will also discuss Geometric Mean and Harmonic Mean.

Requisites of a good averageSince an average is a single value representing a group of values, it is desired that such a

value satisfies the following properties.

1. Easy to understand:- Since statistical methods are designed to simplify the complexities.

2. Simple to compute: A good average should be easy to compute so that it can be used

widely. However, though case of computation is desirable, it should not be sought at the

expense of other averages. ie, if in the interest of greater accuracy, use of more difficult average

is desirable.

3. Based on all items:- The average should depend upon each and every item of the series,

so that if any of the items is dropped, the average itself is altered.

4. Not unduly affected by Extreme observations:- Although each and every item should

influence the value of the average, non of the items should influence it unduly. If one or two

very small or very large items unduly affect the average, ie, either increase its value or reduce its

value, the average can’t be really typical of entire series. In other words, extremes may distortthe average and reduce its usefulness.

5. Rigidly defined: An average should be properly defined so that it has only one

interpretation. It should preferably be defined by algebraic formula so that if different people

compute the average from the same figures they all get the same answer. The average should not

depend upon the personal prejudice and bias of the investigator, other wise results can be

misleading.

6. Capable of further algebraic treatment: We should prefer to have an average that could be

used for further statistical computation so that its utility is enhanced. For example, if we are

given the data about the average income and number of employees of two or more factories, we

should able to compute the combined average.

7. Sampling stability: Last, but not least we should prefer to get a value which has what the

statisticians called “sampling stability”. This means that if we pick 10 different group of collegestudents, and compute the average of each group, we should expect to get approximately the

same value. It does not mean, however that there can be no difference in the value of different



samples. There may be some differences but those samples in which this difference is less that

are considered better than those in which the difference is more.

(a) Mean (Arithmetic mean / average)The mean (or average) is the most popular and well known measure of central tendency. It can

be used with both discrete and continuous data, although its use is most often with continuous

data (see our Types of Variable guide for data types). The mean is equal to the sum of all the

values in the data set divided by the number of values in the data set. So, if we have n values in a

data set and they have values x1, x2, ..., xn, the sample mean, usually denoted by (pronounced x

bar), is:

This formula is usually written in a slightly different manner using the Greekcapitol letter, , pronounced "sigma", which means "sum of...":

Example

In a survey you collected information on monthly spending for mobile recharge by 20 students ofwhich 10 are male and 10 female. We illustrate below how the data is used to find mean.

1 2 3 4 5 6 7 8 9 10 Total MeanMale 250 150 100 175 150 250 200 200 150 170 1795 179.50Female 100 150 150 100 200 150 125 150 130 180 1435 143.50Both 350 300 250 275 350 400 325 350 280 350 3230 161.50

First we found the mean for male students. Here ∑x= 1795. n =10. So 1795/10 = 179.5.Similarly, the mean for female students. Here ∑x= 1435. n =10. So 1435/10 = 143.5.We also find the mean for male and female taken together.

Here ∑x= 3230. n =20. So 3230/20 = 161.50.

Based on the above we can make certain observations. Male students spend Rs. 179.50 on an

average in a month for mobile recharge. Female students spend Rs. 143.50. We may conclude

that male students spend more on monthly mobile recharges. As a researcher, you may now use

this information to make further studies as to why this is so. What are the factors that make male

students to spend more on mobile recharges. We have also calculated the average for all students

taken together. It is Rs. 161.50. Thus we observe that the male students spend more than the

average for ‘all students’ while female students spend less than the total for ‘all students’.Mean is also calculated using another method called the shortcut method asexplained below.

Short cut method: The arithmetic mean can also be calculated by short cut method. This method

reduces the amount of calculation. It involves the following steps










bar), is:


Example




Here ∑x= 3230. n =20. So 3230/20 = 161.50.



















bar), is:


Example




Here ∑x= 3230. n =20. So 3230/20 = 161.50.












i. Assume any one value as an assumed mean, which is also known as working mean

or arbitrary average (A).

ii. Find out the difference of each value from the assumed mean

(d = X-A).

iii. Add all the deviations (∑d)iv. Apply the formulaX = A +

∑Where X → Mean,

∑ → Sum of deviation from assumed mean,A → Assumed mean

Example:

Calculate arithmetic mean

Roll No : 1 2 3 4 5 6Marks : 40 50 55 78 58 60

Roll Nos. Marks d = X - 55

1 40 -15

2 50 -5

3 55 0

4 78 23

5 58 3

6 60 5

∑d = 11

X = A + ∑= 55 + = 56.83

Calculation of arithmetic mean - Discrete series

To find out the total items in discrete series, frequency of each value is multiplies with

the respective size. The value so obtained are totaled up. This total is then divided by the total

number of frequencies to obtain arithmetic mean.

Steps

1. Multiply each size of the item by its frequency fX

2. Add all fX – (∑f X)3. Divide ∑fX by total frequency (N).The formula is X =

∑Example

X 1 2 3 4 5f 10 12 8 7 11



SolutionX f fX

1 10 10

2 12 24

3 8 24

4 7 28

5 11 55

N = ∑fX = 141

X = ∑ = .= 2.93

Short cut Method

Steps:

Take the value of assumed mean (A)

Find out deviations of each variable from Aie d.

Multiply d with respective frequencies (fd)

Add up the product (∑fd) Apply formulaX = A ±

∑Continuous series

In continuous frequency distribution, the value of each individual frequency distributionis unknown. Therefore an assumption is made to make them precise or on the assumption thatthe frequency of the class intervals is concentrated at the centre that the mid point of each classintervals has to be found out. In continuous frequency distribution, the mean can be calculatedby any of the following methods.

a. Direct methodb. Short cut methodc. Step deviation method

a. Direct MethodSteps:1. Find out the mid value of each group or class. The mid value is obtained by adding the

lower and upper limit of the class and dividing the total by two. (symbol = m)2. Multiply the mid value of each class by the frequency of the class. In other words m will

be multiplied by f.

3. Add up all the products - ∑fm4. ∑fm is divided by N

Example:

From the following find out the mean profitProfit/Shop: 100-200 200-300 300-400 400-500 500-600 600-700 700-800

No. of shops: 10 18 20 26 30 28 18



Solution

Profit ( ) Mid point - m No of Shops (f) fm100-200 150 10 1500200-300 250 18 4500300-400 350 20 7000400-500 450 26 11700500-600 550 30 16500600-700 650 28 18200700-800 750 18 13500

∑f = 150 ∑fm = 72900X = ∑= 486

b) Short cut methodSteps:1. Find the mid value of each class or group (m)2. Assume any one of the mid value as an average (A)3. Find out the deviations of the mid value of each from the assumed mean

(d)4. Multiply the deviations of each class by its frequency (fd).5. Add up the product of step 4 - ∑fd6. Apply formulaX = A +∑

Example: (solving the last example)Solving: Calculation of Mean

Profit ( ) m d = m - 450 f fd

100-200 150 -300 10 -3000

200-300 250 -200 18 -3600

300-400 350 -100 20 -2000

400-500 450 0 26 0

500-600 550 100 30 3000

600-700 650 200 28 5600

700-800 750 300 18 5400∑f = 150 ∑fd = 5400X = A +∑

=450 + = 486c) Step deviation method

The short cut method discussed above is further simplified or calculations are reduced to a

great extent by adopting step deviation methos.

Steps:

1. Find out the mid value of each class or group (m)

2. Assume any one of the mid value as an average (A)



3. Find out the deviations of the mid value of each from the assumed mean (d)

4. Deviations are divided by a common factor (d')

5. Multiply the d' of each class by its frequency (f d')

6. Add up the products (∑fd')7. Then apply the formulaX = A +

∑ ′× c Where c = Common factor

Example:

Calculate mean for the last problem

Solution

Profit m f d d' f d'

100-200 150 10 -300 -3 -30

200-300 250 18 -200 -2 -36

300-400 350 20 -100 -1 -20

400-500 450 26 0 0 0

500-600 550 30 100 1 30

600-700 650 28 200 2 56

700-800 750 18 300 3 54∑f = 150 ∑f d' = 540

X = A +∑ × c

450 + × 100450 + (0.36 × 100) = 486

The mean is essentially a model of your data set. It is the value that is most common. You will

notice, however, that the mean is not often one of the actual values that you have observed in

your data set. However, one of its important properties is that it minimises error in the prediction

of any one value in your data set. That is, it is the value that produces the lowest amount of error

from all other values in the data set.

An important property of the mean is that it includes every value in your data set as part of the

calculation. In addition, the mean is the only measure of central tendency where the sum of the

deviations of each value from the mean is always zero.

We complete our discussion on arithmetic mean by listing the merits and demerits of it.

Merits:

It is rigidly defined.

It is easy to calculate and simple to follow.

It is based on all the observations.

It is determined for almost every kind of data.

It is finite and not indefinite.



It is readily put to algebraic treatment.

It is least affected by fluctuations of sampling.

Demerits:

The arithmetic mean is highly affected by extreme values.

It cannot average the ratios and percentages properly.

It is not an appropriate average for highly skewed distributions.

It cannot be computed accurately if any item is missing.

The mean sometimes does not coincide with any of the observed value.

We elaborate on only one of the demerits for your better understanding. The first demerit says

the arithmetic mean is highly affected by extreme values. What does this mean. See the

following example.

Consider the following table which gives information on the marks obtained by students in a test.

Student 1 2 3 4 5 6 7 8 9 10

Mark 15 18 16 14 15 15 12 17 90 95

The mean mark for these ten students is 30.7. However, inspecting the raw data suggests that this

mean value might not be the best way to accurately reflect the typical mark obtained by a

student, as most students have marks in the 12 to 18 range. Here we see that the mean is being

affected by the two large figures 90 and 95. This shows that arithmetic mean is highly affected

by extreme values.

Therefore, in this situation, we would like to have a better measure of central tendency. As we

will find out later, taking the median would be a better measure of central tendency in this

situation.

Weighted MeanSimple arithmetic mean gives equal importance to all items. Some times the items in a

series may not have equal importance. So the simple arithmetic mean is not suitable for those

series and weighted average will be appropriate.

Weighted means are obtained by taking in to account these weights (or importance).

Each value is multiplied by its weight and sum of these products is divided by the total weight to

get weighted mean.

Weighted average often gives a fair measure of central tendency. In many cases it is

better to have weighted average than a simple average. It is invariably used in the following

circumstances.1. When the importance of all items in a series are not equal. We associate weights to the

items.2. For comparing the average of one group with the average of an other group, when the

frequencies in the two groups are different, weighted averages are used.3. When rations percentages and rates are to be averaged, weighted average is used.4. It is also used in the calculations of birth and death rate index number etc.5. When average of a number of series is to be found out together weighted average is used.

Formula: Let x1+ x2 + x3 - - - - +xn be in values with corresponding weightsw1+ w2 + w3 - - - - +wn . Then the weighted average is



=

=∑∑

(b) Median

The median is also a frequently used measure of central tendency. The median is the midpoint of

a distribution: the same number of data points are above the median as below it. The median is

the middle score for a set of data that has been arranged in order of magnitude.

The median is determined by sorting the data set from lowest to highest values and taking the

data point in the middle of the sequence. There is an equal number of points above and below the

median. For example, in the data 7,8,9,10,11, the median is 9; there are two data points greater

than this value and two data points less than this value. Thus to find the median, we arrange the

observations in order from smallest to largest value. If there is an odd number of observations,

the median is the middle value.

If there is an even number of observations, the median is the average of the two middle values.

Thus, the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.

In certain situations the mean and median of the distribution will be the same, and in some

situations it will be different. For example, in the data 1,2,3,4,5 the median is 3; there are two

data points greater than this value and two data points less than this value. In this case, the

median is equal to the mean. But consider the data 1,2,3,4,10. In this dataset, the median still is

three, but the mean is equal to 4.

The median can be determined for ordinal data as well as interval and ratio data. Unlike the

mean, the median is not influenced by outliers at the extremes of the data set. For this reason, the

median often is used when there are a few extreme values that could greatly influence the mean

and distort what might be considered typical. For data which is very skewed, the median often is

used instead of the mean.

Calculation of Median : Discrete series

Steps: Arrange the date in ascending or descending order Find cumulative frequencies Apply the formula Median

Median = Size of itemExample: Calculate median from the following

Size of shoes: 5 5.5 6 6.5 7 7.5 8Frequency : 10 16 28 15 30 40 34

Solution



Size f Cumulative f (f)5 10 10

5.5 16 266 28 54

6.5 15 697 30 99

7.5 40 1398 34 173

Median = Size of item

N = 173

Median = = 87th item = 7

Median = 7Calculation of median – Continuous frequency distributionSteps:

Find out the median by using N/2 Find out the class which median lies Apply the formula = + ℎ 2 −

Where L = lower limit of the median classh = class interval of the median class

f = frequency of the median classN = ∑ , ℎc = cumulative frequency of the preceding median class

Example: Calculate median from the following data

Age inyears

Below10

Below20

Below30

Below40

Below50

Below60

Below70

70 andover

No. ofpersons

2 5 9 12 14 15 15.5 15.6

Solution:

First we have to convert the distribution to a continuous frequency distribution as in thefollowing table and then compute median.

Age in years No. of persons (f) Cumulative frequency (cf) – less than

0-10 2 2

10-20 5-2=3 5

20-30 9-5=4 9

30-40 12-9=3 12

40-50 14-12=2 14

50-60 15-14=1 15

60-70 15.5-15=0.5 15.5

70 and above 15.6-15.5=0.1 15.6



= = .Median item = = . = 7.8Find the cumulative frequency (c.f) greater than 7.8 is 9. Thus the corresponding class 20-30 isthe median class. = 20, ℎ = 10, = 4, = 15.6 , = 5

Use the formula = + −= 20 + 104 (7.8 − 5) = 20 + 52 × 2.8= 20 + 5 × 1.4 = 27.So the median age is 27.

The Mean vs. the MedianAs measures of central tendency, the mean and the median each have advantages anddisadvantages. Some pros and cons of each measure are summarized below.

The median may be a better indicator of the most typical value if a set of scores has an outlier.An outlier is an extreme value that differs greatly from other values.

However, when the sample size is large and does not include outliers, the mean score usuallyprovides a better measure of central tendency.

(b) Mode

The mode of a data set is the value that occurs with the most frequency. This measurement is

crude, yet is very easy to calculate. Suppose that a history class of eleven students scored the

following (out of 100) on a test: 60, 64, 70, 70, 70, 75, 80, 90, 95, 95, 100. We see that 70 is in

the list three times, 95 occurs twice, and each of the other scores are each listed only once. Since

70 appears in the list more than any other score, it is the mode. If there are two values that tie for

the most frequency, then the data is said to be bimodal.

The mode can be very useful for dealing with categorical data. For example, if a pizza shop sells10 different types of sandwiches, the mode would represent the most popular pizza. The modealso can be used with ordinal, interval, and ratio data. However, in interval and ratio scales, thedata may be spread thinly with no data points having the same value. In such cases, the modemay not exist or may not be very meaningful.To find mode in the case of a continuous frequencydistribution, mode is found using the formula= + ℎ( − )( − ) − ( − )Rearranging we get = + ℎ( − )2 − −Where

is the lower limit of the model class

is the frequency of the model class



is the frequency of the class preceding the model class

is the frequency of the class succeeding the model class

his the class interval of the model class

See the following example where we compute mode using the above formula.(mean and median

are also computed)

Example

Find the values of mean, mode and median from the following data.

Weight

(kg)

93-97 98-102 103-

107

108-

112

113-

117

118-

122

123-

127

128-

132

No. of

students

3 5 12 17 14 6 3 1

Solution: Since the formula for mode requires the distribution to be continuous

with ‘exclusive type’ classes, we first convert the classes into class boundaries.

Wight Classboundaries

Midvalue (X)

Numberof

students(f)

= − 1105 fd Less thanc.f

93-97 92.5-97.5 95 3 -3 -9 398-102 97.5-102.5 100 5 -2 -10 8103-107 102.5-107.5 105 12 -1 -12 20108-112 107.5-112.5 110 17 0 0 37113-117 112.5-117.5 115 14 1 14 51118-122 117.5-122.5 120 6 2 12 57123-127 122.5-127.5 125 3 3 9 60128-132 127.5-132.5 130 1 4 4 61

== 61 == 8Mean = + ℎ ∑

= 110 + 5 × 861 = 110.66.Mean = 110.66kgs.Mode

Here maximum frequency is 17. The corresponding class 107.5-112.5 is the model class.Using the formula of mode = + ℎ( − )2 − −We get



= 107.5 + 5(17 − 12)2(17) − 12 − 14= 107.5 + 258 = 107.5 + 3.125 = 110.625Hence mode is 110.63 kgs.

Median

Use the formula = + ℎ 2 −Here 2 = 61 2 = 30.5The cumulative frequency (c.f.) just greater than 30.5 is 37. So the corresponding class 107.5-112.5 is the median class.Substituting values in the median formula = 107.5 + 517 612 − 20= 107.5 + 517 (30.5 − 20)

= 107.5 + 5 × 10.517= 107.5 + 3.09 = 110.59Median is 110.59 Kgs.

When to use Mean, Median, and ModeThe following table summarizes the appropriate methods of determining the middle or typicalvalue of a data set based on the measurement scale of the data.

Measurement Scale Best Measure

Nominal(Categorical)

Mode

Ordinal Median

Interval Symmetrical data: MeanSkewed data: Median

Ratio Symmetrical data: MeanSkewed data: Median

Merits and demerits of mean, median and mode

Merits and demerits of arithmetic mean has already been discussed. Please refer to that. Here we

discuss only median and mode.

Median:

The median is that value of the series which divides the group into two equal parts, one part



comprising all values greater than the median value and the other part comprising all the values

smaller than the median value.

Merits of median

(1) Simplicity:- It is very simple measure of the central tendency of the series. I the case of

simple statistical series, just a glance at the data is enough to locate the median value.

(2) Free from the effect of extreme values: - Unlike arithmetic mean, median value is not

destroyed by the extreme values of the series.

(3) Certainty: - Certainty is another merits is the median. Median values are always a certain

specific value in the series.

(4) Real value: - Median value is real value and is a better representative value of the series

compared to arithmetic mean average, the value of which may not exist in the series at all.

(5) Graphic presentation: - Besides algebraic approach, the median value can be estimated also

through the graphic presentation of data.

(6) Possible even when data is incomplete: - Median can be estimated even in the case of certain

incomplete series. It is enough if one knows the number of items and the middle item of the

series.

Demerits of median:

Following are the various demerits of median:

(1) Lack of representative character: - Median fails to be a representative measure in case of such

series the different values of which are wide apart from each other. Also, median is of limited

representative character as it is not based on all the items in the series.

(2) Unrealistic:- When the median is located somewhere between the two middle values, it

remains only an approximate measure, not a precise value.

(3) Lack of algebraic treatment: - Arithmetic mean is capable of further algebraic treatment, but

median is not. For example, multiplying the median with the number of items in the series will

not give us the sum total of the values of the series.

However, median is quite a simple method finding an average of a series. It is quite a commonly

used measure in the case of such series which are related to qualitative observation as and health

of the student.

Mode: The value of the variable which occurs most frequently in a distribution is called the

mode.

Merits of mode:

Following are the various merits of mode:

(1) Simple and popular: - Mode is very simple measure of central tendency. Sometimes, just at

the series is enough to locate the model value. Because of its simplicity, it s a very popular

measure of the central tendency.

(2) Less effect of marginal values: - Compared top mean, mode is less affected by marginal

values in the series. Mode is determined only by the value with highest frequencies.



(3) Graphic presentation:- Mode can be located graphically, with the help of histogram.

(4) Best representative: - Mode is that value which occurs most frequently in the series.

Accordingly, mode is the best representative value of the series.

(5) No need of knowing all the items or frequencies: - The calculation of mode does not require

knowledge of all the items and frequencies of a distribution. In simple series, it is enough if one

knows the items with highest frequencies in the distribution.

Demerits of mode:

Following are the various demerits of mode:

(1) Uncertain and vague: - Mode is an uncertain and vague measure of the central tendency.

(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable of further algebraic

treatment.

(3) Difficult: - With frequencies of all items are identical, it is difficult to identify the modal

value.

(4) Complex procedure of grouping:- Calculation of mode involves cumbersome procedure of

grouping the data. If the extent of grouping changes there will be a change in the model value.

(5) Ignores extreme marginal frequencies:- It ignores extreme marginal frequencies. To that

extent model value is not a representative value of all the items in a series.

Besides, one can question the representative character of the model value as its calculation does

not involve all items of the series.

Exercises

1. Find the measures of central tendency for the data set 3, 7, 9, 4, 5, 4, 6, 7, and 9.

Mean = 6, median = 6 and modes are 4, 7 and 9.Note that here mode is bimodal.

2. Four friends take an IQ test. Their scores are 96, 100, 106, 114. Which of the following

statements is true?

I. The mean is 103.

II. The mean is 104.

III. The median is 100.

IV. The median is 106.



(A) I only

(B) II only

(C) III only

(D) IV only

(E) None is true

The correct answer is (B). The mean score is computed from the equation:

Mean score = Σx / n = (96 + 100 + 106 + 114) / 4 = 104Since there are an even number of scores (4 scores), the median is the average of the two middle

scores. Thus, the median is (100 + 106) / 2 = 103.

3. The owner of a shoe shop recorded the sizes of the feet of all the customers who bought shoes

in his shop in one morning. These sizes are listed below:

8 7 4 5 9 13 10 8 8 7 6 5 3 11 10 8 5 4 8 6

What is the mean of these values: 7.25

What is the median of these values: 7.5

What is the mode of these values: 8.

4. Eight people work in a shop. Their hourly wage rates of pay are:

Worker 1 2 3 4 5 6 7 8

Wage

Rs.

4 14 6 5 4 5 4 4

Work out the mean, median and mode for the values above.

Mean = 5.75, Median = 4.50, Mode = 4.00.

Using the above findings, if the owner of the shop wants to argue that the staff are paid well.

Which measure would they use? He will use mean. Because mean shows the highest value.

Using the above findings, if the staff in the shop want to argue that they are badly paid. Which

measure would they use? The staff will use mode as it is the lowest of the three measures of

central tendencies.

5. The table below gives the number of accidents each year at a particular road junction:

1991 1992 1993 1994 1995 1996 1997 19984 5 4 2 10 5 3 5

Work out the mean, median and mode for the values above.

Mean =4.75Median =4.5Mode =5

Using the above measures, a road safety group want to get the council to make this junction

safer.

Which measure will they use to argue for this? They will use mode as it is the figure which will

help them to justify their argument that the junction has a large number of accidents.



Using the same data the council do not want to spend money on the road junction. Which

measure will they use to argue that safety work is not necessary? The council will use median as

this figure will help them to argue that the junction has less number of accidents.

6. Mr Sasi grows two different types of tomato plant in his greenhouse.

One week he keeps a record of the number of tomatoes he picks from each type of plant.

Day Mon Tue Wed Thu Fri Sat SunType A 5 5 4 1 0 1 5Type B 3 4 3 3 7 9 6

(a) Calculate the mean, median and mode for the Type A plants.

Mean =3, Median = 4, Mode = 5.

(b) Calculate the mean, median and mode for the Type B plants.

Mean =5, Median = 4, Mode = 3.

(c) Which measure would you use to argue that there is no difference between the types?

We will use median as it is the same for both plants.

(d) Which measure would you use to argue that Type A is the best plant?

We will use mode as mode for type A is higher than B. Note that for type A mean is lower than

type B and median is the same for both types.

(e) Which measure would you use to argue that Type B is the best plant?

We will use mean as mean for type A is higher than type B.

Geometric Mean:

The geometric mean is a type of mean or average, which indicates the central tendency or typical

value of a set of numbers. It is similar to the arithmetic mean, which is what most people think of

with the word "average", except that the numbers are multiplied and then the nth root (where n is

the count of numbers in the set) of the resulting product is taken.

Geometric mean is defined as the nth root of the product of N items of series. If there are two

items, take the square root; if there are three items, we take the cube root; and so on.

Symbolically;

GM = ( )( ) … … ( )Where X1, X2 ….. Xn are refer to the various items of the series.

For instance, the geometric mean of two numbers, say 2 and 8, is just the square root of theirproduct; that is √2 × 8 = 4. As another example, the geometric mean of three numbers 1, ½, ¼ is

the cube root of their product (1/8), which is 1/2; that is 1 × 1 2 × 1 4 = 1 8 = .

When the number of items are three or more, the task of multiplying the numbers and of

extracting the root becomes excessively difficult. To simplify calculations, logarithms are used.

GM then is calculated as follows.

log G.M =……



G.M. =∑

G.M. = Antilog∑ log XN

In discrete series GM = Antilog∑f log XN

In continuous series GM = Antilog∑f log mN

Where f = frequency

M = mid pointMerits of G.M

1. It is based on each and every item of the series.2. It is rigidly defined.3. It is useful in averaging ratios and percentages and in determining rates of increase and

decrease.4. It is capable of algebraic manipulation.

Limitations1. It is difficult to ounderstant2. It is difficult to compute and to interpret3. It can’t be computed when there are negative and positive values in a series or one or

more of values is zero.4. G.M has very limited applications.

Harmonic MeanHarmonic mean is a kind of average.It is the mean of a set of positive variables. It is calculatedby dividing the number of observations by the reciprocal of each number in the series.

Harmonic Mean of a set of numbers is the number of items divided by the sum of the reciprocalsof the numbers. Hence, the Harmonic Mean of a set of n numbers i.e. a1, a2, a3, ... an, is given as= + + + ⋯ +Example: Find the harmonic mean for the numbers 3 and 4.Take the reciprocals of the given numbers and sum them.13 + 14 = 4 + 312 = 712Now apply the formula. Since the number of observations is two, here n = 2.

= 27 12 = 2 × 127 = 247 = 3.43In discrete series, H.M = ∑ .In continuous series, H.M = ∑ . = ∑

Merits of Harmonic mean:

1. Its value is based on every item of the series.

2. It lends itself to algebraic manipulation.

Limitations

1. It is not easily understood



2. It is difficult to compute

3. It gives larges weight to smallest item.

7. MEASURES OF VARIABILITY / DISPERSIONThe terms variability, spread, and dispersion are synonyms, and refer to how spread out a

distribution is.Just as in the section on central tendency where we discussed measures of the

centre of a distribution of scores, here we discuss measures of the variability of a

distribution.Measures of variability provide information about the degree to which individual

scores are clustered about or deviate from the average value in a distribution.

Quite often students find it difficult to understand what is meant by variability or dispersion and

hence they find the measures of dispersion difficult. So will discuss the meaning of the term in

detail. First one should understand that dispersion or variability is a continuation of our

discussion of measure of central tendency. So for any discussion on measure of dispersion we

should use any of the measure of central tendency. We continue this discussion taking mean as

an example. The mean or average measures the centre of the data. It is one aspect observations.

Another feature of the observations is as to how the observations are spread about the centre. The

observation may be close to the centre or they may be spread away from the centre. If the

observation are close to the centre (usually the arithmetic mean or median), we say that

dispersion or scatter or variation is small. If the observations are spread away from the centre, we

say dispersion is large.

Let us make this clear with the help of an example. Suppose we have three groups of students

who have obtained the following marks in a test. The arithmetic means of the three groups are

also given below:

Group A: 46, 48, 50, 52, 54, for this the mean is 50.

Group B: 30, 40, 50, 60, 70, for this the mean is 50.

Group C: 40, 50, 60, 70, 80, for this the mean is 60.

In a group A and B arithmetic means are equal i.e. mean of Group A = Mean of Group B = 50.

But in group A the observations are concentrated on the centre. All students of group A have

almost the same level of performance. We say that there is consistence in the observations in

group A. In group B the mean is 50 but the observations are not closed to the centre. One

observation is as small as 30 and one observation is as large as 70. Thus there is greater

dispersion in group B. In group C the mean is 60 but the spread of the observations with respect

to the centre 60 is the same as the spread of the observations in group B with respect to their own

centre which is 50. Thus in group B and C the means are different but their dispersion is the

same. In group A and C the means are different and their dispersions are also different.

Dispersion is an important feature of the observations and it is measured with the help of the

measures of dispersion, scatter or variation. The word variability is also used for this idea of

dispersion.



The study of dispersion is very important in statistical data. If in a certain factory there is

consistence in the wages of workers, the workers will be satisfied. But if some workers have high

wages and some have low wages, there will be unrest among the low paid workers and they

might go on strikes and arrange demonstrations. If in a certain country some people are very

poor and some are very high rich, we say there is economic disparity. It means that dispersion is

large. The idea of dispersion is important in the study of wages of workers, prices of

commodities, standard of living of different people, distribution of wealth, distribution of land

among framers and various other fields of life. Some brief definitions of dispersion are:

The degree to which numerical data tend to spread about an average value is called the

dispersion or variation of the data.

Dispersion or variation may be defined as a statistics signifying the extent of the scatteredness of

items around a measure of central tendency.

Dispersion or variation is the measurement of the scatter of the size of the items of a series about

the average.

There are five frequently used measures of variability: the Range, Interquartile range or quartile

deviation, Mean deviation or average deviation, Standard deviation and Lorenz curve.

7.1 Range

The range is the simplest measure of variability to calculate, and one you haveprobably encountered many times in your life. The range is simply the highestscore minus the lowest score.

Range: R = maximum – minimum

Let’s take a few examples. What is the range of the following group of numbers: 10, 2, 5, 6, 7, 3,

4. Well, the highest number is 10, and the lowest number is 2, so 10 - 2 = 8. The range is 8.

Let’s take another example. Here’s a dataset with 10 numbers: 99, 45, 23, 67, 45, 91, 82, 78, 62,

51. What is the range. The highest number is 99 and the lowest number is 23, so 99 - 23 equals

76; the range is 76.

Example2: Ms. Kesavan listed 9 integers on the blackboard. What is the range of these integers?

14, -12, 7, 0, -5, -8, 17, -11, 19

Ordering the data from least to greatest, we get:

-12, -11, -8, -5, 0, 7, 14, 17, 19

Range: R = highest - lowest = 19 - -12 = 19 + +12 = +31



The range of these integers is +31.

Example 3: A marathon race was completed by 5 participants. What is the range of times given

in hours below

2.7 hr, 8.3 hr, 3.5 hr, 5.1 hr, 4.9 hr

Ordering the data from least to greatest, we get:

2.7, 3.5, 4.9, 5.1, 8.3

Range: R = highest – lowest = 8.3 hr - 2.7 hr = 5.6 hr

The range of marathon race is 5.6 hr.

Merits and LimitationsMerits

Amongst all the methods of studying dispersion, range is the simplest to understand

easiest to compute.

It takes minimum time to calculate the value of range Hence if one is interested in getting

a quick rather than very accurate picture of variability one may compute range.

Limitation

Range is not based on each and every item of the distribution.

It is subject to fluctuation of considerable magnitude from sample to sample.

Range can’t tell us anything about the character of the distribution with the two.

According to kind “Range is too indefinite to be used as a practical measure of dispersionUses of Range

Range is useful in studying the variations in the prices of stocks, shares and other

commodities that are sensitive to price changes from one period to another period.

The meteorological department uses the range for weather forecasts since public is

interested to know the limits within which the temperature is likely to vary on a particular

day.

7.2 Inter – Quartile Range Or Quartile Deviation

So we have seen Range which is a measure of variability which concentrate on two extreme

values. If we concentrate on two extreme values as in the case of range, we do not get any idea

about the scatter of the data within the range ( i.e. what happens within the two extreme values ).

If we discard these two values the limited range thus available might be more informative. For

this reason the concept of interquartile range is developed. It is the range which includes middle

50% of the distribution. Here 1/4 ( one quarter of the lower end and 1/4 ( one quarter ) of the

upper end of the observations are excluded.



Now the lower quartile ( Q1 ) is the 25th percentile and the upper quartile (Q3 ) is the 75th

percentile. It is interesting to note that the 50th percentile is the middle quartile ( Q2 ) which is in

fact what you have studied under the title ’ Median . Thus symbolically

Inter quartile range = Q3 - Q1

If we divide ( Q3 - Q1 ) by 2 we get what is known as Semi-Iinter quartile range.

i.e. . It is known as Quartile deviation ( Q. D or SI QR ).

Another look at the same issue is given here to make the concept more clear for the student.

In the same way that the median divides a dataset into two halves, it can be further divided intoquarters by identifying the upper and lower quartiles. The lower quartile is found one quarter ofthe way along a dataset when the values have been arranged in order of magnitude; the upperquartile is found three quarters along the dataset. Therefore, the upper quartile lies half waybetween the median and the highest value in the dataset whilst the lower quartile lies halfwaybetween the median and the lowest value in the dataset. The inter-quartile range is found bysubtracting the lower quartile from the upper quartile.

For example, the examination marks for 20 students following a particular module are arrangedin order of magnitude.

median lies at the mid-point between the two central values (10th and 11th)

= half-way between 60 and 62 = 61

The lower quartile lies at the mid-point between the 5th and 6th values

= half-way between 52 and 53 = 52.5

The upper quartile lies at the mid-point between the 15th and 16th values

= half-way between 70 and 71 = 70.5

The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the range is: 80 - 43

= 37.

The inter-quartile range provides a clearer picture of the overall dataset by removing/ignoring the

outlying values.



Like the range however, the inter-quartile range is a measure of dispersion that is based upon

only two values from the dataset. Statistically, the standard deviation is a more powerful measure

of dispersion because it takes into account every value in the dataset. The standard deviation is

explored in the next section.

Example 1

The wheat production (in Kg) of 20 acres is given as: 1120, 1240, 1320, 1040, 1080, 1200, 1440,

1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470, 1750, and 1885. Find the

quartile deviation and coefficient of quartile deviation.

After arranging the observations in ascending order, we get

1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750,

1755, 1785, 1880, 1885, 1960. = + 14 ℎ= 20 + 14 ℎ= (5.25) ℎ= 5 ℎ + 0.25(6 ℎ − 5 ℎ )= 1240 + 0.25(1320 − 1240)= 1240 + 20 = 1260= 3( + 1)4 ℎ= 3(20 + 1)4 ℎ= (15.75) ℎ= 15 ℎ + 0.75(16 ℎ − 15 ℎ )= 1750 + 0.75(1755 − 1750)= 1750 + 3.75 = 1753.75( . . ) = −2 = 1753.75 − 12602 = 492.752 = 246.88

= −+ = 1753.75 − 12601753.75 + 1260 = 0.164Example 2

Calculate the range and Quartile deviation of wages.



Wages ( ) Labourers

30 – 32

32 – 34

34 – 36

36 – 38

38 – 40

40 – 42

42 - 44

12

18

16

14

12

8

6

SolutionRange : = L – SCalculation of Quartiles :

X f c.f

30 – 32

32 – 34

34 – 36

36 – 38

38 – 40

40 – 42

42 - 44

12

18

16

14

12

8

6

12

30

46

60

72

80

86

= Size of item

= = 21.5ie. Q. lies in the group 32 – 34

= L +.

× i

= 32 +.

× 2

= 32 += 32 + 1.06= 33.06

====

= Size of item



= 3 × = 64.5 itemlies in the group 38 – 40

= L +.

× i

= 38 +.

× 2= 38 + 0.75= 38.75

Q.D =

=. .

=.

= 2.85===

Coefficient of Q.D. =

=. .. .

=. .

= 0.08Merits of Quartile Deviation

1. It is simple to understand and easy to calculate.2. It is not influenced by extreme values.3. It can be found out with open end distribution.4. It is not affected by the presence of extreme values.

Demerits1. It ignores the first 25% of the items and the last 25% of the items.2. It is a positional average : hence not amenable to further mathematical treatment.3. The value is affected by sampling fluctuations.7.3 Mean Deviation or Average DeviationAverage deviations (mean deviation) is the average amount of variations(scatter) of the items in a distribution from either the mean or the median orthe mode, ignoring the signs of these deviations. In other words, the meandeviation or average deviation is the arithmetic mean of the absolutedeviations.

Example 1: Find the Mean Deviation of 3, 6, 6, 7, 8, 11, 15, 16

Step 1: Find the mean: = = = 9Step 2: Find the distance of each value from that mean:



Which looks like this diagrammatically:

Step 3. Find the mean of those distances:= 6 + 3 + 3 + 2 + 1 + 2 + 6 + 78 = 308 = 3.75So, the mean = 9, and the mean deviation = 3.75

It tells us how far, on average, all values are from the middle.

In that example the values are, on average, 3.75 away from the middle.

The formula is: = ∑| − |Where

μ is the mean (in our example μ = 9)

x is each value (such as 3 or 16)

N is the number of values (in our example N = 8)

Each distance we calculated is called an Absolute Deviation, because it is the

Absolute Value of the deviation (how far from the mean).To show "Absolute

Value" we put “|” marks either side like this: |-3| = 3. Thus absolute value is

one where we ignore sign. That is, if it is – or +, we consider it as +. Eg. -3 or +3

will be taken as just 3.

Value Distancefrom 9

3 66 36 37 28 111 215 616 7



Let us redo example 1 using the formula: Find the Mean Deviation of 3, 6, 6, 7,8, 11, 15, 16

Step 1: Find the mean:= 3 + 6 + 6 + 7 + 8 + 11 + 15 + 168 = 728 = 9Step 2: Find the Absolute Deviations:

x x - μ |x - μ|

3 -6 6

6 -3 3

6 -3 3

7 -2 2

8 -1 1

11 2 2

15 6 6

16 7 7| − | = 30Step 3. Find the Mean Deviation: = ∑| − | = 308 = 3.75Example 2

Calculate the mean deviation using mean for the following data

2-4 4-6 6-8 8-10

3 4 2 1

Solution

Class Mid

Value

(X)

Frequency

(f)

d = X-5 fd | − |= | − 5.2| | − |



2-4 3 3 -2 -6 2.2 6.6

4-6 5 4 0 0 0.2 0.8

6-8 7 2 2 4 1.8 3.6

8-10 9 1 4 4 3.8 3.8= 10 = 2 | − |= 14.8= + ∑ = 5 + 210 = 5.2= 1 | − | = 14.810 = 1.48

Example 3

Calculate mean deviation based on (a) Mean and (b) median

Class

Interval

0-10 10-20 20-30 30-40 40-50 50-60 60-70

Frequency

f

8 12 10 8 3 2 7

Solution

Let us first make the necessary computations.

Classinterval

Midvalue

(X)

Freq-uency

(f)

Lessthanc.f.

fX| − |= |− 29| |− | |− |= |− 22|

|− |0-10 5 8 8 40 24 192 17 136

10-20 15 12 20 180 14 168 7 84

20-30 25 10 30 250 4 40 3 30

30-40 35 8 38 280 6 48 13 104

40-50 45 3 41 135 16 48 23 69

50-60 55 2 43 110 26 52 33 66

60-70 65 7 50 455 36 252 43 301

N=50 = 1450 |− |= 800|− |= 790

(a) M.D. from Mean ( ) = 1 = 145050 = 29So mean =29. Let us now find men deviation about mean



. . = 1 | − | = 80050 = 16We see that mean deviation based on mean is 16.

Now let us compute M.D. about median

(b) M.D. from median

(N/2) =(50/2) = 25. The c.f. just greater than 25 is 30 in the table above. So the

corresponding class 20-30 is the median class.

Sol= lower limit of the median class = 20, f = frequency of the median class =

25, h = class interval of the median class =10,c = cumulative frequency of the

preceding median class =20.

Use the formula of median to substitute values.= + ℎ (2 − )= 20 + 1025 (25 − 20) = 20 + 2 = 22

Median = 22. Let us now find Mean Deviation about median.. . = 1 | − | = 79050 = 15.8Thus we have computed Mean Deviation from Mean and Median. Let us

compare the two results. MD from Mean is 16 and MD from median is 15.8.

So, M.D. from Median < M.D. from Mean. This implies that M.D. is least when

taken about median.

Merits of M.D.

i. It is simple to understand and easy to compute.ii. It is not much affected by the fluctuations of sampling.

iii. It is based on all items of the series and gives weight according to their size.iv. It is less affected by extreme items.v. It is rigidly defined.

vi. It is a better measure for comparison.Demerits of M.D.

i. It is a non-algebraic treatmentii. Algebraic positive and negative signs are ignored. It is mathematically unsound

and illogical.iii. It is not as popular as standard deviation.

Uses :



It will help to understand the standard deviation. It is useful in marketingproblems. It is used in statistical analysis of economic, business and socialphenomena. It is useful in calculating the distribution of wealth in acommunity or nation.

7.4 Standard DeviationThe concept, standard deviation was introduced by Karl Pearson in 1893. It is the most

important measure of dispersion and is widely used. It is a measure of the dispersion of a set of

data from its mean. The standard deviation is kind of the “mean of the mean,” and often can helpyou find the story behind the data.

The standard deviation is a measure that summarises the amount by which every value within a

dataset varies from the mean. Effectively it indicates how tightly the values in the dataset are

bunched around the mean value. It is the most robust and widely used measure of dispersion

since, unlike the range and inter-quartile range, it takes into account every variable in the dataset.

When the values in a dataset are pretty tightly bunched together the standard deviation is small.

When the values are spread apart the standard deviation will be relatively large.

Standard deviation is defined as a statistical measure of dispersion in the value of an asset around

mean. The standard deviation calculation tells you how spread out the numbers are in your

sample. Standard Deviation is represented using the symbol ( ℎ ).

For example if you want to measure the performance a mutual fund, SD can be used. It gives an

idea of how volatile a fund's performance is likely to be. It is an important measure of a fund's

performance. It gives an idea of how much the return on the asset at a given time differs or

deviates from the average return. Generally, it gives an idea of a fund's volatility i.e. a higher

dispersion (indicated by a higher standard deviation) shows that the value of the asset has

fluctuated over a wide range.

The formula for finding SD in a sentence form is : it is the square root of the Variance. So now

you ask, ‘What is the Variance’. Let us see what is variance.

The Variance is defined as:The average of the squared differences from the Mean.

We can calculate the variance follow these steps:

a. Work out the Mean (the simple average of the numbers)

b. Then for each number: subtract the Mean and square the result (the squared difference).

c. Then work out the average of those squared differences.

You may ask Why square the differences. If we just added up the differences from the mean ...

the negatives would cancel the positives as shown below. So we take the square.



ExampleYou have figures of the marks obtained by your five bench mates which is asfollows: 600, 470, 170, 430 and 300. Find out the Mean, the Variance, and theStandard Deviation.Your first step is to find the Mean:

= 600 + 470 + 170 + 430 + 3005 = 19705 = 394So the mean (average) mark is 394. Let us plot this on the chart:

x − ( − )600 206 42436

470 76 5776

170 -224 50176

430 36 1296

300 -94 8836 ( − )= 108520To calculate the Variance, take each difference, square it, find the sum

(108520) and find average: = 1085205 = 21704So, the Variance is 21,704.The Standard Deviation is just the square root of Variance, so:SD = σ = √21704 = 147.32 ≈ 147

Now we can see which heights are within one Standard Deviation (147) of theMean.Please note that there is a slight difference when we find variance from apopulation and mean. In the above example we found out variance for datacollected from all your bench mates. So it may be considered as population.Suppose now you collect data only from some of your bench mates. Now it maybe considered as a sample. If you are finding variance for a sample data, in theformula to find variance, divide by N-1 instead of N.For example, if we say that in our problem the marks are of some students in aclass, it should be treated as a sample. In that caseVariance (or to be precise Sample Variance) = 108,520 / 4 = 27,130. Note thatinstead of N (i.e.5) we divided by N-1 (5-1=4).Standard Deviation (Sample Standard Deviation) = σ = √27130 = 164.31 ≈ 164



Based on the above information, let us build the formula for finding SD. Sincewe use two different formulae for data which is population and data which issample, we will have two different formula for SD also.

The "Population Standard Deviation":

The "Sample Standard Deviation":

Computation of Standard Deviation: There are different methods tocomputeSD. They are illustrated through examples below.

Example 1

Calculate SD for the following observations using different methods.

160, 160, 161, 162, 163, 163, 163, 164, 164, 170

(a) Direct method No.1

Formula = ∑ ℎ = − X = −

160 -3 9

160 -3 9

161 -2 4

162 -1 1

163 0 0

163 0 0

163 0 0

164 1 1

164 1 1

170 7 49= 1630 = 74ℎ = = ∑ = 163

Now compute SD = ∑= = √7.4 = 2.72



(b) Direct method No.2

Here the formula is

= ∑ − ∑ /X

160 25600

160 25600

161 25921

162 26244

163 26569

163 26569

163 26569

164 26896

164 26896

170 28900= 1630 = 2657640= 265764 − 1630 /1010= 7410 = √7.4 = 2.72(c)Method 3 (Short Cut Method) – in this method instead of finding the mean we assume afigure as mean. Here we have assumed 162 as mean arbitrarily.We use the formula= ∑ − ∑

X Deviation from assumed mean (herewe assume mean as162)

dx

( )160 -2 4160 -2 4161 -1 1162 0 0163 1 1



163 1 1163 1 1164 2 4164 2 4170 8 641630 +10 = 84

= 8410 − 1010= √8.4 − 1= √7.4 = 2.72Another example where we find many of the concepts together.Example:

Given the series: 3, 5, 2, 7, 6, 4, 9.

Calculate:

The (a)mode, (b)median and (c)mean.

(d) variance (e)standard deviation and (f)The average deviation.

(a)Mode: Does not exist because all the scores have the same frequency.

(b) Median

2, 3, 4, 5, 6, 7, 9.

Median = 5

(c)Mean = 2 + 3 + 4 + 5 + 6 + 7 + 97 = 5.143(d)Variance(d)Variance

= = 2 + 3 + 4 + 5 + 6 + 7 + 97 − 5.143 = 4.978(e)Standard Deviation = √4.978 = 2.231(f) Average Deviation | − |= | − 5.143|

2 3.1433 2.1434 1.1435 0.1436 0.8577 1.8579 3.857



| − |= 13.143= ∑| − | = 13.1437 = 1.878

Calculation of SD for continuous series

The step deviation method is easy to use to find SD for continuous

series.

= ∑ − ∑ ×ℎ = ( − ) ℎ =

Calculate Mean and SD for the following data

0-10 10-20 20-30 30-40 40-50 50-60 60-70

5 12 30 45 50 37 21

Make the necessary computations

x Midpoint(m)

f =( − 35)10 fd f×d2

0-10 5 5 -3 -15 45

10-20 15 12 -2 -24 48

20-30 25 30 -1 -30 30

30-40 35 45 0 0 0

40-50 45 50 1 50 50

50-60 55 37 2 74 148

60-70 65 21 3 63 189

N = 200 = 118 = 510= = + ∑ × = 35 + 118200 × 10 = 35 + 5.9 = 40.9



= ∑ − ∑ × = 510200 − 118200 × 10= √2.55 − 3481×10

=1.4839×10=14.839.

Merits of Standard Deviation

1. It is rigidly defined and its value is always definite and based on all observation.2. As it is based on arithmetic mean, it has all the merits of arithmetic mean.3. It is possible for further algebraic treatment.4. It is less affected by sampling fluctuations.

Demerits

1. It is not easy to calculate.

It gives more weight to extreme values, because the values are squared up.

Coefficient of Variation

Standard deviation is the absolute measure of dispersion. It is expressed interms of the units in which the original figures are collected and stated. The relativemeasure of standard deviation is known as coefficient of variation.

Variance : Square of Standard deviation

Symbolically;Variance =

= √Coefficient of standard deviation =

8. MEASURES OF VARIABILITY IN SHAPE- Graphic Method of Dispersion

Dispersion or variance can be represented using graphs also. We discuss here some of

the graphical methods which rely on the shape of the curve to represent the deviations.

We will see Lorenz Curve, Gini’s Coefficient, Skewness and Kurtosis



8.1 - LORENZ CURVELorenz Curve is a graphical representation of wealth distribution developed by

American economist Dr. Max O. Lorenz a popular Economic- Statistician in 1905. He

studied distribution of Wealth and Income with its help.. On the graph, a straight

diagonal line represents perfect equality of wealth distribution; the Lorenz curve lies

beneath it, showing the reality of wealth distribution. The difference between the

straight line and the curved line is the amount of inequality of wealth distribution, a

figure described by the Gini coefficient. One practical use of The Lorenz curve is that it

can be used to show what percentage of a nation's residents possess what percentage of

that nation's wealth. For example, it might show that the country's poorest 10% possess

2% of the country's wealth.

It is graphic method to study dispersion. It helps in studying the variability in different

components of distribution especially economic. The base of Lorenz Curve is that we

take cumulative percentages along X and Y axis. Joining these points we get the Lorenz

Curve. Lorenz Curve is of much importance in the comparison of two series

graphically. It gives us a clear cut visual view of the series to be compared.

Steps to plot 'Lorenz Curve'

Cumulate both values and their corresponding frequencies. Find the percentage of each of the cumulated figures taking the grand total of each

corresponding column as 100. Represent the percentage of the cumulated frequencies on X axis and those of the values

on the Y axis. Draw a diagonal line designated as the line of equal distribution. Plot the percentages of cumulated values against the percentages of the cumulated

frequencies of a given distribution and join the points so plotted through a free handcurve.



The greater the distance between the curve and the line of equal distribution, thegreater the dispersion. If the Lorenz curve is nearer to the line of equal distribution, thedispersion or variation is smaller.

Based on data of annual income of 8 individuals we have drawn a Lorenz curvebelow using MS Excel.

Individual Income %population

%income

CumulativeIncome %

0 0 0 0 0

1 5000 12.5 1.204819 1.204819

2 12000 25 2.891566 4.096385

3 18000 37.5 4.337349 8.433735

4 30000 50 7.228916 15.66265

5 40000 62.5 9.638554 25.3012

6 60000 75 14.45783 39.75904

7 100000 87.5 24.09639 63.85542

8 150000 100 36.14458 100

415000



Example

From the following table giving data regarding income of workers in a factory, drawLorenz Curve to study inequality of income

The following method for constructing Lorenz Curve.

1. The size of the item and their frequencies are to be cumulated.

2. Percentage must be calculated for each cumulation value of the size andfrequency of items.

3. Plot the percentage of the cumulated values of the variable against thepercentage of the corresponding cumulated frequencies. Join these points with assmooth free hand curve. This curve is called Lorenz curve.

4. Zero percentage on the X axis must be joined with 100% on Y axis. This line iscalled the line of equal distribution.

IncomeMid value Cumulative

income% of

cumulativeincome

No. ofworkers (f)

Cumulativeno. of

workers

% ofCumulative

no. Ofworkers

0-500 250 250 2.94 6000 6000 37.50

500-1000 750 1000 11.76 4250 10250 64.06

1000-2000 1500 2500 29.41 3600 13850 86.56

2000-3000 2500 5000 58.82 1500 15350 95.94

3000-4000 3500 8500 100.00 650 16000 100.00

8500 16000



Example








income% of

cumulativeincome

No. ofworkers (f)

Cumulativeno. of

workers

% ofCumulative

no. Ofworkers

0-500 250 250 2.94 6000 6000 37.50

500-1000 750 1000 11.76 4250 10250 64.06

1000-2000 1500 2500 29.41 3600 13850 86.56

2000-3000 2500 5000 58.82 1500 15350 95.94

3000-4000 3500 8500 100.00 650 16000 100.00

8500 16000



Example








income% of

cumulativeincome

No. ofworkers (f)

Cumulativeno. of

workers

% ofCumulative

no. Ofworkers

0-500 250 250 2.94 6000 6000 37.50

500-1000 750 1000 11.76 4250 10250 64.06

1000-2000 1500 2500 29.41 3600 13850 86.56

2000-3000 2500 5000 58.82 1500 15350 95.94

3000-4000 3500 8500 100.00 650 16000 100.00

8500 16000



Uses of Lorenz Curve

1. To study the variability in a distribution.2. To compare the variability relating to a phenomenon for two regions.3. To study the changes in variability over a period.

8.2 - Gini index / Gini coefficientA Lorenz curve plots the cumulative percentages of total income received against the cumulative

number of recipients, starting with the poorest individual or household. The Gini index measures

the area between the Lorenz curve and a hypothetical line of absolute equality, expressed as a

percentage of the maximum area under the line. This is the most commonly used measure of

inequality. The coefficient varies between 0, which reflects complete equality and 1(100), which

indicates complete inequality (one person has all the income or consumption, all others have

none). Gini coefficient is found by measuring the areas A and B as marked in the following

diagram and using the formula A/(A+B). If the Gini coefficient is to be presented as a ratio or

percentage, A/(A+B)×100.



The Gini coefficient (also known as the Gini index or Gini ratio) is a measure of statistical

dispersion intended to represent the income distribution of a nation's residents. This is the most

commonly used measure of inequality. The coefficient varies between 0, which reflects complete

equality and 1, which indicates complete inequality (one person has all the income or

consumption, all others have none). It was developed by the Italian statistician and sociologist

Corrado Gini in 1912.

8.3 - SkewnessWe have discussed earlier techniques to calculate the deviations of adistribution from its measure of central tendency (mean / median, mode ).Here we see another measure for that named Skewness. Skewness characterizesthe degree of asymmetry of a distribution around its mean. If there is only one mode (peak)in our data (unimodel) , and if the other data are distributed evenly to the left and right ofthis value, if we plot it in a graph, we get a curve like this, which is called a normal curve(See figure below). Here we say that there is no skewness or skewness = 0. If there is zeroskewness (i.e., the distribution is symmetric) then the mean = median for this distribution.



However data need not always be like this. Sometimes the bulk of the data is at the left and theright tail is longer, we say that the distribution is skewed right or positively skewed. Positiveskewness indicates a distribution with an asymmetric tail extending towards more positivevalues.On the other hand, sometimes the bulk of the data is at is at the right and the left tail islonger, we say that the distribution is skewed left or negatively skewed. Negative skewnessindicates a distribution with an asymmetric tail extending towards more negative values"

Skewed Left Symmetric Skewed Right

Tests of Skewness

There are certain tests to know whether skewness does or does not exist in a frequencydistribution.They are :1. In a skewed distribution, values of mean, median and mode would not coincide. Thevalues of mean and mode are pulled away and the value of median will be at the centre.In this distribution, mean-Mode = 2/3 (Median - Mode).2. Quartiles will not be equidistant from median.3. When the asymmetrical distribution is drawn on the graph paper, it will not give abell shapedcurve.4. Sum of the positive deviations from the median is not equal to sum of negativedeviations.5. Frequencies are not equal at points of equal deviations from the mode.Nature of SkewnessSkewness can be positive or negative or zero.1. When the values of mean, median and mode are equal, there is no skewness.2. When mean > median > mode, skewness will be positive.3. When mean < median < mode, skewness will be negative.Characteristic of a good measure of skewness1. It should be a pure number in the sense that its value should be independent of theunit of the series and also degree of variation in the series.2. It should have zero-value, when the distribution is symmetrical.



3. It should have a meaningful scale of measurement so that we could easily interpretthe measured value.Measures of Skewness

Skewness can be studied graphically and mathematically. When we studySkewness graphically, we can find out whether Skewness is positive or negative or zero.This is what we have shown above.

Mathematically Skewness can be studied as :(a) Absolute Skewness(b) Relative or coefficient of skewness

When the skewness is presented in absolute term i.e, in units, it is absoluteskewness. If the value of skewness is obtained in ratios or percentages, it is calledrelative or coefficient of skewness. When skewness is measured in absolute terms, wecan compare one distribution with the other if the units of measurement are same.When it is presented in ratios or percentages, comparison become easy. Relativemeasures of skewness is also called coefficient of skewness.

(a) Absolute measure of Skewness:

Skewness can be measured in absolute terms by taking the difference betweenmean and mode.

Absolute Skewness = – mode

If the value of the mean is greater than mode, the Skewness is positiveIf the value of mode is greater than mean, the Skewness is negative

Greater the amount of Skewness (negative or positive) the more tendencytowards asymmetry. The absolute measure of Skewness will be proper measure forcomparison, and hence, in each series a relative measure or coefficient of Skewenesshave to be computed.(b) Relative measure of skewnessThere are three important measures of relative skewness.

1. Karl Pearson’s coefficient of skewness.2. Bowley’s coefficient of skewness.3. Kelly’s coefficient of skewness.

(b 1) Karl Pearson’s coefficient of SkewnessThe mean, median and mode are not equal in a skewed distribution. The Karl Pearson’smeasure of skewness is based upon the divergence of mean from mode in a skeweddistribution.Karl Pearson's measure of skewness is sometimes referred to Skp= −Properties of Karl Pearson coefficient of Skewness

(1)−1 ≤ Skp ≤ 1.

(2) Skp = 0 ⇒ distribution is symmetrical about mean.



(3)Skp> 0 ⇒ distribution is skewed to the right.

(4) Skp< 0 ⇒ distribution is skewed to the left.

Advantage of Karl Pearson coefficient of Skewness

Skp is independent of the scale. Because (mean-mode) and standard deviation have

same scale and it will be canceled out when taking the ratio.

Disadvantage of Karl Pearson coefficient of Skewness

Skp depends on the extreme values.

Example: 1Calculate the coefficient of skewness of the following data by using Karl Pearson'smethod for the data 2 3 3 4 4 6 6

Step 1. Find the mean:

Step 2. Find the standard deviation:

Then

Step 3. Find the coefficient of skeness:

Here skewness is negative.

(b 2) Bowley’s coefficient of skewness

Bowley's formula for measuring skewness is based on quartiles. For a symmetrical

distribution, it is seen that Q1, and Q3areequidistant from median (Q2).

Thus (Q3 − Q2) − (Q2 − Q1) can be taken as an absolute measure of skewness.= ( − ) − ( − )( − ) + ( − )= − − +− + −= + −−



Note:

In the above equation, where the Qs denote the interquartile ranges. Divide a set of data into twogroups (high and low) of equal size at the statistical median if there is an even number of datapoints, or two groups consisting of points on either side of the statistical median itself plus thestatistical median if there is an odd number of data points. Find the statistical medians of the lowand high groups, denoting these first and third quartiles by Q1 and Q3. The interquartile rangeis then defined by IQR = Q3 - Q1.

Properties of Bowley’s coefficient of skewness1 −1 ≤ Skq ≤ 1.2 Skq = 0 ⇒ distribution is symmetrical about mean.3 Skq> 0 ⇒ distribution is skewed to the right.4 Skq< 0 ⇒ distribution is skewed to the left.Advantageof Bowley’s coefficient of skewnessSkq does not depend on extreme values.Disadvantage of Bowley’s coefficient of skewnessSkq does not utilize the data fully.ExampleThe following table shows the distribution of 128 families according to the number ofchildren.

No of children No of families

0 20

1 15

2 25

3 30

4 18

5 10

6 6

7 3

8 or more 1

Compute Bowley’s coefficient of skewness

We use formula for measuring Bowley’s coefficient of skewness= + −−



Let us find the necessary values

No ofchildren

No offamilies

Cumulativefrequency

0 20 20

1 15 35

2 25 60

3 30 90

4 18 108

5 10 118

6 6 124

7 3 127

8 or more 1 128

= (32.25)th observation

= 1


= 3


= 4= + − ( )−= − 13 = −0.333Since Skq< 0 distribution is skewed left

(b 3) Kelly’s coefficient of skewness

Bowley’s measure of skewness is based on the middle 50% of the observations because

it leaves 25% of the observations on each extreme of the distribution.As an



improvement over Bowley’s measure, Kelly has suggested a measure based on P10 and,

P90 so that only 10% of the observations on each extreme are ignored.= ( − ) − ( − )( − ) + ( − )= − − +− + −= + − 2−8.4 - KURTOSISAs Wesaw above, Skewness is a measure of symmetry, or more precisely, the lack ofsymmetry. A distribution, or data set, is symmetric if it looks the same to the left andright of the center point.Kurtosis is a measure of whether the data are peaked or flat relative to a normaldistribution. That is, data sets with high kurtosis tend to have a distinct peak near themean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend tohave a flat top near the mean rather than a sharp peak. A uniform distribution would bethe extreme case. Kurtosis has its origin in the Greek word ‘Bulginess.’Distributions of data and probability distributions are not all the same shape. Some areasymmetric and skewed to the left or to the right. Other distributions are bimodal andhave two peaks. In other words there are two values that dominate the distribution ofvalues. Another feature to consider when talking about a distribution is not just thenumber of peaks but the shape of them. Kurtosis is the measure of the peak of adistribution, and indicates how high the distribution is around the mean. The kurtosisof a distributions is in one of three categories of classification:

Mesokurtic

Leptokurtic

Platykurtic

We will consider each of these classifications in turn.Mesokurtic



Kurtosis is typically measured with respect to the normal distribution. A distributionthat is peaked in the same way as any normal distribution, not just the standard normaldistribution, is said to be mesokurtic. The peak of a mesokurtic distribution is neitherhigh nor low, rather it is considered to be a baseline for the two other classifications.Besides normal distributions, binomial distributions for which p is close to 1/2 areconsidered to be mesokurtic.

Leptokurtic

A leptokurtic distribution is one that has kurtosis greater than a mesokurticdistribution. Leptokurtic distributions are identified by peaks that are thin and tall. Thetails of these distributions, to both the right and the left, are thick and heavy.Leptokurtic distributions are named by the prefix "lepto" meaning "skinny."

There are many examples of leptokurtic distributions. One of the most wellknownleptokiurtic distributions is Student's t distribution.

Platykurtic

The third classification for kurtosis is platykurtic. Platykurtic distributions are thosethat have a peak lower than a mesokurtic distribution. Platykurtic distributions arecharacterized by a certain flatness to the peak, and have slender tails. The name of thesetypes of distributions come from the meaning of the prefix "platy" meaning "broad."

All uniform distributions are platykurtic. In addition to this the discrete probabilitydistribution from a single flip of a coin is platykurtic.

Measures of KurtosisMoment ratio and Percentile Coefficient of kurtosis are used to measure the kurtosisMoment Coefficient of Kurtosis= =Where M4 = 4th moment and M2 = 2nd momentIf = 3, the distribution is said to be normal. (ie mesokurtic)If > 3, the distribution is more peaked to curve is lepto kurtic.If < 3, the distribution is said to be flat topped and the curve is platy kurtic.



Percentile Coefficient of Kurtosis = = . .where . . = ( − ) is the semi-interquartile range. For normal distribution thishas the value 0.263.A normal random variable has a kurtosis of 3 irrespective of its mean or standarddeviation. If a random variable’s kurtosis is greater than 3, it is said to be Leptokurtic. Ifits kurtosis is less than 3, it is said to be Platykurtic.Thus we conclude our discussion by saying that kurtosis is any measure of the‘peakedness’ of a distribution. The height and sharpness of the peak relative to the restof the data are measured by a number called kurtosis. Higher values indicate a higher,sharper peak; lower values indicate a lower, less distinct peak. This occurs because,higher kurtosis means more of the variability is due to a few extreme differences fromthe mean, rather than a lot of modest differences from the mean. A normal distributionhas kurtosis exactly 3. Any distribution with kurtosis =3 is called mesokurtic. Adistribution with kurtosis <3 is called platykurtic. Compared to a normal distribution,its central peak is lower and broader, and its tails are shorter and thinner. A distributionwith kurtosis >3 is called leptokurtic. Compared to a normal distribution, its centralpeak is higher and sharper, and its tails are longer and fatter.Comparison among dispersion, skewness and kurtosis

Dispersion, Skewness and Kurtosis are different characteristics of frequencydistribution. Dispersion studies the scatter of the items round a central value or amongthemselves. It does not show the extent to which deviations cluster below an average orabove it. Skewness tells us about the cluster of the deviations above and below ameasure of central tendency. Kurtosis studies the concentration of the items at thecentral part of a series. If items concentrate too much at the centre, the curve becomes‘leptokurtic’ and if the concentration at the centre is comparatively less, the curvebecomes ‘platykurtic’.

POPULATION AND SAMPLEThe study of statistics revolves around the study of data sets. Here describes twoimportant types of data sets – population and samples.PopulationIn statistics the term ‘population’ has a slightly different meaning from the one given toit in ordinary speech. It need not refer only to people or to animate creatures - thepopulation of India. When we think of the term ‘population,’ we usually think of peoplein our town, region, state or country and their respective characteristics such as gender,age, marital status, religion, caste and so on. In statistics the term ‘population’ takes on aslightly different meaning. The ‘population’ in statistics includes all members of adefined group that we are studying or collecting information on for data drivendecisions.A population is a group of phenomena that have something in common.



A population is any entire collection of people, animals, plants or things from which wemay collect data. It is the entire group we are interested in, which we wish to describeor draw conclusions about.

A population is an entire set of individuals or objects, which may be finite or infinite.Examples of finite populations include the employees of a given company, the numberof airplanes owned by an airline, or the potential consumers in a target market.Examples of infinite populations include the number of watches manufactured by acompany that plans to be in business forever, or the grains of sand on the beaches of theworld or stars in the sky.

For a deeper understanding of a population, consider a market researcher for a fast foodchain who might want to determine the flavour preferences of Indian customersbetween the ages of 15 and 25. The population in this example is finite and includesevery Indian in this age group of 15-25.

Note that population does not refer to people only. Statisticians also speak of apopulation of objects, or events, or procedures, or observations, including such thingsas the quantity of haemoglobin in blood, number of visits to the doctor by a patient, ornumber surgical operations by a doctor. A population is thus an aggregate of creatures,things, cases and so on.

Sample

A population commonly contains too many individuals to study conveniently, so gathering data

from every individual in this population would be nearly impossible and prohibitively expensive.

So an investigation is often restricted to a part drawn from it, which is called a sample. A part of

the population is called a sample. It is a proportion of the population, a slice of it, a part of it and

all its characteristics.

A sample is a group of units selected from a larger group (the population). By studying the

sample it is hoped to draw valid conclusions about the larger group.

A sample is a smaller group of members of a population selected to represent the population.

A sample is a subset of population.

A sample is a scientifically drawn group that actually possesses the same characteristics as the

population – if it is drawn randomly. Thus a well-chosen sample will contain most of the

information about a particular population parameter but the relation between the sample and the

population must be such as to allow true inferences to be made about a population from that

sample.

The best example of sampling is what housewives do in a kitchen to see whether rice has cooked

enough by tasting just one piece of grain.

If the sample is to be used to make inferences about the population the sample data must be

unbiased. In order for a sample to be unbiased, it must be



representative of the population randomly selected sufficiently large

Representative of the population: A representative sample contains members from the

population of interest. In the case of the flavour preferences study we discussed above, the

sample would need to include Indians between the ages of 15 and 25. If people outside of the

target age range are included, the sample would not be representative.

Randomly selected: A random sample is one in which every member of a population has an

equal chance of being selected. In a random sample, each member of the population has an

equally likely chance of being selected for the sample. Suppose that the sample data for the

flavour preferences study discussed earlier came exclusively from students at one university in

the India. This sample is not random due to the limited opportunity for the rest of the population

to be involved in the study. Data from this sample would not be representative of the entire

Indian population between ages 15 and 25, because the students attending this university may

have a different preference than other groups of young people. Drawing conclusions about the

overall population from this sample could lead to mistakes. The most commonly used sample is a

simple random sample. It requires that every possible sample of the selected size has an equal

chance of being used.

Sufficiently large: A sample must also be large enough in order for its data to reflect the

population. A sample that is too small may bias population estimates. When larger samples are

used, data collected from idiosyncratic individuals have less influence than when smaller

samples are used.

Imagine what would happen if the flavour preferences study collected data from a sample of

three students and, based on the results from this sample, concluded that Indians between the

ages of 15 and 25 favour a particular flavour – say masala flavour. A sample of three people is

too small to serve as the basis for drawing conclusions about the population in general.

How many people must be included in a sample in order for it to represent the population? The

optimal sample size depends on, among other things, the desired confidence level and the

precision of the confidence interval. A sample size of 30 or more is often desired to ensure that

the distribution of the sample mean is normal. In general, more is better.

Population vs Sample

The main difference between a population and sample has to do with how observations are

assigned to the data set.

A populationincludes each element from the set of observations that can be made.

A sample consists only of observations drawn from the population.



Depending on the sampling method, a sample can have fewer observations than the population,

the same number of observations, or more observations. More than one sample can be derived

from the same population.

Other differences are related to terms used. For example,

A measurable characteristic of a population, such as a mean or standard deviation, is

called a parameter; but a measurable characteristic of a sample is called a statistic.

The mean of a population is denoted by the symbol μ; but the mean of a sample isdenoted by the symbol x.

What is the difference between information based on a sample and information based on a

population: Information based on a sample is, by definition, incomplete; as such, a sample

demands that inferences be drawn regarding the population from which it came. Information

based on a population, however, is considered complete, and therefore requires no inferential

leap to be made.

What Characteristics are necessary before a sample can be considered random: The members of

the sample must be chosen based on chance from the population. Each member of the population

must have an equal likelihood of being chosen.

What is the consequence of failing to have a random sample from a population?: A sample is a

subset of a population. If a sample is randomly selected and sufficiently large, the information

obtained from the sample will be representative of the population. A small sample, or one that is

not drawn in a random fashion, may be biased. Making inferences from a biased sample to a

population is ill-advised and may lead to costly business mistakes.

Different methods of sampling

There are numerous sample selection methods for drawing the sample from the population,

broadly classified into random or probability-based sampling schemes or survey design methods,

and non-random or non-probability based sampling.

Probability Sampling

Probability samples are selected in such a way as to be representative of the population. They

provide the most valid or credible results because they reflect the characteristics of the

population from which they are selected.

The following sampling methods are types of probability sampling:

1. Simple Random Sampling (SRS)2. Stratified Sampling3. Cluster Sampling4. Multistage Sampling5. Random-Digit Dialing6. Systematic Sampling

1. Simple Random Sampling

The most widely known type of a random sample is the simple random sample (SRS). This is

characterized by the fact that the probability of selection is the same for every case in the



population. All have an equal chance of being selected. Simple random sampling is a method of

selecting n units from a population of size N such that every unit of the population has equal

chance of being selected.

There are two methods by which we can select a random sample

(a) Lottery Method

An example may make this easier to understand. Imagine you want to carry out a survey of 100

voters in a small town with a population of 1,000 eligible voters. One method of SRS is that we

write the names of all voters on a piece of paper, put all pieces of paper into a box and draw 100

tickets at random. The draw is done in this manner - Shake the box, draw a piece of paper and set

it aside, shake again, draw another, set it aside, etc. until we had 100 slips of paper. These 100

form our sample. And this sample would be drawn through a simple random sampling procedure

- at each draw, every name in the box had the same probability of being chosen. This is called

the lottery method of random sampling.

(b) Table of random numbers:

The lottery method is a clumsy physical process for choosing random samples. Often it is

convenient to use a ready-made table of random numbers. A random number table is a table of

digits. The digit given in each position in the table was originally chosen randomly from the

digits 1,2,3,4,5,6,7,8,9,0 by a random process in which each digit is equally likely to be chosen.

Thus a random number table is a series of digits (0 to 9) arranged randomly through the rows and

columns. Table 1 gives part of table of random numbers. The digits are often grouped in fives as

shown here.Table 1 : table of Random Numbers

The researcher can use the list of random numbers to draw a simple random sample from a

population.

Step 1: each element in the population from which the sample is to be drawn must be assigned a

unique number. This is usually done by numbering the elements in the population consecutively.

If there were 280 elements in the population, for example, they would be numbered 001, 002,

003. . . 280. Here is one procedure for using Table B.1 to select a simple random sample:







(a) Lottery Method


















population.











(a) Lottery Method


















population.







Step 2: determine a starting point in the table by closing your eyes and placing the point of your

pencil anywhere in the table.

Step 3:Using the starting point you have selected, begin reading the numbers in the table either

across the rows or down the columns. If your population consisted of 99 or fewer elements, read

the numbers in two-digit units; for 999 or fewer elements in the population, read the numbers in

three-digit units, and so forth. If a table number is larger than the number of elements in the

population (e.g., if the table number is 323 and the your population is 286), skip that number and

read the next. If you come to a number equivalent to one you have already drawn, you can either

skip the number and read the next one or count the data for that unit of analysis twice. Continue

until you have selected as many valid numbers as there are elements in your desired sample.

The population elements that comprise the simple random sample are those whosenumbers correspond to the numbers read from the table.

For example, you have to select a sample 5 students from a population of 75 students.First give numbers to all students from 1 to 75. Now through process in step two above,place your pencil anywhere on the table. Suppose you place on 62570 in 2nd column and4th row. Since ‘step 3’ above says ‘If your population consisted of 99 or fewer elements, read

the numbers in two-digit units’, we read only the first two digits, so it is 62. So the 62nd student is

our 1st sample. (If in case you get a number which is bigger than your sample, then you take the

next number from the table). Now to get the next sample, move in the table in any direction from

the number you have chosen. Suppose we decide to keep moving move down the column. So

the next digit is 26440. We take the first two digits, so the number is 26. This means 26th student

is our 2nd sample. Going down the column, we get 47174, so it is 47. So the 47th student is our 3rd

sample. Moving down, 34378, we take 34. So 34th student is our 4th sample. Next is 22466, so

22nd student is our 5th sample.

Stratified Random Sampling

In this form of sampling, the population is first divided into two or more mutually exclusive

segments based on some categories of variables of interest in the research. It is designed to

organize the population into homogenous subsets before sampling, then drawing a random

sample within each subset. With stratified random sampling the population of N units is divided

into subpopulations of units respectively. These subpopulations, called strata, are non-

overlapping and together they comprise the whole of the population. When these have been

determined, a sample is drawn from each, with a separate draw for each of the different strata.

The sample sizes within the strata are denoted by respectively. If a SRS is taken within each

stratum, then the whole sampling procedure is described as stratified random sampling.

The primary benefit of this method is to ensure that cases from smaller strata of the population

are included in sufficient numbers to allow comparison.

Systematic Sampling

This method of sampling is at first glance very different from SRS. In practice, it is a variant of

simple random sampling that involves some listing of elements - every nth element of list is then



drawn for inclusion in the sample. Say you have a list of 10,000 people and you want a sample of

1,000.

Creating such a sample includes three steps:

1. Divide number of cases in the population by the desired sample size. In this example,

dividing 10,000 by 1,000 gives a value of 10.

2. Select a random number between one and the value attained in Step 1. In this example,

we choose a number between 1 and 10 - say we pick 7.

3. Starting with case number chosen in Step 2, take every tenth record (7, 17, 27, etc.).

More generally, suppose that the N units in the population are ranked 1 to N in some order (e.g.,

alphabetic). To select a sample of n units, we take a unit at random, from the 1st k units and take

every k- unit thereafter.

Cluster Sampling

In some instances the sampling unit consists of a group or cluster of smaller units that we call

elements or subunits (these are the units of analysis for your study). There are two main reasons

for the widespread application of cluster sampling. Although the first intention may be to use the

elements as sampling units, it is found in many surveys that no reliable list of elements in the

population is available and that it would be prohibitively expensive to construct such a list. In

many countries there are no complete and updated lists of the people, the houses or the farms in

any large geographical region.

Even when a list of individual houses is available, economic considerations may point to the

choice of a larger cluster unit. For a given size of sample, a small unit usually gives more precise

results than a large unit. For example a SRS of 600 houses covers a town more evenly than 20

city blocks containing an average of 30 houses each. But greater field costs are incurred in

locating 600 houses and in traveling between them than in covering 20 city blocks. When cost is

balanced against precision, the larger unit may prove superior.

Nonprobability Sampling

Social research is often conducted in situations where a researcher cannot select the kinds of

probability samples used in large-scale social surveys. For example, say you wanted to study

homelessness - there is no list of homeless individuals nor are you likely to create such a list.

However, you need to get some kind of a sample of respondents in order to conduct your

research. To gather such a sample, you would likely use some form of non-probability sampling.

To restate, the primary difference between probability methods of sampling and non-probability

methods is that in the latter you do not know the likelihood that any element of a population will

be selected for study.

There are four primary types of non-probability sampling methods:

Availability Sampling

Availability sampling is a method of choosing subjects who are available or easy to find. This

method is also sometimes referred to as haphazard, accidental, or convenience sampling. The

primary advantage of the method is that it is very easy to carry out, relative to other methods. For



example if you want to collect data from women alone, you may stand in a crowded market place

and distribute your schedule as you wish

Quota Sampling

Quota sampling is designed to overcome the most obvious flaw of availability sampling. Rather

than taking just anyone, you set quotas to ensure that the sample you get represents certain

characteristics in proportion to their prevalence in the population. Note that for this method, you

have to know something about the characteristics of the population ahead of time. Say you want

to make sure you have a sample proportional to the population in terms of gender - you have to

know what percentage of the population is male and female, then collect sample until yours

matches. Marketing studies are particularly fond of this form of research design.

Purposive or judgmental Sampling

Purposive sampling is a sampling method in which elements are chosen based on purpose of the

study. Purposive sampling may involve studying the entire population of some limited group

(Economics BA students of Calicut University) or a subset of a population (Economics BA

students of Calicut University who are women). As with other non-probability sampling

methods, purposive sampling does not produce a sample that is representative of a larger

population, but it can be exactly what is needed in some cases - study of organization,

community, or some other clearly defined and relatively limited group.

Snowball Sampling

Snowball sampling is a method in which a researcher identifies one member of some population

of interest, speaks to him/her, then asks that person to identify others in the population that the

researcher might speak to. This person is then asked to refer the researcher to yet another person,

and so on. Snowball sampling is very good for cases where members of a special population are

difficult to locate.

The best sampling method is the sampling method that most effectively meets the particular

goals of the study in question. The effectiveness of a sampling method depends on many factors.

Because these factors interact in complex ways, the ‘best’ sampling method is seldom obvious.Good researchers use the following strategy to identify the best sampling method.

List the research goals (usually some combination of accuracy, precision, and/or cost).

Identify potential sampling methods that might effectively achieve those goals.

Test the ability of each method to achieve each goal.

Choose the method that does the best job of achieving the goals.

***********************************



Module II

CORRELATION AND REGRESSION ANALYSISModule II. Correlation and Regression Analysis

Correlation-Meaning, Types and Degrees of Correlation- Methods of Measuring Correlation-

Graphical Methods: Scatter Diagram and Correlation Graph; Algebraic Methods: Karl

Pearson’s Coefficient of Correlation and Rank Correlation Coefficient - Properties and

Interpretation of Correlation Coefficient

Introduction

Correlation is a statistical technique which tells us if two variables are related.For

example, consider the variables family income and family expenditure. It is well known that

income and expenditure increase or decrease together. Thus they are related in the sense that

change in any one variable is accompanied by change in the other variable.Again price and

demand of a commodity are related variables; when price increases demand will tend to

decreases and vice versa. If the change in one variable is accompanied by a change in the other,

then the variables are said to be correlated. We can therefore say that family income and family

expenditure, price and demand are correlated.

Correlation can tell us something about the relationship between variables. It is used to

understand:a. whether the relationship is positive or negative b. the strength of relationship.

Correlation is a powerful tool that provides these vital pieces of information.

In the case of family income and family expenditure, it is easy to see that they both rise or fall

together in the same direction. This is called positive correlation.

In case of price and demand, change occurs in the opposite direction so that increase in one is

accompanied by decrease in the other. This is called negative correlation.

Coefficient of Correlation

Correlation is measured by what is called coefficient of correlation (r). A correlation coefficient

is a statistical measure of the degree to which changes to the value of one variable predict change

to the value of another. Correlation coefficients are expressed as values between +1 and -1. Its

numerical value gives us an indication of the strength of relationship. In general, r > 0 indicates

positive relationship, r < 0 indicates negative relationship while r = 0 indicates no relationship

(or that the variables are independent and not related). Here r = +1.0 describes a perfect positive

correlation and r = −1.0 describes a perfect negative correlation. Closer the coefficients are to+1.0 and −1.0, greater is the strength of the relationship between the variables. As a rule ofthumb, the following guidelines on strength of relationship are often useful (though many experts

would somewhat disagree on the choice of boundaries).



Value of r Strength of relationship

−1.0 to −0.5 or 1.0 to 0.5 Strong

−0.5 to −0.3 or 0.3 to 0.5 Moderate

−0.3 to −0.1 or 0.1 to 0.3 Weak

−0.1 to 0.1 None or very weak

1 A perfect positive correlation

0 No Correlation (No relation between two variables)

− 1 A perfect negative correlation

Correlation is only appropriate for examining the relationship between meaningful quantifiable

data (e.g. air pressure, temperature) rather than categorical data such as gender, favourite colour

etc.

A key thing to remember when working with correlations is never to assume a correlation means

that a change in one variable causes a change in another. Sales of personal computers and

athletic shoes have both risen strongly in the last several years and there is a high correlation

between them, but you cannot assume that buying computers causes people to buy athletic shoes

(or vice versa).

The second caution is that the Pearson correlation technique (which we are about to see) works

best with linear relationships: as one variable gets larger (or smaller), the other gets larger (or

smaller) in direct proportion. It does not work well with curvilinear relationships (in which the

relationship does not follow a straight line). An example of a curvilinear relationship is age and

health care. They are related, but the relationship doesn't follow a straight line. Young children

and older people both tend to use much more health care than teenagers or young adults. (In such

cases, the technique of ‘multiple regression’ can be used to examine curvilinear relationships)

METHODS OF MEASURING CORRELATION

I. Graphical Method

(a) Scatter Diagram

(b) Correlation Graph

II. Algebraic Method (Coefficient of Correlation)

(a) Karl Pearson’s Coefficient of Correlation(b) Spearman’s Rank Correlation Coefficient

I. (a) Scatter Diagram

Scatter Diagram (also called scatter plot, X–Y graph) is a graph that shows the relationship

between two quantitative variables measured on the same individual. Each individual in the data

set is represented by a point in the scatter diagram. The predictor variable is plotted on the



horizontal axis and the response variable is plotted on the vertical axis. Do not connect the points

when drawing a scatter diagram. The scatter diagram graphs pairs of numerical data, with one

variable on each axis, to look for a relationship between them. If the variables are correlated, the

points will fall along a line or curve. The better the correlation, the tighter the points will hug the

line. Scatter Diagram is a graphical measure of correlation.

Examples of Scatter Diagram. Given below each diagram is the value of correlation.

Note that the value shows how good the correlation is (not how steep the line is), and if it is

positive or negative.

Scatter Diagram Procedure

1. Collect pairs of data where a relationship is suspected.

2. Draw a graph with the independent variable on the horizontal axis and the dependent variable

on the vertical axis. For each pair of data, put a dot or a symbol where the x-axis value intersects

the y-axis value. (If two dots fall together, put them side by side, touching, so that you can see

both.)

3. Look at the pattern of points to see if a relationship is obvious. If the data clearly form a line

or a curve, you may stop. The variables are correlated.

The data set below represents a random sample of 5 workers in a particular industry. The

productivity of each worker was measured at one point in time, and the worker was asked the

number of years of job experience. The dependent variable is productivity, measured in number

of units produced per day, and the independent variable is experience, measured in years.

Worker y=Productivity(output/day) x=Experience(inyears)

1 33 102 19 63 32 124 26 85 15 4



This scatter diagram tell us that the two variables, productivity and experience, are

positively correlated.

Merits of Scatter Diagram Method:

1. It is an easy way of finding the nature of correlation between two variables.

2. By drawing a line of best fit by free hand method through the plotted dots, the method

can be used for estimating the missing value of the dependent variable for a given value

of independent variable.

3. Scatter diagram can be used to find out the nature of linear as well as non-linear

correlation.

4. The values of extreme observations do not affect the method.

Demerits of Scatter Diagram Method:It gives only rough idea of how the two variables are related. It gives an idea about the

direction of correlation and also whether it is high or low. But this method does not give any

quantitative measure of the degree or extent of correlation.

I (b) Correlation Graph

Correlation graph is also used as a measure of correlation. When this method is usedthe correlation graph is drawn and the direction of curve is examined to understand the nature ofcorrelation. Under this method, separate curves are drawn for the X variable and Y variable onthe same graph paper. The values of the variable are taken as ordinates of the points plotted.From the direction and closeness of the two curves we can infer whether the variables arerelated. If both the curves move in the same direction (upward or downward), correlation is said

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14

Prod

uctiv

ity

Experience

Scatter Chart for Worker Productivity VsExperience



to be positive. If the curves are moving in the opposite direction, correlation is said to benegative.

But correlation graphs are not capable of doing anything more than suggesting the factof a possible relationship between two variables. We can neither establish any casualrelationship between two variables nor obtain the exact degree of correlation through them.They only tell us whether the two variables are positively or negatively correlated. Example of agraph is given below.

II. Algebraic Method (Coefficient of Correlation)

II. (a) Karl Pearson’s Coefficient of Correlation (Pearson product-momentcorrelation coefficient)

Karl Pearson’s Product-Moment Correlation Coefficient or simply Pearson’s CorrelationCoefficient for short, is one of the important methods used in Statistics to measureCorrelation between two variables. Karl Pearson was a British mathematician,statistician, lawyer and a eugenicist. He established the discipline of mathematicalstatistics. He founded the world’s first statistics department In the University of Londonin the year 1911. He along with his colleagues Weldon and Galton founded the journal‘Biometrika’ whose object was the development of statistical theory.

The Pearson product-moment correlation coefficient (r) is a common measure of thecorrelation between two variables X and Y. When measured in a population the Pearson



Product Moment correlation is designated by the Greek letter rho (?). When computedin a sample, it is designated by the letter "r" and is sometimes called "Pearson's r."Pearson's correlation reflects the degree of linear relationship between two variables.Mathematical Formula:--

The quantity r, called the linear correlation coefficient, measures the strength and thedirection of a linear relationship between two variables. (The linear correlationcoefficient is a measure of the strength of linear relation between two quantitativevariables. We use the Greek letter ρ (rho) to represent the population correlationcoefficient and r to represent the sample correlation coefficient.)

Correlation coefficient for ungrouped data= ∑ ( − )( − )WhereXi is the ith observation of the variable XYi is the ith observation of the variable Y

is the mean of the observations of the variable Xis the mean of the observations of the variable Y

n is the number of pairs of observations of X and Yis the standard deviation of the variable Xis the standard deviation of the variable Y

The above formula may be presented in the following form

= ∑ ( − )( − )∑ ( − ) ∑ ( − )The same may be computed using Pearson product-moment correlation coefficient

formula as shown below.

= ∑ − ∑ ∑∑ − ∑ ∑ − ∑



Year(i)

Annualadvertising

expenditure Xi

Annual Sales

1 10 20

2 12 30

3 14 37

4 16 50

5 18 56

6 20 78

7 22 89

8 24 100

9 26 120

10 28 110

Compute the necessary values and substitute in the formula, we will solve using both

formula. We get = (∑ ⁄ ) = = 19. = (∑ ⁄ ) = = 69.Year

(i)Xi Annual

Sales(Yi)

( − ) ( − ) ( − ) ( − ) ( − )( − )1 10 20 -9 -49 81 2401 4412 12 30 -7 -39 49 1521 2733 14 37 -5 -32 25 1024 1604 16 50 -3 -19 9 361 575 18 56 -1 -13 1 169 136 20 78 1 9 1 81 97 22 89 3 20 9 400 608 24 100 5 31 25 961 1559 26 120 7 51 49 2601 35710 28 110 9 41 81 1681 369

190 690 0 0 330 11200 1894We make the additional computations for the Pearson product-moment correlation

coefficient formula.

200 100 400360 144 900518 196 1369



800 256 25001008 324 31361560 400 60841958 484 79212400 576 100003120 676 144003080 784 1210015004 3940 58810

Substitute the values in the respective formula.

Using the basic formula = ∑ ( )( )∑ ( ) ∑ ( )= 1894√330 √11200 = 0.985Now let us re do the problem using Pearson product-moment correlation coefficient

formula = ∑ − ∑ ∑∑ − (∑ ) ∑ − (∑ )= 10 × 15004 − 190 × 690√10 × 3940 − 190 √10 × 58810 − 690 = 0.985

The correlation coefficient between annual advertising expenditure and annual sales revenue is

0.985. This is a positive value and is very close to 1. So it implies there is very strong corelation

between annual advertising expenditure and annual sales revenue.

Properties of Correlation coefficient

1. The correlation coefficient lies between -1 & +1 symbolically ( - 1≤ r ≥ 1 )

2. The correlation coefficient is independent of the change of origin & scale.

3. The coefficient of correlation is the geometric mean of two regression coefficient.= ×The one regression coefficient is (+ve) other regression coefficient is also (+ve) correlation

coefficient is (+ve)



Assumptions of Pearson’s Correlation Coefficient

1. There is linear relationship between two variables, i.e. when the two variables are plotted on a

scatter diagram a straight line will be formed by the points.

2. Cause and effect relation exists between different forces operating on the item of the two

variable series.

Advantages of Pearson’s Coefficient

1. It summarizes in one value, the degree of correlation & direction of correlation also.

Disadvantages

While 'r' (correlation coefficient) is a powerful tool, it has to be handled with care.

1. The most used correlation coefficients only measure linear relationship. It is therefore

perfectly possible that while there is strong non-linear relationship between the variables,

r is close to 0 or even 0. In such a case, a scatter diagram can roughly indicate the

existence or otherwise of a non-linear relationship.

2. One has to be careful in interpreting the value of 'r'. For example, one could compute 'r'

between the size of shoe and intelligence of individuals, heights and income. Irrespective

of the value of 'r', it makes no sense and is hence termed chance or non-sense correlation.

3. 'r' should not be used to say anything about cause and effect relationship. Put differently,

by examining the value of 'r', we could conclude that variables X and Y are related.

However the same value of 'r' does not tell us if X influences Y or the other way round.

Statistical correlation should not be the primary tool used to study causation, because of

the problem with third variables.

Coefficient of Determination

The convenient way of interpreting the value of correlation coefficient is to use of square of

coefficient of correlation which is called Coefficient of Determination.

The Coefficient of Determination = r2.

Suppose: r = 0.9, r2 = 0.81 this would mean that 81% of the variation in the dependent variable

has been explained by the independent variable.

The maximum value of r2 is 1 because it is possible to explain all of the variation in y but it is

not possible to explain more than all of it.

Coefficient of Determination: An example

Suppose: r = 0.60 in one case and r = 0.30 in another case. It does not mean that the first

correlation is twice as strong as the second the ‘r’ can be understood by computing the value ofr2.

When r = 0.60, r2 = 0.36 -----(1)

When r = 0.30, r2 = 0.09 -----(2)

This implies that in the first case 36% of the total variation is explained whereas in second case

9% of the total variation is explained.



II. (b) Spearman’s Rank Correlation Coefficient

The Spearman's rank-order correlation is the nonparametric version of the Pearson product-

moment correlation. Spearman's correlation coefficient, ( ℎ ℎ , )measures the strength of association between two ranked variables.

Data which are arranged in numerical order, usually from largest to smallest and numbered 1,2,3

---- are said to be in ranks or ranked data.. These ranks prove useful at certain times when two or

more values of one variable are the same. The coefficient of correlation for such type of data is

given by Spearman rank difference correlation coefficient.

Spearman Rank Correlation Coefficient uses ranks to calculate correlation. The Spearman Rank

Correlation Coefficient is its analogue when the data is in terms of ranks. One can therefore also

call it correlation coefficient between the ranks.

The Spearman's rank-order correlation is used when there is a monotonic relationship between

our variables. A monotonic relationship is a relationship that does one of the following: (1) as the

value of one variable increases, so does the value of the other variable; or (2) as the value of one

variable increases, the other variable value decreases. A monotonic relationship is an important

underlying assumption of the Spearman rank-order correlation. It is also important to recognize

the assumption of a monotonic relationship is less restrictive than a linear relationship (an

assumption that has to be met by the Pearson product-moment correlation). The middle image

above illustrates this point well: A non-linear relationship exists, but the relationship is

monotonic and is suitable for analysis by Spearman's correlation, but not by Pearson's

correlation.

Let us make the relevance of use of Spearman Rank Correlation Coefficient with the aid of an

example.

As an example, let us consider a musical talent contest where 10 competitors are evaluated by

two judges, A and B. Usually judges award numerical scores for each contestant after his/her

performance.

A product moment correlation coefficient of scores by the two judges hardly makes sense here as

we are not interested in examining the existence or otherwise of a linear relationship between the

scores.

What makes more sense is correlation between ranks of contestants as judged by the two judges.

Spearman Rank Correlation Coefficient can indicate if judges agree to each other's views as far

as talent of the contestants are concerned (though they might award different numerical scores) -

in other words if the judges are unanimous.

The numerical value of the correlation coefficient, rs, ranges between -1 and +1. The correlation

coefficient is the number indicating the how the scores are relating.

In general, rs > 0 implies positive agreement among ranks rs < 0 implies negative agreement (or agreement in the reverse direction) rs = 0 implies no agreement



Closer rs is to 1, better is the agreement while rs closer to -1 indicates strong agreement in thereverse direction.The formula for finding Spearman Rank Correlation Coefficient is= 1 − 6 ∑ ( + )( − 1)WhereXiis the rank of the ith observation of the variable XYiis the rank of the ith observation of the variable Yn is the number of payers of observations

Let us calculate Spearman Rank Correlation Coefficient for our example of the musical talent

contest where 10 competitors are evaluated by two judges, A and B. The scores are givenbelow,

Contestant Rating by judge 1 Rating by judge 2

1 1 2

2 2 4

3 3 5

4 4 1

5 5 3

6 6 6

7 7 7

8 8 9

9 9 10

10 10 8

Let us first make the necessary calculations

Contestant Rating by

judge 1 (Xi)

Rating by

judge 2(Yi)

− ( − )1 1 2 -1 1

2 2 4 -2 4

3 3 5 -2 4

4 4 1 3 9

5 5 3 2 4

6 6 6 0 0

7 7 7 0 0

8 8 9 -1 1

9 9 10 -1 1

10 10 8 2 4

28



= 1 − 6 ∑ ( + )( − 1) = = 1 − 6 × 2810 × (10 −) = 0.8303Spearman Rank Correlation Coefficient tries to assess the relationship between rankswithout making any assumptions about the nature of their relationship. Hence it is anon-parametric measure - a feature which has contributed to its popularity and widespread use.Interpretation of Rank Correlation Coefficient (R)

1. The value of rank correlation coefficient, R ranges from -1 to +12. If R = +1, then there is complete agreement in the order of the ranks and the ranks arein the same direction3. If R = -1, then there is complete agreement in the order of the ranks and the ranks arein the opposite direction4. If R = 0, then there is no correlationAdvantages Spearman’s Rank Correlation

1. This method is simpler to understand and easier to apply compared to karlearson’scorrelation method.2. This method is useful where we can give the ranks and not the actual data.(qualitative term)3. This method is to use where the initial data in the form of ranks.

Disadvantages Spearman’s Rank Correlation

1. It cannot be used for finding out correlation in a grouped frequency distribution.2. This method should be applied where N exceeds 30.3. As Spearman's rank only uses rank, it is not affected by significant variations inreadings. As long as the order remains the same, the coefficient will stay the same. Aswith any comparison, the possibility of chance will have to be evaluated to ensure thatthe two quantities are actually connected.4. A significant correlation does not necessarily mean cause and effect.

Advantages of Correlation studies1. Show the amount (strength) of relationship present.2. Can be used to make predictions about the variables under study.3. Can be used in many places, including natural settings, libraries, etc.

4. Easier to collect co relational data

REGRESSION ANALYSIS*

* Note: In the syllabus for III Semester BA Economics paper ‘Quantitative Methods forEconomic Analysis – 1’,though the tile of this module II is given as “Correlation and RegressionAnalysis”, regression is not included in the contents. Hence here we give a brief discussion onregression.



If two variables are significantly correlated, and if there is some theoretical basis for doing so, it

is possible to predict values of one variable from the other. This observation leads to a very

important concept known as ‘Regression Analysis’.Regression analysis, in general sense, means the estimation or prediction of the unknown value

of one variable from the known value of the other variable. It is one of the most important

statistical tools which is extensively used in almost all sciences – Natural, Social and Physical. It

is specially used in business and economics to study the relationship between two or more

variables that are related causally and for the estimation of demand and supply graphs, cost

functions, production and consumption functions and so on.

Prediction or estimation is one of the major problems in almost all the spheres of human activity.

The estimation or prediction of future production, consumption, prices, investments, sales,

profits, income etc. are of very great importance to business professionals. Similarly, population

estimates and population projections, GNP, Revenue and Expenditure etc. are indispensable for

economists and efficient planning of an economy.

Regression analysis was explained by M. M. Blair as follows:

“Regression analysis is a mathematical measure of the average relationship between two or morevariables in terms of the original units of the data.”

Regression Analysis is a very powerful tool in the field of statistical analysis in predicting the

value of one variable, given the value of another variable, when those variables are related to

each other.Regression Analysis is mathematical measure of average relationship between two or

more variables.Regression analysis is a statistical tool used in prediction of value of unknown

variable from known variable.

Advantages of Regression Analysis

1. Regression analysis provides estimates of values of the dependent variables from the values of

independent variables.

2. Regression analysis also helps to obtain a measure of the error involved in using the

regression line as a basis for estimations .

3. Regression analysis helps in obtaining a measure of the degree of association or correlation

that exists between the two variable.

Assumptions in Regression Analysis

1. Existence of actual linear relationship.

2. The regression analysis is used to estimate the values within the range for which it is valid.

3. The relationship between the dependent and independent variables remains the same till the

regression equation is calculated.

4. The dependent variable takes any random value but the values of the independent variables are

fixed.

5. In regression, we have only one dependant variable in our estimating equation. However, we

can use more than one independent variable.



Regression lineA regression line summarizes the relationship between two variables in the setting when one of

the variables helps explain or predict the other.

A regression line is a straight line that describes how a response variable y changes as an

explanatory variable x changes. A regression line is used to predict the value of y for a given

value of x. Regression, unlike correlation, requires that we have an explanatory variable and a

response variable.

Regression line is the line which gives the best estimate of one variable from the value of any

other given variable. The regression line gives the average relationship between the two variables

in mathematical form.

For two variables X and Y, there are always two lines of regression –Regression line of X on Y : gives the best estimate for the value of X for any specific given

values of Y :

X = a + b YWhere

a = X – intercept

b = Slope of the line

X = Dependent variable

Y = Independent variable

Regression line of Y on X : gives the best estimate for the value of Y for any specific given

values of X

Y = a + bxWhere

a = Y – intercept

b = Slope of the line

Y = Dependent variable

x= Independent variable

Simple Linear RegressionRegression analysis is most often used for prediction. The goal in regression analysis is to create

a mathematical model that can be used to predict the values of a dependent variable based upon

the values of an independent variable. In other words, we use the model to predict the value of Y

when we know the value of X. (The dependent variable is the one to be predicted). Correlation

analysis is often used with regression analysis because correlation analysis is used to measure the

strength of association between the two variables X and Y.

In regression analysis involving one independent variable and one dependent variable the values

are frequently plotted in two dimensions as a scatter plot. The scatter plot allows us to visually

inspect the data prior to running a regression analysis. Often this step allows us to see if the

relationship between the two variables is increasing or decreasing and gives only a rough idea of

the relationship. The simplest relationship between two variables is a straight-line or linear

relationship. Of course the data may well be curvilinear and in that case we would have to use a



different model to describe the relationship. Simple linear regression analysis finds the straight

line that best fits the data.

Fitting a Line to DataFitting a Line to data means drawing a line that comes as close as possible to the points. (Note

that, no straight line passes exactly through all of the points). The overall pattern can be

described by drawing a straight line through the points.

Example:The data in the table below were obtained by measuring the heights of 161 childrenfrom a village each month from 18 to 29 months of age.Table: Mean height of children

Age in

months

(x)

Height in

centimeters

(y)

18 76.1

19 77

20 78.1

21 78.2

22 78.8

23 79.7

24 79.9

25 81.1

26 81.2

27 81.8

28 82.8

29 83.5

Figure below is a scatterplot of the data in the above table.

Age is the explanatory variable, which is plotted on the x axis. Mean height (in cm) is

the response variable.



We can see on the plot a strong positive linear association with no outliers. The correlation is

r=0.994, close to the r = 1 of points that lie exactly on a line.

If we draw a line through the points, it will describe these data very well. This line is called the

regression line and the process of doing so is called ‘Fitting a line’. This is done in figure below.

Let y is a response variable and x is an explanatory variable.

A straight line relating y to x has an equation of the form y = a + bx.

In this equation, b is the slope, the amount by which y changes when x increases by one unit.

The number a is the intercept, the value of y when x = 0

The straight line describing the data has the form

height = a + (b × age).

In Figure below the regression line has been drawn with the following equation

height = 64.93 + (0.635 × age).

75

76

77

78

79

80

81

82

83

84

16 18 20 22 24 26 28 30

Mea

n He

ight

Age in months



⇒The figure above shows that this line fits the data well.

The slope b = 0.635 tells us that the height of children increases by about 0.6 cm for eachmonth of age.

The slope b of a line y = a + bx is the rate of change in the response y as the explanatoryvariable x changes.

The slope of a regression line is an important numerical description of the relationshipbetween the two variables.

Regression for predictionWe use the regression equation for prediction of the value of a variable,

Suppose we have a sample of size ‘n’ and it has two sets of measures, denoted by x andy. We can predict the values of ‘y’ given the values of ‘x’ by using the equation, calledthe regression equation given below.

y* = a + bx

where the coefficients a and b are given by= ∑ − ∑= ∑ − (∑ )(∑ )(∑ ) − (∑ )

In the regreesion equation the symbol y* refers to the predicted value of y from a givenvalue of x from the regression equation.

Let us see with the aid of an example how regressions used for prediction.

Example:

y = 0.635x + 64.92

75767778798081828384

16 18 20 22 24 26 28 30

Mea

n He

ight

Age in months

Regression Line



Scores made by students in a statistics class in the mid - term and final examination aregiven here. Develop a regression equation which may be used to predict finalexamination scores from the mid – term score.

STUDENT MID TERM FINAL

1 98 90

2 66 74

3 100 98

4 96 88

5 88 80

6 45 62

7 76 78

8 60 74

9 74 86

10 82 80

Solution:We want to predict the final exam scores from the mid term scores. So let us designate‘y’ for the final exam scores and ‘x’ for the mid term exam scores. We open thefollowing table for the calculations.

STUDENT X Y X2 XY

1 98 90 9604 8820

2 66 74 4356 4884

3 100 98 10000 9800

4 96 88 9216 8448

5 88 80 7744 7040

6 45 62 2025 2790

7 76 78 5776 5928

8 60 74 3600 4440

9 74 86 5476 6364

10 82 80 6724 6560

785 810 64521 65074



First find b and then find a and substitute in the equation.

= ∑ − (∑ )(∑ )(∑ ) − (∑ ) = 10(65074) − (785)(810)10 (64521) − (785)= 650740 − 635850645210 − 616225 = 1489028985 = 0.514

= ∑ − ∑ = 810 − (0.514)(785)10 = 810 − 403.4910 = 406.5110 = 40.651So a = 40.651 and b =0.514

Substitute in the equation for regression line y* = a + bx

y* = 40.651 + (0.514)x

Now we can use this for making predictions.

We can use this to find the projected or estimated final scores of the students.

For example, for the midterm score of 50 the projected final score is

y* = 40.651 + (0.514) 50 = 40.651 + 25.70 = 66.351, which is a quite a good estimation.

To give another example, consider the midterm score of 70. Then the projected final

score is

y* = 40.651 + (0.514) 70 = 40.651 + 35.98= 76.631, which is again a very good estimation.

Applications (uses) of regression analysis

1. Predicting the Future :The most common use of regression in business is to predict events that

have yet to occur. Demand analysis, for example, predicts how many units consumers will

purchase. Many other key parameters other than demand are dependent variables in regression

models, however. Predicting the number of shoppers who will pass in front of a particular

billboard or the number of viewers who will watch the Champions Trophy Cricket may help

management assess what to pay for an advertisement.

2. Insurance companies heavily rely on regression analysis to estimate, for example, how many

policy holders will be involved in accidents or be victims of theft,.

3. Optimization: Another key use of regression models is the optimization of business processes.

A factory manager might, for example, build a model to understand the relationship between

oven temperature and the shelf life of the cookies baked in those ovens. A company operating a



call center may wish to know the relationship between wait times of callers and number of

complaints.

4. A fundamental driver of enhanced productivity in business and rapid economic advancement

around the globe during the 20th century was the frequent use of statistical tools in

manufacturing as well as service industries. Today, managers considers regression an

indispensable tool.

Limitations of Regression AnalysisThere are three main limitations:

1. Parameter Instability - This is the tendency for relationships between variables to change over

time due to changes in the economy or the markets, among other uncertainties. If a mutual fund

produced a return history in a market where technology was a leadership sector, the model may

not work when foreign and small-cap markets are leaders.

2. Public Dissemination of the Relationship - In an efficient market, this can limit the

effectiveness of that relationship in future periods. For example, the discovery that low price-to-

book value stocks outperform high price-to-book value means that these stocks can be bid

higher, and value-based investment approaches will not retain the same relationship as in the

past.

3. Violation of Regression Relationships - Earlier we summarized the six classic assumptions of

a linear regression. In the real world these assumptions are often unrealistic - e.g. assuming the

independent variable X is not random.

Correlation or Regression

Correlation and regression analysis are related in the sense that both deal with relationships

among variables. Whether to use Correlation or Regression in an analysis is often confusing for

researchers.

In regression the emphasis is on predicting one variable from the other, in correlation the

emphasis is on the degree to which a linear model may describe the relationship between two

variables. In regression the interest is directional, one variable is predicted and the other is the

predictor; in correlation the interest is non-directional, the relationship is the critical aspect.

Correlation makes no a priori assumption as to whether one variable is dependent on the other(s)

and is not concerned with the relationship between variables; instead it gives an estimate as to

the degree of association between the variables. In fact, correlation analysis tests for

interdependence of the variables.

As regression attempts to describe the dependence of a variable on one (or more) explanatory

variables; it implicitly assumes that there is a one-way causal effect from the explanatory

variable(s) to the response variable, regardless of whether the path of effect is direct or indirect.

There are advanced regression methods that allow a non-dependence based relationship to be

described (eg. Principal Components Analysis or PCA) and these will be touched on later.



The best way to appreciate this difference is by example.

Take for instance samples of the leg length and skull size from a population of elephants. It

would be reasonable to suggest that these two variables are associated in some way, as elephants

with short legs tend to have small heads and elephants with long legs tend to have big heads. We

may, therefore, formally demonstrate an association exists by performing a correlation analysis.

However, would regression be an appropriate tool to describe a relationship between head size

and leg length? Does an increase in skull size cause an increase in leg length? Does a decrease in

leg length cause the skull to shrink? As you can see, it is meaningless to apply a causal

regression analysis to these variables as they are interdependent and one is not wholly dependent

on the other, but more likely some other factor that affects them both (eg. food supply, genetic

makeup).

Consider two variables: crop yield and temperature. These are measured independently, one by

the weather station thermometer and the other by Farmer Giles' scales. While correlation anaylsis

would show a high degree of association between these two variables, regression anaylsis would

be able to demonstrate the dependence of crop yield on temperature. However, careless use of

regression analysis could also demonstrate that temperature is dependent on crop yield: this

would suggest that if you grow really big crops you will be guaranteed a hot summer.

Thus, neither regression nor correlation analyses can be interpreted as establishing cause-and-

effect relationships. They can indicate only how or to what extent variables are associated with

each other. The correlation coefficient measures only the degree of linear association between

two variables. Any conclusions about a cause-and-effect relationship must be based on the

judgment of the analyst.

Uses of Correlation and Regression

There are three main uses for correlation and regression.

1. One is to test hypotheses about cause-and-effect relationships. In this case, the experimenter

determines the values of the X-variable and sees whether variation in X causes variation in Y.

For example, giving people different amounts of a drug and measuring their blood pressure.

2. The second main use for correlation and regression is to see whether two variables are

associated, without necessarily inferring a cause-and-effect relationship. In this case, neither

variable is determined by the experimenter; both are naturally variable. If an association is found,

the inference is that variation in X may cause variation in Y, or variation in Y may cause

variation in X, or variation in some other factor may affect both X and Y.

3.The third common use of linear regression is estimating the value of one variable

corresponding to a particular value of the other variable.

*************************



MODULE III

INDEX NUMBERS AND TIME SERIES ANALYSIS

Index Numbers: Meaning and Uses- Laspeyre’s, Paasche’s, Fisher’s, Dorbish-Bowley,

Marshall-Edgeworth and Kelley’s Methods- Tests of Index Numbers: Time Reversal and Factor

Reversal tests -Base Shifting, Splicing and Deflating- Special Purpose IndicesWholesale Price

Index, Consumer Price Index and Stock Price Indices: BSE SENSEX and NSE-NIFTY. Time

Series Analysis-Components of Time Series, Measurement of Trend by Moving Average and the

Method of Least Squares.

Introduction

Historically, the first index was constructed in 1764 to compare the Italian price index in

1750 with the price level in 1500. Though originally developed for measuring the effect of

change in prices, index numbers have today become one of the most widely used statistical

devices and there is hardly any field where they are not used. Newspapers headline the fact that

prices are going up or down, that industrial production is rising or falling, that imports are

increasing or decreasing, that crimes are rising in a particular period compared to the previous

period as disclosed by index numbers. They are used to feel the pulse of the economy and they

have come to be used as indicators of inflationary or deflationary tendencies, In fact, they are

described as ‘barometers of economic activity’, i.e., if one wants to get an idea as to what ishappening to an economy, he should look to important indices like the index number of

industrial production, agricultural production, business activity, etc.

Of the important statistical devices and techniques, Index Numbers have today become one of

the most widely used for judging the pulse of economy, although in the beginning they were

originally constructed to gauge the effect of changes in prices. Today we use index numbers for

cost of living, industrial production, agricultural production, imports and exports, etc.

Index numbers are the indicators which measure percentage changes in a variable (or a group of

variables) over a specified time. For example,if we say that the index of export for the year 2013

is 125, taking base year as 2010, it means that there is an increase of 25% in the country's export

as compared to the corresponding figure for the year 2000.

Definitions of Index number

According to

Spiegel: “An index number is a statistical measure, designed to measure changes in a variable,or a group of related variables with respect to time, geographical location or other characteristics

such as income, profession, etc.”

Patternson: “In its simplest form, an index number is the ratio of two index numbers expressed asa percent. An index is a statistical measure, a measure designed to show changes in one variable



or a group of related variables over time, with respect to geographical location or other

characteristics.”

Bowley: “Index numbers are used to measure the changes in some quantity which we cannotobserve directly”

We can thus say that index numbers are economic barometers to judge the inflation (increase in

prices) or deflationary (decrease in prices) tendencies of the economy. They help the government

in adjusting its policies in case of inflationary situations.

TYPES OF INDEX NUMBERSIndex numbers are names after the activity they measure. Their types are as under :

Price Index :Measure changes in price over a specified period of time. It is basically the ratio of

the price of a certain number of commodities at the present year as against base year.

Quantity Index: As the name suggest, these indices pertain to measuring changes in volumes of

commodities like goods produced or goods consumed, etc.

Value Index: These pertain to compare changes in the monetary value of imports, exports,

production or consumption of commodities

Purpose of Index NumbersAn index number, which is designed keeping, specific objective in mind, is a very powerful tool.

For example, an index whose purpose is to measure consumer price index, should not include

wholesale rates of items and the index number meant for slum-colonies should not consider

luxury items like A.C., Cars refrigerators, etc.

Index numbers are meant to study the change in the effects of such factors which cannot be

measured directly. For example, changes in business activity in a country are not capable of

direct measurement but it is possible to study relative changes in business activity by studying

the variations in the values of some such factors which affect business activity, and which are

capable of direct measurement.

CHARACTERISTICS OF INDEX NUMBERS

Following are some of the important characteristics of index numbers :

(a) Index numbers are expressed in terms of percentages to show the extent of relativechange

(b) Index numbers measure relative changes. They measure the relative change in thevalue of a variable or a group of related variables over a period of time or between places.

(c) Index numbers measures changes which are not directly measurable.

The cost of living, the price level or the business activity in a country are not directlymeasurable but it is possible to study relative changes in these activities by measuring thechanges in the values of variables/factors which effect these activities.



PROBLEMS IN THE CONSTRUCTION OF INDEX NUMBERS

The decision regarding the following problems/aspect have to be taken before starting the actual

construction of any type of index numbers.

(i) Purpose of Index numbers under construction

(ii) Selection of items

(iii) Choice of an appropriate average

(iv) Assignment of weights (importance)

(v)Choice of base period

Let us discuss these one-by-one

(i) Purpose of Index Numbers

An index number, which is designed keeping, specific objective in mind, is a very powerful tool.

For example, an index whose purpose is to measure consumer price index, should not include

wholesale rates of items and the index number meant for slum-colonies should not consider

luxury items like A.C., Cars refrigerators, etc.

(ii) Selection of Items

After the objective of construction of index numbers is defined, only those items which are

related to and are relevant with the purpose should be included.

(iii) Choice of Average

As index numbers are themselves specialised averages, it has to be decided first as to which

average should be used for their construction. The arithmetic mean, being easy to use and

calculate, is preferred over other averages (median, mode or geometric mean). In this lesson, we

will be using only arithmetic mean for construction of index numbers.

(iv) Assignment of weights

Proper importance has to be given to the items used for construction of index numbers. It is

universally agreed that wheat is the most important cereal as against other cereals, and hence

should be given due importance.

(v) Choice of Base year

The index number for a particular future year is compared against a year in the near past, which

is called base year. It may be kept in mind that the base year should be a normal year and

economically stable year.

USES OF INDEX NUMBERSIndex numbers are commonly used statistical device for measuring the combined fluctuations in

a group related variables. If we wish to compare the price level of consumer items today with

that prevalent ten years ago, we are not interested in comparing the prices of only one item, but

in comparing some sort of average price levels. We may wish to compare the present agricultural

production or industrial production with that at the time of independence. Here again, we have to



consider all items of production and each item may have undergone a different fractional

increase (or even a decrease). How do we obtain a composite measure? This composite measure

is provided by index numbers which may be defined as a device for combining the variations that

have come in group of related variables over a period of time, with a view to obtain a figure that

represents the ‘net’ result of the change in the constitute variables.

Index numbers may be classified in terms of the variables that they are intended to measure. In

business, different groups of variables in the measurement of which index number techniques are

commonly used are (i) price, (ii) quantity, (iii) value and (iv) business activity. Thus, we have

index of wholesale prices, index of consumer prices, index of industrial output, index of value of

exports and index of business activity, etc. Here we shall be mainly interested in index numbers

of prices showing changes with respect to time, although methods described can be applied to

other cases. In general, the present level of prices is compared with the level of prices in the past.

The present period is called the current period and some period in the past is called the base

period.

1) Index numbers are used as economic barometers:

Index number is a special type of averages which helps to measure the economic

fluctuations on price level, money market, economic cycle like inflation, deflation etc.

G.Simpson and F.Kafka say that index numbers are today one of the most widely used

statistical devices. They are used to take the pulse of economy and they are used as indicators

of inflation or deflation tendencies. So index numbers are called economic barometers.

2) Index numbers helps in formulating suitable economic policies and planning etc.

Many of the economic and business policies are guided by index numbers. For

example while deciding the increase of DA of the employees; the employer’s have to dependprimarily on the cost of living index. If salaries or wages are not increased according to the

cost of living it leads to strikes, lock outs etc. The index numbers provide some guide lines that

one can use in making decisions.

3) They are used in studying trends and tendencies.

Since index numbers are most widely used for measuring changes over a period of

time, the time series so formed enable us to study the general trend of the phenomenon under

study. For example for last 8 to 10 years we can say that imports are showing upward

tendency.

4) They are useful in forecasting future economic activity.

Index numbers are used not only in studying the past and present workings of our

economy but also important in forecasting future economic activity.

5) Index numbers measure the purchasing power of money.

The cost of living index numbers determine whether the real wages are rising or falling

or remain constant. The real wages can be obtained by dividing the money wages by the



corresponding price index and multiplied by 100. Real wages helps us in determining the

purchasing power of money.

6) Index numbers are used in deflating.

Index numbers are highly useful in deflating i.e. they are used to adjust the wages for cost of

living changes and thus transform nominal wages into real wages, nominal income to real

income, nominal sales to real sales etc. through appropriate index numbers.

Methods of Constructing Index NumbersConstruction of index numbers can be divided into two types :

(a)Unweighted indices

(i) Simple Aggregative method

(ii) Simple average of price relative method

(b)Weighted indices

(i) Weighted Aggregative Indices

1. Laspayers Method

2. Paashe Method

3. Dorbish&Bowley’s method4. Fisher’s ideal Method5. Marshall –Edgeworth Method, and

6. Kelley’s Method(ii) Weighted Average of relatives

Let us see them in detail.

a (i) Simple Aggregative MethodThis is a simple method for constructing index numbers. In this method, the total of the prices of

commodities in a given (current) years is divided by the total of the prices of commodities in a

base year and expressed as percentage. = ∑∑ × 100∑ = Total of Current year prices for various commodities∑ = Total of base year prices for various commoditiesExample 1

Let us take an example to illustrate

Construct the price index number for 2013, taking the year 2010 as base year

Commodity Price in the year Price in the year2010 2013

A 60 80B 50 60C 70 100D 120 160E 100 150



Solution :

Calculation of simple Aggregative index number for 2013 (against the year 2010) using theformula.

CommodityPrice in the year2010 Price in the year2013

A 60 80B 50 60C 70 100D 120 160E 100 150= 400 = 550

Substitute in the formula = ∑∑ × 100 = 400550 × 100 = 137.50This means that the price index for the year 2013, taking 2010 as base year, is 137.5, showingthat there is an increase of 37.5% in the prices in 2013 as against 2010.Example 2Compute the index number for the years 2011, 2012, 2013 and 2014, taking 2010 as base year,from the following data.

Year 2010 2011 2012 2013 2014Price 120 144 168 204 216

Solution :

Price relatives for different years are

Year 2010 2011 2012 2013 2014Price 120120 × 100= 100 144120 × 100= 120 168120 × 100= 140 204120 × 100= 170 216120 × 100= 180

Price index for different years are as in the following table.

Year 2010 2011 2012 2013 2014Price Index 100 120 140 170 180

There are two main limitations of this method. They are ;(i)The units used in the price or quantity quotations can exert a big influence on the value of theindex, and

(ii) No consideration is given to the relative importance of the Commodities.

a (ii) Simple Average of price Relatives MethodPrice Relative means the ratio of price of a certain item in current year to the price of that item in

base year, expressed as a percentage (i.e. Price Relative = (p2/p1)×100). For example, if a fridge

TV cost Rs 12000 in 2005 and Rs. 18000 in 2013, the price relative is

(18000/12000)×100 = 150.



When this method is used to construct a price index, first of all price relatives are

obtained for the various items included in the index and then arrange of these relatives is

obtained using any one of the measures of central value, ie, arithmetic mean, median, mode,

geometric or harmonic mean. When arithmetic mean is used for averaging the relatives, the

formula for computing the index is: ∑ × 100if A.M. is used as average where Is the price index, N is the number of items,P0 is the pricein the base year and P1 is the price of corresponding commodity in present year (for which indexis to be calculated).

Example

Construct by simple average of price relative method the price index of 2013, taking 2010 asbase year from the following data

Commodity A B C D E F

Price in2010

60 50 60 50 25 20

Price in2014

80 60 72 75 37.5 30

Solution

Find the price relatives for each, take the sum, substitute in formula.

Commodity A B C D E F

Price in2010 (P0)

60 50 60 50 25 20

Price in2014 (P1)

80 60 72 75 37.5 30

Pricerelative× 100

6080 × 100 6050 × 100 7260 × 100 7550 × 100 37.525 × 100 3020 × 100133.33 120.00 120.00 150.00 150.00 150.00

× 100 = 823.33Substituting we get ∑ × 100 = 823.336 = 137.22

Price index for 2013, taking 2010 for base year = 137.22

An un-weighted aggregate price index represents the changes in prices, over time, for anentire group of commodities. However, an un-weighted aggregate price index has twoshort comings. First, this index considers each commodity in the group as equally



important. Thus, the most expensive commodities per unit are overly influential. Second,not all the commodities are consumed at the same rate. In an un-weighted index, changesin the price of the least consumed commodities are overly influential.

(b) i. Weighted Aggregative Indices

Due to the shortcomings of un-weighted aggregate price indices, weighted aggregate price

indices are generally preferable. Weighted aggregate price indices account for differences in the

magnitude of prices per unit and differences in the consumption levels of the items in the market

basket.

When all commodities are not of equal importance, this method is used. Here we assign weight

to each commodity relative to its importance and index number computed from these weights is

called weighted index numbers.

b.i. (i) 1. Laspayers Method

In this index number the base year quantities are used as weights, so it also called base year

weighted index. = ∑∑ ×The primary disadvantage of the Laspeyres Method is that it does not take into consideration the

consumption pattern. The Laspeyres Index has an upward bias. When the prices increase, there

is a tendency to reduce the consumption of higher priced items. Similarly when prices decline,

consumers shift their purchase to those items which decline the most.

b. i. (ii) Paasche’s Method

Under this method weights are determined by quantities in the given year

= ∑∑ ×The Paasche price index uses the consumption quantities in the year of interest instead of using

the initial quantities. Thus, the Paasche index is a more accurate reflection of total consumption

costs at that point in time. However, there are two major drawbacks of the Paasche index. First,

accurate consumption values for current purchases are often difficult to obtain. Thus, many

important indices, such as the consumer price index (CPI), use the Laspeyres method. Second, if

a particular product increases greatly in price compared to the other items in the market basket,

consumers will avoid the high-priced item out of necessity, not because of changes in what they

might prefer to purchase.



b.i. (iii) Dorbish&Bowley’s Method

Dorbish and Bowley have suggested simple arithmetic mean of the two indices (Laspeyres and

Paasche) mentioned above so s to take into account the influence of both the periods, i.e., current

as well as base periods. The formula for constructing the index is:

= +Where L = Laspeyres Index P = Paasche’s Index

OR it may be written as

= ∑∑ + ∑∑ ×b.i. (iv) Fisher’s Ideal Index

The geometric mean of Laspeyre’s and Paasche’s price indices is called Fisher’s price Index.

Fisher price index uses both current year and base year quantities as weight. This index corrects

the positive bias inherent in the Laspeyres index and the negative bias inherent in the Paasche

index. Fisher’s price index is also a weighted aggregative price index because it is an average

(G.M) of two weighted aggregative indices. The computational formula for the fisher ideal price

index is:

= ∑∑ × ∑∑ ×OR = √ ×Fischer’s Index is known as ‘ideal’ because (1) it is based on geometric mean, which

is considered to be the best average for constructing index numbers. (2) It takes into account

both current as well as base year prices and quantities (3) It satisfies both time reversal as well

as the factor reversal tests (which we will study soon) and (4) it is free from bias.

It is not, however, a practical index to compute because it is excessively laborious.

The data, particularly for the Paasche segment of the index, are not readily available.

b.i. (v) Marshall-Edgeworth Method

If the weights are taken as the arithmetic mean of base and current year quantities, then theweighted aggregative index is called Marshal-Edgeworth index. Like Fisher’s index, Marshall-Edgeworth index alsorequires too much labor in selection of commodities. In some cases theusage of this index is not suitable, for example the comparison of the price level of a large



country to a small country. Marshal-Edgeworth index can be calculated by using the formulagiven below. = ∑ + ∑∑ + ∑ ×

It is a simple, readily constructed measure, giving a very close approximation to theresults obtained by the ideal formula.

The Marshall-Edgeworth formula uses the arithmetic mean of the quantities purchased in the

base and current periods as weights. Like the Fisher 'Ideal' index it is impracticable to use as a

timely indicator of price change because it requires the use of quantities purchased in the current

period. In practice, the Marshall-Edgeworth index and the Fisher Ideal, index give similar

results.

b.i. (vi) Kelley’s Method

According to Truman L. Kelly the formula for constructing index numbers.= ∑∑ ×Where q refer to some period, not necessarily the base year or current year.

Example 1

From the following data calculate Price Index Numbers for 2000 with 2013 as base year by using

(i) Laspayers Method (ii) Paasche’s Method (iii) Dorbish&Bowley’s Method (iv) Fisher’s IdealIndex (v) Marshall-Edgeworth Method

Commodity2000 2013

Price Quantity Price QuantityA 20 8 40 6B 50 10 60 5C 40 15 50 15D 20 20 20 25

SolutionLet us first compute the necessary values.(i) Laspayers Method = ∑∑ × 100

2000 2013

Commodity P0 Q0 P1 Q1 P1Q0 P0Q0

A 20 8 40 6 320 160

B 50 10 60 5 600 500

C 40 15 50 15 750 600

D 20 20 20 25 400 400

2070 1660

= 20701660 × 100 = 124.70



(ii) Paasche’s Method

= ∑∑ × 1002000 2013

Commodity P0 Q0 P1 Q1 P1Q1 P0Q1

A 20 8 40 6 240 120

B 50 10 60 5 300 250

C 40 15 50 15 750 600

D 20 20 20 25 500 500

1790 1470

= 17901470 × 100 = 121.77(iii) Dorbish&Bowley’s Method = +2= 124.70 + 121.772 = 246.472 = 123.23(iv)Fisher’s Ideal Index = √ ×= √124.70 × 121.77 = √15184.56 = 123.23(v)Marshall-Edgeworth Method = ∑ + ∑∑ + ∑ × 100

2000 2013

Commodity P0 Q0 P1 Q1 P1Q1 P0Q1 P1Q0 P0Q0

A 20 8 40 6 240 120 320 160

B 50 10 60 5 300 250 600 500

C 40 15 50 15 750 600 750 600

D 20 20 20 25 500 500 400 400

1790 1470 2070 1660

= 2070 + 17901660 + 1470 × 100 = 38603130 × 100 = 1.233226837 × 100= 123.32



Example 2

Compute index number from the following data

Materials Unit Quantityrequired

Price2000 2010

Cement 100lb 500 lb 5.0 8.0Timber c.ft. 2000 c.ft. 9.5 14.2Steel Cwt. 50 cvt. 34.0 42.20

Bricks Per 000 20000 12.0 24.0

Solution

Since the quantities (weights) required of different materials are fixed for both base year and

current year, we will use Kelly’s formula.

For materials we have to do certain conversions. For example, for cement unit is in 100 lbs, and

the quantity required is 500 lbs. Hence, the quantity consumed per unit for cement is 500/100 =

5. Similarly, the quantity consumed per unit for brick is 20000/1000= 20.

By Kelley’s Method, = ∑∑ × 100Let us make the necessary computations.

Materials Unit Quantityrequired q

Price (Rs.)P1q P0q2000

P02010P1

Cement 100 lb 500 lb 5 5.0 8.0 25 40Timber c.ft. 2000 c.ft. 2000 9.5 14.2 19000 28400Steel Cwt. 50 cvt. 50 34.0 42.0 1700 2100

Bricks Per 000 20000 20 12.0 24.0 240 480Total 20965 31020

Substituting

= ∑∑ × 100 = 3102020965 × 100 = 1.4796100 = 147.96B. (II) WEIGHTED AVERAGE OF RELATIVES

I. Weighted Average of Price Relatives MethodIn this method, appropriate weights are assigned to the commodities according to the

relative importance of those commodities in the group. Thus the index for the whole group is



obtained on taking the weighted average of the price relatives. To find the average, Arithmetic

Mean or Geometric Mean can be used.

When AM is used, the index is = ∑∑Where P = Price relative

V = Value of weights i.e.

Example:

From the following data compute price index by supplying weighted average of price relatives

method using Arithmetic Mean

Commodity Sugar Flour Milk

3.0 1.5 1.0

20 Kg. 40 Kg. 10 Lit.

4.0 1.6 1.5

By using Arithmetic Mean

Commodity (v) x 100pvSugar 3.0 20 Kg 4 60

x 1008000

Flour 1.5 40 Kg. 1.6 60 .. x 1006400Milk 1.0 10 Lit. 1.5 10 .. x 1001500V = 130 PV= 15900

= ∑∑ = 15900130 = 122.31Instead of Arithmetic Mean, we can use Geometric Mean.

When GM is used, the index is = ∑ ∑Where P = x 100

V = Value of weight

The above example can be re worked using GM as follows.



By using Geometric Mean

Commodity (v) p Log p V Log p3.0 20 Kg 4 60 133.3 2.1249 127.494

Flour 1.5 40 Kg. 1.6 60 106.7 2.0282 121.6921.0 10 Lit. 1.5 10 150.0 2.1761 21.761

= 130 .= 270.947= ∑ ∑ = 270.947130 = 2.084 = 120.9

Merits of weighted Average of Relative Indices When different index numbers are constructed by the average of relatives method, all of

which have the same base, they can be combined to form a new index. When an index is computed by selecting one item from each of the many sub groups of

items, the values of each sub subgroup may be used as weights. Then only the method ofweighted average of relatives is appropriate.

When a new commodity is introduced to replace the one formerly used, the relative forthe new it may be spliced to the relative for the old one, using the former value weights.

The price or quantity relatives each single item in the aggregate are in effect, themselvesa simple index that often yields valuable information for analysis.

TESTS OF INDEX NUMBERSThe following are the most important tests through which one can list the consistency of

index numbers.

1. The time Reversal Test2. The factor Reversal Test

1.The Time Reversal Test × =Where P01 is the price index for year ‘1’ with year ‘0’ as base year and P10 is the price index foryear ‘a’ with year ‘b’ as base.This test is not satisfied by both Laspeyres and Paasche’s index numbers.× =

∑∑ X∑∑ ≠

Paasche’s Method = × =∑∑ X

∑∑ ≠Fisher’s formula satisfies this test

Fisher’s Method = × = ∑∑ × ∑∑ × ∑∑ × ∑∑ =2. The Factor Reversal Test× =

∑∑Wheref stands for the price relative for the year ‘1’ with base year ‘0’ and stands forquantity relative for the year ‘1’ with base year ‘0’, then the condition isThis test is not satisfied by both Laspeyres and Paasche’s index numbers.

Laspeyre’sFormula = × =∑∑ × ∑∑ ≠ ∑∑

Paasche’sformula = × =∑∑ × ∑∑ ≠ ∑∑



Fisher’s formula satisfies this test

Fisher’s Formula = × =∑∑ × ∑∑ × ∑∑ × ∑∑

=∑∑ =

∑∑Fisher’s formula satisfies both time reversal and factor reversal test. This is why the

Fisher’s formula is often called Fisher’s Ideal Index Number.ExampleFor the following data prove that the Fisher’s Ideal Index satisfies both the Time Reversal Testand the Factor Reversal Test.

Commodity Base Year Current YearPrice Quantity Price Quantity

A 6 50 10 56B 2 100 2 120C 4 60 6 60D 10 30 12 24

Solution

P1

A 6 50 10 56 300 336 500 560

B 2 100 2 120 200 240 200 240

C 4 60 6 60 240 240 360 360

D 10 30 12 24 300 240 360 288

= 1040 = 1056 = 1420 = 1448Fisher’s price index number is gven by= ∑∑ × ∑∑ ×

Substituting the values we get= × × = .Time reversal test: × =

We have P01 = 1.3683 (without factor 100)And = ∑∑ × ∑∑ ( )



Substituting = × = .× = 1.3683 × 0.7308 = 0.9999 ≈ 1

Hence, Fischer’s index satisfies Time Reversal Test.

Factor Reversal Test × =∑∑

We have (without factor 100) = ∑∑ × ∑∑ = ×× = ×× × ××

= = = ∑∑ .BASE SHIFTING, SPLICING AND DEFLATING THE INDEX NUMBERS :

(a) Base shifting

Most index numbers are subjected to revision from time to time due to different reasons. In mostcases it becomes compulsory to change the base year because numerous changes took place withthe passage of time. For example changes may happen due to disappearance of old items,inclusion of new ones, changes in weights of commodities or changes in conditions, habits, andstandard of life etc.One of the most frequent operations necessary in the use of index numbers is changing the baseof an index from one period to another with out recompiling the entire series. Such a change isreferred to as ‘base shifting’. The reasons for shifting the base areIf the previous base has become too old and is almost useless for purposes of comparison.If the comparison is to be made with another series of index numbers having different base.

The following formula must be used in this method of base shifting is

Index number based on new base year = 100numberindexoldyearsbasenew

numberindexoldyearscurrent

Shifting from one fixed base to another fixed baseTo convert a fixed base to a new fixed base each old index is divided by the index of new basesought multiplied by 100. It can be illustrated with the help of following problem.

Example:



Following series is given to the base year 2000. Now convert it into the new series with baseyear 2003.

Year 2000 2001 2002 2003 2004 2005Index 100 130 145 155 205 255

Year Fixed Base IndexBase = 2000 Base = 2003

2000 100 100/155×100 = 64.522001 130 130/155×100 = 83.872002 145 145/155×100 = 93.552003 155 155/155×100 = 100.002004 205 205/155×100 = 132.262005 255 255/155×100 = 164.52

Shifting from chain base to fixed base

One of the disadvantages of chain base method is that the comparison between distant periods is

not immediately evident. Therefore it becomes necessary to convert chain base indices into fixed

base indices. This can be illustrated with the help of following example.

Example:

Convert the following chain indexes into the new series with base year 2005.

Year 2005 2006 2007 2008 2009 2010Index 100 105 110 107 112 107

Year Chain Base Index Fixed Index (1970 = 100)2005 100 1002006 105 100×105/100 = 1052007 110 105×110/100 = 115.52008 107 115.5×107/100 = 123.592009 112 123.59×112/100 = 138.422010 107 138.42×107/100 = 148.10

Shifting from Fixed to chain base

As discussed earlier, conditions change over a period due to revised weightings system, inclusion

of new items and disappearance of old ones etc. Due to all these factors, sometimes it is

necessary to convert the indices from fixed base to chain base. This can be explained with the

help of following problem. Problem: Convert the following indexes with base 1980 to chain

indexes.

Year 2005 2006 2007 2008 2009 2010FixedIndex

100 105 115 130 150 175

Year Fixed Index (Base= 1980)

Chain Base Index, = × 100⁄2005 1980 100 , = 100100 × 100 = 100



2006 1981 105 , = 105100 × 100 = 1052007 1982 115 , = 115105 × 100 = 109.522008 1983 130 , = 130115 × 100 = 113.042009 1984 150 , = 150130 × 100 = 115.382010 1985 175 , = 175150 × 100 = 116.67

Splicing of two series of index numbers

Splicing of index numbers mean combining two or more series of overlapping index numbers to

obtain a single index number on a common base. This is done by the same technique as used in

base shifting.

To combine two or more series of overlapping index numbers to obtain a single series of index

numbers on a common base.

It is of two types:-

(i) Splicing of new index numbers to old index numbers

(ii) Splicing of old index numbers to new index number.

Splicing of Index numbers can be done only if the index numbers are constructed with the same

items, and have an overlapping year. Suppose we have an index number with a base year of 2001

and another index number (using the same item as the first one) with a base of 2011. Suppose

both index numbers are continuing. Then we can splice the first series of index number to the

second series and have a common index with base 2011. We can also spice index number series

two with series one and have a common index number with base 2001. Splicing is generally

done when an old index number with an old base is being discontinued and a new index with a

new base is being started.

The following formula must be used in this method of splicing

Index number after splicing =

100

baseexistingofnumberindexoldsplicedbenumber toindex

Example

Index Number A given below was started in 1981 and discontinued in 2001 when another indexB was started which continues up to date. From the data given in the table below splice the indexnumber B to index number A so that a continuous series of index numbers from 1951 up to dateis available.

Splicing of Index B to Index A

Here we multiply index B with a common factor which is the ratio of index B to index A in

the overlapping year 2001.



Year Index A Index B Index B Spliced to A1981….….….

100 - -

2000 180 - -2001 200 100 200100 × 100 = 2002002 - 120 200100 × 120 = 2402003…….….

- 140 200100 × 140 = 2802013 - 250 200100 × 250 = 500

Thus we have a continuous series of index numbers with base 1981 which continues up todate.

DEFLATING THE INDEX NUMBERS

By deflating we mean making allowances for the effect of changing price levels. A rise in price

level means a reduction in the purchasing power of money. To take the case of a single

commodity suppose the price of wheat rises from ₹ 500 per quintal in 1999 to ₹ 1,000 perquintal in 2009 it means that in 2009 one can buy only half of wheat if the spends the same

amount which he was spending on wheat in 1999. Thus the value (or purchasing power) of a

rupee is simply the reciprocal of an appropriate price index written as a proportion. If prices

increase by 60 per cent, the price index is 1.60 and what a rupee will busy is only 1/1.60 or 5/8 of

what it used to buy. In other words the purchasing power of rupee is 5/8 of what it was.

Similarly, if prices increase by 25 per cent the price index is 1.25 (125 per cent). And the

purchasing power of the rupee is 1/1.25 = 0.80.

Thus the purchasing power of money =indexprice

1

In times of rising prices the money wages should be deflated by the price index to get the

figure of real wages. The real wages alone tells whether a wage earner is in better position or in

worst position.

For calculating real wage, the money wages or income is divided by the corresponding

price index and multiplied by 100.

i.e. Real wages = 100Pr

indexice

wagesMoney

Thus Real Wage Index= 100Re

Re

yearbaseofwageal

yearcurrentofwageal



Example

The annual wage of workers (in Rs.) of workers are given along with Consumer Price Indices.

Find (i) the real wage and (ii) the real wage indices.

Year 2010 2011 2012 2013Wages 1800 2200 3400 3600Consumer PriceIndices

100 170 300 320

Year Wage Price Index Real Wage Real Wage Indices 2010 = 100

2010 1800 100 × 100 =1800 100

2011 2200 170 × 100 =1294.1. × 100 =71.90

2012 3400 300 × 100 =1133.3. × 100 =62.96

2013 3600 320 × 100 =1125 × 100 =62.50

SPECIAL PURPOSE INDICES

Price Index: The price index is an indicator of the average price movement over time of a fixed

basket of goods and services. The constitution of the basket of goods and services is done

keeping in to consideration whether the changes are to be measured in retail, wholesale, or

producer prices etc. The basket will also vary for economy-wide, regional, or sector specific

series. At present, separate series of index numbers are compiled to capture the price movements

at retail and wholesale level in India. There are four main series of price indices compiled at the

national level. Out of these four, Consumer Price Index for Industrial Workers (CPI-IW),

Consumer Price Index for Agricultural Labourers / Rural Labourers (CPI -AL/RL), Consumer

Price Index for Urban Non-Manual Employees (CPI-UNME) are consumer price indices. The

Wholesale Price Index (WPI) number is a weekly measure of wholesale price movement for the

economy. Some states also compile variants of CPI and WPI indices at the state level.

1. Wholesale Price Index

The wholesale price index numbers indicate the general condition of the national economy. They

measure the change in prices of products produced by different sectors of an economy. The

wholesale prices of major items manufactured or produced are included in the construction of

these index numbers.



Wholesale Price Index (WPI) represents the price of goods at a wholesale stage i.e. goods that

are sold in bulk and traded between organizations instead of consumers. WPI is used as a

measure of inflation in some economies.

Uses

In a dynamic world, prices do not remain constant. Inflation rate calculated on the basis of the

movement of the Wholesale Price Index (WPI) is an important measure to monitor the dynamic

movement of prices. As WPI captures price movements in a most comprehensive way, it is

widely used by Government, banks, industry and business circles. Important monetary and fiscal

policy changes are often linked to WPI movements. Similarly, the movement of WPI serves as

an important determinant, in formulation of trade, fiscal and other economic policies by the

Government of India. The WPI indices are also used for the purpose of escalation clauses in the

supply of raw materials, machinery and construction work.

WPI is used as an important measure of inflation in India. Fiscal and monetary policy changes

are greatly influenced by changes in WPI.

WPI is an easy and convenient method to calculate inflation. Inflation rate is the difference

between WPI calculated at the beginning and the end of a year. The percentage increase in WPI

over a year gives the rate of inflation for that year.

WPI computation in India

WPI is the most widely used inflation indicator in India. This is published by the Office of

Economic Adviser, Ministry of Commerce and Industry. WPI captures price movements in a

most comprehensive way. It is widely used by Government, banks, industry and business

circles. Important monetary and fiscal policy changes are linked to WPI movements. It is in use

since 1939 and is being published since 1947 regularly. We are well aware that with the

changing times, the economies too undergo structural changes. Thus, there is a need for

revisiting such indices from time to time and new set of articles / commodities are required to be

included based on current economic scenarios. Thus, since 1939, the base year of WPI has been

revised on number of occasions. The current series of Wholesale Price Index has 2004-05 as

the base year.

Wholesale price index comprises as far as possible all transactions at first point of bulk sale in

the domestic market. Provisional monthly WPI for All Commodities is released on 14th of every

month (next working day, if 14th is holiday). Detailed item level WPI is put on official website

(http://www.eaindustry.nic.in/) for public use. The provisional index is made final after a period

of eight weeks/ two months.

The Office of the Economic Adviser to the Government of India undertook to publish for the

first time, an index number of wholesale prices, with base week ended August 19, 1939 = 100,

from the week commencing January 10, 1942. The index was calculated as the geometric mean



of the price relatives of 23 commodities classified into four groups: (1) food & tobacco; (2)

agricultural commodities; (3) raw materials; and (4) manufactured articles. Each item was

assigned equal weight and for each item, there was a single price quotation. That was a modest

beginning to what became an important weekly activity for the monitoring and management of

the Indian economy and a benchmark for business transactions.

Step-in compilation of WPI in India

Like most of the price indices, WPI is based on Laspeyres formula for reason of practical

convenience. Therefore, once the concept of wholesale price is defined and the base year is

finalized, the exercise of index compilation involve finalization of item basket, allocation of

weights (W) at item, groups/ sub-groups level. Simultaneously, the exercise to collect base prices

(Po), current prices (P1), finalization of item specifications, price data sources, and data

collection machinery is undertaken. These steps are

1. Definition of the Concept of Wholesale Prices:

Wholesale price has divergent connotations adopted by the different departments using them.

There is no uniform definition for agricultural and non- agricultural commodities as all the

wholesale prices cannot be collected from the established markets. So proper definition has to be

made by the competent authority.

For example in the case of agricultural commodities, in practice, there are three types of

wholesale markets viz., primary, secondary and terminal in the agricultural sector. The price

movements and price levels in all three vary. Price movement in the terminal market may tend to

converge toward the retail prices. Option to collect the wholesale prices for these three different

stages of wholesale transactions exists for agricultural commodities though the primary market is

prepared. So, the Ministry of Agriculture has defined wholesale price as the rate at which

relatively large transaction of purchase, usually for further sale, is effected.

Similarly, for non-agricultural commodities, which are predominantly manufacturing items, the

problem arises, as there are no established sources in markets. This is true of mining and fuel

items also. The issue of ex-factory vis-à-vis wholesale prices for non-agriculture items have been

discussed by the successive Working Groups set up for the revision of WPI and all have reached

the conclusion that in practice, it is not feasible to collect wholesale prices for most of the

manufacturing items. It has also been observed that the margin of wholesalers in case of non-

agricultural commodities remains unchanged for over a long period of time. As a result, it is felt

that the trends in the index compiled on the basis of ex-factory prices would not be much

different from the index if compiled on the basis of wholesale prices if it were feasible to get

these prices. The last Working Group has recommended collecting wholesale prices from the



markets as far as possible, because the economy is moving towards globalization and open

trade with inputs increasing in the commodities set.

2) Choice of Base Year

The second step is choice of base year. The well-known criteria for the selection of base year

are (i) a normal year i.e. a year in which there are no abnormalities in the level of production,

trade and in the price level and price variations, (ii) a year for which reliable production, price

and other required data are available and (iii) a year as recent possible and comparable with

other data series at national and state level. The National Statistical Commission has

recommended that base year should be revised every five year and not later than ten years.

3. Selection of Items, Varieties/ Grades, Markets:

To ensure that the items in the index basket are as best representatives as possible, efforts are

made to include all the important items transacted in the economy during the base year. The

importance of an item in the free market will depend on its traded value during the base year. At

wholesale level, bulk transactions of goods and services need to be captured. As the services are

not covered so far, the WPI basket mainly consists of items from goods sector. In the absence of

single source of data on traded value, the selection procedures followed for agricultural

commodities and non-agricultural commodities have also been different.

For example, in the case of agricultural commodities: As there is a little scope of emergence of

new commodities in the agriculture, the selection of new items in the basket is done on the basis

of increased importance in wholesale markets. Varieties, which have declined in importance,

need to be dropped in the revised series. Final inclusion or exclusion of an item in the basket is

based on the process of consultation with the various departments. The exercise of adding

/deleting commodities, specifications and markets is completed once the consultation process is

over. In the existing WPI series, items, their specifications and markets have been finalized in

consultation of with the Directorate of E&S (M/O Agriculture), National Horticulture Board,

Spices Board, Tea board, Coffee Board and Rubber Board, Silk Board, Directorate Of Tobacco,

Cotton Corporation of India etc.

4. Derivation of Weighting Diagram

Weights used in the WPI are value weights not quantity weights as its difficult to assign quantity

weights. Distribution of the appropriate weight to each of the item is most important exercise for

reliable index. Unlike consumer price indices, where weights are derived on the basis of results

of Expenditure Surveys, several sources of data are used for derivation of weights for WPI.

5) Collection of Prices

In WPI pricing methodology used is specification pricing. Under this, in consultation with the

identified source agencies, precise specifications of all items in the basket are defined for repeat



pricing every week. All characteristics like make, model, features along with the unit of sale,

type of packaging, if applicable, etc are recorded and printed in the price collection schedule. At

the time of scrutiny of price data all these are kept in mind. This pricing to constant quality

technique is the cornerstone of Laspeyres formula. In case of changes in quality and

specifications, due adjustments are made as per the standard procedures.

The collection of base prices is done concurrently while the work on finalisation of index basket

is on. Therefore, price collection is normally done for larger number of items pending

finalisation. Once the basket is ready, current prices are collected only as per the final basket

from the designated sources. Weekly prices need to be collected for pre-determined day of the

week. For the current series prices are quoted on the basis of the prevailing prices of every

Friday. Agricultural wholesale prices are for bulk transactions and include transport cost. Non-

agricultural prices are ex-mine or ex-factory inclusive of excise duty but exclusive of rebate if

any.

6) Treatment of prices collected from open market & administered prices:

There are some items which constitute part of index baskets but the prices for these items are

either totally administered by the Government or are under dual pricing policy. The issue of

using administered prices for index compilation is resolved by taking into account appropriate

ratio between the levy and non-levy portions. Where these ratios are not available, the issues can

be resolved through taking the appropriate number of price quotations of the administered prices

and the open market prices after periodic review.

Due to variation in quality and different price movements of the commodities belonging to

unorganized sector, separate quotations from organized and unorganized units have to be taken

and merged based on the turnover value of both the sectors at item level. For pricing from

unorganized sector, adequate number of price quotations has to be drawn out of the list of units

by criteria of share of production as far as possible.

7) Classification structure:

The Working Groups over the period have been suggesting to bring the classification of various

items under different groups and sub-groups as per the latest revised National Industrial

Classification (NIC) which in turn is comparable to International Standard Industrial

Classification (ISIC). The classification based on NIC renders the WPI data amenable to

comparison with the Index of Industrial Production (IIP) and National Income data.

Major Group/Groups: I. Primary Articles II. Fuel, Power, Light & Lubricants III. Manufactured

Products

8) Methodology of Index Calculation

Actual index compilation is done in stages.



In the first stage, once the price data are scrutinized, price relative for each price quote is

calculated. Price relative is calculated as the ratio of the current price to the base price multiplied

by 100 i.e. (P1/Po)×100.

In the next stage, commodity/item level index is arrived at as the simple arithmetic average of

the price relatives of all the varieties (each quote) included under that commodity. An average of

price ratio/ relative is used under implicit assumption that each price quotation collected for an

item/commodity index compilation has equal importance i.e. the shares of production value is

equal.

Next, the indices for the sub groups/groups/ major groups are compiled and the aggregation

method is based on Laspeyres formula as below:

I= S (Ii x Wi) / S Wi

Where,

I = Index numbers of wholesale prices of a sub- group/group/ major group/ all commodities

S = represents the summation operation,

Ii = Index of the ith item / sub- group/ group/ major group.

Wi = Weight assigned to the ith item of sub- group/group/ major group.

The weights are value weights. Aggregation is first done at sub-group and group level. All

commodities index is compiled by aggregating Major group indices.

9) Handling of the Seasonal Commodities :

There are number of agriculture items, especially some fruits and vegetables, which are of

seasonal nature. When a particular seasonal item disappears from the market and its prices are

not available because of its being out of season, the weights of such item is imputed amongst the

other items on pro rata basis with in the sub-group of vegetables or fruits. The underlying

assumption is that if the items remained available, the prices of these items would have moved in

the same proportion as the prices of the other items in the sub-group, which did remain available.

This is equivalent to giving a greater weight to the remaining items. The seasonality problem can

be sorted by adopting other methods like, i) prices of unavailable items can also be extrapolated

forward from the period of availability or ii) if such seasonal item has insignificant weight it can

be taken permanently from the basket etc.

2. Consumer Price Index Number

The Consumer Price Index (CPI) is a measure of the average change over time in the prices of

consumer items -goods and services that people buy for day-to-day living. The CPI is a complex

construct that combines economic theory with sampling and other statistical techniques and uses



data from several surveys to produce a timely and precise measure of average price change for

the consumption sector.

Consumer Price Index is a comprehensive measure used for estimation of price changes in a

basket of goods and services representative of consumption expenditure is called consumer price

index. The calculation involved in the estimation of CPI is quite rigorous. Various categories and

sub-categories have been made for classifying consumption items and on the basis of consumer

categories like urban or rural. Based on these indices and sub indices obtained, the final overall

index of price is calculated mostly by national statistical agencies. It is one of the most important

statistics for an economy and is generally based on the weighted average of the prices of

commodities. It gives an idea of the cost of living.

Inflation is measured using CPI. The percentage change in this index over a period of time gives

the amount of inflation over that specific period, i.e. the increase in prices of a representative

basket of goods consumed.

The CPI frequently is called a cost-of-living index, but it differs in important ways from a

complete cost-of-living measure. A cost-of-living index would measure changes over time in the

amount that consumers need to spend to reach a certain utility level or standard of living. Both

the CPI and a cost-of-living index would reflect changes in the prices of goods and services, such

as food and clothing that are directly purchased in the marketplace; but a complete cost-of-living

index would go beyond this role to also take into account changes in other governmental or

environmental factors that affect consumers' well-being. It is very difficult to determine the

proper treatment of public goods, such as safety and education, and other broad concerns, such as

health, water quality, and crime, that would constitute a complete cost-of-living framework.

How do we read or interpret an index?

An index is a tool that simplifies the measurement of movements in a numerical series. Most of

the specific CPI indexes have a 1982-84 reference base. That is, the agency computing the index

sets the average index level (representing the average price level)-for the 36-month period

covering the years 1982, 1983, and 1984-equal to 100. The agency then measures changes in

relation to that figure. An index of 110, for example, means there has been a 10-percent increase

in price since the reference period; similarly, an index of 90 means a 10-percent decrease.

Movements of the index from one date to another can be expressed as changes in index points

(simply, the difference between index levels), but it is more useful to express the movements as

percent changes. This is because index points are affected by the level of the index in relation to

its reference period, while percent changes are not.

Item A Item B Item CYear I 112.500 225.000 110.000Year II 121.500 243.000 128.000Change in indexpoints

9.000 18.000 18.000

Percent change 9.0/112.500 x 100 =8.0

18.0/225.000 x 100 =8.0

18.0/110.000 x 100 =16.4



In the table above, Item A increased by half as many index points as Item B between Year I and

Year II. Yet, because of different starting indexes, both items had the same percent change; that

is, prices advanced at the same rate. By contrast, Items B and C show the same change in index

points, but the percent change is greater for Item C because of its lower starting index value.

Uses of cost of living index numbers:

1. Cost of living index numbers indicate whether the real wages are rising or falling. Inother words they are used for calculating the real wages and to determine the change inthe purchasing power of money.

numberindexlivingofCost

1moneyofpowerPurchasing

100umbersindexlivingofCost

wagesMoneyR Wageseal

2. Cost of living indices are used for the regulation of D.A or the grant of bonus to theworkers so as to enable them to meet the increased cost of living.

3. Cost of living index numbers are used widely in wage negotiations.

4. These index numbers also used for analyzing markets for particular kinds of goods.

Main steps or problems in construction of cost of living index numbers

Production of the CPI requires the skills of many professionals, including economists,statisticians, computer scientists, data collectors, and others.

The cost of living index numbers measures the changes in the level of prices of commoditieswhich directly affects the cost of living of a specified group of persons at a specified place. Thegeneral index numbers fails to give an idea on cost of living of different classes of people atdifferent places.

Different classes of people consume different types of commodities, people’s consumptionhabit is also vary from man to man, place to place and class to class i.e. richer class, middle classand poor class. For example the cost of living of rickshaw pullers at BBSR is different from therickshaw pullers at Kolkata. The consumer price index helps us in determining the effect of riseand fall in prices on different classes of consumers living in different areas.

The following are the main steps in constructing a cost of living index number.

1. Decision about the class of people for whom the index is meantIt is absolutely essential to decide clearly the class of people for whom the index

is meant i.e. whether it relates to industrial workers, teachers, officers, labors, etc. Alongwith the class of people it is also necessary to decide the geographical area covered by theindex, such as a city, or an industrial area or a particular locality in a city.

2. Conducting family budget enquiry

Once the scope of the index is clearly defined the next step is to conduct a sample

family budget enquiry i.e. we select a sample of families from the class of people for

whom the index is intended and scrutinize their budgets in detail. The enquiry should be

conducted during a normal period i.e. a period free from economic booms or depressions.



The purpose of the enquiry is to determine the amount; an average family spends on

different items. The family budget enquiry gives information about the nature and quality

of the commodities consumed by the people. The commodities are being classified under

following heads

i) Food ii) Clothing iii)Fuel and Lighting iv)House rent v) miscellaneous

3. Collecting retail prices of different commodities

The collection of retail prices is a very important and at the same time very

difficult task, because such prices may vary from lace to place, shop to shop and person

to person. Price quotations should be obtained from the local markets, where the class of

people reside or from super bazaars or departmental stores from which they usually make

their purchases.

Method of Constructing the Index

The index may be constructed by applying any of the following methods :

1) Aggregate Expenditure Method or Aggregation Method

2) Family Budget Method or the Method of Weighted Relatives.

1. Aggregate Expenditure Method.

When this method is applied the quantities of commodities consumed by the particular group in

the base year are estimated and these figures are used as weights. Then the total expenditure on

each commodity for each year is calculated. = ∑∑ ×Where

and stand for the prices of the current year and base year.

and stand for the quantities of the current year and base year.

Steps:

i) The prices of commodities for various groups for the current year is multiplied by the quantitiesof the base year and their aggregate expenditure of current year is obtained .i.e. 01qp

ii) Similarly obtain 00qp

iii) The aggregate expenditure of the current year is divided by the aggregate expenditure of thebase year and the quotient is multiplied by 100.

Symbolically 10000

01

qp

qp

2. Family Budget Method

When this method is applied the family budgets of a large number of are carefully studied andthe aggregate expenditure of the average family on various items is estimated. These values areused as weights.



= ∑∑Where 1001

op

pp for each item

00qpv , value on the base year

ExampleConstruct the Consumer price index number of 2013 on the basis of 2009 from the followingdata using 1) the aggregate expenditure method and 2) the family budget method.

Commodity Quantity in units in2009

Price per unit in 2000(₹)

Price per unit in 2013(₹)

A 100 8 12B 25 6 7.50C 10 5 5.25D 20 48 52E 25 15 16.50F 30 9 27

Solution

(1) Aggregate expenditure method

Formula for aggregate expenditure method == ∑∑ × 100Commodity Price

per unitin 2000

(₹)P0

Priceper unitin 2013

(₹)P1

Quantityin unitsin 2009

q0

A 8 12 100 800 1200

B 6 7.5 25 150 187.5

C 5 5.25 10 50 52.5

D 48 52 20 960 1040

E 15 16.5 25 375 412.5

F 9 27 30 270 810

Total 0 0= 2605 1 0= 3702.50= ∑∑ × 100



= 3702.502605 × 100 = 142.132. The family budget method = ∑∑Where = × 100

for each item= , value on the base year

Commodity Price perunit in

2000 (₹)P0

Price perunit in

2013 (₹)P1

Quantity inunits in2009

q0

= × 100 =,

A 8 12 100 150 800 120000

B 6 7.5 25 125 150 18750

C 5 5.25 10 105 50 5250

D 48 52 20 108.33 960 104000

E 15 16.5 25 110 375 41250

F 9 27 30 300 270 81000

898.33 2605 370250

= ∑∑ = 3702502605 = 142.13Note: It should be noted that the answer obtained by applying the aggregate expenditure methodand family budget method is the same.

Given below is an example of Consumer Price Index for Kerala



Possible errors in construction of cost of living index numbers:

Cost of living index numbers or its recently popular name consumer price index numbers

are not accurate due to various reasons.

1. Errors may occur in the construction because of inaccurate specification of groups for

whom the index is meant.

2. Faulty selection of representative commodities resulting out of unscientific family budget

enquiries.

3. Inadequate and unrepresentative nature of price quotations and use of inaccurate weights

4. Frequent changes in demand and prices of the commodity

5. The average family might not be always a representative one.

Wholesale price index numbers (Vs) consumer price index numbers:

1. The wholesale price index number measures the change in price level in a country as awhole. For example economic advisors index numbers of wholesale prices.

Where as cost of living index numbers measures the change in the cost of living



of a particular class of people stationed at a particular place. In this index number we takeretail price of the commodities.

2. The wholesale price index number and the consumer price index numbers are generallydifferent because there is lag between the movement of wholesale prices and the retailprices.

3. The retail prices required for the construction of consumer price index number increasedmuch faster than the wholesale prices i.e. there might be erratic changes in the consumerprice index number unlike the wholesale price index numbers.

4. The method of constructing index numbers in general the same for wholesale prices andcost of living. The wholesale price index number is based on different weighting systemsand the selection of commodities is also different as compared to cost of living indexnumber

Limitations or demerits of index numbers:

Although index numbers are indispensable tools in economics, business, management

etc, they have their limitations and proper care should be taken while interpreting them. Some of

the limitations of index numbers are

1. Since index numbers are generally based on a sample, it is not possible to take into

account each and every item in the construction of index.

2. At each stage of the construction of index numbers, starting from selection of

commodities to the choice of formulae there is a chance of the error being introduced.

3. Index numbers are also special type of averages, since the various averages like mean,

median, G.M have their relative limitations, their use may also introduce some error.

4. None of the formulae for the construction of index numbers is exact and contains the so

called formula error. For example Lasperey’s index number has an upward bias while

Paasche’s index has a downward bias.

5. An index number is used to measure the change for a particular purpose only. Its misuse

for other purpose would lead to unreliable conclusions.

6. In the construction of price or quantity index numbers it may not be possible to retain the

uniform quality of commodities during the period of investigation.

3. STOCK MARKET INDEX NUMBER

A stock market index is a measure of the relative value of a group of stocks in numerical terms.

As the stocks within an index change value, the index value changes. An index is important to

measure the performance of investments against a relevant market index.

An Index is used to give information about the price movements of products in the financial,

commodities or any other markets. Financial indexes are constructed to measure price

movements of stocks, bonds, T-bills and other forms of investments. Stock market indexes are

meant to capture the overall behaviour of equity markets. A stock market index is created by



selecting a group of stocks that are representative of the whole market or a specified sector or

segment of the market. An Index is calculated with reference to a base period and a base index

value.

Stock indexes are useful for benchmarking portfolios, for generalizing the experience of all

investors, and for determining the market return used in the Capital Asset Pricing Model

(CAPM).

A hypothetical portfolio encompassing all possible securities would be too broad to measure, so

proxies such as stock indexes have been developed to serve as indicators of the overall market's

performance. In addition, specialized indexes have been developed to measure the performance

of more specific parts of the market, such as small companies.

It is important to realize that a stock price index by itself does not represent an average return to

shareholders. By definition, a stock price index considers only the prices of the underlying stocks

and not the dividends paid. Dividends can account for a large percentage of the total investment

return.

An stock market index (or just “index) is a number that measures the relative value of a group of

stocks. As the stocks in this group change value, the index also changes value. If an index goes

up by 1% then that means the total value of the securities which make up the index have gone up

by 1% in value.

A stock market index measures the change in the stock prices of the index's components.

How it works/Example:

Let's say we want to measure the performance of the Indian stock market. Assume there arecurrently four public companies that operate in the United States: Company A, Company B,Company C, and Company D.

In the year 2000, the four companies' stock prices were as follows:

Company A ₹10

Company B ₹8

Company C ₹12

Company D ₹25

Total ₹55

To create an index, we simply set the total (₹55) in the year 2000 equal to 100 and measure anyfuture periods against that total. For example, let's assume that in 2001 the stock prices were:

Company A ₹4

Company B ₹38

Company C ₹12

Company D ₹24

Total ₹78



Because ₹78 is 41.82% higher than the 2000 base, the index is now at 141.82. Every day,month, year, or other period, the index can be recalculated based on current stock prices.

Note that this index is price-weighted (i.e., the larger the stock price, the more influence it has onthe index). Indexes can be weighted by any number of metrics, including shares outstanding,market capitalization, or stock price.

Some Important Stock Market Indices

Symbol NameXAX Amex CompositeVOLNDX

DWS NASDAQ-100 Volatility Target Index

FTSEQ500

FTSE NASDAQ 500 Index

RCMP NASDAQ Capital Market Composite IndexIXIC NASDAQ CompositeNQGM NASDAQ Global Market CompositeNQGS NASDAQ Global Select Market CompositeQOMX NASDAQ OMX 100 IndexILTI NASDAQ OMX AeA Illinois Tech IndexQMEA NASDAQ OMX Middle East North Africa IndexIXNDX NASDAQ-100NYA NYSE CompositeOMXB10 OMX Baltic 10OMXC20 OMX Copenhagen 20OMXH25 OMX Helsinki 25OMXN40 OMX Nordic 40OMXS30 OMX Stockholm 30 IndexRUI Russell 1000RUT Russell 2000RUA Russell 3000OEX S&P 100SPX S&P 500MID S&P MidCapNDXE The NASDAQ-100 Equal Weighted IndexVINX30 VINX 30WLX Wilshire 5000

Types of Stock Market Indices (National Stock Exchange)(a) Broad Market Indices

These indices are broad-market indices, consisting of the large, liquid stocks listed on theExchange. They serve as a benchmark for measuring the performance of the stocks or portfoliossuch as mutual fund investments.

Examples

CNX Nifty(The CNX Nifty is a well diversified 50 stock index accounting for 23 sectorsof the economy. It is used for a variety of purposes such as benchmarking fund portfolios,index based derivatives and index funds.)

CNX Nifty Junior

LIX15 Midcap



CNX 100

Nifty Midcap 50

CNX Midcap

CNX Smallcap Index

India VIX

(b) Sectoral Indices

Sector-based index are designed to provide a single value for the aggregate performance of anumber of companies representing a group of related industries or within a sector of theeconomy.

Examples

CNX Auto Index (The CNX Auto Index is designed to reflect the behaviour and performance ofthe Automobiles sector which includes manufacturers of cars & motorcycles, heavy vehicles,auto ancillaries, tyres, etc. The CNX Auto Index comprises of 15 stocks that are listed on theNational Stock Exchange.)

CNX Bank Index

CNX Energy Index

CNX Finance Index

CNX FMCG Index

CNX IT Index

CNX Media Index

CNX Metal Index

CNX Pharma Index

CNX PSU Bank Index

CNX Realty Index

IISL CNX Industry Indices

(c) Thematic Indices

Thematic indices are designed to provide a single value for the aggregate performance of anumber of companies representing a theme.

Examples

CNX Commodities Index (The CNX Commodities Index is designed to reflect the behaviour andperformance of a diversified portfolio of companies representing the commodities segmentwhich includes sectors like Oil, Petroleum Products, Cement, Power, Chemical, Sugar, Metalsand Mining. The CNX Commodities Index comprises of 30 companies that are listed on theNational Stock Exchange (NSE).)

CNX Consumption Index

CPSE Index

CNX Infrastructure Index

CNX MNC Index

CNX Service Sector Index

CNX Shariah25

CNX Nifty Shariah / CNX 500 Shariah

CNX PSE Index

(d) Strategy Indices

Strategy indices are designed on the basis of quantitative models / investment strategies to

provide a single value for the aggregate performance of a number of companies. Strategic indices

are designed on the basis of quantitative models / investment strategies to provide a single value

for the aggregate performance of a number of companies.



CNX 100 Equal Weight (The CNX 100 Equal Weight Index comprises of same constituents as

CNX 100 Index (free float market capitalization based Index).

The CNX 100 tracks the behavior of combined portfolio of two indices viz. CNX Nifty and CNX

Nifty Junior. It is a diversified 100 stock index. The maintenance of the CNX Nifty and the CNX

Nifty Junior are synchronized so that the two indices will always be disjoint sets; i.e. a stock will

never appear in both indices at the same time.)

CNX Alpha Index

CNX Defty

CNX Dividend Opportunities Index

CNX High Beta Index

CNX Low Volatility Index

CNX Nifty Dividend

NV20 Index

NI15 Index

Nifty TR 2X Leverage

Nifty TR 1X Inverse

(e) Fixed Income Indices

Fixed income index is used to measure performance of the bond market. The fixed income

indices are useful tool for investors to measure and compare performance of bond portfolio.

Fixed income indices also used for introduction of Exchange Traded Funds.

Examples

GSEC10 NSE Index (GSEC10 NSE index is constructed using the prices of top 5 ( in terms of

traded value) liquid GOI bonds with residual maturity between 8 to 13 years and have

outstanding issuance exceeding Rs.5000 crores. The individual bonds are assigned weights

considering the traded value and outstanding issuance in the ratio of 40:60.The index measures

the changes in the prices of the bond basket.)

GSECBM NSE Index

(f) Index Concepts

Indices and index-linked investment products provide considerable benefits. Important concepts

and terminologies are associated with Index construction. These concepts are important for

investors to learn from the information that indices contain about investment opportunities.

In the investment world, however, risk is inseparable from performance and, rather than being

desirable or undesirable, is simply necessary. Understanding risk is one of the most important

parts of a financial education.

Indices and index-linked investment products provide considerable benefits. But it is equally

important to know the associated risk that comes as part of such exposure. Important concepts

and terminologies are associated with Indices. For e.g. Beta helps us to understand the concepts

of passive and active risk. Impact cost represents the cost of executing a transaction in a given

stock, for a specific predefined order size, at any given point of time. These concepts are

important for to understanding indices and investment opportunities.



(g) Index Funds

An Index Fund is a type of mutual fund with a portfolio constructed to match the constituents of

the market index, such as CNX Nifty. An index fund provides broad market exposure and lower

operating expenses for investors.

Index Funds today are a source of investment for investors looking at a long term, less risky form

of investment. The success of index funds depends on their low volatility and therefore the

choice of the index.Examples

1 Principal Index Fund

2 UTI Nifty Index Fund

3 Franklin India Index Fund

4 SBI Nifty Index Fund

5 ICICI Prudential Index Fund

6 HDFC Index Fund - Nifty Plan

7 Birla Sun Life Index Fund

8 LIC NOMURA MF Index Fund - Nifty Plan

Uses of Stock Market Indices

With any type of investment it's important to measure the performance of that investment.

Otherwise there's no way for you to distinguish between a good return on your money versus a

bad one.

A relevant stock market index serves that purpose. If your investments consistently lag behind

the index then you know you have a poor performer, and it may be time to find a new

investment.

Stock market indexes are useful for a variety of reasons. Some of them are :

They provide a historical comparison of returns on money invested in the stock market

against other forms of investments such as gold or debt.

They can be used as a standard against which to compare the performance of an equity

fund.

In It is a lead indicator of the performance of the overall economy or a sector of the

economy

Stock indexes reflect highly up to date information

Modern financial applications such as Index Funds, Index Futures, Index Options play an

important role in financial investments and risk management

BSE SENSEX (Bombay Stock Exchange Sensitive Index)

The Sensex is an "index". What is an index? An index is basically an indicator. It gives you ageneral idea about whether most of the stocks have gone up or most of the stocks have gonedown. The Sensex is an indicator of all the major companies of the BSE.



BSE SENSEX is considered as the Barometer of Indian Capital Markets. If the Sensex goes up,

it means that the prices of the stocks of most of the major companies on the BSE have gone up.

If the Sensex goes down, this tells you that the stock price of most of the major stocks on the

BSE have gone down.

BSE SENSEX, first compiled in 1986, was calculated on a "Market Capitalization-Weighted"

methodology of 30 component stocks representing large, well-established and financially sound

companies across key sectors. The base year of S&P BSE SENSEX was taken as 1978-79. S&P

BSE SENSEX today is widely reported in both domestic and international markets through print

as well as electronic media. It is scientifically designed and is based on globally accepted

construction and review methodology. Since September 1, 2003, BSE SENSEX is being

calculated on a free-float market capitalization methodology. The "free-float market

capitalization-weighted" methodology is a widely followed index construction methodology on

which majority of global equity indices are based; all major index providers like MSCI, FTSE,

STOXX, and Dow Jones use the free-float methodology.

The BSE Sensex currently consists of the following 30 major Indian companies as of October

2014Axis Bank Ltd ITC LtdBajaj Auto Ltd Larsen & Toubro LtdBharat Heavy Electricals Ltd Mahindra and Mahindra LtdBharti Airtel Ltd Maruti Suzuki India LtdCipla Ltd NTPC LtdCoal India Ltd Oil and Natural Gas Corporation LtdDr.Reddy's Laboratories Ltd Reliance Industries LtdGAIL (India) Ltd Sesa Goa LtdHDFC Bank Ltd State Bank of IndiaHero MotoCorp Ltd Sun Pharmaceutical Industries LtdHindalco Industries Ltd Tata Consultancy Services LtdHindustan Unilever Ltd Tata Motors LtdHousing Development Finance Corporation Ltd Tata Power Company LtdICICI Bank Ltd Tata Steel LtdInfosys Ltd Wipro Ltd

Nifty (National Stock Exchange Index)

Just like the Sensex which was introduced by the Bombay stock exchange, Nifty is a major stock

index in India introduced by the National stock exchange.

NIFTY was coined fro the two words ‘National’ and ‘FIFTY’. The word fifty is used because;the index consists of 50 actively traded stocks from various sectors.

So the nifty index is a bit broader than the Sensex which is constructed using 30 actively traded

stocks in the BSE.

Nifty is calculated using the same methodology adopted by the BSE in calculating the Sensex –but with three differences. They are:

The base year is taken as 1995

The base value is set to 1000



Nifty is calculated on 50 stocks actively traded in the NSE

50 top stocks are selected from 24 sectors.

The selection criteria for the 50 stocks are also similar to the methodology adopted by the

Bombay stock exchange.

Nifty, is a weighted average of 50 stocks, meaning some stocks hold more “value” than otherstocks. For example ITC has more weight than Lupin.

List of 50 stocks that have been included in the nifty as on October 2014.Name Sector

ACC Ltd. CEMENT AND CEMENT PRODUCTSAmbuja Cements Ltd. CEMENT AND CEMENT PRODUCTSAsian Paints Ltd. PAINTSAxis Bank Ltd. BANKSBajaj Auto Ltd. AUTOMOBILES - 2 AND 3 WHEELERSBank of Baroda BANKSBharat Heavy Electricals Ltd. ELECTRICAL EQUIPMENTBharat Petroleum Corporation Ltd. REFINERIESBharti Airtel Ltd. TELECOMMUNICATION - SERVICESCairn India Ltd. OIL EXPLORATION/PRODUCTIONCipla Ltd. PHARMACEUTICALSCoal India Ltd MININGDLF Ltd. CONSTRUCTIONDr. Reddy's Laboratories Ltd. PHARMACEUTICALSGAIL (India) Ltd. GASGrasim Industries Ltd. CEMENT AND CEMENT PRODUCTSHCL Technologies Ltd. COMPUTERS - SOFTWAREHDFC Bank Ltd. BANKSHero Honda Motors Ltd. AUTOMOBILES - 2 AND 3 WHEELERSHindalco Industries Ltd. ALUMINIUMHindustan Unilever Ltd. PERSONAL CAREHousing Development Finance Corporation Ltd. FINANCE - HOUSINGI T C Ltd. CIGARETTESICICI Bank Ltd. BANKSIndusInd Bank Ltd. BANKSInfosys Technologies Ltd. COMPUTERS - SOFTWAREInfrastructure Development Finance Co. Ltd. FINANCIAL INSTITUTIONJindal Steel & Power Ltd. STEEL AND STEEL PRODUCTSKotak Mahindra Bank Ltd. BANKSLarsen & Toubro Ltd. ENGINEERINGLupin Ltd. PHARMACEUTICALSMahindra & Mahindra Ltd. AUTOMOBILES - 4 WHEELERSMaruti Suzuki India Ltd. AUTOMOBILES - 4 WHEELERSNMDC Ltd. MININGNTPC Ltd. POWEROil & Natural Gas Corporation Ltd. OIL EXPLORATION/PRODUCTIONPower Grid Corporation of India Ltd. POWERPunjab National Bank BANKSReliance Industries Ltd. REFINERIESSesaSterlite Ltd. MININGState Bank of India BANKSSun Pharmaceutical Industries Ltd. PHARMACEUTICALS



Tata Consultancy Services Ltd. COMPUTERS - SOFTWARETata Motors Ltd. AUTOMOBILES - 4 WHEELERSTata Power Co. Ltd. POWERTata Steel Ltd. STEEL AND STEEL PRODUCTSTech Mahindra Ltd. COMPUTERS - SOFTWAREUltraTech Cement Ltd. CEMENT AND CEMENT PRODUCTSUnited Spirits Ltd. BREW/DISTILLERIESWipro Ltd. COMPUTERS - SOFTWARE

Nifty and the Sensex

The Sensex and Nifty are both Indices. The Sensex, also called the BSE 30, is a stock market

index of 30 well-established and financially sound companies listed on Bombay Stock Exchange

(BSE). The Nifty, similarly, is an indicator of the 50 top major companies on the National Stock

Exchange (NSE).

The Sensex and Nifty are both indicators of market movement. If the Sensex or Nifty go up, it

means that most of the stocks in India went up during the given period. If the Nifty goes down,

this tells you that the stock price of most of the major stocks on the BSE have gone down.

Just in case you are confused, the BSE, is the Bombay Stock Exchange and the NSE is the

National Stock Exchange. The BSE is situated at Bombay and the NSE is situated at Delhi.

These are the major stock exchanges in the country. There are other stock exchanges like the

Calcutta Stock Exchange etc. but they are not as popular as the BSE and the NSE.Most of the

stock trading in the country is done though the BSE & the NSE.

TIME SERIES ANALYSIS

In plain English, a time series is simply a sequence of numbers collected at regular intervals overa period of time. In statistics, a time series is a sequence of numerical data points in successiveorder, usually occurring in uniform intervals. This concerns the analysis of data collected overtime, such as weekly values, monthly values, quarterly values, yearly values, etc.

Many statistical methods relate to data which are independent, or at least uncorrelated. There are

many practical situations where data might be correlated. This is particularly so where repeated

observations on a given system are made sequentially in time. Data gathered sequentially in time

are called a time series.

Here are some examples in which time series arise:

• Economics and Finance•Environmental Modelling•Meteorology and Hydrology•Demographics

•Medicine•Engineering•Quality Control

The simplest form of data is a longish series of continuous measurements at equally spaced time

points. That is observations are made at distinct points in time, these time points being



equally spaced and, the observations may take values from a continuous distribution.

The above setup could be easily generalized: for example, the times of observation need not be

equally spaced in time, the observations may only take values from a discrete distribution . . .

If we repeatedly observe a given system at regular time intervals, it is very likely that the

observations we make will be correlated. So we cannot assume that the data constitute a random

sample. The time-order in which the observations are made is vital.

Objectives of time series analysis:

• description - summary statistics, graphs

• analysis and interpretation - find a model to describe the time dependence in the data, can we

interpret the model

• forecasting or prediction - given a sample from the series, forecast the next value, or the next

few values

• control - adjust various control parameters to make the series fit closer to a target

• adjustment - in a linear model the errors could form a time series of correlated observations,

and we might want to adjust estimated variances to allow for this

Types of time Series

1. continuous

2. discrete

Discrete means that observations are recorded in discrete times – it says nothing about the natureof the observed variable. The time intervals can be annually, quarterly, monthly, weekly, daily,hourly, etc.

Continuous means that observations are recorded continuously -e.g. temperature and/or humidityin some laboratory. Again, time series can be continuous regardless of the nature of the observedvariable.

Discrete time series can result when continuous time series are sampled. Sometimes quantitiesthat don't have an instantaneous value get aggregated also resulting in a discrete time series e.g.daily rainfall We will mostly study discrete time series in this course. Note that discrete timeseries are often the result of discretization of continuous time series (e.g. monthly rainfall).

Uses of time series

There are two main uses of time series analysis: (a) identifying the nature of the phenomenon

represented by the sequence of observations, and (b) forecasting (predicting future values of the

time series variable). Both of these goals require that the pattern of observed time series data is

identified and more or less formally described. Once the pattern is established, we can interpret

and integrate it with other data (i.e., use it in our theory of the investigated phenomenon, e.g.,

seasonal commodity prices). Regardless of the depth of our understanding and the validity of our

interpretation (theory) of the phenomenon, we can extrapolate the identified pattern to predict

future events.

The usage of time series models is twofold:



Obtain an understanding of the underlying forces and structure that produced the

observed data

Fit a model and proceed to forecasting, monitoring or even feedback and feedforward

control.

Time Series Analysis is used for many applications such as:

Economic Forecasting Sales Forecasting Budgetary Analysis Stock Market Analysis Yield Projections Process and Quality Control Inventory Studies Workload Projections Utility Studies Census Analysis

Time series analysis can be useful to see how a given asset, security or economic variable

changes over time or how it changes compared to other variables over the same time period. For

example, in stock market investments, suppose you wanted to analyze a time series of daily

closing stock prices for a given stock over a period of one year. You would obtain a list of all the

closing prices for the stock over each day for the past year and list them in chronological order.

This would be a one-year, daily closing price time series for the stock. Delving a bit deeper, you

might be interested to know if a given stock's time series shows any seasonality, meaning it goes

through peaks and valleys at regular times each year. Or you might want to know how a stock’sshare price changes as an economic variable, such as the unemployment rate, changes.

The analysis of time series if of great significance not only to the economists and business man

but also to the scientist, astronomist, geologist etc. for the reasons given below.

1) It helps in understanding past behavior. It helps to understand what changes have taken

place in the past. Such analysis is helpful in predicting the future behavior.

2) It helps in planning future operations : Statistical techniques have been evolved which

enable time series to be analysed in such a way that the influence which have determined

the form of that series may be ascertained. If the regularity of occurrence of any feature

over a sufficient long period could be clearly established then. Within limits, prediction

of probable future variations would become possible.

3) It helps in evaluating current accomplishments. The actual performance can be compared

with the expected performance and the cause of variation analysed. For example, if

expected sale for 2000-01 was 10,000 washing machine and the actual sale was only

9000. One can investigate the cause for the shortfall in achievement.

4) It facilitates comparison. Different time series are often compared and important

conclusions drawn therefrom.



Components of Time Series

The fluctuations of time series can be classified into four basic type of variations, They

are often called components or elements of a time series. They are :

(1) Secular Trend or Long Term Movements (T)

(2) Seasonal Variations (S)

(3) Cyclical Variations (C)

(4) Irregular Variations (I)

The value (y) of a phenomenon observed at any point of time (t) is the net effect of all the above

mentioned categories of components of a time series. We will see them in detail here.

(1) Secular Trend

The secular trend is the main component of a time series which results from long term effect of

socio-economics and political factors. This trend may show the growth or decline in a time series

over a long period. This is the type of tendency which continues to persist for a very long period.

Prices, export and imports data, for example, reflect obviously increasing tendencies over time.

(2) Seasonal Variations (Seasonal Trend)

These are short term movements occurring in a data due to seasonal factors. The short term is

generally considered as a period in which changes occur in a time series with variations in

weather or festivities. For example, it is commonly observed that the consumption of ice-cream

during summer us generally high and hence sales of an ice-cream dealer would be higher in some

months of the year while relatively lower during winter months. Employment, output, export etc.

are subjected to change due to variation in weather. Similarly sales of garments, umbrella,

greeting cards and fire-work are subjected to large variation during festivals like Onam, Eid,

Christmas, New Year etc. These types of variation in a time series are isolated only when the

series is provided biannually, quarterly or monthly.

(3) Cyclical Variations (Cyclical Variations)

These are long term oscillation occurring in a time series. These oscillations are mostly observed

in economics data and the periods of such oscillations are generally extended from five to twelve

years or more. These oscillations are associated to the well known business cycles. These cyclic

movements can be studied provided a long series of measurements, free from irregular

fluctuations is available.

(4) Irregular Variations (Irregular Fluctuations)

These are sudden changes occurring in a time series which are unlikely to be repeated, it id that

component of a time series which cannot be explained by trend, seasonal or cyclic movements .It

is because of this fact these variations some-times called residual or random component. These

variations though accidental in nature, can cause a continual change in the trend, seasonal and

cyclical oscillations during the forthcoming period. Floods, fires, earthquakes, revolutions,

epidemics and strikes etc,.are the root cause of such irregularities.



Measurement of Trend : Moving Average and the Method of least squares :

Mean of time series data (observations equally spaced in time) from several consecutive periods.Called 'moving' because it is continually recomputed as new data becomes available, itprogresses by dropping the earliest value and adding the latest value. For example, the movingaverage of six-month sales may be computed by taking the average of sales from January toJune, then the average of sales from February to July, then of March to August, and so on.Moving averages (1) reduce the effect of temporary variations in data, (2) improve the 'fit' ofdata to a line (a process called 'smoothing') to show the data's trend more clearly, and (3)highlight any value above or below the trend.

1. Method of Moving Averages

Let us explain the concept of Moving Average with the aid of an example.

Suppose that the demand for skilled laborers for a construction project is given for the last 7months as shown in the following table:

Month Demand1 1202 1103 904 1155 1256 1177 121

The engineer who is in charge of this project needs to predict the demand for the next month (the

8th month) based on the available data. He decided to take the average of the data and predicted

the demand as follows.

Average = (120 + 110 + 90 + 115 + 125 + 117 + 121)/7 = 114

But this method has a disadvantage. The above method is known as the Simple Mean

Forecasting Method. The main problem with this method is the space limitation for storing all of

the past data. If the data contains several thousand items, each of which has several hundred data

records, you need a lot of memory space to store this data on your computer. In addition, this

method is not very sensitive to a shift in recent data if it contains a large number of data points.

A solution to the these problems is the Moving Averages technique. Using this method, you need

to maintain only the N most recent periods of data points. At the end of each period, the oldest

period's data is discarded and the newest period's data is added to the data base. The average is

then divided by N and used as a forecast for the next period.

The formula for a three period moving average is given below:ℎ = (3) = = [ + + ]3



Now using the three period moving average, the average for the above problem can be calculatedas follows. = (3) = = [ + + ]3 = 125 + 117 + 1213 = 121So from the above example we can summarize as follows.

When a trend is to be determined by the method of moving average value for a number of yearsis secured and this average is taken as the normal or trend value for the unit of time falling at themiddle of the period covered in the calculation of the average. While applying this method, it isnecessary to select a period for moving average such as 3 yearly, 5 yearly or 8 yearly movingaverage etc.

The 3 yearly moving average shall be computed as follows :+ +3 , + +3 , + +3 , + +3 … … … … … … … … … … … ..5 yearly moving average+ + + +5 , + + + +3 , + + + +3 … … … … … … … … … … … ..Example

Calculate the 3 yearly moving average and 5yearly moving average of the producing figuresgiven below .

For computing three yearly trend, first find three yearly moving totals a+b+c, b+c+d, c+d+eetc

(Column 3 in the following table). Then find average of each. Since it is sum of three

observations, divide each by 3 to get average. , , , . Repeat the same

process for 5 years taking 5 instead of 3.

Year

(1)

Y

(2)

3 yearlymoving totals

(3)

3 yearlymoving averages

(trend values)(4) =(3)÷3

5 yearlymovingtotals

(5)

5 yearlymovingaverages

(trend values)(6) = (5) ÷ 5

1990 242 _ _ _ _

1991 250 744 248.0 1246 249.2

1992 252 751 250.3 1259 251.8

1993 249 754 251.3 1260 252

1994 253 757 252.3 1265 253

1995 255 759 253.0 1276 255.2

1996 251 763 254.3 1288 257.6

1997 257 768 256.0 1295 259

1998 260 782 260.7 _ _

1999 265 787 262.3 _ _

2000 262 _ _ _ _



246.0

248.0

250.0

252.0

254.0

256.0

258.0

260.0

262.0

264.0

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Three yearly moving average

248

250

252

254

256

258

260

1990 1991 1992 1993 1994 1995 1996 1997 1998

Five yearly moving average

230235240245250255260265270

Valu

e

Data Point

Moving Average

Actual

Forecast

3 per. Mov. Avg. (Forecast)



Merits of Moving Average Method

It is simple as compared to the method of least squares. It is flexible, If a few more figures are added to the data, the entire calculations are not

changed. It has the advantage that it follows the general movements of the data and that its shape is

determined by the data rather than statisticians choice of a mathematical function. It is particularly effective if the trend of a series is very irregular.

Limitations :

Trend values cannot be computed for all the years. The moving averages for the first fewyears and last few years cannot be obtained. It is often these extreme years in which hwe may be interested.

Selection of proper period is a great difficulty. If a wrong period is selected, there is everlikelihood that conclusions may be misleading.

Since the moving average is not represented by a mathematical function, this methodcannot be used for forecasting.

Itcan be applied only to those series which show periodically.

2. METHOD OF LEAST SQUARES:

Least Squares Method is astatistical technique to determine the line of best fit for a model. The

least squares method is specified by an equation with certain parameters to observed data. This

method is extensively used in regression analysis and estimation.

In the most common application - linear or ordinary least squares - a straight line is sought to be

fitted through a number of points to minimize the sum of the squares of the distances (hence the

name "least squares") from the points to this line of best fit.

In contrast to a linear problem, a non-linear least squares problem has no closed solution and is

generally solved by iteration. The earliest description of the least squares method was by Carl

Freidrich Gauss in 1795.

Field data is often accompanied by noise. Even though all control parameters (independent

variables) remain constant, the resultant outcomes (dependent variables) vary. A process of

quantitatively estimating the trend of the outcomes, also known as regression or curve fitting,

therefore becomes necessary.

The curve fitting process fits equations of approximating curves to the raw field data.

Nevertheless, for a given set of data, the fitting curves of a given type are generally NOT unique.

Thus, a curve with a minimal deviation from all data points is desired. This best-fitting curve can

be obtained by the method of least squares.The principle of least squares provides us an analytical or mathematical device to obtain an

objective fit to the trend of the given time series. Most of the data relating to economic and

business time series conform to definite laws of growth or predictions. This technique can be

used to fit linear as well as nonlinear trends.



Fitting linear trend

A straight line can be fitted to the data by the method of curve fitting based on the most popular

principle called the principle of least squares. Such a straight line is also known as Line of Best

fit. Let the line of best fit be described by an equation of the type y = a+bx where y is the value

of dependent variable, a and b are two unknown constants whose values are to be determined.

To find a and b, we apply the method of least squares. Let ‘E’ be the sum of the squaresof the deviations of all the original values from their respective values derived from the

equations. So that E = [y – (a+bx)] 2

By Calculus method, for minimum = 0 . Thus we get the two equations known as

Normal equations. They are : = += +Solving these two normal equations, we get a and b. Substituting these values in the

equation y = a+bx, we get the trend equation.

Example:

Fit a linear trend to the following data by the least square method.

Year 2000 2002 2004 2006 2008

Production 18 21 23 27 16

Solution

Let x = t -2004 ….(I)Let the trend line of y (production) on x be= + , ( 2004)…..(II)

Year (t) y x=t-2004 x2 xy Ye=21+0.1x Y-Ye2000 18 -4 16 -72 20.6 -2.62002 21 -2 4 -42 20.8 0.22004 23 0 0 0 21 22006 27 2 4 54 21.2 5.82008 16 4 16 64 21.4 -5.4∑ =105 ∑ =0 ∑ =40 ∑ =4 (− ) = 0

The normal equations for estimating and b in (II) are= + = +



105 = 5 + × 0 4 = × 0 + × 40= 1055 = 21 = 440 = 110 = 0.1

Substituting in (II), the straight line trend equation is given by

Y = 21+0.1x, (Origin :2004) ……..(III)

[x units = 1 year and y = production in 000 units)]

Putting x = −4, −2,0,2 and 4 in (III), we obtain the trend values ( ) for the years 2000,2002…2008 respectively, as given in last but one column of the table above.

The difference ( − ) is calculated in the last column of the table.

We have ( − ) = −2.6 + 0.2 + 2.0 + 5.8 − 5.4 = 8 − 8 = 0, .Uses of Method of Least Squares

The least square methods (LSM)is probably the most popular technique in statistics. This is due

to several factors.

First, most common estimators can be casted within this framework. For example, the mean of a

distribution is the value that minimizes the sum of squared deviations of the scores.

Second, using squares makes LSM mathematically very tractable because the Pythagorean

theorem indicates that, when the error is independent of an estimated quantity, one can add the

squared error and the squared estimated quantity.

Third, the mathematical tools and algorithms involved in LSM (for eg. derivatives) have been

well studied for a relatively long time.

The use of LSM in a modern statistical framework can be traced to Galton (1886) who used it in

his work on the heritability of size which laid down the foundations of correlation and (also gave

the name to) regression analysis. The two antagonistic giants of statistics Pearson and Fisher,

who did so much in the early development of statistics, used and developed it in different

contexts (factor analysis for Pearson and experimental design for Fisher).

Nowadays, the least square method is widely used to find or estimate the numerical values of the

parameters to fit a function to a set of data and to characterize the statistical properties of

estimates. It exists with several variations: Its simpler version is called ordinary least

squares(OLS), a more sophisticated version is called weighted least squares (WLS), which often

performs better than OLS because it can modulate the importance of each observation in the final

solution. Recent variations of the least square method are alternating least squares (ALS) and

partial least squares (PLS).



Problems with least squares

Despite its popularity and versatility, LSM has its problems. Probably, the most important

drawback of LSM is its high sensitivity to outliers (i.e., extreme observations). This is a

consequence of using squares because squaring exaggerates the magnitude of differences (e.g.,

the difference between 20 and 10 is equal to 10 but the difference between 20 2and 102 is equal

to 300) and therefore gives a much stronger importance to extreme observations. This problem is

addressed by using robust techniques which are less sensitive to the effect of outliers. This field

is currently under development and is likely to become more important in the next future.



MODULE IV

NATURE AND SCOPE OF ECONOMETRICS

Econometrics: Meaning, Scope, and Limitations - Methodology of econometrics-Modern

interpretation-Stochastic Disturbance term- Population Regression Function and Sample

Regression Function-Assumptions of Classical Linear regression model.

Introduction

Between the world wars, advances in mathematical statistics and a cadre of

mathematically trained economists led to econometrics, which was the name proposed for the

discipline of advancing economics by using mathematics and statistics. The roots of modern

econometrics can be traced to the American economist Henry L. Moore. Moore studied

agricultural productivity and attempted to fit changing values of productivity for plots of corn

and other crops to a curve using different values of elasticity. Moore made several errors in his

work, some from his choice of models and some from limitations in his use of mathematics.

Ragnar Frisch coined the word “econometrics” and helped to found both the EconometricSociety in 1930 and the Journal Econometrica in 1933.

It may be described as a branch of economics in which economic theory and statistical

methods are fused in the analysis of numerical and institutional data. The term econometrics

means ‘economic measurement,’ which is synonymous with empirical research in economics.

Econometrics is concerned with the measurement of data or the application of statistical

procedures, which have been formulated in mathematical terms. It is therefore a branch of

mathematical economics. Statistical data and statistical procedures are employed to provide

numerical results, which may be used for verification of or to help in verification of economic

theorems. Econometrics provides the quantitative information that may be used to make a

qualitative analysis empirically truer and more meaningful.

The term econometrics is formed from two Greek words which means, economy and measure.

Econometrics is a rapidly developing branch of economics. Econometrics aims to give empirical

content to economic relations. The term econometrics was first used by PawelClompa in 1910.

But the credit of coining the term econometrics should be given to Ragnar Frisch (1936), one of

the founders of the Econometric Society. He was the person who established the subject in the

sense in which it is known today. Econometrics can be defined generally as “the application ofmathematics and statistical methods to the analysis of economic data”. In the words ofSamuelson, Koopmans and Stone, econometrics is defined as the quantitative analysis of actual

economic phenomena based on the concurrent development of the theory and observation,

related by appropriate methods of inference (1954). Other definitions of econometrics are:

Every application of mathematics or of statistical methods to the study of economic phenomena

(Malinvaud 1966)



The production of quantitative economic statements that either explain the behaviour of variables

we have already seen, or forecast (ie. predict) behaviour that we have not yet seen, or both

(Christ 1966)

Econometrics is the art and science of using statistical methods for the measurement of economic

relations (Chow, 1983).

Need for econometrics

Economic theory makes statements or hypotheses that are mostly qualitative in nature.

For eg. Micro economic theory states that other thing remaining the same, a reduction in the

price of a commodity is expected to increase the quantity demanded of that commodity. Thus

economic Theory postulates a negative or inverse relation between price and quantity. But the

theory does not provide any numerical measure of the relationship between the two. It is the job

of the econometrician to provide such numerical estimates. Econometrics give empirical content

to most economic theory.

Scope of Econometrics

To make the meaning of econometrics more clear and detailed, it is appropriate to quote Frish

(1933) in full. “……econometrics is by no means the same as economic statistics. Nor is itidentical with what we call general economic theory, although a considerable portion of this

theory has a definitely quantitative character. Nor should econometrics be taken as synonymous

with the application of mathematics to economics. Experience has shown that each of these

three view points, that of statistics, economic theory, and mathematics, is necessary, but not by

itself a sufficient, condition for a real understanding of the quantitative relations in modern

economic life. It is this unification of all three that is powerful. And it is this unification that

constitutes econometrics”.

Let us consider the following example to understand this unification more clearly. From +2

classes onwards we learn demand function which explains that demand is a function of price,

assuming ceteris paribus. When we relax the assumption of ceteris paribus, we argue that

demand is influenced by four factors namely, price, price of substitutes, income and taste of the

consumer. So when we consider these four factors together, it is a case of exact relation. This

exact relation can be expressed in the form of a regression model, where quantity demanded is

dependent variable and price, price of substitutes, income and taste are the independent variables.

So this mathematical representation is again an exact relation. But practical wisdom suggests

that there are many more factors which influence the quantity demanded. Some new factors are

expectation of a price rise, coming of a new product, government policy and so on. Because of

the influence of these factors, our price quantity relation becomes not exact. Then, naturally

there should be a provision to incorporate the influence of “other factors”. The inclusion ofprovision for other factors is the uniqueness of econometrics and how it is done can be explained

in later pages.



Goals of econometrics

There are three main goals

1. Analysis- the testing of economic theory

2. Policy making -supplying numerical estimates which can be used for decision making

3. Forecasting – using numerical estimates to forecast future values.

1. Analysis: Testing Economic theory

The earlier economic theories started from a set of observations concerning the behaviour

of individuals as consumers or producers. Some basic assumptions were set regarding the

motivations of individual economic units. From these assumptions the economists by pure

logical reasoning derive some general conclusion regarding the working process of the economic

system. Economic theories thus developed in an abstract level were not tested against economic

reality. No attempt was made to examine whether the theories explained adequately the actual

economic behaviour of individuals.

Econometrics aims primarily at the verifications of economic theories. That is obtaining

empirical evidence to test the explanatory power of economic theories. To decide how well they

explain the observed behaviour of the economic units.

2. Policy making

Various econometric techniques can be obtained in order to obtain reliable estimates of

the individual coefficients of economic relationships .The knowledge of numerical value of these

coefficients is very important for the decision of the firm as well as the formulation of the

economic policy of the government. It helps to compare the effects of alternative policy

decisions.

For eg. If the price elasticity of demand for a product is less than one (inelastic demand)

it will not benefit the manufacturer to decrease its price, because his revenue would be reduced.

Since econometrics can provide numerical estimate of the co-efficients of economic relationships

it becomes an essential tool for the formulation of sound economic policies.

3. Forecasting future values

In formulating policy decisions it is essential to be able to forecast the value of the

economic variables. Such forecasts will enable the policy makers to make efficient decision. In

formulating policy decisions, it is essential to be able to forecast the value of the economic

magnitudes. For example, what will be the demand for food grains in India by 2020? Estimates

about this are essential for formulating agriculture production policies. Similarly, what will be

the impact of a rise in deposit rate in share market and so on? It is known that if the bank deposit

rates go up, day to day demand for shares will come down. Econometric tools help in such

decision makings.



Methodology of Econometric model building

As mentioned earlier, the scope of econometrics is widening day by day. The development of

computers further promoted the use of econometric tools. Thus it is relevant and useful to have

an insight into the methodology of developing an econometric model. The development of an

econometric model undergoes the following important stages or phases.

1. Specification of the model

2. Estimation of the model

3. Evaluation of estimates

4. Forecasting power of the model

1 Specification of the model

In econometric analysis we have to identify the relevant variables, express the relationship in

appropriate mathematical form and make estimates. In order the complete this process, we have

to go step by step.

The first step is to identify the relation to be studied and express that relation in the form of a

hypothesis. For example, if we are interested in testing the relevance of law of demand, choose

law of demand and express it in the form of a hypothesis. The law of demand states that there is

an inverse relation between price and quantity demanded. This can be expressed in the form of a

null hypothesis and alternative hypothesis.

The null hypothesis is: Quantity demanded and price is unrelated or quantity demanded and

price is independent.

When we formulate null hypothesis, automatically an alternative hypothesis is also formed.

In this example, the alternative hypothesis will be “quantity demanded and price are related”

If we consider another example, the validity of psychological law of Keynes which relates

consumption expenditure and income, the suitable null hypothesis is consumption expenditure

and income are unrelated and the alternative hypothesis will be consumption expenditure and

income are related. These hypotheses will be used for testing the validity of estimated

coefficients, which will discussed later.

Now let us discuss how to develop econometric models to test these hypotheses. First let us

start with law of demand. The first step is identifying the relevant variables

(a)Identification of variables:The most important and difficult part in developing an

econometric model is identification of relevant variables. One source of identifying the variables

is theory. Based on the law of demand we know that the variables are quantity demanded, price,

price of substitutes, income and taste of the consumers. Conventionally we believe that demand

depends on these factors. Thus demand is the dependent variable or regressand and price, price



of substitutes, income and taste are independent variables or regressors. There are certain

practical difficulties at this stage (1) there may be a host of variables influencing a phenomenon.

Then is it possible to identify all those variables? Even if we could identify all those variables, is

it appropriate to include all those variables in the model? If we are omitting certain important

variables, it will be leading to errors. Similarly if we are including large number of variables or

unnecessary variables, it will also lead to errors. When such errors are committed in the

development of an econometric model, it is called as specification bias or specification error. So

let us assume that we are considering only price as the variable influencing quantity demanded,

assuming other factors remain constant. So let us write,

D = f (P)

where D represents quantity demanded, P represents price.

(b) Sign and magnitude of parameters: Once the function is identified, next task is to

attribute signs to the coefficients. Based on the general theory, we know that price takes a

negative sign. Thus we can convert the demand function into a demand equation as follows

D = + βP where represents intercept of demand equation and β represents the slope of thedemand equation.

But we know that price is not the only factor influencing demand, but at the same time it is

difficult to add all the variables. Thus to accommodate the unexplained variables or variables

which are not included in the model, we add a stochastic term U into the model, called

disturbance term or error term. The inclusion of an error term makes an econometric model

unique and distinct from a mathematical model or exact model. When an error term is included,

our demand equation model will become,

D = + βP + U , This is a unique econometric model.

Similarly, in the case of consumption function, the variables are consumption expenditure,

income, savings, and government policy and so on. Conventionally we assume that consumption

expenditure depends on income, assuming other factors remain constant. Thus our consumption

function model will be,

C = + βY + U where C is consumption expenditure, Y is income, is intercept and β isslope of consumption function.

(c) Mathematical form of the model: There are two issues discussed here. First issue is

whether we should follow a single equation approach or simultaneous equation approach.

Second issue is whether we should follow a linear equation or non linear equation. Economic

theory does not explain whether the system follows single equation or simultaneous models. It is

true that demand is a function of price. But at the same time, demand is a function of supply

also. If we are considering the interrelationships among economic variables, the appropriate



method is simultaneous equation model. However, in the present discussions let us limit to

single equation models.

The second issue is also very relevant. If we use a linear equation, there is an implied

assumption that, in the case of linear equations, the growth rate remains constant or more

precisely β coefficient remain constant. When we estimate a demand equation, we assume thatthe rate of change in quantity demanded for a change in price is constant. Similarly, in the case

of consumption function, we assume that the slope (β) remains constant; otherwise, marginalpropensity remains constant. If we apply little numerical wisdom, we can realize that marginal

propensity to consume can never be constant. Then what is the logic in assuming a linear

equation? Thus we have to keep in mind that linear equations are suitable for class room

analysis but not for policy research. However, after this caution, for the time being let us assume

that we follow a linear equation for the purpose of simple understanding and explanation.

When we develop an econometric model, time specifications are also very important.

Conventionally, for all current values we give suffix “t”, for previous values “t-1” and for allfuture values “t+1”(t*). Thus our models can be written as,

Dt = + βPt + Ut ………….Demand equation

Ct = + βYt + Ut ……… Consumption equation

Normally, the dependent variable is denoted by Y and independent variable by X. Thus

general framework of an econometric model can be written as,

Yt = + βXt + Ut

When we incorporate only one independent variable, it is only a narrow situation of the

reality. When we want to make our model more realistic, we have to incorporate more number

of independent variables. When we use two independent variables, the model can be written as,

Yt = 1+ β1Xt1 + β2Xt2 +Ut+ …………+ βnXn

This is the most simple multiple regression model. When we have two or more independent

variables, the model becomes multiple regression models. The general form of a multiple

regression model can be written as,

Yt = 1 + β1X1t + β2X2t + β3X3t + …………+ βnXn +Ut,

this is also written as,

Yt = +∑βiXti + Ut

Just like incorporating current variables, it is easy to incorporate lagged variables or expected

variables in a model. See the following example.



Yt = + β1Pt +β2Yt-1 + β3W* +Utwhere the new variables are Yt-1which is the lagged

value of the variable Y and W* is the expected value of W (wt+1).

Similarly there are situations where we can not measure variables directly. In such situations, we

can define a proxy variable or an instrument variable and incorporate in the system as usual. See

the following example

Yt = + β1X1t + β2Zt + Ut

where Z is an instrument variable or proxy variable. Proxy variable is a variable used to

represent qualitative or non measurable phenomenon.

Another important question in developing an econometric model is whether we should go for

linear models or non linear models. This is a highly debatable issue and beyond the scope of this

course. The following are the other forms available.

Lin log model Yt = + βLog X t +Ut

Log linmodel LogYt = + βXt + Ut

Double log model Log Yt = + βlog Xt + Ut

The choice of the model depends on many factors, particularly the scatter diagram of the

dependent and independent variables. Among the following, the best is double log model

because the coefficients of the double log models give directly elasticity values.

Thus in the model specification stage we consider mainly, the variables to be included in the

model, and also the mathematical form of the model. Any error committed in this stage will lead

to errors termed as “specification bias or specification error”, as mentioned earlier.

2 Estimation of the model

As mentioned above, one of the objectives of econometric models is to estimate the

coefficients. Estimations are possible only if data are gathered. Data can be collected either by

census method or sample method. Important sampling methods used are simple random sample,

stratified sample, systematic sample, multistage sampling, cluster sampling and quota sampling.

Similarly, data are classified into primary data, secondary data, time series data, cross section

data and pooled data.

In econometric models, the distinction between time series data and cross section data are

important. To make its distinction clear, let us consider the following example,

Year 1999 2000 2002 2003 2004 2005 2007 2008 2010

Sales 15 14 17 14 12 14 17 14 12



A casual look into the data set gives an impression that it belongs to time series, because it is

ordered in time. But the given set is neither time series nor cross section. Why?

For a data set to be time series, there are two conditions. Data collection interval should be

equal and gather information on a single entity. The given set of data does not obey these

conditions and hence not time series. But if we are provided with sales data for a few years, with

regular intervals, on year, six months etc, definitely they constitute time series data.

Now what is cross section data? When we gather information on multiple entities at a point

of time, it is called cross section data. For example, if we are gathering details of income,

savings, education, occupation etc of a group of 35 persons at a point of time, it is the best

example of cross section data. In other words, survey data are broadly cross section data.

In short, time series data is gathered at an interval of time while cross section data are

gathered at a point of time. The classification of time series and cross section data are important

because, the use of appropriate techniques depends on the nature of the data, whether it is time

series or cross section.

Another set of data used in econometric modelling is pooled data. Pooled data, in a simple

way is the integration or mixing of time series and cross section data. But the treatment pooled

data set is little complicated.

Aggregation problem

Once the data are collected, another issue to be dealt is the aggregation problem. Aggregation

problem arises from the irrational pooling of data. Aggregation problems are classified into

aggregation over individuals, over commodities, over space and over time.

Aggregation over individuals arises when we get the sum total of income of a few individuals

or income of firms. When we do this exercise, we are likely to commit errors. For example, if

the income of three persons namely, X, Y and Z are, Rs100000, Rs10000 and Rs500

respectively, their aggregate income can be easily computed as Rs110500 and average income as

36833, but this computation as well comparison is unscientific and leads to aggregation problem

over individuals. We may aggregate over the quantities of various commodities using

appropriate quantity indexes or over the prices of a group of commodities using some

appropriate price index. But these aggregations may lead to errors called as aggregation over

commodities.

While we collect data for different purposes, periodicity is very important. But in many

practical situations, this periodicity is not maintained. For example, in India, data are gathered at

two levels. One classification is recording of data at calendar year while the other one is

recording of data at financial year. Accountants admit that these differences create sufficient

difficulties while computing certain ratios or while comparing different years. This problem is

called aggregation over time.

At last, the aggregation of population of different towns, countries, regions also create

problems. This problem is called aggregation over space. The above sources of aggregation



create various complications which may impart some aggregation bias in the estimates of the

coefficients.

Identification problem

While discussing the econometric methodology, econometricians mention the problem of

identification of coefficients. This problem arises seriously only in the case of simultaneous

equation models, but a mentioned is made below.

We know that demand is a function of price. Similarly, supply is also a function of price.

Thus, at equilibrium point, demand equals supply. Thus at this point, we do not know whether

we are estimating the parameters of the demand function or the supply function. The problem

becomes more complex while we deal with a system of large number of equations.

Choice of the appropriate econometric technique: Next issue is the selection of the

appropriate method for estimating the coefficient of economic relationships. The kit of

econometric tools provides different techniques which can be split into single equation

techniques and simultaneous equation techniques. The important single equation techniques are

Ordinary Least Square method, Indirect Squares or Reduced form technique, Two Stage least

Square method and Limited Information Maximum Likelihood method and mixed estimation.

Simultaneous equation techniques are techniques which applied to all equations of a system at

once, and give estimates of the coefficients of all the functions simultaneously. The most

important are the three stage least squares method and the full information maximum likelihood

method. The selection of the method depends on the following.

1. The nature of the relation and its identification condition.

2. The properties of the estimates of the coefficients obtained from each technique

3. Simplicity of the method

4. Time and cost requirements of the method

5. The desirable properties expected for the coefficients.

3 Evaluation of estimates

After the estimation of the model, the econometrician must proceed with the evaluation of

the results of the computations. That is, we are testing the reliability of the results. The

evaluation consists of deciding whether the estimates of the parameters are theoretically

meaningful and statistically satisfactory. For this purpose, we use different criteria, namely

apriori criteria, statistical criteria and econometric criteria.



Economic or apriori criteria

This is decided by the principles of economic theory and refers to the sign and magnitude of the

parameters of economic relationships. Consider the example of demand equation. In the case of

demand equation, D = +βP, the coefficient of β should be negative in the case of a normalgood. Similarly there is a range in which the value of and β can vary. Similarly, when weconsider the case of consumption function, and β respectively represent autonomousconsumption and marginal propensity to consume. Normally the sign of β will be positive and itvaries within a range (0-1). Once our coefficients take an unexpected sign or magnitude, the

reliability of the estimates is doubtful and needs a review.

Statistical criteria (First order test):The coefficients estimated may be apriori true but need

not be statistically valid. Thus the validity of the model is to be ascertained using statistical

criteria. The frequently used tests are standard error, “t” test, Coefficient of determination and F

ratio. These tests are discussed later in detail.

Econometric Criteria (Second order test):The validity of the model also depends on the

validity of the assumptions of the model or more specifically the stochastic assumptions. If the

assumptions of the econometric method applied by the investigators are not satisfied, either the

estimates of the parameters cease to possess some of their desirable properties or the statistical

criteria lose their validity and become unreliable for the determination of the significance of

these estimates.

When the model does not satisfy the economic, statistical or econometric criteria, it is

appropriate to re specify the model. This process and re estimation should continue until we get

reliable estimates.

3 Evaluating the forecasting power of the estimated model

Forecasting is one of the prime aims of econometric analysis and research. The forecasting

power will be based on the stability of the estimates, their sensitivity to changes in the size of the

sample. We must establish whether the estimated function performs adequately outside the

sample of data whose average variation it represents. One way of establishing the forecasting

power of a model is to use the estimates of the model for a period not included in the sample.

The estimated value or forecast value is compared with the actual or realized magnitude of the

relevant dependent variable. Usually there will be a difference between the actual and the

forecast value of the variable, which is tested with the aim of establishing whether it is

statistically significant. If, after conducting the relevant test of significance, we find that the

difference between the realized value of the dependent variable and that estimated from the

model is statistically significant, we conclude that the forecasting power of the model is poor.

Another way of establishing the stability of the estimates and the performance of the model

outside the sample of data, from which it has been estimated, is to re estimate the function with

an expanded sample that is a sample including additional observations. The original estimates



will normally differ from the new estimates. The difference is tested for statistical significance

with appropriate methods.

Desirable properties of an Econometric model

1. Theoretical plausibility: The model should explain clearly the economic theory or

phenomena to which it relates.

2. Explanatory ability: The model should be able to explain the observations of the actual

world.

3. Accuracy of the estimates of the parameters: The estimates of the coefficients should be

accurate in the sense that they should possess the desirable properties of unbiasedness,

consistency and efficiency.

4. Forecasting ability: The model should produce satisfactory predictions of future values

of independent variables.

5. Simplicity: The model should represent the economic relationships with possible

simplicity. If the number of equations is less and if the mathematical form is less

complicated, that model is said to be a good model.

Types of econometrics

Econometrics may be divided into two broad categories. Theoretical econometrics and applied

econometrics.

Theoretical econometrics is concerned with the development of appropriate method for

measuring economic relationships specified by econometric models. For e.g. one of the methods

used extensively is the principle of least squares.

In applied econometrics, the tools of theoretical econometrics, is used to study some special area

of economics and business such as the production function, investment function, demand &

supply functions etc.

Uses of Econometrics

1.Econometrics is widely used in policy formulation

For eg. Suppose the government wants to devalue its currency to correct the balance of

payment problem. For estimating the consequences of devaluation, the price elasticities of

imports and exports is needed. If imports and exports are inelastic then devaluation will not

produce the necessary change. If imports and exports are elastic then the BOP of the country

will improve by devaluation. Price elasticity can be estimated with the help of demand function

of import and export. An econometric model can be built through which the variables can be

estimated.

2. Econometrics helps the producers in making rational calculations.



3. Econometrics is also useful in verifying theories.

4. Studies of econometrics mainly consist of testing of hypothesis, estimation of the parameters

and ascertaining the proper functional form of the economic relations.

5 Limitations of Econometric Approach

Econometrics has come a long way over a relatively short period of time. Important advances

have been made in the compilation of data, development of concepts, theories and tools for the

construction and evaluation of a wide variety of econometric models. Applications of

econometrics can be found in almost every field of economics. Nowadays, even there is a

tendency to use econometric tools in certain other sciences like sociology, political science,

agriculture and management. Econometric models have been used frequently by government

departments, international organizations and commercial enterprises. At the same time,

experience has brought out a number of difficulties also in the use of econometric tools. The

important limitations are,

1. Quality of data: Econometric analysis and research depends on intensive data base. One

of the serious problems of Indian econometric research is non availability of accurate,

timely and reliable data.

2. Imperfections in economic theory: Earlier it was felt that the economic theory is

sufficient to provide base for model building. But later it was realized that many of the

economic theories are illusory because they are based on the assumption of ceteris

paribus and hence models can not fully accommodate the dynamic forces behind a

phenomena.

3. There are institutional features and accounting conventions that have to be allowed for in

econometric models but which are either ignored or are only partially dealt with at the

theoretical level.

4. Any economic phenomenon is influenced by social, cultural, political, physiological and

even physical factors. These factors can not be easily quantified. Even if quantified, they

may not be capable of explaining the phenomenon properly. For example, it is said that

the intelligentsia of Indian planners gave birth to very beautiful mathematical models, but

they forgot to feed the hungry masses.

Thus we may conclude our discussion on econometrics by restating the following.

Economists develop economic models to explain consistently recurring relationships. Their

models link one or more economic variables to other economic variables. For example,

economists connect the amount individuals spend on consumer goods to disposable income and

wealth, and expect consumption to increase as disposable income and wealth increase (that is,

the relationship is positive).



There are often competing models capable of explaining the same recurring relationship, called

an empirical regularity, but few models provide useful clues to the magnitude of the association.

Yet this is what matters most to policymakers. When setting monetary policy, for example,

central bankers need to know the likely impact of changes in official interest rates on inflation

and the growth rate of the economy. It is in cases like this that economists turn to econometrics.

Econometrics uses economic theory, mathematics, and statistical inference to quantify economic

phenomena. In other words, it turns theoretical economic models into useful tools for economic

policymaking. The objective of econometrics is to convert qualitative statements (such as “therelationship between two or more variables is positive”) into quantitative statements (such as“consumption expenditure increases by 95 cents for every one dollar increase in disposable

income”). Econometricians—practitioners of econometrics—transform models developed by

economic theorists into versions that can be estimated. As Stock and Watson put it, “econometricmethods are used in many branches of economics, including finance, labor economics,

macroeconomics, microeconomics, and economic policy.” Economic policy decisions are rarelymade without econometric analysis to assess their impact.

Econometrics can be divided into theoretical and applied components.

Theoretical econometricians investigate the properties of existing statistical tests and procedures

for estimating unknowns in the model. They also seek to develop new statistical procedures that

are valid (or robust) despite the peculiarities of economic data—such as their tendency to change

simultaneously. Theoretical econometrics relies heavily on mathematics, theoretical statistics,

and numerical methods to prove that the new procedures have the ability to draw correct

inferences.

Applied econometricians, by contrast, use econometric techniques developed by the theorists to

translate qualitative economic statements into quantitative ones. Because applied

econometricians are closer to the data, they often run into—and alert their theoretical

counterparts to—data attributes that lead to problems with existing estimation techniques. For

example, the econometrician might discover that the variance of the data (how much individual

values in a series differ from the overall average) is changing over time.

The main tool of econometrics is the linear multiple regression model, which provides a formal

approach to estimating how a change in one economic variable, the explanatory variable, affects

the variable being explained, the dependent variable—taking into account the impact of all the

other determinants of the dependent variable. This qualification is important because a regression

seeks to estimate the marginal impact of a particular explanatory variable after taking into

account the impact of the other explanatory variables in the model.

The methodology of econometrics is fairly straightforward. It involves 4 steps as explained

below.



The first step is to suggest a theory or hypothesis to explain the data being examined. The

explanatory variables in the model are specified, and the sign and/or magnitude of the

relationship between each explanatory variable and the dependent variable are clearly stated. At

this stage of the analysis, applied econometricians rely heavily on economic theory to formulate

the hypothesis. For example, a tenet of international economics is that prices across open borders

move together after allowing for nominal exchange rate movements (purchasing power parity).

The empirical relationship between domestic prices and foreign prices (adjusted for nominal

exchange rate movements) should be positive, and they should move together approximately one

for one.

The second step is the specification of a statistical model that captures the essence of the theory

the economist is testing. The model proposes a specific mathematical relationship between the

dependent variable and the explanatory variables—on which, unfortunately, economic theory is

usually silent. By far the most common approach is to assume linearity—meaning that any

change in an explanatory variable will always produce the same change in the dependent variable

(that is, a straight-line relationship).

Because it is impossible to account for every influence on the dependent variable, a catchall

variable is added to the statistical model to complete its specification. The role of the catchall is

to represent all the determinants of the dependent variable that cannot be accounted for—because

of either the complexity of the data or its absence. Economists usually assume that this “error”term averages to zero and is unpredictable, simply to be consistent with the premise that the

statistical model accounts for all the important explanatory variables.

The third step involves using an appropriate statistical procedure and an econometric software

package to estimate the unknown parameters (coefficients) of the model using economic data.

This is often the easiest part of the analysis thanks to readily available economic data and

excellent econometric software. Just because something can be computed doesn’t mean it makeseconomic sense to do so.

The fourth step is by far the most important: administering the smell test. Does the estimated

model make economic sense—that is, yield meaningful economic predictions? For example, are

the signs of the estimated parameters that connect the dependent variable to the explanatory

variables consistent with the predictions of the underlying economic theory? (In the household

consumption example, for instance, the validity of the statistical model would be in question if it

predicted a decline in consumer spending when income increased). If the estimated parameters

do not make sense, how should the econometrician change the statistical model to yield sensible

estimates? And does a more sensible estimate imply an economically significant effect? This

step, in particular, calls on and tests the applied econometrician’s skill and experience.

REGRESSION ANALYSIS

The term regression was introduced by Francis Galton. Regression analysis is concerned



with the study of the dependence of one variable (dependent variable) on one or more other

variables (explanatory variables) with a view to estimating the average (mean) valve of the

former in terms of known (fixed) values of the latter.

Galton found that, although there was a tendency for tall parents to have tall children and for

short parents to have short children, the average height of children born of parents of a given

height tended to more or regress towards the average height in the population as a whole. In

other words, the height of the children of unusually tall or unusually shorts parents tends to more

towards the average height of the population. In the modern view of regression, the concern is

with finding out how the average height of sons changes, given the fathers height. Regression

analysis is largely concerned with estimating and/or predicting the (population) mean value of

the dependent variable on the basis of the known or fixed values of the explanatory variable.

Origin of the Linear Regression Model

There are different methods for estimating the coefficients of the parameters. Of these different

methods, the most popular and widely used is the regression technique using Ordinary Least

Square (OLS) method. This method is used because of the inherent properties of the estimates

derived using this method. But, first let us try to understand the rationale of this method. For

this purpose, let us go back to the demand theory as well as the consumption function which we

discussed in the earlier chapter. Demand theory says that there is a negative relation between

price and quantity demanded certeris paribus. In the case of consumption function, there is a

positive relation between consumption expenditure and income. There are three important

questions here.

1. Which is the dependent variable and which is the independent variable?

2. Which is the appropriate mathematical form which explains the phenomenon?

3. What is the expected sign and magnitude of the coefficients?

In order to answer these questions, the theory will give the necessary support.

In the case of demand equation, quantity demanded is the dependent variable, and price is the

independent variable. Economic theory does not discuss the choice between single equation

models or simultaneous equation models to discuss the relationship. So naturally we may

assume that the relation is explained with the help of single equation, that too assuming a linear

relation. As far as the sign and magnitude of the coefficients are concerned, in the equation,

D = α + βP + U, ∞ can take any value but preferably zero or positive. It actually shows thequantity demanded at price zero. So chances of demanding negative quantity is very rare and

hence if we get negative quantity, it can be approximated to zero. In the case of β, it can bepositive or negative. But normally it will be negative assuming that the commodity demanded is

a normal good. Of course, elasticity nature of the commodity also influences the magnitude and

nature of this value.



In the case of consumption function, consumption is the dependent variable and income is the

independent variable. Whether the relation is linear or non linear, is a debatable issue. For

instance, psychological law of Keynes suggests that when income increases, consumption also

increases, but less than proportionate. So assuming that consumption and income are linearly

related is in one way, over simplification. But for the time being let us assume so just for

explanatory purpose. Regarding the sign and magnitude of parameters ∞ and β. There is somemeaning and interpretation. ∞ represents the consumption when income takes the value zero,that is, according to theory, it is autonomous consumption. Similarly, β is nothing but the valueof marginal propensity to consume which is normally less than 1 and can not be negative.

Based on the above discussed rationale and logic, let us rewrite the demand equation as D =

α+ βP + U , where D is the quantity demanded, P is price, α and β are the parameters to beestimated. In order to estimate these parameters, we use Ordinary Least Square (OLS) method.

Once we plot this on a graph, we will be able to get the deviations between actual and estimated

observations, popularly called as errors. Naturally, a rational decision is to minimize these

errors. Thus from all possible lines, we choose the one for which the deviations of the points is

the smallest possible. The least squares criterion requires that the regression line be drawn in

such a way, so as to minimize the sum of the squares of the deviations of the observations from

it. The first step is to draw the line so that the sum of the simple deviations of the observations is

zero. Some observations will lie above the line and will have a positive deviation, some will lie

below the line, in which case, they will have a negative deviation, and finally the points lying on

the line will have a zero deviation. In summing these deviations the positive values will offset

the negative values, so that the final algebraic sum of these residuals will equal zero.

Mathematically, ∑e = 0. Since the sum total of deviations is 0, it can not be minimized as such.So we try to square the deviations and minimize the sum of the squares. ∑e2. Thus we call thismethod as least square method,

Population Regression Function (PRF)

Mathematically a population regression function (PRF) or Conditional Expectation Function

(CEF) can be defined as the average value of the dependent value for a given value of the

explanatory or independent variable. In other words, PRF tries to find out how the average value

of the dependent variable varies with the given value of the explanatory variable. On the other

hand, when we estimate the average value of the dependent variable with the help of a sample, it

is called stochastic sample regression function (SRF).

E(Y | Xi) = f (Xi)

where f (Xi) denotes some function of the explanatory variable X.

E(Y | Xi) is a linear function of Xi. This is known as the conditional expectation function(CEF) or population regression function (PRF). It states merely that the expected value of the



distribution of Y given Xi is functionally related to Xi.In simple terms, it tells how the mean oraverage response of Y varies with X. For example, an economist might posit that consumptionexpenditure is linearly related to income. Therefore, as a first approximation or a workinghypothesis, we may assume that the PRF E(Y | Xi) is a linear function of Xi,

E(Y | Xi) = β1 + β2Xi

where β1 and β2 are unknown but fixed parameters known as the regression coefficients;β1 and β2 are also known as intercept and slope coefficients, respectively.

we can express the deviation of an individual Yi around its expected value as follows: ui= Yi − E(Y | Xi) or

Yi = E(Y | Xi) + ui where the deviation ui is an unobservable random variable takingpositive or negative values. Technically, ui is known as the stochastic disturbance or stochasticerror term.

We can say that the expenditure of an individual family, given its income level, can be

expressed as the sum of two components: (1) E(Y | Xi), which is simply the mean consumption

expenditure of all the families with the same level of income. This component is known as the

systematic, or deterministic, component, and (2) ui, which is the random, or nonsystematic,

component is a surrogate or proxy for all the omitted or neglected variables that may affect Y but

are not (or cannot be) included in the regression model.

If E(Y | Xi) is assumed to be linear in Xi, it may be written as

Yi = E(Y | Xi) + ui

= β1 + β2Xi+ ui

Sample regression function (SRF)

Since the entire population is not available to estimate y from given xi, we have to

estimate the PRF on the basis of sample information. From a given sample we can estimate the

mean value of y corresponding to chosen xi values. The estimated PRF value may not be

accurate because of sampling fluctuations. Because of this only an approximate value of PRF

can be obtained. In general, we would get N different sample regression function (SRFs) for N

different samples and these SRFs are not likely to be the same.

we can develop the concept of the sample regression function (SRF) to represent thesample regression line.

Y = β1˄ + β2

˄Xi

where ˆY is read as “Y-hat’’ or “Y-cap’’

ˆYi = estimator of E(Y | Xi)

ˆ β1 = estimator of β1

ˆ β2 = estimator of β2

Note that an estimator, also known as a (sample) statistic, is simply a method that tellshow to estimate the population parameter from the information provided by the sample at hand.



we can express the SRF in its stochastic form as follows:

Yi = ˆ β1 + ˆ β2Xi + ˆuiwhere, in addition to the symbols already defined, ˆui denotes theestimate of the error term.

Significance of the stochastic Error term

The disturbance term ui is a surrogate for all those variables that are omitted from themodel but that collectively affect Y.

1. Vagueness of theory

The theory determining the behavior of Y may be, incomplete. We might know for certain thatweekly income X influences weekly consumption expenditure Y, but we might be ignorant orunsure about the other variables affecting Y. Therefore ui may be used as a substitute for all theexcluded or omitted variables from the model.

2. Unavailability of data

Even if we know what some of the excluded variables are we may not have quantitativeinformation about these variables. For example, in principle we could introduce family wealth asan explanatory variable in addition to the income to explain family consumption expenditure.But unfortunately, information on family wealth generally is not available.

3. Core variables versus peripheral variables

Assume in our consumption income example that besides income X1, the number of children perfamily X2, sex X3, religion X4, education X5, and geographical region X6 also affectconsumption expenditure. But it is quite possible that the joint influence of all these variablesmay be so small that it need not be introduced in the model. Their combined effect can be treatedas a random variable ui.

4. Intrinsic randomness in human behavior

Even if all the relevant variables affecting y are introduced into the model, there may bevariations due to intrinsic randomness in individual which cannot be explained. The disturbanceterm ui also include this intrinsic randomness.

5. Poor proxy variables

Although the classical regression model assumes that variables y and x are measured accurately,it is possible that there may be errors of measurement. Variables which are used as proxy maynot provide accurate measurement. The disturbance term u can also be used to include errors ofmeasurement.

6.Principle of parsimony

Regression model should be formulated as simple as possible. If the behavior of y can beexplained with the help of two or three explanatory variables then more variation need not beincluded in the model. Let ui represent all other variables. This does not mean that relevant andimportant variables should be excluded to keep the regression model simple.

7. Wrong functional form

Even if we have theoretically correct variables exploring a phenomenon and even if it is possible

to get data on these variables, very often the functional relationship between the dependent and

independent variable may be uncertain. In two variable models functional relation can be

ascertained with the help of scattergram. But in multiple regression model it is not easy to



determine the, approximate functional form. Scattergram cannot be visualised in multi-

dimensional form. For all these reasons, the stochastic disturbance ui assumes an extremely

critical role in regression analysis.

Assumptions of Classical Linear Regression Model

1. U is a random real variable. The value which may assume in any one period depends onchance. It may be positive, zero or negative. Each value has a certain probability ofbeing assumed by U in any particular instance.

2. The mean value of U in any particular period is zero. If we consider all the possiblevalues of U, for any given value of X, they would have an average value equal to zero.With this assumption we may say that Y = ∞ +βX + U gives the relationship betweenX and Y on the average. That is, when X assumes the value X1, the dependent variablewill on the average assume the value Y1, although the actual value of Y observed in anyparticular occasion may display some variation.

3. The variance of U is constant in each period. The variance of U about its mean isconstant at all values of X. In other words, for all values of X, the U will show the samedispersion round their mean.

4. The variable U has a normal distribution

5. The random terms of different observations are independent. This means that all thecovariance of any U (ui) with any other U (uj) are equal to zero

6. U is independent of the explanatory variables

The above mentioned assumptions are really classic to regression estimations and make the

method OLS efficient.

There are a few other assumptions also used in OLS estimated. They are,

(i) The explanatory variables are measured without error. In other words, the explanatory

variables are measured without error. In the case of dependent variable, error may or may not

arise.

(ii) The explanatory variables are not perfectly linearly correlated. If there is more than one

explanatory variable in the relationship, it is assumed that they are not perfectly correlated with

each other. More specifically, we are assuming the absence of multicollinearity.

(iii) There is no aggregation problem. In the previous chapter, we discussed aggregation over

individuals, time, space and commodities. So we assume the absence of all these problems.

(iv) The relationship being estimated is identified. This means that we have to estimate a unique

mathematical form. There is no confusion about the coefficients and the equations to which it

belong.

(v) The relationship is correctly specified. It is assumed that we have not committed any

specification error in determining the explanatory variables, in deciding the mathematical form

etc.

*************

Documents

Quantitative Methods for Economic Analysis 6nov2014