29
1 CHAPTER 1 INTRODUCTION Statistical methods are becoming more applied in socio-economic parameters and related problems. Socio-economic statistics thus form an important factor in the development of the country. It includes a vast array of information on health, literacy status, labour force, economic status such as poverty, unemployment, population parameters like fertility, migration, gender effects etc. Socio-economic parameters and related problems: The socio-economic parameters we have chosen to study using the official statistics are mainly rural poverty, literacy, migration, child-labour, unemployment and related problems. 1. Literacy: Literacy is fundamental to many state sponsored interventions in less developed countries because of its pervasive influence on economically relevant variables, such as productivity, health and earnings quite apart from its intrinsic value as a vitally important goal of development. Traditionally, a society’s literacy is measured by the “literacy rate” that is the percent (or, equivalently, fraction) of the adult population that is literate. The present proposal will try to maintain that the distribution of literates across households may also matter, due o the external effects of literacy the benefits that illiterate members of a household will derive from having a literate person in the family. This argument was introduced by Basu and

CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

1

CHAPTER 1

INTRODUCTION

Statistical methods are becoming more applied in socio-economic parameters

and related problems. Socio-economic statistics thus form an important factor in the

development of the country. It includes a vast array of information on health, literacy

status, labour force, economic status such as poverty, unemployment, population

parameters like fertility, migration, gender effects etc.

Socio-economic parameters and related problems:

The socio-economic parameters we have chosen to study using the official

statistics are mainly rural poverty, literacy, migration, child-labour, unemployment

and related problems.

1. Literacy: Literacy is fundamental to many state sponsored interventions in

less developed countries because of its pervasive influence on economically relevant

variables, such as productivity, health and earnings quite apart from its intrinsic value

as a vitally important goal of development. Traditionally, a society’s literacy is

measured by the “literacy rate” that is the percent (or, equivalently, fraction) of the

adult population that is literate. The present proposal will try to maintain that the

distribution of literates across households may also matter, due o the external effects

of literacy – the benefits that illiterate members of a household will derive from

having a literate person in the family. This argument was introduced by Basu and

Page 2: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

2

Foster (1998) who had suggested a new measure of literacy, -the “effective literacy”

rate to which further amendments had been proposed by Subramanian (1999). The

measure of literacy most widely employed is the familiar literacy rate r, which is the

proportion of the adult population that is literate: r = L/ N .Where, L= Literates, N=

Persons.

2. Religion composition: The term religion is used to mean any shared of

beliefs, activities and institutions premised upon faith in supernatural forces.

Abundant evidence affirms that religious belief affects a wide range of behavioral

outcomes and religious activity can also affect economic performance at level of the

individual, group, state or nation. Hindus, Muslims, Christians, Sikhs, Buddhists,

Jains etc. are some of the main religious communities followed in Gujarat.

3. Rural poverty and Total Factor Productivity: In India, only 28% of

population was living in urban areas as per 2001 census ( Pranati Datta, 2006). There

is still a large pressure of population in rural India. So, the best way to study poverty

in India would be to study rural poverty of India. Poverty in rural India has declined

substantially during the period of 1970-1993 decades. With few exceptions (Bardhan

1973;Griffin and Ghose 1979), most of these studies have found an inverse

relationship between growth in agricultural income and the incidence of rural poverty.

Our purpose of the research is also to investigate the contribution of agricultural

growth & other socio-economic parameters as well in rural poverty reduction in India.

The poverty measure we have used to study is poverty Head-count ratio (HCR). The

poverty headcount ratio (HCR) is the proportion of the national population whose

Page 3: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

3

incomes are below the official threshold (or thresholds) set by the national

government. The main chosen responsible factors for our study are rural development

expenditure by government (DEV), percentage of cropped area sown with high

yielding varieties (HYV), percentage of cropped area irrigated (IRR), percentage of

villages electrified (ELE), percentage of rural population that is literate (LITE), road

density in rural India (ROAD), total factor productivity growth in Indian agriculture

(TEP), changes in rural wages (WAGES).

As result of rapid adoption of new technology and improved rural

infrastructure, agriculture production and factor productivity have both grown rapidly

in India. Five major crops ( Rice, Wheat, Sorghum, Pearl, Millet and Maize), 14

Minor Crops (Barley, cotton, groundnut, other grain, other pulses, Potato rapeseed,

mustard, sesame, sugar, tobacco, soybeans, jute and sunflower) and three major

livestock products ( Milk, Meat and chicken ) are included in this measure of total

production. Total factor productivity (TFP) is defined as aggregate output minus

aggregate input. Five inputs (labor, land, fertilizer, tractors and buffalos)

are included.

4. In-migration: We have applied cluster analysis techniques to examine the

degree of migration integration in Gujarat state, focusing in particular on the In-

migration process. Dalkhat Ediev(2003) and David Coleman (2006) have studied

Migration as a factor of population reproduction. Migration remains a major player in

the population structure and hence plays a key role in the transmission of population

related policies for the particular area (Gujarat). We have focused on a data set

Page 4: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

4

consisting mainly of reason variables for In-migration in Gujarat from distinct

destinations.

5. Child labour: Our study here also deals with the incidence of child labour

in India for the period 1990-1991. According to one article by D. P. Chaudhri (1997),

the incidence of total child labour in India is around 10 million during the decade of

2000 and he had also mentioned that it had the declining trend. Here we are interested

to check this assumption. We have also tried to focus on the impact of socio-

economic parameters on the incidence of decline in child labour.

6. Unemployment: Provision of increasing employment opportunities both

in urban and rural areas to solve the problem of unemployment has been recognized

as an important objective of economic planning in India. Unemployment is measured

through labour force surveys which elicit the “activity” status of the respondent for a

given reference period. Our focus was to study impact of socio-economic variables on

rural unemployment.

MULTIVARIATE TECHNIQUES:

In this research work, we have discussed some advanced multivariate

techniques. It requires a high level of statistical knowledge. Multivariate methods deal

with the simultaneous treatment of several variables. They have a useful role in

analyzing data from developing country surveys. We first discuss the effective use of

such methods: (a) as an exploratory tool to investigate patterns in the data; (b) to

Page 5: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

5

identify natural groupings of the population for further analysis; and (c) to reduce

dimensionality in the number of variables involved.

1. MULTIPLE REGRESSION ANALYSIS:

Regression analysis is an important statistical technique to analyze data in

social, economic and engineering sciences. Regression models involve the unknown

parameters denoted as β; this may be a scalar or a vector of length k, the independent

variables: x and the dependent variable y. The regression methods used under our

research work are following:

Step-wise regression method: Step-wise regression method is a very

famous method of regression analysis. It gives a better idea of the independent

contribution of each explanatory variable. Under these techniques, we have added the

independent contribution of each explanatory variable into the prediction equation one

by one, computing betas and R2 at each step.

Pooled Panel data analysis: Pooled Panel data analysis is a method of

studying a particular subject within multiple sites, periodically observed over a

defined time frame. The combination of time series with cross-sections can enhance

the quality and quantity of data in ways that would be impossible using only one of

these two dimensions (Gujarati, 2003). This facilitates the construction and testing

more realistic behavioral models that could not be identified using only a cross

section or a single time-series data set.

Page 6: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

6

Robust regression: Robust regression is an important tool for analyzing

data that are contaminated with outliers. It can be used to detect outliers and to

provide resistant (stable) results in the presence of outliers. In order to achieve this

stability, robust regression limits the influence of outliers. In theory, when the

assumption of normally does not meet, the standard least squares estimation for the

regression coefficients will be biased and / or non efficient, see for example

(Hampel et al. 1986). When the assumption of normally is not met in a linear

regression problem, several alternative methods of the standard Least Squares (LS)

regression had been proposed (Draper & Smith, 1998; Kutner et al., 2004; Ortiz et.,

2006). Among these, the estimation methods we have applied are robust M-

estimation, (Huber, 1964), least Absolute Deviations (LAD) method, (Dielman &

Pfaffenberger, 1982) and (Bloom field & steiger, 1983) and nonparametric (rank

based) methods, (Adichie, 1967; Jureckova, 1971); Jaeckel, 1972)etc. LTS estimation

(Rousseeuw, 1984), S estimation (Rousseeuw and Yohai, 1984), LMS estimation

(Rousseeuw and Leroy,1987).

Regression analysis with change-point: Quite often, in practice, the

regression coefficients are assumed constant. In many real-life problems, however,

theoretical or empirical deliberations suggest models with occasionally changing one

or more of its parameters. The main parameter of interest in such regression analyses

is the shift point parameter, which indexes when or where the unknown change

occurred.

Page 7: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

7

A variety of problems, such as switching straight lines (Ferreira,

1975), shifts of level or slope in linear time-series models (Smith, 1980), detection of

ovulation time in women (Carter and Blight, 1983), and many others, have been

studied during the last two decades. Holbert (1982), while reviewing Bayesian

developments in structural change from 1968 onward, gives a variety of interesting

examples from economics and biology.

Two-phase linear regression model: The Two-phase linear regression

model is one of the many models, which exhibits structural change. Holbert (1982)

used a Bayesian approach, based on the TPLR model, to reexamine the McGee and

Carlteon (1970) data for stock market sales volume and reached the same conclusion

that the abolition of split-ups did hurt the regional exchanges. Let the regression

model be Yi = β1 X i + i , where i are i. i. d. N (0, 2) random errors with variance

2

> 0 but later it was found that there was a change in the system at some point of time

m and it is reflected in the sequence after Xm by change in slope, regression

parameter The problem of study is: When and where this change has started

occurring. This is called change point inference problem.

2. MULTIVARIATE SPATIAL ANALYSIS:

According to Anselin (1988), spatial econometrics addresses

issues causes by space in statistical analysis of regional science regressions. In other

words, spatial analysis is the combination of statistical and econometric methods that

Page 8: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

8

deal with problems concerning spatial effects. According to Anselin (1988), spatial

econometrics addresses issues causes by space in statistical analysis of regional

science regressions. We have focused on two issues that are often overlooked in

technical treatments of the methods of spatial statistics and spatial econometrics.

There are two opposite approaches towards dealing with spatially referenced

data (Anselin 1986b; Haining 1986). In one, which we will call the data-driven

approach; information is derived from the data without a prior notion of what the

theoretical framework should be. In other words, one lets the "data speak for

themselves" (Gould, 1981) and attempts to derive information on spatial pattern,

spatial structure and spatial interaction without the constraints of a pre-conceived

theoretical notion.

In the data-driven approach, errors are considered to provide information. The

focus of attention is on how the spatial pattern of the errors relates to data generation

processes. For example, in attempts to provide measures of uncertainty for spatial

information in a GIS, the spatial pattern of errors is related to data collection and

manipulation procedures. The spatial pattern of errors can often provide insight into

the form of the underlying substantive spatial process, e.g., as exploited in the model

identification stage of a spatial time series analysis. Important and still unresolved

research questions deal with the formulation of useful spatial error distributions, in

which error is related to location, distance to reference locations, area, etc...

Page 9: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

9

The data-driven approach in spatial analysis is reflected in a wide range of

different techniques, such as point pattern analysis (Getis and Boots, 1978; Diggle,

1983), indices of spatial association (Hubert 1985; Wartenberg, 1985), kriging (Clark,

1979), spatial adaptive filtering (Foster and Gorr, 1986), and spatial time series

analysis (Bennett, 1979). All these techniques have two aspects in common. First,

they compare the observed pattern in the data (e.g., locations in point pattern analysis,

values at locations in spatial autocorrelation) to one in which space is irrelevant. In

point pattern analysis this is the familiar Poisson pattern, or "randomness," while in

many of the indices of spatial association it is the assumption that an observed data

value could occur equally likely at each location (i.e., the basis for the randomization

approach). The second common aspect is that the spatial pattern, spatial structure, or

forms for the spatial dependence are derived from the data only. For example, in

spatial time series analysis, the specification of the autoregressive and moving

average lag lengths is derived from autocorrelation indices or spatial spectra.

The data-driven approach is attractive in many respects, but its application is

not always straightforward. Indeed, the characteristics of spatial data (dependence and

heterogeneity) often void the attractive properties of standard statistical techniques.

Since most EDA techniques are based on an assumption of independence, they cannot

be implemented uncritically for spatial data. In this respect, it is also important to note

that dependence in space is qualitatively more complex than dependence in the time

dimension, due to its two-directional nature. As a consequence, many results from the

analysis of time series data will not apply to spatial data. As discussed in detail in

Page 10: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

10

Hooper and Hewings (1981), the extension of time series analysis into the spatial

domain is limited, and only applies to highly regular processes. It goes without saying

that most data in empirical spatial analysis for irregular areal units do not fit within

this restrictive framework.

The second approach, which is called model-driven, starts from a theoretical

specification, which is subsequently confronted with the data. The main conceptual

problem associated with this approach is how to formalize the role of "space." This is

reflected in three major methodological problems, which are still largely unresolved

to date: the choice of the spatial weights matrix (Anselin 1984, 1986a); the modifiable

areal unit problem (Openshaw and Taylor 1979, 1981); and the boundary value

problem (Griffith 1983, 1985). We have chosen model-driven approach.

The analysis of spatial data has always played a central role in the quantitative

scientific tradition in geography. Recently, there have appeared a considerable

number of publications devoted to presenting research results and to assessing the

state of the art. For example, at an elementary level, Goodchild (1987a), Griffith

(1987a), and Odland (1988) introduce the concept of spatial autocorrelation, and

Boots and Getis (1988) review the analysis of point patterns. At more advanced

levels, Anselin (1988a) and Griffith (1988) deal with a wider range of methodological

issues in spatial econometrics and spatial statistics. Extensive reviews of the current

state of the art for different aspects of spatial data analysis are presented in Anselin

(1988b), Anselin and Griffith (1988), Getis (1988), Griffith (1987b), and Odland,

Golledge and Rogerson (1989). In addition, spatial data analysis has received

Page 11: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

11

considerable attention as an essential element in the development of Geographic

Information Systems (GIS), e.g., in Goodchild (1987b) and Openshaw (1987), and as

an important factor in regional modeling, in Anselin (1989a).

The main problem is that the geometric and graphical representation of the

location of points, lines or areal boundaries (i.e., a map) gives an imperfect

impression of the uncertainty associated with errors in their measurement. Since these

locational features are important elements in the evaluation of distance and relative

position, and in the operations of areal aggregation and interpolation, the associated

measurement error will affect many of the "values" generated in a spatial information

system as well. Although similar errors occur in the time dimension, they are much

simpler to take into account since they only propagate in one direction.

SPATIAL ERRORS:

Basic to both the data-driven and the model-driven analysis of spatial data is an

understanding of the stochastic properties of the data. The use of "space" as the

organizing framework leads to a number of features that merit special attention, since

they are different from what holds for a-spatial or time series data. The most

important concept in this respect is that of error, or, more precisely for data observed

in space, spatial error. The distinguishing characteristics of spatial error have

important implications for description, explanation and prediction in spatial analysis.

In this section, we present a simple taxonomy of the nature of spatial errors and

outline some alternative perspectives on how error can be taken into account.

Page 12: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

12

THE NATURE OF SPATIAL ERRORS:

Spatial errors can be due to measurement error or specification error, or to a

combination of both. Measurement error occurs when the location or the value of a

variable are observed with imperfect accuracy. The former is an old cartographic

problem and is still very relevant in modern geographic information systems (e.g.,

errors due to a lack of precision in digitizing).

Other spatial errors of measurement have to do with the imperfect way in

which data on socio-economic phenomena are recorded and grouped in spatial units

of observation (e.g., various types of administrative units). This interdependence of

location and value in spatial data leads to distinctively spatial characteristics of the

errors. These are the familiar spatial dependence and spatial heterogeneity.

Dependence is mostly due to the existence of spatial spillovers, as a result of a miss-

match between the scale of the spatial unit of observation and the phenomenon of

interest (e.g., continuous processes represented as points, or processes extending

beyond the boundaries of administrative regions). Heterogeneity is due to structural

differences between locations and leads to different error distributions (e.g.,

differences in accuracy of census counts between low-income and high-income

neighborhoods).

Specification error is particular to the model-driven approach in spatial data

analysis. It pertains to the use of a wrong model (e.g., recursive vs. simultaneous), an

inappropriate functional form (e.g., linear vs. nonlinear), or a wrong set of variables.

In essence it is no different from miss-specification in general, but it generates spatial

Page 13: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

13

patterns of error due to the use of spatial data. These spatial aspects can occur as a

result of ignoring location-specific phenomena, spatial drift, regional effects or spatial

interaction. When a false assumption of homogeneity is forced onto a model in those

instances, spatial heterogeneous errors will result. Similarly, when the spatial scale or

extent of a process does not correspond to the scale of observation, or when the nature

of a process changes with different scales of observation, spatial dependent errors will

be generated.

The treatment of spatial errors in data analysis is fundamentally different

between the data-driven and the model-driven approaches. In the model-driven

approach to spatial data analysis, the errors are considered to be a nuisance. The main

focus is on how to identify the spatial distribution of the error process and to eliminate

e the effect of errors on the statistical inference. In other words, once errors are

identified, they are eliminated by means of transformations, corrections or filters.

Alternatively, robust estimation and test procedures can be applied that are no longer

sensitive to the effect of errors. A major research question in this respect is how

diagnostics can be developed that are powerful in detecting various types of errors,

and are able to distinguish between them (e.g., to distinguish between spatial

dependence and spatial heterogeneity, or “real" vs. "apparent" contagion).

3. CLUSTER ANALYSIS:

Clusters are not unique to economics, but also and more frequently appear e.g.

in statistics, music or the computer sciences. In its literal and most general meaning a

Page 14: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

14

“cluster” is simply defined as a “close group of things” (The Concise Oxford

Dictionary, 1982). Thus the synonymous notions of “density”, “relative nearness”

and “similarity” lie at the very heart of the cluster idea. A priori, “closeness” or

“similarity” is not restricted to any particular dimension (geography, technology, or

social characteristics, etc.) or limited by any specific scale. Consequently, both have

to be chosen exogenously to fit the question under investigation.

In Cluster analysis, we seek to identify the “natural” structure of groups based

on a multivariate profile, if it exists, which both minimizes the within-group variation

and maximizes the between-group variation. It is exploratory, descriptive and non-

inferential. The solutions in this approach are not unique since they are dependent on

the variables used and how cluster membership is being defined. There are no

essential assumptions required for its use except that there must be some regard to

theoretical/conceptual rationale upon which the variables are selected. The

homogenous clusters derived from the cluster analysis can then become a unit for

analysis. Based on scores on psychological inventories, you can cluster patients into

subgroups that have similar response patterns. This may help you in targeting

appropriate treatment and studying typologies of diseases.

An important step in any clustering is to select a distance measure, which will

determine how the similarity of two elements is calculated. This will influence the

shape of the clusters, as some elements may be close to one another according to one

distance and farther away according to another.

Page 15: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

15

4. FACTOR ANALYSIS:

Factor analysis is a statistical method used to describe variability among

Observed variables in terms of fewer unobserved variables called factors. The

Observed variables are modeled as linear combinations of the factors, plus error

terms. The information gained about the interdependencies can be used later to reduce

the set of variables in a dataset. Factor analysis is originated in psychometrics and is

used in behavioral sciences, social sciences, marketing, product management,

operations research and other applied sciences that deal with large quantities of data.

Factor analysis estimates how much of the variability is due to common factors

(communality). A high value of communality means that not much of the variable is

left over after whatever the factors represent is taken into consideration. Since the

factors happen to be linear combination of data, the coordination of each observation

or variable is measured to obtain what are called factor loadings. Such factor loadings

represent the correlation between the particular variable and the factor, and are

usually placed in a matrix of correlations between the variable and the factors.

M.C. Garg and Anju Verma had done an empirical factor analysis of

Marketing Mix in the Life Insurance Industry in India. they have created nine

dimensions and converted them into four factors after applying factor analysis

through principal component analysis. D. P. Chaudhri had also applied factor analysis

to child labour in India with Pervasive Gender and Urban Bias in School Education.

Page 16: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

16

5. PRINCIPAL COMPONENT ANALYSIS:

Principal-components method (or simply P.C. method) is

developed by H.Hotelling, seeks to maximize the sum of squared loadings of each

factor extracted in turn. The Principal Component analysis has a special advantage

over all other methods of data aggregation in the sense that it redefines the large set of

variables in terms of a fewer set of orthogonal variables called Principal Components

and succeeds in reducing the dimensionality problem. Principal Components are

linear combinations of variables with weights in terms of their eigen vectors. These

vectors are derived from the correlation matrix of the variables.

6. DISCRIMINENT ANALYSIS:

In discriminant analysis technique individuals or objects might be

classified into one of two or more mutually exclusive and exhaustive groups on the

basis of a set of independent variables. Discriminant analysis is considered an

appropriate technique when the single dependent variable happens to be non-metric

and is to be classified into two or more groups, depending upon its relationship with

several independent variables which all happen to be metric.

The discriminant analysis provides a predictive equation, measures the

relative importance of each variable and is also a measure of the ability of the

equation to predict actual class-groups concerning the dependent variable. While

estimating the discriminant function coefficients, we need to choose the appropriate

discriminant analysis method. The direct method involves estimating the discriminant

Page 17: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

17

function so that all the predictors are assessed simultaneously. The step-wise method

enters the predictors sequentially. The two-group method should be used when the

dependent variable has two categories or states.

CHANGE-POINT PROBLEM:

A sequence of random variables X1,…, Xn is said to have a change-point

at r (1 ≤ r ≤ n) if Xi ~ F1(x|θ1) ( i = 1,….,r ) and Xi ~ F2(x|θ2) ( i = r + 1,….,n ), where

F1(x|θ1) ≠ F2(x|θ2). We shall consider the situation in which F1 and F2 have known

functional forms, but the change-point, r, is unknown. Given a sequence of

observations x1,….,xn, the problem we shall be concerned with is that of making

inferences about r. The parameters θ1 and θ2, which could be vector-valued, may be

known or unknown; in the latter case, it might be of interest to make inferences about

these also.

BAYESIAN INFERENCE

In Classical Inference, we consider a parameter in a model ),( xf as a

fixed unknown quantity and make inferences about based on the information

contained in the observed sample only. In Bayesian Inference, it is believed that we

always have available and make use of subjective probabilities measuring degrees of

belief about the value or values of unknown parameter . These subjective

probabilities are used to define what is called the prior distribution for the parameter

Page 18: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

18

, prior to sampling. In other words, the parameter may be treated as a random

variable with known prior distribution, say )(g . In this case, the probability density

(mass) function ),( xf is denoted by a conditional probability notation )( xf to

stress the fact that f can only be used to compute probabilities for a given value of ,

i.e. f is conditionally depending on . After observing a random sample

),........,,( 21 nxxxx , the joint (unconditional) density for the sample x and the

parameter is )()( gxf . Using well known Bayes theorem the conditional density

of , given the sample x is,

dgxf

gxfxf

)()(

)()()( ,

Where,

dgxf )()( is the marginal density of x. )( xf is called Posterior

density of . The foundation of Bayesian Inference is the Posterior Density )( xf of

. All phases of the inference problem depend on this distribution. For example, the

posterior mean, the posterior median or the posterior mode are the Bayesian Point

estimate of and may be used depending upon the Posterior distribution is symmetric

or asymmetric.

The ML method, as well as other classical approaches is based only on the

empirical information provided by the data. However, when there is some technical

knowledge on the parameters of the distribution available, a Bayesian inference seems

to an attractive inferential method. The Bayes procedure is based on a posterior

Page 19: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

19

density, say g (1, 1,, m|X) , which is proportional to the product of the likelihood

function ),x|m,,,(L21 with a prior joint density, say g(1, 1,,m) representing

the uncertainty on the parameters values.

g(1, 1,,m | x) =

dddmgxmLn

m

mgxmL

21),,2,1()|,,2,1(

21

1

1

),,2,1()|,,2,1(

In Bayesian inferences, choice of prior density is most controversial.

Generally, the priors are classified as follows:

(1) Proper and Improper Priors: A Proper prior satisfies the definition of a

probability mass function or a probability density function (whichever the case

be). It is a weight function that allocates positive weights that total one to

possible values of the parameters. An Improper Prior is any weight function

that sums (or integrates) over the possible values of the parameter to a value

other than one say k, so if f(x) is uniform over the entire line from -∞ to ∞,

0,;)( kxkxf ,

then it is not a proper density, since the integral dxkdxxf

)(

does not exist, no matter how small k is. These kind of density functions are

called Improper Prior distributions. An Improper Prior can induce a proper

prior by normalizing the function, if k is finite, otherwise it remains Improper

and simply plays the role of a weighting function, or some ad hoc method is

Page 20: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

20

used to determine a Proper prior. These type of density functions are used to

represent the local behavior of the prior distribution in the region where

likelihood is appreciable but not over its entire admissible range.

(2) Non-Informative Priors: Because of the compelling reasons to perform a

conditional analysis and the attractiveness of using Bayesian Machinery to do so,

there have been attempts to use the Bayesian approach even when no (or minimal)

prior information is available. The Non-Informative prior refer to the case when

relatively little information (or limited information) is available a priori. They are

often called priors of ignorance, but rather than a state of complete ignorance, the

priori information about the parameter is not considered substantial relative to the

information expected to be provided by the sample of empirical data. A non-

informative prior also results when the experimenter is willing to accept the set of

parameter values which are equally likely choices for the parameter. Here the

experimenter is indifferent to the choice of parameter values and this state of

indifference may be expressed by taking a prior to be locally uniform. In general

an approximate prior is taken proportional to the square root of Fisher’s

Information. This is known as Jeffrey’s rule and the prior as Jeffrey’s prior.

(3) Conjugate Priors: The conjugate prior distributions are essentially the

families of distributions that ease the computational burden when used as a

prior distribution. This is a set of the distribution each of which can be

combined with the given without too much difficulty. Three properties that

have been put forth as desirable properties for conjugate families of

Page 21: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

21

distribution are mathematical tractability, richness and ease of interpretation.

By richness, we mean, the prior distribution should reflect the statistician’s

prior information. By ease of interpretation, we mean, that it should be

possible to specify the conjugate family such a way that it is readily interpreted

by the person whose prior information is of interest.

SYMMERIC LOSS FUNCTIONS

The Bayes estimator of a generic parameter (or function thereof) under

the “continuous” squared error loss (SEL) function

dddL ,,)(),( 2

1 , (1.1)

is just the posterior mean when is a real valued parameter and d is a decision rule to

estimate .

As a consequence, the SEL function relative to an integer parameter,

.......2,1,0,,)(),( 2

1 vmvmvmL (1.2)

Hence, the Bayesian estimate of an integer-valued parameter under the SEL function

(1.2) is no longer the posterior mean and can be obtained by numerically minimizing

the corresponding posterior loss. Generally, such a Bayesian estimate is equal to the

nearest integer value to the posterior mean.

Other Bayes estimators of based on the loss functions:

||),(2 ddL ,

Page 22: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

22

otherwise

difdL

,1

0,||,0),(3

,

is the posterior median and posterior mode respectively.

ASYMMERIC LOSS FUNCTIONS

The loss function L(,d) provides a measure of the financial consequences

arising from a wrong decision rule d to estimate a unknown quantity (a generic

parameter or function thereof) . The choice of the appropriate loss function depends

on financial considerations only, and is independent from the estimation procedure

used. The Bayes approach allows the economic considerations formulated through

loss functions to be used in a rational manner. In fact, combining the selected loss

function with the posterior density, the posterior expectation of the loss function is

obtained and the d value that minimizes the expected loss is defined as the Bayes

Estimate. This point estimate is, of course, optimal relative to the loss function

chosen. The use of symmetric loss function was found to be generally inappropriate,

since for example, an overestimation of the reliability function is usually much more

serious than an under estimation. Large attention has been recently addressed to

Asymmetric Loss Functions.

A useful asymmetric loss, known as the Linex Loss Function was introduced

by Varian (1975). This function rises approximately exponentially on one side of

zero and approximately linearly on the other side. Under the assumption that the

minimal loss occurs at d, the Linex loss function can be expressed as,

Page 23: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

23

1)()(exp, 114 dqdqdL , .01 q (1.3)

The sign of the shape parameter q1 reflects the deviation of the asymmetry,

q1>0 if over estimation is more serious than under estimation, and vice versa, and the

magnitude of q1 reflects the degree of asymmetry.

The posterior expectation of LINEX loss function was introduced by Varian

(1975) is:

1expexp, 1114 iii EdqqEdqdLE ,i=1, 2.

The Bayes estimates *L is the value of d that minimizes dLEi ,4 viz., the

solution of the following equation,

0expexp

,111

41

qqEdqq

d

dLEi

i

This gives

11

*1 exp

qEe i

Lq

and hence

11

*expln

1qE

qiL , i=1, 2 (1.4)

provided that 1exp qEi exists and is finite.

Despite the flexibility and popularity of the LINEX loss function as for the

location parameter estimation, it appears to be not suitable for the estimation of the

scale parameter and other quantities (Basu and Ebrahimi (1991)). For these reasons

Basu and Ebrahimi (1991) defined a Modified LINEX Loss function:

1)1()]1(exp[),( 1

2

1

25 dqdqdL , (1.5)

Page 24: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

24

q2

where, the estimation error is expressed by ]1)/[( d . Such modification does not

change the characteristics of Varian’s LINEX loss.

The posterior expectation of the modified Linex Loss Function dL ,5 is

1//expexp, 22225 qdEqdqEqdLE iii , i= 1, 2

and the value of d that minimizes dLEi ,5 , say *m , is the solution of the

following equation.

0

2

1exp

1exp

,2

222

5

ii

i Eqdq

Eqqd

dLE

provided that all the expectations are finite. Thus, *m can not be given in a closed

form and its evaluation involves iterative procedures.

A suitable alternative to the Modified LINEX Loss is the General Entropy

(GE) loss function, proposed by Calabria and Pulcini (1994) is given by,

1ln3

3,6

dq

qd

dL , (1.6)

whose, minimum occurs at d=. This loss is generalization of the Entropy loss

function used by several authors (see, for example, Dey et. al. (1987) and Dey and Liu

(1992)), where the shape parameter q3 is equal to 1. The more general version (1.4)

allows different shapes of the loss function to be considered. When q3 > 0, a positive

error (d > α) causes more serious consequence than a negative error. The posterior

expectation of the generalized Entropy Loss Function dL ,6 is

1]/[ln3

]3/[,6

di

Eqqdi

EdLi

E , i= 1, 2

Page 25: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

25

and the value of d that minimizes dLEi ,6 , say *E , is the solution of the

following equation.

0/, 1

333

36 1

dEqdEq

d

dLEii

i qq

,

which gives

1]}3[{3

q

iE

qd ,

3/1

]}3[{* qq

iE

E

, (1.7)

provided that ]3[q

iE

exists and is finite.

General Entropy Loss (GEL) Function for an Integer parameter:

,.....2,1,0,,1)/ln()/(),( 363 vmmvqmvvmL

q (1.8)

Minimizing expectation )],([ 6 vmLEm and using posterior distribution, we get the

estimate m by means of the nearest integer value , say *GEm .

The chapters in the thesis are organized in the following way: Chapter: 1

gives a detailed introduction of the study and its genesis.

Chapter: 2 deals with application of multivariate technique different multiple

regression models on rural poverty parameter for selected states of India. The official

data source for the study is the World Bank data. The different regression models we

Page 26: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

26

have discussed here are as under. Section 2.1 and 2.2 explains a detailed introduction

and the data application for this chapter respectively.

2.3 Step-wise regression model

2.4 Pooled panel regression model

2.5 Robust regression model

Some contents of the section 2.4 have been published in an International

Journal of Agricultural and Statistical sciences (IJASS) (Vol.6, No.2, Dec 2010).

Chapter: 3

Here we have studied application of multivariate spatial analysis technique on

rural population under poverty line on socio-economic parameters such as rural

literacy, irrigation, wages etc. The official statistics used here is also the World Bank

data across selected states in India. We have developed two spatial error models. Our

focus was to apply such model, which minimize the spatial error. Jean H.P. Paelinck

in the early 1970s, originates as an identifiable field in Europe because sub-country

data in regional econometric models are needed to deal with and been fast developed

& grown during the 1990s (Anselin,1999).

Chapter: 4 discuss application of First order autoregressive process with exponential

errors with one change point on total population of India from the official statistics

source CENSUS. Bayesian inferences, particularly estimation of change point, auto

regressive coefficient are derived under two prior considerations and using Linex and

Page 27: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

27

General Entropy loss functions. In applications, we are frequently faced with time

series data which, for a variety of different reasons, have characteristics not

compatible with the usual assumption of linearity or / and Gaussian errors –

exponential autoregressive model studied by Bell and Smith (1986) are for instance,

very powerful in the analysis of such time series. Bell and Smith (1986) concerned

with the autoregressive scheme Xi = Xi-1 + i where 0 < < 1 and ’s are i.i.d. and

non-negative.

Chapter: 5

In this chapter some advanced and popular multivariate techniques have been

studied. They are as under:

5.1 Application of multivariate cluster analysis technique on Gujarat in-migration

problem with respect to the socio-economic reasons for migration from the

Gujarat Census.

5.2 Application of multivariate factor analysis on the burning economic problem for

India and the world as well the child-labour. We have studied child labour in rural

India and the socio-economic reasons for the situation from Census of India across

selected states in India.

5.3 We have studied principal component estimation method through application on

the official statistics of Gujarat Census data to study religious as well as gender

effect on different working population in Gujarat.

5.4 This section shows the application of multivariate discriminent analysis on the

Page 28: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

28

Official from National Sample Survey Organization (NSSO) on unemployment

in rural labour household with the affected socio-economic factors

across selected states in India.

The results of the section 5.1 have been published in an International

Journal of Agricultural and Statistical sciences (IJASS) (Vol.6, No.1, pp. 281-291,

2010).

Chapter: 6

Application of Two-Phase regression model is done for rural poverty and

irrigation data from the official statistics of the World Bank. The TPLR model is

defined as

α1 + β1 Xt + ut t = 1, 2, …, m

yt = α2 + β2 Xt + ut t = m+1,…, n; n ≥ 3

where ut’s are i.i.d. N (0, r) random errors with precision r > 0, xt is a

nonstochastic explanatory variable, and the regression parameters (α1, β1) ≠ (α2, β2).

The shift point m is such that if m = n there is no shift, but when m = 1, 2,…, n – 1

exactly one shift has occurred. The estimator of m is derived under symmetric and

asymmetric loss functions and simulated numerical study is derived. Bansal and

Chakravarty (1996) had proposed to study the effect of an ESD prior for the changed

slope parameter of the two-phase linear regression (TPLR) model on the Bayes

estimates of the shift point in the simple regression model.

Page 29: CHAPTER 1shodhganga.inflibnet.ac.in/bitstream/10603/9690/5/05_chapter 1.pdf · ovulation time in women (Carter and Blight, 1983), and many others, have been studied during the last

29

Chapter: 7

This chapter shows the estimation of change point in proposed First order auto

regressive model.

We have proposed First order autoregressive model with single change point.

β1 Xt-1 + Єt t = 1, 2, …, m

Xt = β2 Xt-1 + Єt t = m+1,…., n.

Where, and are unknown autocorrelation coefficients, is the

ith

observation of the dependent variable, the error terms are independent random

variables and follow a N( 0, ) for i=1,2,…,m. and a N( 0,

) for i= m+1, …,n. m is

the unknown change point and is the initial quantity. Bayesian inferences,

particularly estimation of change point, auto regressive coefficient are derived under

two prior considerations and using Linex and General Entropy loss functions.

Simulated numerical study and sensitivity of the estimator is also shown in te chapter.

This model is also applied on the job-seekers registered with employment exchanges

India from the official statistics of Directorate General Employment and Training,

Ministry of Labour.

At the end we provide an extensive bibliography. Some of the outputs of

this thesis are communicated for publication.