Assignment on Statistics for Management

STATISTICS FOR MANAGEMENT

ASSIGNMENT ON STATISTICS FOR MANAGEMENT

BY RAHUL GUPTA

Question 1: What do you mean by sample survey? What are the different sampling methods? Briefly describe them?

Answer

Introduction:

In statistics, survey sampling describes the process of selecting a sample of elements from a target population in order to conduct a survey.

A survey may refer to many different types or techniques of observation, but in the context of survey sampling it most often refers to a questionnaire used to measure the characteristics and/or attitudes of people. The purpose of sampling is to reduce the cost and/or the amount of work that it would take to survey the entire target population. A survey that measures the entire target population is called a census.

RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2 Page 1

http://en.wikipedia.org/wiki/Census

http://en.wikipedia.org/wiki/Sampling_(statistics)

http://en.wikipedia.org/wiki/Survey

http://en.wikipedia.org/wiki/Statistics


Probability Sampling:

In a probability sample (also called "scientific" or "random" sample) each member of the target population has a known and non-zero probability of inclusion in the sample. A survey based on a probability sample can in theory produce statistical measurements of the target population that are:

unbiased, the expected value of the sample mean is equal to the population mean E(ȳ)=μ, and

Have a measurable sampling error, which can be expressed as a confidence interval, or margin of error.

A probability based survey sample is created by constructing a list of the target population, called the sample frame, a randomized process for selecting units from the sample frame, called a selection procedure, and a method of contacting selected units to and enabling them complete the survey, called a data collection method or mode. For some target populations this process may be easy, for example, sampling the employees of a company by using payroll list. However, in large, disorganized populations simply constructing a suitable sample frame is often a complex and expensive task. Common methods of conducting a probability sample of the household population in the United States are Area Probability Sampling, Random Digit Dial telephone sampling, and more recently Address Based Sampling. Within probability sampling there are specialized techniques such as stratified sampling and cluster sampling that improve the precision or efficiency of the sampling process without altering the fundamental principles of probability sampling.

Bias in Probability Sampling:

Bias in surveys is undesirable, but often unavoidable. The major types of bias that may occur in the sampling process are:

Non-response bias: When individuals or households selected in the survey sample cannot or will not complete the survey there is the potential for bias to result from this non-response. No response bias occurs when the observed value deviates from the population parameter due to differences between respondents and no respondents.

Coverage bias: Coverage bias can occur when population members do not appear in the sample frame (under coverage). Coverage bias occurs when the observed value deviates from the population parameter due to differences between covered and non-covered units. Telephone surveys suffer from a well known source of coverage bias because they cannot include households without telephones.

Selection Bias: Selection bias occurs when some units have a differing probability of selection that is unaccounted for by the researcher. For example,


http://en.wikipedia.org/wiki/Margin_of_error

http://en.wikipedia.org/wiki/Confidence_interval

http://en.wikipedia.org/wiki/Confidence_interval

http://en.wikipedia.org/wiki/Unbiased


some households have multiple phone numbers making them more likely to be selected in a telephone survey than households with only one phone number.

Non-Probability Sampling:

Many surveys are not based on a probability samples, but rather by finding a suitable collection of respondents to complete the survey. Some common examples of non-probability sampling are:

Judgment Samples: A researcher decides which population members to include in the sample based on his or her judgment. The researcher may provide some alternative justification for the representativeness of the sample.

Snowball Samples: Often used when a target population is rare, members of the target population recruit other members of the population for the survey.

Quota Samples: The sample is designed to include a designated number of people with certain specified characteristics. For example, 100 coffee drinkers. This type of sampling is common in non-probability market research surveys.

Convenience Samples: The sample is composed of whatever persons can be most easily accessed to fill out the survey.

In non-probability samples the relationship between the target population and the survey sample is immeasurable and potential bias is unknowable. Sophisticated users of non-probability survey samples tend to view the survey as an experimental condition, rather than a tool for population measurement, and examine the results for internally consistent relationships

Sampling Methods:

Random sampling is the purest form of probability sampling. Each member of the population has an equal and known chance of being selected. When there are very large populations, it is often difficult or impossible to identify every member of the population, so the pool of available subjects becomes biased.

Systematic sampling is often used instead of random sampling. It is also called an Nth name selection technique. After the required sample size has been calculated, every Nth record is selected from a list of population members. As long as the list does not contain any hidden order, this sampling method is as good as the random sampling method. Its only advantage over the random sampling technique is simplicity.

Stratified sampling is commonly used probability method that is superior to random sampling because it reduces sampling error. A stratum is a subset of the population that



shares at least one common characteristic. Examples of stratums might be males and females, or managers and non-managers. The researcher first identifies the relevant stratums and their actual representation in the population. Random sampling is then used to select a sufficient number of subjects from each stratum. "Sufficient" refers to a sample size large enough for us to be reasonably confident that the stratum represents the population.

Convenience sampling is used in exploratory research where the researcher is interested in getting an inexpensive approximation of the truth. As the name implies, the sample is selected because they are convenient. This no probability method is often used during preliminary research efforts to get a gross estimate of the results, without incurring the cost or time required to select a random sample.

Judgment sampling is a common no probability method. The researcher selects the sample based on judgment. This is usually an extension of convenience sampling. For example, a researcher may decide to draw the entire sample from one "representative" city, even though the population includes all cities. When using this method, the researcher must be confident that the chosen sample is truly representative of the entire population.

Quota sampling is the no probability equivalent of stratified sampling. Like stratified sampling, the researcher first identifies the stratums and their proportions as they are represented in the population. Then convenience or judgment sampling is used to select the required number of subjects from each stratum. This differs from stratified sampling, where the stratums are filled by random sampling.

Snowball sampling is a special no probability method used when the desired sample characteristic is rare. It may be extremely difficult or cost prohibitive to locate respondents in these situations. Snowball sampling relies on referrals from initial subjects to generate additional subjects.

Question 2: What is the different between correlation and regression? What do you understand by Rank Correlation? When we use rank correlation and when we use Pearsonian Correlation Coefficient? Fit a linear regression line in the following data –

X 12 15 18 20 27 34 28 48Y 123 150 158 170 180 184 176 130

Answer

Correlation:



Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero. In statistics, correlation (often measured as a correlation coefficient, ρ) indicates the strength and direction of a relationship between two random variables. The commonest use refers to a linear relationship, but the concept of nonlinear correlation is also used. In general statistical usage, correlation or co-relation refers to the departure of two random variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of the data.

Pearson's product-moment coefficient:

A number of different coefficients are used for different situations. The best known is the Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. Karl Pearson developed the coefficient from a similar but slightly different idea by Francis Galton.

Regression analysis:

In statistics, regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps us understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables — that is, the average value of the dependent variable when the independent variables are held fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a


http://en.wikipedia.org/wiki/Location_parameter

http://en.wikipedia.org/wiki/Quantile

http://en.wikipedia.org/wiki/Conditional_expectation

http://en.wikipedia.org/wiki/Independent_variable

http://en.wikipedia.org/wiki/Dependent_variable


http://en.wikipedia.org/wiki/Francis_Galton

http://en.wikipedia.org/wiki/Karl_Pearson

http://en.wikipedia.org/wiki/Standard_deviation

http://en.wikipedia.org/wiki/Covariance

http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

http://en.wikipedia.org/wiki/Random_variables


http://en.wikipedia.org/wiki/File:Correlation_examples.png

http://en.wikipedia.org/wiki/File:Correlation_examples.png


function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.

Regression analysis is widely used for prediction (including forecasting of time-series data). Use of regression analysis for prediction has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables.

Mathematical properties:

The correlation coefficient ρX, Y between two random variables X and Y with expected values μX and μY and standard deviations σX and σY is defined as:

where E is the expected value operator and cov means covariance. A widely used alternative notation is

Since μX = E(X), σX2 = E[(X − E(X))2] = E(X2) − E2(X) and likewise for Y, and since

we may also write

The correlation is defined only if both of the standard deviations are finite and both of them are nonzero. It is a corollary of the Cauchy–Schwarz inequality that the correlation cannot exceed 1 in absolute value.

The correlation is 1 in the case of an increasing linear relationship, −1 in the case of a decreasing linear relationship, and some value in between in all other cases, indicating the degree of linear dependence between the variables. The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.


http://en.wikipedia.org/wiki/Linear_dependence

http://en.wikipedia.org/wiki/Absolute_value

http://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality

http://en.wikipedia.org/wiki/Covariance

http://en.wikipedia.org/wiki/Expected_value




http://en.wikipedia.org/wiki/Random_variables

http://en.wikipedia.org/wiki/Causality

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Time_series

http://en.wikipedia.org/wiki/Forecast

http://en.wikipedia.org/wiki/Prediction

http://en.wikipedia.org/wiki/Probability_distribution

http://en.wikipedia.org/wiki/Function_(mathematics)


If the variables are independent then the correlation is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. Here is an example: Suppose the random variable X is uniformly distributed on the interval from −1 to 1, and Y = X2. Then Y is completely determined by X, so that X and Y are dependent, but their correlation is zero; they are uncorrelated. However, in the special case when X and Y are jointly normal, uncorrelatedness is equivalent to independence.

A correlation between two variables is diluted in the presence of measurement error around estimates of one or both variables, in which case disattenuation provides a more accurate coefficient.

Sample correlation:

If we have a series of n measurements of X and Y written as xi and yi where i = 1, 2, ..., n, then the Pearson product-moment correlation coefficient can be used to estimate the correlation of X and Y . The Pearson coefficient is also known as the "sample correlation coefficient". The Pearson correlation coefficient is then the best estimate of the correlation of X and Y. The Pearson correlation coefficient is written:

where and are the sample means of X and Y , sx and sy are the sample standard deviations of X and Y and the sum is from i = 1 to n. As with the population correlation, we may rewrite this as

Again, as is true with the population correlation, the absolute value of the sample correlation must be less than or equal to 1. The above formula conveniently suggests a single-pass algorithm for calculating sample correlations, but, depending on the numbers involved, it can sometimes be numerically unstable.

The square of the sample correlation coefficient, which is also known as the coefficient of determination, is the fraction of the variance in yi that is accounted for by a linear fit of xi to yi. This is written


http://en.wikipedia.org/wiki/Coefficient_of_determination

http://en.wikipedia.org/wiki/Coefficient_of_determination

http://en.wikipedia.org/wiki/Numerical_stability



http://en.wikipedia.org/wiki/Arithmetic_mean

http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

http://en.wikipedia.org/wiki/Disattenuation

http://en.wikipedia.org/wiki/Bivariate_Gaussian_distribution

http://en.wikipedia.org/wiki/Uncorrelated

http://en.wikipedia.org/wiki/Statistical_independence


Where sy|x2 is the square of the error of a linear regression of xi on yi by the equation y = a

+ bx:

And sy2 is just the variance of y:

Note that since the sample correlation coefficient is symmetric in xi and yi, we will get the same value for a fit of yi to xi:

This equation also gives an intuitive idea of the correlation coefficient for higher dimensions. Just as the above described sample correlation coefficient is the fraction of variance accounted for by the fit of a 1-dimensional linear sub manifold to a set of 2-dimensional vectors (xi, yi), so we can define a correlation coefficient for a fit of an m-dimensional linear sub manifold to a set of n-dimensional vectors. For example, if we fit a plane z = a + bx + CY to a set of data (xi, yi, zi) then the correlation coefficient of z to x and y is

The distribution of the correlation coefficient has been examined by R. A. Fisher and A. K. Gayen.

Geometric interpretation:

For centered data (i.e., data which have been shifted by the sample mean so as to have an average of zero), the correlation coefficient can also be viewed as the cosine of the angle between the two vectors of samples drawn from the two random variables.


http://en.wikipedia.org/wiki/Vector_(geometry)

http://en.wikipedia.org/wiki/Angle

http://en.wikipedia.org/wiki/Cosine

http://en.wikipedia.org/wiki/R._A._Fisher

http://en.wikipedia.org/wiki/Euclidean_space

http://en.wikipedia.org/wiki/Dimension

http://en.wikipedia.org/wiki/Equation

http://en.wikipedia.org/wiki/Linear_regression


Some practitioners prefer a un centered (non-Pearson-compliant) correlation coefficient. See the example below for a comparison.

As an example, suppose five countries are found to have gross national products of 1, 2, 3, 5, and 8 billion dollars, respectively. Suppose these same five countries (in the same order) are found to have 11%, 12%, 13%, 15%, and 18% poverty. Then let x and y be ordered 5-element vectors containing the above data: x = (1, 2, 3, 5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18).

By the usual procedure for finding the angle between two vectors (see dot product), the uncentered correlation coefficient is:

Note that the above data were deliberately chosen to be perfectly correlated: y = 0.10 + 0.01 x. The Pearson correlation coefficient must therefore be exactly one. Centering the data (shifting x by E(x) = 3.8 and y by E(y) = 0.138) yields x = (−2.8, −1.8, −0.8, 1.2, 4.2) and y = (−0.028, −0.018, −0.008, 0.012, 0.042), from which

As expected.

Motivation for the form of the coefficient of correlation:

Another motivation for correlation comes from inspecting the method of simple linear regression. As above, X is the vector of independent variables, xi, and Y of the dependent variables, yi, and a simple linear relationship between X and Y is sought, through a least-squares method on the estimate of Y:

Then, the equation of the least-squares line can be derived to be of the form:


http://en.wikipedia.org/wiki/Linear_Regression

http://en.wikipedia.org/wiki/Linear_Regression

http://en.wikipedia.org/wiki/Dot_product


Which can be rearranged in the form?

Where r has the familiar form mentioned above

Rank correlation coefficients:

Rank correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank correlation coefficient (τ) measure the extent to which, as one variable increases, the other variable tends to increase, without requiring that increase to be represented by a linear relationship. If, as the one variable increase, the other decreases, the rank correlation coefficients will be negative. It is common to regard these rank correlation coefficients as alternatives to Pearson's coefficient, used either to reduce the amount of calculation or to make the coefficient less sensitive to non-normality in distributions. However, this view has little mathematical basis, as rank correlation coefficients measure a different type of relationship than the product moment correlation coefficient, and are best seen as measures of a different type of association, rather than as alternative measure of the population correlation coefficient. To illustrate the nature of rank correlation, and its difference from linear correlation, consider the following four pairs of numbers (x, y): (0, 1), (100, 10), (101, 500), (102, 2000).

As we go from each pair to the next pair x increases, and so does y. This relationship is perfect, in the sense that an increase in x is always accompanied by an increase in y. This means that we have a perfect rank correlation, and both Spearman's and Kendall's correlation coefficients are 1, whereas in this example Pearson's product moment correlation coefficient is 0.456, indicating that the points are far from lying on a straight line. In the same way if y always decreases when x increases, the rank correlation coefficients will be −1, while the product moment correlation coefficient may or may not be close to 1, depending on how close the points are to a straight line. Although in the extreme cases of perfect rank correlation the two coefficients are both equal (being both +1 and both -1) this is not in general so, and values of the two coefficients cannot meaningfully be compared. For example, for the three pairs (1, 1) (2, 3) (3, 2) Spearman's coefficient is 1/2, while Kendall's coefficient is 1/3.


http://en.wikipedia.org/wiki/Association_(statistics)

http://en.wikipedia.org/wiki/Kendall's_tau

http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient


Correlation and linearity

Four sets of data with the same correlation of 0.816

The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value generally does not completely characterize their relationship. In particular, if the conditional mean of Y given X, denoted E (Y|X), is not linear in X, the correlation coefficient will not fully determine the form of E (Y|X).

The image on the right shows scatter plots of Anscombe's quartet, a set of four different pairs of variables created by Francis Anscombe. The four y variables have the same mean (7.5), standard deviation (4.12), correlation (0.816) and regression line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two variables correlated and following the assumption of normality. The second one (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear, and the Pearson correlation coefficient is not relevant. In the third case (bottom left), the linear relationship is perfect, except for one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816. Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.

If a pair (X, Y) of random variables follows a bivariate normal distribution, the conditional mean E (X|Y) is a linear function of Y, and the conditional mean E (Y|X) is a linear function of X. The correlation coefficient r between X and Y, along with the marginal means and variances of X and Y, determines this linear relationship:

where EX and EY are the expected values of X and Y, respectively, and σx and σy are the standard deviations of X and Y, respectively.


http://en.wikipedia.org/wiki/Outlier

http://en.wikipedia.org/wiki/Francis_Anscombe

http://en.wikipedia.org/wiki/Anscombe's_quartet

http://en.wikipedia.org/wiki/Scatterplot

http://en.wikipedia.org/wiki/Conditional_expectation

http://en.wikipedia.org/wiki/File:Anscombe.svg

http://en.wikipedia.org/wiki/File:Anscombe.svg


a) Fit a linear regression line in the following data –

X 12 15 18 20 27 34 28 48

Y 123 150 158 170 180 184 176 130

Answer:

Assumed mean of X is 26.

Assumed mean of Y is 158

X dx

=X-26

dx2 Y dy= Y-158 dy2 dxdy

12 -14 196 123 -35 1225 490

15 -11 121 150 -8 64 88

18 -8 64 158 0 0 0

20 -6 36 170 12 12 -72

27 1 1 180 22 484 22

34 7 49 184 26 676 182

28 2 4 176 18 324 36

48 22 484 130 -28 784 -616

Total=202 -7 955 1271 7 3701 130

Mean of X= 202/8 = 25.25, Mean of Y= 1271/8 = 158.8



Regression equation of Y on X

Y-158.8= byx (X-25.25) where byx= N*dxdy – dx*dy/N*dx2 – (dx)2

byx= 8*130- (-7)(7)/ 8*955- (-7)2

byx= 540+49/ 7640-49

byx = 589/ 7591

byx= 0.07

Y-158.8= 0.07(X-25.25)

Y-158.8 = 0.07X- 1.7675

Y=0.07X+ 157.0325

Regression equation of X on Y

X-25.25= bxy (X-158.8) where bxy= N*dxdy – dx*dy/N*dy2 – (dy)2

bxy= 8* 130 – (-7)(7)/ 8* 3701 – (7)2

bxy= 540 +49 / 29559

bxy= 589/ 29559

bxy = 0.019

X-25.25 = 0.019 (Y- 158.8)

X – 25.25 = 0.019Y – 3.0172

X = 0.019Y + 22.2328

Regression equation of Y on X:

Y=0.07X+ 157.0325

Regression equation of X on Y :

X = 0.019Y + 22.2328



Question 3: What do you mean by business forecasting? What are the different methods of business forecasting? Describe the effectiveness of time-series analysis as a mode of business forecasting. Describe the method of moving averages?

Answer

Introduction:

Business forecasting has always been one component of running an enterprise. However, forecasting traditionally was based less on concrete and comprehensive data than on face-to-face meetings and common sense. In recent years, business forecasting has developed into a much more scientific endeavor, with a host of theories, methods, and techniques designed for forecasting certain types of data. The development of information technologies and the Internet propelled this development into overdrive, as companies not only adopted such technologies into their business practices, but into forecasting schemes as well. In the 2000s, projecting the optimal levels of goods to buy or products to produce involved sophisticated software and electronic networks that incorporate mounds of data and advanced mathematical algorithms tailored to a company's particular market conditions and line of business. Business forecasting involves a wide range of tools, including simple electronic spreadsheets; enterprise resource planning (ERP) and electronic data interchange (EDI) networks, advanced supply chain management systems, and other Web-enabled technologies. The practice attempts to pinpoint key factors in business production and extrapolate from given data sets to produce accurate projections for future costs, revenues, and opportunities. This normally is done with an eye toward adjusting current and near-future business practices to take maximum advantage of expectations.

In the Internet age, the field of business forecasting was propelled by three interrelated phenomena. First, the Internet provided a new series of tools to aid the science of business forecasting. Second, business forecasting had to take the Internet itself into account in trying to construct viable models and make predictions. Finally, the Internet fostered vastly accelerated transformations in all areas of business that made the job of business forecasters that much more exacting. By the 2000s, as the Internet and its myriad functions highlighted the central importance of information in economic activity, more and more companies came to recognize the value, and often the necessity, of business forecasting techniques and systems. Business forecasting is indeed big business, with companies investing tremendous resources in systems, time, and employees aimed at bringing useful projections into the planning process. According to a survey by the Hudson, Ohio-based Answer Think Consulting Group, which specializes in studies of



business planning, the average U.S. Company spends more than 25,000 person-days on business forecasting and related activities for every billion dollars of revenue.

Forecasting systems draw on several sources for their forecasting input, including databases, e-mails, documents, and Web sites. After processing data from various sources, sophisticated forecasting systems integrate all the necessary data into a single spreadsheet, which the company can then manipulate by entering in various projections—such as different estimates of future sales—that the system will incorporate into a new readout.

A flexible and sound architecture is crucial, particularly in the fast-paced, rapidly developing Internet economy. If a system's base is rigid or inadequate, it can be impossible to reconfigure to adjust to changing market conditions. Along the same lines, according to the Journal of Business Forecasting Methods & Systems, it's important to invest in systems that will remain useful over the long term, weathering alterations in the business climate.

One of the distinguishing characteristics of forecasting systems is the mathematical algorithms they use to take various factors into account. For example, most forecasting systems arrange relevant data into hierarchies, such as a consumer hierarchy, a supply hierarchy, a geography hierarchy, and so on. To return a useful forecast, the system can't simply allocate down each hierarchy separately, but must account for the ways in which those dimensions interact with each other. Moreover, the degree of this interaction varies according to the type of business in which a company is engaged. Thus, businesses need to fine-tune their allocation algorithms in order to receive useful forecasts.

The second forecasting model is cause-and-effect. In this model, one assumes a cause, or driver of activity, that determines an outcome. For instance, a company may assume that, for a particular data set, the cause is an investment in information technology, and the effect is sales. This model requires the historical data not only of the factor with which one is concerned (in this case, sales), but also of that factor's determined cause (here, information technology expenditures). It is assumed, of course, that the cause-and-effect relationship is relatively stable and easily quantifiable.

The third primary forecasting model is known as the judgmental model. In this case, one attempts to produce a forecast where there is no useful historical data. A company might choose to use the judgmental model when it attempts to project sales for a brand new product, or when market conditions have qualitatively changed, rendering previous data obsolete. In addition, according to the Journal of Business Forecasting Methods & Systems, this model is useful when the bulk of sales derive only from a relative handful of customers. To proceed in the absence of historical data, alternative data is collected by way of experts in the field, prospective customers, trade groups, business partners, or any other relevant source of information. Business forecasting systems often work hand-in-hand with supply chain management systems. In such systems, all partners in the supply



chain can electronically oversee all movement of components within that supply chain and gear the chain toward maximum efficiency.

The Internet has proven to be a panacea in this field, and business forecasting systems allow partners to project the optimal flow of components into the future so that companies can try to meet optimal levels rather than continually catch up to them.

Time series methods:

Time series methods use historical data as the basis of estimating future outcomes.

Rolling forecast is a projection into the future based on past performances, routinely updated on a regular schedule to incorporate data.[1]

Moving average Exponential smoothing Extrapolation Linear prediction Trend estimation Growth curve Topics

1. Causal / Econometric methods:

Some forecasting methods use the assumption that it is possible to identify the underlying factors that might influence the variable that is being forecast. For example, sales of umbrellas might be associated with weather conditions. If the causes are understood, projections of the influencing variables can be made and used in the forecast.

Regression analysis using linear regression or non-linear regression Autoregressive moving average (ARMA) Autoregressive integrated moving average (ARIMA) e.g. Box-Jenkins Econometrics

2. Judgmental methods:

Judgmental forecasting methods incorporate intuitive judgments, opinions and subjective probability estimates.

Composite forecasts Surveys Delphi method Scenario building Technology forecasting Forecast by analogy


http://en.wikipedia.org/w/index.php?title=Forecast_by_analogy&action=edit&redlink=1

http://en.wikipedia.org/wiki/Technology_forecasting

http://en.wikipedia.org/wiki/Scenario_building

http://en.wikipedia.org/wiki/Delphi_method

http://en.wikipedia.org/wiki/Statistical_survey

http://en.wikipedia.org/w/index.php?title=Composite_forecasts&action=edit&redlink=1

http://en.wikipedia.org/wiki/Probability

http://en.wikipedia.org/wiki/Econometrics

http://en.wikipedia.org/wiki/Box-Jenkins

http://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average

http://en.wikipedia.org/wiki/Autoregressive_moving_average_model

http://en.wikipedia.org/wiki/Non-linear_regression

http://en.wikipedia.org/wiki/Linear_regression

http://en.wikipedia.org/wiki/Regression_analysis

http://en.wikipedia.org/wiki/Growth_curve

http://en.wikipedia.org/wiki/Trend_estimation

http://en.wikipedia.org/wiki/Linear_prediction

http://en.wikipedia.org/wiki/Extrapolation

http://en.wikipedia.org/wiki/Exponential_smoothing

http://en.wikipedia.org/wiki/Moving_average

http://en.wikipedia.org/wiki/Forecasting#cite_note-0

http://en.wikipedia.org/wiki/Time_series


3. Other methods:

Simulation Prediction market Probabilistic forecasting and Ensemble forecasting Reference class forecasting

4. Forecasting accuracy:

The forecast error is the difference between the actual value and the forecast value for the corresponding period.

Where E is the forecast error at period t, Y is the actual value at period t, and F is the forecast for period t.

Measures of aggregate error:

Mean Absolute Error (MAE)

Mean Absolute Percentage Error (MAPE)

Percent Mean Absolute Deviation (PMAD)

Mean squared error (MSE)

Root Mean squared error (RMSE)

Forecast skill (SS)

Time-Critical Decision Modeling and Analysis:

The ability to model and perform decision modeling and analysis is an essential feature of many real-world applications ranging from emergency medical treatment in intensive care units to military command and control systems. Existing formalisms and methods of


http://en.wikipedia.org/wiki/Forecast_skill

http://en.wikipedia.org/wiki/Mean_squared_error

http://en.wikipedia.org/wiki/Mean_Absolute_Percentage_Error

http://en.wikipedia.org/wiki/Reference_class_forecasting

http://en.wikipedia.org/wiki/Ensemble_forecasting

http://en.wikipedia.org/wiki/Probabilistic_forecasting

http://en.wikipedia.org/wiki/Prediction_market

http://en.wikipedia.org/wiki/Simulation


inference have not been effective in real-time applications where tradeoffs between decision quality and computational tractability are essential. In practice, an effective approach to time-critical dynamic decision modeling should provide explicit support for the modeling of temporal processes and for dealing with time-critical situations.

One of the most essential elements of being a high-performing manager is the ability to lead effectively one's own life, then to model those leadership skills for employees in the organization. This site comprehensively covers theory and practice of most topics in forecasting and economics. I believe such a comprehensive approach is necessary to fully understand the subject. A central objective of the site is to unify the various forms of business topics to link them closely to each other and to the supporting fields of statistics and economics. Nevertheless, the topics and coverage do reflect choices about what is important to understand for business decision making. Almost all managerial decisions are based on forecasts. Every decision becomes operational at some point in the future, so it should be based on forecasts of future conditions. Forecasts are needed throughout an organization -- and they should certainly not be produced by an isolated group of forecasters. Neither is forecasting ever "finished". Forecasts are needed continually, and as time moves on, the impact of the forecasts on actual performance is measured; original forecasts are updated; and decisions are modified, and so on.

For example, many inventory systems cater for uncertain demand. The inventory parameters in these systems require estimates of the demand and forecast error distributions. The two stages of these systems, forecasting and inventory control, are often examined independently. Most studies tend to look at demand forecasting as if this were an end in itself or at stock control models as if there were no preceding stages of computation. Nevertheless, it is important to understand the interaction between demand forecasting and inventory control since this influences the performance of the inventory system. This integrated process is shown in the following figure:

The decision-maker uses forecasting models to assist him or her in decision-making process. The decision-making often uses the modeling process to investigate the impact of different courses of action retrospectively; that is, "as if" the decision has already been



made under a course of action. That is why the sequence of steps in the modeling process, in the above figure must be considered in reverse order. For example, the output (which is the result of the action) must be considered first.

It is helpful to break the components of decision making into three groups: Uncontrollable, Controllable, and Resources (that defines the problem situation). As indicated in the above activity chart, the decision-making process has the following components:

1. Performance measure (or indicator, or objective): Measuring business performance is the top priority for managers. Management by objective works if you know the objectives. Unfortunately, most business managers do not know explicitly what it is. The development of effective performance measures is seen as increasingly important in almost all organizations. However, the challenges of achieving this in the public and for non-profit sectors are arguably considerable. Performance measure provides the desirable level of outcome, i.e., objective of your decision. Objective is important in identifying the forecasting activity. The following table provides a few examples of performance measures for different levels of management:

Level Performance Measure

Strategic Return of Investment, Growth, and Innovations

Tactical Cost, Quantity, and Customer satisfaction

Operational Target setting, and Conformance with standard

2. Clearly, if you are seeking to improve a system's performance, an operational view is really what you are after. Such a view gets at how a forecasting system really works; for example, by what correlation its past output behaviors have generated. It is essential to understand how a forecast system currently is working if you want to change how it will work in the future. Forecasting activity is an iterative process. It starts with effective and efficient planning and ends in compensation of other forecasts for their performance

3. What is a System? Systems are formed with parts put together in a particular manner in order to pursue an objective. The relationship between the parts determines what the system does and how it functions as a whole. Therefore, the relationships in a system are often more important than the individual parts. In general, systems that are building blocks for other systems are called subsystems

4. The Dynamics of a System: A system that does not change is a static system. Many of the business systems are dynamic systems, which mean their states change over time. We refer to the way a system changes over time as the system's behavior. And when the system's development follows a typical pattern, we say



the system has a behavior pattern. Whether a system is static or dynamic depends on which time horizon you choose and on which variables you concentrate. The time horizon is the time period within which you study the system. The variables are changeable values on the system.

5. Resources: Resources are the constant elements that do not change during the time horizon of the forecast. Resources are the factors that define the decision problem. Strategic decisions usually have longer time horizons than both the Tactical and the Operational decisions.

6. Forecasts: Forecasts input come from the decision maker's environment. Uncontrollable inputs must be forecasted or predicted.

7. Decisions: Decisions inputs ate the known collection of all possible courses of action you might take.

8. Interaction: Interactions among the above decision components are the logical, mathematical functions representing the cause-and-effect relationships among inputs, resources, forecasts, and the outcome.

Interactions are the most important type of relationship involved in the decision-making process. When the outcome of a decision depends on the course of action, we change one or more aspects of the problematic situation with the intention of bringing about a desirable change in some other aspect of it. We succeed if we have knowledge about the interaction among the components of the problem.

There may have also sets of constraints which apply to each of these components. Therefore, they do not need to be treated separately.

9. Actions: Action is the ultimate decision and is the best course of strategy to achieve the desirable goal.

Simple Moving Averages:

The best-known forecasting methods is the moving averages or simply takes a certain number of past periods and add them together; then divide by the number of periods. Simple Moving Averages (MA) is effective and efficient approach provided the time series is stationary in both mean and variance. The following formula is used in finding the moving average of order n, MA(n) for a period t+1,

MAt+1 = [Dt + Dt-1 + ... +Dt-n+1] / n

Where n is the number of observations used in the calculation.

The forecast for time period t + 1 is the forecast for all future time periods. However, this forecast is revised only when new data becomes available. You may like using Forecasting by Smoothing JavaScript, and then performing some numerical experimentation for a deeper understanding of these concepts.


http://home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/ForecaSmo.htm


Weighted Moving Average:

Very powerful and economical. They are widely used where repeated forecasts required-uses methods like sum-of-the-digits and trend adjustment methods. As an example, a Weighted Moving Averages is:

Weighted MA (3) = w1.Dt + w2.Dt-1 + w3.Dt-2

Where the weights are any positive numbers such that: w1 + w2 + w3 = 1. A typical weights for this example is, w1 = 3/ (1 + 2 + 3) = 3/6, w2 = 2/6, and w3 = 1/6.

You may like using Forecasting by Smoothing JavaScript, and then performing some numerical experimentation for a deeper understanding of the concepts.

An illustrative numerical example: The moving average and weighted moving average of order five are calculated in the following table.

Week Sales ($1000) MA(5) WMA(5)

1 105 - -

2 100 - -

3 105 - -

4 95 - -

5 100 101 100

6 95 99 98

7 105 100 100

8 120 103 107

9 115 107 111

10 125 117 116

11 120 120 119

12 120 120 119

Moving Averages with Trends: Any method of time series analysis involves a different degree of model complexity and presumes a different level of comprehension about the underlying trend of the time series. In many business time series, the trend in the


http://home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/ForecaSmo.htm


smoothed series using the usual moving average method indicates evolving changes in the series level to be highly nonlinear.

In order to capture the trend, we may use the Moving-Average with Trend (MAT) method. The MAT method uses an adaptive linearization of the trend by means of incorporating a combination of the local slopes of both the original and the smoothed time series.

In making a forecast, it is also important to provide a measure of how accurate one can expect the forecast to be. The statistical analysis of the error terms known as residual time-series provides measure tool and decision process for modeling selection process. In applying MAT method sensitivity analysis is needed to determine the optimal value of the moving average parameter n, i.e., the optimal number of period m. The error time series allows us to study many of its statistical properties for goodness-of-fit decision. Therefore it is important to evaluate the nature of the forecast error by using the appropriate statistical tests. The forecast error must be a random variable distributed normally with mean close to zero and a constant variance across time.

For computer implementation of the Moving Average with Trend (MAT) method one may use the forecasting (FC) module of WinQSB which is commercial grade stand-alone software. WinQSB’s approach is to first select the model and then enter the parameters and the data. With the Help features in WinQSB there is no learning-curve one just needs a few minutes to master its useful features.

Exponential Smoothing Techniques: One of the most successful forecasting methods is the exponential smoothing (ES) techniques. Moreover, it can be modified efficiently to use effectively for time series with seasonal patterns. It is also easy to adjust for past errors-easy to prepare follow-on forecasts, ideal for situations where many forecasts must be prepared, several different forms are used depending on presence of trend or cyclical variations. In short, an ES is an averaging technique that uses unequal weights; however, the weights applied to past observations decline in an exponential manner.


http://home.ubalt.edu/ntsbarsh/Business-stat/opre/partX.htm


Question 4: What is definition of Statistics? What are the different characteristics of statistics? What are the different functions of Statistics? What are the limitations of Statistics?

Answer

Introduction:

Statistics is considered by some to be a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data, while others consider it to be a branch of mathematics concerned with collecting and interpreting data. Statisticians improve the quality of data with the design of experiments and survey sampling. Statistics also provides tools for prediction and forecasting using data and statistical models. Statistics is applicable to a wide variety of academic disciplines, including natural and social sciences, government, and business.

Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. This is useful in research, when communicating the results of experiments. In addition, patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, and are then used to draw inferences about the process or population being studied; this is called inferential statistics. Inference is a vital element of scientific advance, since it provides a prediction (based in data) for where a theory logically leads. To further prove the guiding theory, these predictions are tested as well, as part of the scientific method. If the inference holds true, then the descriptive statistics of the new data increase the soundness of that hypothesis. Descriptive statistics and inferential statistics (a.k.a., predictive statistics) together comprise applied statistics. There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject. The word statistics can either be singular or plural. In its singular form, statistics refers to the mathematical science discussed in this article. In its plural form, statistics is the plural of the word statistic, which refers to a quantity (such as a mean) calculated from a set of data.

Experimental and observational studies:


http://en.wikipedia.org/wiki/Mean

http://en.wikipedia.org/wiki/Statistic

http://en.wikipedia.org/wiki/Mathematical_statistics

http://en.wikipedia.org/wiki/Scientific_method

http://en.wikipedia.org/wiki/Inferential_statistics

http://en.wikipedia.org/wiki/Random

http://en.wikipedia.org/wiki/Mathematical_model

http://en.wikipedia.org/wiki/Descriptive_statistics

http://en.wikipedia.org/wiki/Social_science

http://en.wikipedia.org/wiki/Natural

http://en.wikipedia.org/wiki/Academic_discipline

http://en.wikipedia.org/wiki/Statistical_model

http://en.wikipedia.org/wiki/Survey_sampling

http://en.wikipedia.org/wiki/Survey_sampling

http://en.wikipedia.org/wiki/Design_of_experiments

http://en.wikipedia.org/wiki/Data

http://en.wikipedia.org/wiki/Mathematics

http://en.wikipedia.org/wiki/Data

http://en.wikipedia.org/wiki/Mathematics


A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on dependent variables or response. There are two major types of causal statistical studies: experimental studies and observational studies. In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed. The difference between the two types lies in how the study is actually conducted. Each can be very effective.

An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead, data are gathered and correlations between predictors and response are investigated. An example of an observational study is one that explores the correlation between smoking and lung cancer. This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis. In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through a case-control study, and then look for the number of cases of lung cancer in each group.

The basic steps of an experiment are:

1. Planning the research, including determining information sources, research subject selection, and ethical considerations for the proposed research and method.

2. Design of experiments, concentrating on the system model and the interaction of independent and dependent variables.

3. Summarizing a collection of observations to feature their commonality by suppressing details. (Descriptive statistics)

4. Reaching consensus about what the observations tell about the world being observed. (Statistical inference)

5. Documenting / presenting the results of the study.

Levels of measurement:

There are four types of measurements or levels of measurement or measurement scales used in statistics:

Nominal. Ordinal. Interval. Ratio.

Characteristics of Statistics:Some of its important characteristics are given below:


http://en.wikipedia.org/wiki/Level_of_measurement

http://en.wikipedia.org/wiki/Statistical_inference

http://en.wikipedia.org/wiki/Statistical_inference

http://en.wikipedia.org/wiki/Descriptive_statistics

http://en.wikipedia.org/wiki/Summary_statistics

http://en.wikipedia.org/wiki/Design_of_experiments

http://en.wikipedia.org/wiki/Ethics

http://en.wikipedia.org/wiki/Case-control_study

http://en.wikipedia.org/wiki/Case-control_study

http://en.wikipedia.org/wiki/Dependent_variable

http://en.wikipedia.org/wiki/Independent_variable

http://en.wikipedia.org/wiki/Causality


Statistics are aggregates of facts. Statistics are numerically expressed. Statistics are affected to a marked extent by multiplicity of causes. Statistics are enumerated or estimated according to a reasonable standard of

accuracy. Statistics are collected for a predetermine purpose. Statistics are collected in a systemic manner. Statistics must be comparable to each other.

Functions of Statistics:

1) Statistics helps in providing a better understanding and exact description of a phenomenon of nature.

(2) Statistical helps in proper and efficient planning of a statistical inquiry in any field of study.

(3) Statistical helps in collecting an appropriate quantitative data.

(4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and graphic form for an easy and clear comprehension of the data.

(5) Statistics helps in understanding the nature and pattern of variability of a phenomenon through quantitative observations.

(6) Statistics helps in drawing valid inference, along with a measure of their reliability about the population parameters from the sample data.

Limitations of Statistics:

The important limitations of statistics are:

(1) Statistics laws are true on average. Statistics are aggregates of facts. So single observation is not a statistics, it deals with groups and aggregates only.

(2) Statistical methods are best applicable on quantitative data.

(3) Statistical cannot be applied to heterogeneous data.

(4) It sufficient care is not exercised in collecting, analyzing and interpretation the data, statistical results might be misleading.



(5) Only a person who has an expert knowledge of statistics can handle statistical data efficiently.

(6) Some errors are possible in statistical decisions. Particularly the inferential statistics involves certain errors. We do not know whether an error has been committed or not.

Question 5: What are the different stages of planning a statistical survey? Describe the various methods for collecting data in a statistical survey?

Answer

Introduction:

Statistical surveys are used to collect quantitative information about items in a population. Surveys of human populations and institutions are common in political polling and government, health, social science and marketing research. A survey may focus on opinions or factual information depending on its purpose, and many surveys involve administering questions to individuals. When the questions are administered by a researcher, the survey is called a structured interview or a researcher-administered survey. When the questions are administered by the respondent, the survey is referred to as a questionnaire or a self-administered survey.

Structure and standardization:

The questions are usually structured and standardized. The structure is intended to reduce bias; (see questionnaire construction). For example, questions should be ordered in such a way that a question does not influence the response to subsequent questions. Surveys are standardized to ensure reliability, generalizability, and validity (see quantitative marketing research). Every respondent should be presented with the same questions and in the same order as other respondents. In organizational development (OD), carefully constructed survey instruments are often used as the basis for data gathering, organizational diagnosis, and subsequent action planning. Some OD practitioners (e.g. Fred Nickols) even consider survey guided development as the sine qua non of OD.

Serial surveys:

Serial surveys are those which repeat the same questions at different points in time, producing time-series data. They typically fall into two types:


http://en.wikipedia.org/wiki/Time-series

http://en.wikipedia.org/wiki/Sine_qua_non

http://en.wikipedia.org/w/index.php?title=Survey_guided_development&action=edit&redlink=1

http://en.wikipedia.org/wiki/Organizational_development

http://en.wikipedia.org/wiki/Quantitative_marketing_research

http://en.wikipedia.org/wiki/Quantitative_marketing_research

http://en.wikipedia.org/wiki/Validity_(psychometric)

http://en.wikipedia.org/wiki/Reliability_(psychometric)

http://en.wikipedia.org/wiki/Questionnaire_construction

http://en.wikipedia.org/wiki/Self-administered_survey

http://en.wikipedia.org/wiki/Questionnaire

http://en.wikipedia.org/wiki/Respondent

http://en.wikipedia.org/wiki/Researcher_administered_survey

http://en.wikipedia.org/wiki/Researcher_administered_survey

http://en.wikipedia.org/wiki/Structured_interviewing

http://en.wikipedia.org/wiki/Researcher

http://en.wikipedia.org/wiki/Opinion

http://en.wikipedia.org/wiki/Marketing

http://en.wikipedia.org/wiki/Social_science


Cross-sectional surveys which draw a new sample each time. In a sense any one-off survey will also be cross-sectional.

Longitudinal surveys where the sample from the initial survey is re-contacted at a later date to be asked the same questions.

Advantages:

It is an efficient way of collecting information from a large number of respondents. Very large samples are possible. Statistical techniques can be used to determine validity, reliability, and statistical significance.

Surveys are flexible in the sense that a wide range of information can be collected. They can be used to study attitudes, values, beliefs, and past behaviors.

Because they are standardized, they are relatively free from several types of errors.

They are relatively easy to administer. There is an economy in data collection due to the focus provided by standardized

questions. Only questions of interest to the researcher are asked, recorded, codified, and analyzed. Time and money is not spent on tangential questions.

Cheaper to run.

Disadvantages:

They depend on subjects’ motivation, honesty, memory, and ability to respond. Subjects may not be aware of their reasons for any given action. They may have forgotten their reasons. They may not be motivated to give accurate answers; in fact, they may be motivated to give answers that present themselves in a favorable light.

Structured surveys, particularly those with closed ended questions, may have low validity when researching affective variables.

Although the chosen survey individuals are often a random sample, errors due to no response may exist. That is, people who choose to respond on the survey may be different from those who do not respond, thus biasing the estimates.

Survey question answer-choices could lead to vague data sets because at times they are relative only to a personal abstract notion concerning "strength of choice". For instance the choice "moderately agree" may mean different things to different subjects, and to anyone interpreting the data for correlation. Even yes or no answers are problematic because subjects may for instance put "no" if the choice "only once" is not available.

Stages of Planning a statistical survey:RAHUL GUPTA, MBAHCS (1ST SEM), SUBJECT CODE-MB0024, SET-2 Page 27

http://en.wikipedia.org/wiki/Validity_(psychometric)

http://en.wikipedia.org/wiki/Closed_ended_question

http://en.wikipedia.org/wiki/Sampling_(statistics)

http://en.wikipedia.org/wiki/Longitudinal_survey

http://en.wikipedia.org/wiki/Cross-sectional_survey


1. Nature of the problem to be investigated should be clearly defined in an un-ambiguous manner.

2. Objectives of investigation should be stated at the outset. Objectives could be to obtain certain estimates or to establish a theory or to verify an existing statement to find relationship between characteristics etc.

3. The scope of investigation has to be made clear. It refers to area to be covered, identification of units to be studied, nature of characteristics to be observed, accuracy of measurements, analytical methods, time, cost and other resources required.

4. Whether to use data collected from primary or secondary source should be determined in advance.

5. The organization of investigation is the final step in the process. It encompasses the determination of number of investigators required their training, supervision work needed, funds required.

Modes of Data Collection:

There are several ways of administering a survey, including:

a. Telephone:

Use of interviewers encourages sample persons to respond, leading to higher response rates.

Interviewers can increase comprehension of questions by answering respondents' questions.

Fairly cost efficient, depending on local call charge structure. Good for large national (or international) sampling frames. Some potential for interviewer bias (e.g. some people may be more willing to

discuss a sensitive issue with a female interviewer than with a male one). Cannot be used for non-audio information (graphics, demonstrations, taste/smell

samples). Unreliable for consumer surveys in rural areas where telephone penetration is

low. Three types:

o traditional telephone interviewso computer assisted telephone dialingo computer assisted telephone interviewing (CATI)


http://en.wikipedia.org/wiki/CATI

http://en.wikipedia.org/wiki/Sampling_frame


b. Mail:

The questionnaire may be handed to the respondents or mailed to them, but in all cases they are returned to the researcher via mail.

Cost is very low, since bulk postage is cheap in most countries. Long time delays, often several months, before the surveys are returned and

statistical analysis can begin. Not suitable for issues that may require clarification. Respondents can answer at their own convenience (allowing them to break up

long surveys; also useful if they need to check records to answer a question). No interviewer bias introduced. Large amount of information can be obtained: some mail surveys are as long as

50 pages. Response rates can be improved by using mail panels:

o Members of the panel have agreed to participate.o Panels can be used in longitudinal designs where the same respondents are

surveyed several.

c. Online surveys:

Can use web or e-mail. Web is preferred over e-mail because interactive HTML forms can be used. Often inexpensive to administer. Very fast results. Easy to modify. Response rates can be improved by using online panels - members of the panel

have agreed to participate. If not password-protected, easy to manipulate by completing multiple times to

skew results. Data creation, manipulation and reporting can be automated and/or easily

exported. into a format which can be read by PSPP, DAP or other statistical analysis software.

Data sets created in real time. Some are incentive based (such as Survey Vault or Yoga). May skew sample towards a younger demographic compared with CATI. Often difficult to determine/control selection probabilities, hindering quantitative

analysis of data.


http://en.wikipedia.org/wiki/CATI

http://en.wikipedia.org/wiki/DAP_(software)

http://en.wikipedia.org/wiki/PSPP

http://en.wikipedia.org/wiki/Online_panel

http://en.wikipedia.org/wiki/HyperText_Markup_Language

http://en.wikipedia.org/wiki/Electronic_mail


Use in large scale industries.

d. Personal in-home survey:

Respondents are interviewed in person, in their homes (or at the front door). Very high cost. Suitable when graphic representations, smells, or demonstrations are involved. Often suitable for long surveys (but some respondents object to allowing

strangers into their home for extended periods). Suitable for locations where telephone or mail are not developed. Skilled interviewers can persuade respondents to cooperate, improving response

rates. Potential for interviewer bias.

e. Personal mall intercept survey:

Shoppers at malls are intercepted - they are interviewed on the spot, taken to a room and interviewed, or taken to a room and given a self-administered questionnaire.

Socially acceptable - people feel that a mall is a more appropriate place to do research than their home.

Potential for interviewer bias. Fast. Easy to manipulate by completing multiple times to skew results.

Methods used to increase response rates:

Brevity - single page if possible. Financial incentives

o Paid in advance.o Paid at completion.

Non-monetary incentives o Commodity giveaways (pens, notepads).o Entry into a lottery, draw or contest.o Discount coupons.o Promise of contribution to charity.



Preliminary notification. Foot-in-the-door techniques - start with a small inconsequential request. Personalization of the request - address specific individuals. Follow-up requests - multiple requests. Claimed affiliation with universities, research institutions, or charities. Emotional appeals. Bids for sympathy. Convince respondent that they can make a difference. Guarantee anonymity. Legal compulsion (certain government-run surveys).

Question 6: What are the functions of classification? What are the requisites of a good classification? What is Table and describe the usefulness of a table in mode of presentation of data?

Answer

Collected data in the raw form would be voluminous and no comprehensible. Therefore it should be condensed and simplified for better understanding and usefulness. Classification is first stage in simplification. It can be defined as a systematic grouping of the units according to their common characteristics. Each of the group is called class. For example in survey of Industrial workers of a particular industry, workers can be classified as unskilled, semiskilled and skilled each of which form a class.

Types of classification:The very important types are:

1) Geographical classification: Data are classified according to region.2) Chronological classification: Data are classified according to the time of its

occurrence.3) Conditional classification: Data are classified according to certain conditions.4) Qualitative classification: Classification of data that is no measurable. E.g. Sex of

a person, marital status, color etc.5) Quantitative classification: Classification of data that is measurable either in

discrete or continuous form.6) Statistical Series: Data arranged logically according to size or time of occurrence

or some other measurable or no measurable characteristics.

Methods of Classification:



Classification is done according to a single attribute or variable, is known as one way classification.

Classification done according to two attributes or variables is known as two-wayClassification.Classification done according to more than two attributes or variables is known asManifold classification.

Examples:

One-way classification: No. of students who secured more than 60 % in various sections of same course. Two – way classification: Classification of students according to sex who secured more than 60 %.Manifold classification: Classification of employees according to skill, sex and education.

Statistical classification is a supervised machine learning procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items.

Formally, the problem can be stated as follows: given training data

produce a classifier that maps any object

to its true classification label defined by some unknown mapping (ground truth). For example, if the problem is filtering spam, then is

some representation of an email and y is either "Spam" or "Non-Spam".

The second problem is to consider classification as an estimation problem, where the goal is to estimate a function of the form

Where the feature vector input is , and the function f is typically parameterized by some

parameters . In the Bayesian approach to this problem, instead of choosing a single

parameter vector , the result is integrated over all possible thetas, with the thetas weighted by how likely they are given the training data D:


http://en.wikipedia.org/wiki/Bayesian_statistics

http://en.wikipedia.org/wiki/Estimation

http://en.wikipedia.org/wiki/Training_set

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Supervised_learning


The third problem is related to the second, but the problem is to estimate the

class-conditional probabilities and then use Bayes' rule to produce the class probability as in the second problem.

Table:

In relational databases and flat file databases, a table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows. A table has a specified number of columns, but can have any number of rows. Each row is identified by the values appearing in a particular column subset which has been identified as a candidate key. Table is another term for relations; although there is the difference in that a table is usually a multi-set (bag) of rows whereas a relation is a set and does not allow duplicates. Besides the actual data rows, tables generally have associated with them some meta-information, such as constraints on the table or on the values within particular columns. The data in a table does not have to be physically stored in the database. Views are also relational tables, but their data are calculated at query time. Another example is nicknames, which represent a pointer to a table in another database.

Comparisons with other data structures

In non-relational systems, hierarchical databases, the distant counterpart of a table is a structured file, representing the rows of a table in each record of the file and each column in a record.

Unlike a spreadsheet, the data type of field is ordinarily defined by the schema describing the table. Some relational systems are less strict about field data type definitions.

Tabulation:Tabulation follows classification. It is a logical listing of related data in rows and columns. Objectives of tabulation are:

To simplify complex data. To highlight important characteristics.


http://en.wikipedia.org/wiki/Logical_schema

http://en.wikipedia.org/wiki/Datatype

http://en.wikipedia.org/wiki/Spreadsheet

http://en.wikipedia.org/wiki/Computer_file

http://en.wikipedia.org/wiki/Hierarchical_model

http://en.wikipedia.org/wiki/View_(database)

http://en.wikipedia.org/wiki/Check_Constraint

http://en.wikipedia.org/wiki/Metadata

http://en.wikipedia.org/wiki/Relation_(database)

http://en.wikipedia.org/wiki/Candidate_key

http://en.wikipedia.org/wiki/Row_(database)

http://en.wikipedia.org/wiki/Column_(database)

http://en.wikipedia.org/wiki/Flat_file_database

http://en.wikipedia.org/wiki/Relational_database

http://en.wikipedia.org/wiki/Bayes'_rule

http://en.wikipedia.org/wiki/Conditional_probability


To present data in minimum space. To facilitate comparison. To bring out trends and tendencies. To facilitate further analysis.


Documents

Assignment on Statistics for Management