28
Statistics Statistics – Overview (Analytical approach) Page 1 of 28 Convenience Sampling, Judgment Sampling, Quota Sampling. Simple Random Sampling, Stratified Random Sampling, Systematic Random Sampling, Cluster sampling, Multistage sampling, Multi-phase Sampling. Statistical Analysis Numerical Data Population Definition Decide the type of data needed Experimental Data, Survey Data, Observational Data, Spatial Data. Based on study Secondary Data Primary Data Take precautions Accuracy, Validity, and Reliability Decide the Data collection type Decide the data Collection method Sampling Census Data Presentation Data Analysis Bivaria te Classificat ion Tabulation Presentatio n Quantitative Classificat ion, Qualitative Classificat ion, Chronological Classificat ion, Spatial Classificat ion. One-way Table, Two-way Table, Manifold Table. Ogives, Less than Ogive, More than Ogive, Frequency Curve, Frequency Polygon. Bar Chart, Simple Bar, Multiple Bar, Stacked Bar, Percentage Bar. Pie-Chart, Histogram. Direct personal investigation, Indirect oral investigation, Questionnaire method, Local correspondent method, Enumeration method. Univariat e Multivaria te Non-Probability Sampling. Probability Sampling. Interpretatio n

Stat Note MDS

Embed Size (px)

DESCRIPTION

biostatistics

Citation preview

Introduction

Statistics

Statistics Overview

(Analytical approach)

Statistics: Different authors give different definition for statistics from time to time. But, a definition must aim at laid down the meaning; scope and definition of subject. Statistics is used in two senses Viz, singular and plural.

In the singular sense it denotes numerical facts whereas, in the plural sense it denotes statistical methods.

Among them, two authors C.E.Croxton and D.J.Cowdon give the precious definition for statistics, and Prof. Horce Secrist gives the best definition.According to C.E.Croxton and D.J.Cowden,

A branch of mathematics that deals with collection, Classification, analysis and interpretation of numerical data is called as statistics.

From this definition, the main divisions of statistics are,

i. Collection of Data,

ii. Classification of data,

iii. Analysis of Data,

iv. And interpretation of numerical data.

According to Prof. Horce Secrist,

The aggregates of facts affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standard of accuracy, collected in a systematic manner for a predetermined purpose and placed in relation to each other is called statistics.

This definition gives the characteristics of the statistics. The characteristics of statistics are,

It is aggregate of facts.

It is affected to a marked extent by multiplicity of causes.

It is numerically expressed.

It should be enumerated or estimated.

It should be collected in a systematic manner for a predetermined purpose

It should be collected with reasonable standard of accuracy.

It should be placed in relation to each other.

Need of Quantifying the data: As per the definition of STATISTICS (i.e., A branch of mathematics that deals with collection, Classification, analysis and interpretation of numerical data) it mainly deals with numerical data. Hence, whenever we have the numerical data then only statistics can be applied. But in many situations researcher cant get numerical data. (i.e., it will be of mixture of numerical and qualitative characteristics)

So to draw valid conclusion from the qualitative characteristics it essential to quantify the qualitative information into quantitative by giving ranks or scale values.Collection of data: The first and foremost step of the research process is data collection. Before the statistical investigation, the researcher has to know the nature, objective and scope of investigation, time and type of investigation and the desired degree of study.

The two types of investigation are

Census/complete enumeration method.

Sampling method.

Census: A data collection method that investigates or collects information each and every unit of the population is called as census method. That is, in this method the data is collected from all the population units. For e.g., To study the average height of the students of a particular college then the investigator has to investigate (Measure) all the students height in that college.

Population: The collection of individual items about which the study of the investigation is concerned is called as population.

Merits:

The data is collected from all the items of study. Hence, bias is minimized data is more accurate reliable and

The highest accuracy can be maintained.

Results drawn from the data collected through this method is more representative and true.

Demerits:

When the coverage area is wide, this method is not suitable. Because it will take more money, time and energy.

The cost needed is more, hence the organization that posses huge finance and manpower can only adopt this method.

If the population size is infinite, this method is not suitable.

If the study is of destructive type product this method is not suitable.

Destructive type product: The product that cant be used after its initial use is called destructive type product.

Type of population:

The two types of population are,

Hypothetical

Existent population.

The collection of concrete objects or persons under the study of investigation constitutes the existent population.

The existent population may be finite or infinite.

An existent population that consists of countable number of individuals or objects is called as finite population.

An existent population that consists of un-countable no of individuals or objects is called infinite population. E.g., In the study of economical level of a particular college students, the totality of that college students and it will be finite. Hence it is a finite population. E.g., In the study of characteristic pattern of stars in the sky. All the stars in the sky constitute the population. But there are infinite. Hence it is an infinite population.

The collection of non-concrete objects that exists only in imagination and un-countable constitutes hypothetical or theoretical population. For e.g., In the study pattern of the result of the coin tossing experiment, the researcher couldnt get the concrete result. He can only imagine the result as head and tail.

Hence the result of the coin tossing experiment constitutes the hypothetical population.

Sampling method: The method or technique that is adopted to select the sample from the population is called as sampling method.

Sample: A finite subset or small part of population that has exactly duplicate characteristic of population used to make valid inference regarding the entire mass of population is called as sample.

Objects:

To get more information about the population with minimum effort time and cost.

To estimate the population parameters through its statistic.

To obtain the degree of precision of the drawn result through its statistic.

To draw valid conclusion about the population.

To give desired result with required precision with the given minimum cost.

To identify the true representative of the population.

Merits:

It is more economical. (i.e.,) it saves time, money and energy because of limited number of investigation units.

It helps to achieve high degree of accuracy.

It helps to get reliable results for the population.

It serves as the alternative method of census.

It helps to organize and administrate the survey easy.

If the approximate result is needed or required this method can be used.

Demerits:

Careful planning must be followed otherwise the result will be incorrect and biased.

The result is based on the investigator. The attitude of personnel will affect the result.

There is possibility of large errors.

Hence

The sample must be true representative of population

Experienced personnel have to be employed to the fieldwork.

The sample size must be adequate number.

The coverage area should be small.

The two types of sampling methods are,

Probability sampling

Non-probability sampling.

Probability sampling:The sampling method that follows some standard procedure and selects the units with pre-defined probability is called probability sampling.

The six types of probability sampling method are,

1. Simple (Equal) Random (chance) Sampling.

2. Stratified Random Sampling.

3. Systematic Random Sampling.

4. Cluster Sampling.

5. Multistage Sampling.

6. Multi-phase Sampling.

Simple random sampling: Sampling procedure that is used to select the sample from the population in such a way that each population units has an equal and independent chance of being included in that sample is called as simple random sample.

This is the simplest method to select the sample. This method is applicable when the population is of homogenous nature. This simple random sample can be selected by two ways.

Lottery method:

In this method, all the population units are numbered or named. Then the numbers or the names are written on different slips or cards of same size and shape so that a card is not distinguished from others.

These cards are placed in a box and shuffled well so that no particular card gets any preference in selection. From that box sample is selected one by one, till the desired number of units are selected.

The only one drawback of this method is if the population size is very large, this method is not suitable.

Random number table method:

In this method is sample is selected from the population by making use of random number table. The table which contains random digits arranged in row and column format is known as Random number table.

Selection process:

Random number table is arrangement of five digit numbers in row and column format.

Selection process may be proceeded row wise or column wise.

Assign numbers to the population units.

Decide the sample size.

Count the number digits of population size. (i.e.,) k.

Read out number with k-digits from the random number table.

If the read number is greater than the population size, ignore it and select the next number.

If the read number is less than the population size include the corresponding population unit in the sample.

Precede this process until required numbers of sample units are selected.

There are several standard random number tables are available. Among them some are,

L.H.C Tippets random number table :

10,400 four-digit numbers.

Fisher and Yates random number table:

15,000 two digit numbers.

Kendall and B.B Smiths random number table:

25,000 four-digit numbers.

Rand corporations random number table:

2,00,000 five-digit numbers.

Merits:

There is less chance for personal bias.

As the sample size increases; the selected sample will be more representative one.

Sampling errors can be measured.

This method saves money, time and labour.

Demerits:

This method requires complete list of population. But in many enquires it is not possible.

As the sample size decreases the sample wont represent the population.

If the population units are of heterogeneous nature this method cant be employed.

Stratified random sampling: A sampling method that selects sample from the heterogeneous population by dividing the population into homogenous sub-groups called stratum, is called as stratified random sampling.

Since the population is of heterogeneous nature the population is divided into stratum that are of homogenous nature. From that each stratum, a number of sample units that constitutes the sample is selected.

The two types of stratified random sampling method are,

Proportional method: If the sample is selected from the stratum proportionate to its size, then the sample is selected by proportional method.

Optimum method: If the sample is selected from the stratum by considering the cost, then the sample is selected by optimum allocation method.

That is, based on the cost, the sample is selected.

Merits:

The sample selected by this method is more representative of population.

If ensures grater accuracy.

For the heterogeneous population this method is more reliable.

Demerits:

The process of dividing the population into strata requires more time money and experience.

If the stratification is not proper, then the sampling bias will prevail in the sample.

Systematic sampling: A probability sampling method that selects sample by making using up-to-date complete list of population units is called as systematic sampling. In this method, the selection of first sapling unit is selected with probability, so it is also known as quasi-random sampling. After the selection of first unit is selected then the remaining units of sample are automatically selected using the random start range.

If the complete and up-to-date list of population units is, available, then this method can be used.

Selection procedure:

Assume that we have to select n units from N population units.

Arrange the items in numerical or alphabetical or geographical or any other order.

Find the sampling interval K = N / n such that nk = N.

Select the random start i such that i < k.

Select the sample units of i-th, i+k-th, i+2k-th I+(n-1) k-th units to constitute the systematic sample.

Hence the random start determines the (Whole) sample.

Merit:

This method is simple and operationally more convenient.

Time and work involved in selection procedure is less.

Demerit:

This sample maynt represent the population.

If the population size is not multiple of sample size, one cant get required number of sampling units.

Cluster sampling: A probability sampling method that selects the sample by grouping the population units into some groups called clusters-similarity of objects, and selects the sampling units through the selection of clusters is known as cluster sampling.

Cluster sampling is same as stratified random sampling, but the only difference is, in the former the entire units of the selected clusters constitute sample. But in the later case, the sampling units are selected from the selected strata.

Merits:

It introduces flexibility in sampling method.

It is suitable in large-scale survey, where the list preparation is difficult.

Demerits:

It has less accurate than other methods.

Non-probability sampling: The sampling method that doesnt follow any standard procedure and selects the units with unknown probability is called as non-probability sampling. This method is directly opposite to the probability sampling method.

The three types of non-probability sampling methods are,

1. Judgment or purposive sampling.

2. Convenience sampling.

3. Quota sampling.

Judgment/purposive sampling: The sampling method, which selects the sample units to achieve a specific purpose, is called as judgment or purposive sampling method. In this method the samplers choice plays major role in collecting the sampling unit.

For e.g., To know or study the cultural activity of the students in a particular college the sampler has to select the students who are interested in cultural activity. Then only the study reveals the valid conclusion. If not so the sample does not reflect the population characteristics- Cultural skill of the college. Hence he has to find the students who are involved in that activity, from them the investigator has to collect the information.

Merits:

It is simple method

The sample collected is more representative.

This method can be adopted for public policy, to make decision, etc.,

Demerits:

Due to sampler interest, the sample maynt be true representative of population.

Difficult to correct sampling errors.

The estimates will not be accurate.

Quota sampling: This method is similar to the stratified random sampling.

In this method population is divided into various quota and then from he quota the sample is selected. The sample size per quota is personal judgment. This is also known as stratified purposive sampling method.

Merits:

This method reduces money and time.

Demerits:

Result is based on the investigators.

Personal bias is possible.

Since sample selection is based on random sampling. Sampling errors cant be estimated.

Convenience sampling: The sampling method that selects the sample units based on the continent of investigator is called as convenient sampling. If

The universe is not clearly defined.

Sample unit is not clear.

Complete list is not available.

Then this method can be used.

Demerits:

This sample is not true representative of population

The results are biased.

But this method can be used for pilot study.

PRESENTATION OF DATA

After the data collection is over, the researcher has raw data. (I.e., The information prior to the proper arrangement is known as raw data.) They are huge and confusive. As such, the researcher cant carryout analysis and they wont furnish any useful information. So to condense and present the data into compact manner we go for presentation of data. Presentation of data has three main types of presentations. They are,

1. Classification,

2. Tabulation, and

3. Graphical representation.

Classification: The process of arranging the data into sequences and groups according to their common characteristics and separating them into different but related parts is called as classification.

Objects:

The raw data are classified,

To condense the mass of data.

To present the data in simpler form.

To differentiate the similarity and dissimilarity among the data.

To facilitate comparison and statistical treatment.

To bring out relation.

To facilitate further analysis.

To eliminate the unnecessary data.

Rules for classification:

The classes should be rigidly defined. (I.e.) there shouldnt be any ambiguity in their rules.

The classes shouldnt overlap (i.e.) each item of data must have its place in only one class.

The classification must be flexible to adjustment of new situations.

The items included in total and sub total of class and subclass must be same.

Types of classification:

Geographical classification: Classifying the data based on the area of its occurrence such as states, districts, Taluks etc., is called as geographical classification.

Chronological classification: Classifying the data based on the time of its occurrence such as decades, Years, Months, etc., is called as chronological classification.

Quantitative classification: Classifying the data based on some characteristics that is capable of quantitative measurement like age, price, weight etc., is called as quantitative classification.

Qualitative classification: Classifying the data based on the qualitative characteristics such as sex, honesty, literacy, etc., is called as qualitative classification.

That is, presence or absence of the characteristic is presented in this type of classification.

Tabulation: The systematic arrangement of numerical data in the form of rows and columns in accordance with some characteristics is called as tabulation.

Objects:

To simplify complex data.

To clarify characteristics of data.

To facilitate comparison.

To detect errors and omissions in the data.

To facilitate statistical processing.

The parts of table are:

1. Table number,

2. Title,

3. Head note,

4. Caption,

5. Strata,

6. Body of table,

7. Foot-note,

8. Source-note.

The table number is used for identify and reference of the table in future. For the reference and explanation the columns may also have numbers.

Each table has to be given a suitable title. Suitable in the sense, it must describe the content of table.

Head note is a statement about the tables that is placed below the table title within brackets. Usually the measurements of the table units are placed such as, in-millions; in crores; etc,

The headings of the columns are called as captions. They must be brief and self-explanatory. This caption may have sub-headings.

The row headings names are called stabs.

The most important part of the table that contains the numerical information is called body of table. To provide any explanation about the items in the table, footnote is used.

Types of tabulation:

1. One-way tabulation,

2. Two way tabulation, and

3. Manifold tabulation.

One-way Table: The table that displays information on a single variable is called as one-way table or univariate table. The variable may be discrete or categorical.

Two-way Table: The table that displays information on categories of a single variable over the categories of another variable is known as two-way table or bi-variate table.

Manifold table: The table that shows information on more than two variables categories is known as manifold table.

Frequency Distribution: A tabulation type that summarizes the raw data in the form of table along with variable values or variable class intervals and their corresponding frequencies is known as Frequency table. It may be one-way or two-way or manifold type.

Moreover, Frequency table

1) Organizes the data into compact manner without loss of essential information.

2) Describes how the total frequency distributed over different classes or discrete points.

There are three types of frequency tables. They are,

1. Discrete frequency table.

2. Continuous frequency table.

3. Relative frequency table.

Discrete Frequency table: A Frequency table that shows the distribution of frequencies at different distinct values of variable is known as discrete frequency table.

Procedure to form discrete frequency table:

1. Draw a table with three columns namely, variable, tally marks and frequency.

2. Take the first observation.

3. Write down the observation in the variable column and put a tally mark (|) against the written observation in the tally mark column.

4. Take the next observation.

5. Check weather the observation is entered in the variable column or not.

6. If it is entered, put another tally mark against the written observation. Else, go to the step 3.

Repeat the procedures starting from 4 6 until all the observations are entered in the table.

7. Count number of tally marks for each variable and put the totals in the frequencies column.

8. The resultant table is called as discrete Frequency Table.

9. If for any variable row has four tally marks, then the next occurrence of that variable is marked by putting a cross mark over the four bars. This process facilitates counting process.

Continuous Frequency table: A Frequency table that shows the distribution of frequencies over different class intervals of values is known as continuous frequency table.

Procedure to form Continuous frequency table:

1. Draw a table with three columns namely, variable, tally marks and frequency columns.

2. Find the smallest and largest observations in the data set.

3. Decide the class interval.

4. Write down the class limits with equal class intervals under the heading variables.

5. Take the first observation.

6. Decide in which class it falls.

7. Put a tally mark (|) against the variable class in the tally mark column.

8. Take the next observation.

9. Repeat the procedures starting from 6 - 8 until all the observations are entered in the table.

10. Count number of tally marks for each variable class and put the totals in the frequencies column.

11. The resultant table is called as continuous Frequency Table.

Relative Frequency Table: A frequency distribution in which the frequencies are expressed as fraction or percentage of total number of observations is known as relative frequency distribution.

It is noted that, the sum of relative frequency is equal to one when the frequencies are expressed as fractions and the total is 100 when the frequencies are expressed as percentage.

Graphical representation:

Classification and tabulation are used to present the data in the neat, concise systematic and understandable manner. But, the large amount of information, extending over a large number of columns is difficult to understand the significance of data. Hence, the statisticians are necessitated to introduce diagrams and graphs.

Classification is the process of grouping of data into homogenous groups or categories. Tabulation is the process of presenting the classified data in tabular form.The process of highlighting the salient features of study through graphs and charts is called as graphical representation. This type of presentation made easy to understand. Moreover, attractive graphs and charts make understood at a glance for even layman.

Merits:

Diagrams are attractive and create interest in the mid of readers.

Diagrams are easily understandable to even for the layman.

In interpretation, diagram saves much time.

i.e., human beings maynt like go through numerical figures. But they may like to go through diagrams.

Diagrams make data simple.

i.e., at a glance of look on diagrams remembered and readers can easily understand the pattern of data.

A diagram facilitates comparison of two or more sets of data.

Diagrams reveal more information than data in a table.

Limitations:

Diagrams cant be analyzed or used for further analysis.

Diagrams shows approximate values only

It exposes only limited facts.

(i.e.) all details cant be presented in the form of diagrams.

Construction of diagram needs some intelligence and experience.

This is supplementing to tabulation not an alternative one.

Rules for making diagrams:

Every diagram must be given a suitable title of bold letters.

The title conveys the main fact depicted by the diagram.

Sub-headings may also be given.

Title should be brief and self-explanatory.

Due to comparison, diagram must be drawn accurately and neatly.

Each diagram should be numbered for further reference.

The type of diagram should be selected according to the nature of data.

When many items are shown in the diagram, through different patterns such as dots, crossing etc., index must be given.

Diagram must be simple as understandable by the layman.

There are two types of graphical representation. They are,

1. Graphs,

a. Frequency curves,

b. Frequency polygon, and

c. Ogives.

i. Less than ogives, and

ii. More than Ogives.

2. Charts/ Diagrams.

a. Bar chart,

i. Simple bar chart,

ii. Multiple bar chart,

iii. Stacked bar chart, and

iv. Percentage bar chart.

b. Pie- chart, andc. Histogram.

One-dimensional diagram: The diagram that is drawn to the single set of data set is called one-dimensional diagram. The bar and pie diagram are belongs to this one-dimensional diagram.

Bar chart: The visual representation of (qualitative or categorical or discrete numerical) data is called as bar chart. The bars are proportionate height to the frequency. The bars may be horizontal or vertical. The distances between the bars are kept uniform. Bar charts are drawn only for single discrete quantitative or categorical variables.

The types of bar diagrams are

Simple bar chart.

Multiple bar chart,

Stacked bar chart.

Simple bar chart: The bar diagram that is drawn for a single set of categorical or numerical data is called as simple bar diagram.

Multiple bar chart: The bar diagram that is drawn to single variable with more than one phenomenon is called as multiple bar diagram. This facilitates the comparison. The categories of a single variable are drawn side by side. The differentiation is shown by different colors or patterns such as lines dots etc,

Stacked bar chart: A type of bar diagram that is drawn for single variable with any number of (categorical or numerical) categories is called as Stacked bar diagram. In this diagram the categorical variables categories are placed on the bar by dividing the portion of bar.

Percentage bar chart: Percentage bar diagram is a kind of stacked bar chart, drawn for percentage of frequencies of categorical variables with the equal bar height is called as percentage bar diagram. The division of bars of categories is made with the percentages. But in this case bars are of equal heights to 100%. But in the stacked bar diagram the height of bars are unequal. That is, bars are proportional to the frequencies of the base variables category.

Pie diagram: The graphical representation of single variables categories in circle form is called pie diagram. In this graph the circle is divided into the various pieces based on the frequency. This type of diagram provides high understanding ability at a glance. The each slide is divided by taking the whole data equal to 360 degrees.

Relative Frequency Histogram: A histogram constructed with the help of relative frequencies rather than absolute frequencies is known as relative frequency histogram.

Histogram: A bar diagram where the bars are constructed continuously without (leaving space between bars) on the class intervals in such a way that the height of bars are proportional to the frequencies of relative classes is known as Histogram.

Frequency polygon: The graph formed by plotting the frequencies against the mid points of continuous frequency distribution and joining the points by straight lines is known as Frequency polygon.

This can also be obtained from the histogram by joining the top mid points of bars with straight lines.

Frequency Curve:

The graph that is formed by plotting the frequencies against the mid points of continuous frequency distribution and joining the points by free-hand curve is known as Frequency polygon.

This can also be obtained from the histogram by joining the top mid points of bars with free hand curve.

Ogaives:

The graph obtained by plotting the cumulative frequencies against the class limits of continuous frequency distribution is known as Ogives.

The two types of Ogaives are,

1. Less than Ogive.

2. More than Ogive.

Less than Ogaive:

The graph obtained by plotting the less than cumulative frequencies against the upper class limits of continuous frequency distribution and joining the points of smooth curve is known as less than Ogive.

More than Ogaive:

The graph obtained by plotting the more than cumulative frequencies against the lower class limits of continuous frequency distribution and joining the points of smooth curve is known as more than Ogive.

Choice of Data:

Before the data collection, type of data should be decided. That is, primary data or secondary data. The choice of data depend on,

Nature and scope of study,

Availability of finance, time factors,

The degree of accuracy needed,

Nature of investigation (individual or government study).

Generally most of the survey primary data is preferable.

Primary data:The f0irst hand information that is collected for the first time by the investigator for the purpose of his study is called primary data.

This is first hand information.

This data is original in character.

The primary data collection methods: To collect the primary data five methods are commonly used. They are,

1. Direct personal investigation,

2. Indirect oral investigation,

3. Questionnaire method,

4. Local correspondent method, and

5. Enumeration method.

Direct personal investigation: In this method, the investigator personally meets the informants and collects the information by asking them questions. The person form whom the information is collected is called informants. This method is intensive rather than extensive. The investigator must be keen observer and tactful and courteous in behavior.

Suitability:

This method can be employed, when

High accuracy is needed.

The coverage area is small.

The confidential data is needed.

The intensive study is needed. And

Sufficient time is available.

Merits:

Original (first hand) data is collected.

The collected data are highly reliable.

The high degree of accuracy can be achieved.

Due to personal approach response will be more.

Correct information can be extracted from the informant.

Cross-examination is possible.

Miss interpretation on the informant part can be avoided.

Demerits:

This method is not advisable when coverage area is large and time, finance factor are low.

Possibility of bias is more.

Untrained investigator cannot bring good result.

It is expensive and time consuming.

Indirect oral investigation:

If the informant is unwilling (reluctant) to provide information, this method can be used. But in this method the investigator dont meet the actual informant. Alternatively, the investigator meets the witnesses or third parties or friends who are in touch with the informant. Investigator interviews the people who are directly or indirectly connected to the informant and collect the information.

For example: To collect the information relation to gambling or drinking or smoking habit the informant wont provide information. Even, they wont response the study. On such situations the investigator has to approach friends, neighbors, etc., of the actual informant to collect the information. Usually police department adopts this method.

Example: Police department, riots, alliance, etc.,

Merits:

It is simple and convenient method.

It is suitable when the investigation area is large.

It saves time, money and labour factors.

The information is unbiased.

Adequate information can be collected.

Demerits:

The result is based on third parties prejudice.

To get adequate information much number of persons may be interviewed.

Interview with an improper man will spoil the result.

Bad information will spoil the result.

Mailed questionnaire method:

In this method, a separate questionnaire consisting of a list of questions for the enquiry is prepared. This questionnaire is sent to the informants requesting them to do extend their co-operation by fill-upping the questionnaire and correct replay of the questionnaire.

To get the quick and better response, the postal expense is borne by the investigator. After receiving the sent questionnaires back analysis is carried out. The research workers of state and central governments adopt this method.

Suitability:

This method is advisable, if,

The coverage area is wide.

There is a legal compulsion to supply information, so that non-response risk is eliminated.

Merits:

This method is most and economical comparing with other methods.

This method of data collection covers wide area and reduces money, time and labour

Bias is less since the data is collected directly from the respondents.

Demerits:

There is no direct contact between the investigator and respondent.

The accuracy and reliability are less.

This method is suitable among literate people only.

There is the possibility of delay in receiving questionnaire.

The people may furnish wrong information.

Asking supplementary questions is not possible.

Framing questionnaire:

In this mailed questionnaire method, questionnaire is the communication media between the investigator and the informant. Hence, the success of investigation is based on the questionnaire. So the questionnaire must be designed with adequate skill, efficiency and experience.

Characteristics of Good questionnaire:

Number of questions should be minimum.

Questions should be short and simple to understand.

Questions should be arranged in logical order.

Questions may have multiple-choice answers.

Personal questions are to be avoided.

The questions that require calculations are to be avoided.

Questions of sensitive and personal type should be avoided.

The wordings of questionnaire shouldnt hurt the feelings of respondents.

Questionnaire information must be given.

Questionnaire should look attractive.

Pre - Test:

After the questionnaire is prepared, pre test is to be done.

The process of refining the validation of questionnaire by collecting information from the related respondents in small number with the framed questionnaire in the view of overcoming the shortcomings of questionnaire is called as pre test. If any shortcoming is found in the questionnaire, it will be incorporated in the questionnaire. After the required changes are incorporated, pilot study is employed.

Pilot study:

Whenever the investigator has to deal with large survey, he should not plunge directly. After the pre-test is over, to overcome the shortcomings of the analysis pilot study is carried out. This is a small-scale survey with a small number of persons. The collected data through the pilot study is analyzed. If any technical difficulty in the analysis is found then the questionnaire will be altered. The main survey is taken if the pilot study doesnt reveal any analytical difficulties.(See Figure 1.)

Local Correspondents Method:

In this method instead of collecting the information by the researcher, local agents are appointed to collect the information. They collect the information from the informant and the collected data is sent to the actual researcher or investigator. The data collection is done according to local correspondents taste. Newspaper agencies, magazines, etc. adopt this method.

Suitability:

If the data is required regularly from the wide area, this method can be used.

Merits:

Extensive information is collected.

This is most cheep economical method.

Information will be collected regularly.

Demerits:

Information may be biased.

Degree of accuracy cant be maintained.

Data may be of duplicate nature.

Enumerator method:

In this method, a number of enumerators are selected and trained to collect the data. They are provided the questionnaires and trained to fill up the questionnaire. They meet the informant along with the questionnaire and collect the data by filling up the questionnaire. The enumerator explains the object, purpose of the study to the informant.

Merits:

Intensive information is collected.

This method yields reliable and accurate results.

This method is helpful even if the informants are illiterate, because the investigator is going to record the information.

Due to personal contact, the non-response is less.

Demerits

This method leads to more money and time

Personal bias of enumerator leads to wrong conclusion.

Secondary data: The second hand information that is, collected from the already existing sources for the study is called as secondary data. That is, the researcher gets the required information from the information that is already collected by some one for his purpose. The sources of secondary data are,

Published sources:

The data that is published by the various governments, local and international agencies are published data.

International publications:

IMF, IBRD, ICAFE and LINO etc., publish the data regular time intervals.

Central and state governments:

Department of union and state government regularly publish the data. The other organizations are, RBI-Bulletin; census of India; Indian trade journal etc,

Semi-official publications:

The semi government institutions like district, panchayat, municipal, corporation etc, publish the statistical data.

Research institutions publication:

The research institutions such as Indian statistical institution (ISI); Indian agricultural statistics research institute (IASRI) etc., publish the data.

Journals and newspapers:

Some journals like Indian finance, commence etc, publish the current and important material on statistics and socio-economic problems.

Unpublished sources:

There are various unpublished data sources. Various government and private office maintain them. These are the data carried out by the researchers in universities or research institutions.

Precautions In Using Secondary Data

The secondary data is not a reliable one and the data taken in olden days will be inadequate. So before using the secondary data in the analysis, some precautions must be taken.

The precaution steps are,

Suitability of data:

The available data should be suitable for his study. This characteristic is to be examined by the investigator himself. The data should be coherent with scope of the present analysis.

Adequacy of data:

After the suitability is tested, the data must be adequate for the study. That is adequate data must be extracted from the source to carry out analysis.

Reliability of data:

Reliability is checked by testing the findings or results from the data. If the agency has used proper methods to collect the data, the statistics may be relied upon.

DATA ANALYSIS:

The process of drawing or obtaining the representative measure from the raw, mass amount of data is called data analysis. To carry out, the analysis, statistical methods are used. Hence it is called statistical data analysis.

The three type of data analysis are

Univariate data analysis.

Bivariate data analysis.

Multivariate data analysis.

Univariate data analysis:

Analyzing or drawing representative measure for the one-dimensional data set (it may be raw or grouped or ungrouped) is called univariate data analysis. That is, the characteristics of single data set are studied. The three types of Univariate Data Analysis Tools are,

1. Measures of Central Tendency,

2. Measures of Dispersion,

3. Skewness, and

4. Kurtosis.

Measures of Central Tendency:

A set of statistical tools that results a single (single) representative measure that describes the characteristics of entire mass of data is known as Measures of Central Tendency. The human mind is incapable of remaining the mass of data. So if there is any summary measure that reveals the characteristics, it will be easy to remember. There are three types of measures of central tendency. They are

Mean,

Arithmetic Mean,

Weighted Mean,

Geomentric Mean,

Harmonic Mean.

Median,

Mode,

The characteristics of good average are:

It should be preciously (rigidly) defined.

It should be

Easy to understand.

Easy (Simple) to compute.

Based on all observation.

Capable of further analysis.

Its definition should be in the form of mathematical formula.

It should not be influenced by extreme values.

It should have sampling stability. (Least affected by sampling fluctuations)

Merits of averages:

It facilitate quick understanding of complex data:

The purpose of average is to represent a group of values in simple and concise manner. That is, an average condenses the mass of data into a single figure.

It facilitates comparison.

It facilitates to know about universe from sample.

If helps in decision-making.

It establishes mathematical relationship.

Mean: A single representative figure of a mass amount of data which obtained by adding together all the values and dividing the sum by the total number observations is called mean (i.e.) if the series x1, x2, x3, , xn has the n observations. Than the mean value of this series will be,

This is the most widely used measure of central tendency tool.

Properties:

1. The sum of deviations taken from arithmetic mean is zero. (i.e.,) (xi-x) = 0

2. The sum of squares taken form the mean other than is minimum.

(i.e.,) , Where A is any value and x is mean of the observations.

Merits:

It is easy to understand and calculate.

It is used in further calculations.

It is based on all the items.

It provides a good basis for comparison.

It is a more stable measure.

It is considered as good or idle average.

Demerits:

Mean is unduly affected by extreme values.

It is unrealistic.

It may lead to wrong conclusion.

It is not useful for studying the qualitative characters.

It is not suitable measure in case of highly skewed distribution.

It gives greater importance for bigger values and smaller importance for the smaller values in the series.

It cannot calculate for the frequency distribution with open-end class.

Median: A measure of location calculated from the set of values that divides the series into two equal parts is called as median. That is one of part of data set contains the items less then median and another part of data set contains the items greater then median value. But he number of observations on both the sides is equal.

1). For ungrouped data:

a. Arrange the observations in either ascending or descending order of magnitude.

b. Find the number of observations in the data set. (i.e., n).

c. If n is odd, then the median of the data set is, observation.

d. If n is even, then the median of the data set is,

2). For grouped data: (Discrete frequency distribution)

1. Form the cumulative frequencies.

2. Find

3. Find the cumulative frequency just grater than .4. The observation (x value) that corresponds to that frequency is the median of the set of observation.3). For grouped data: (Continuous frequency distribution)

1. Form the cumulative frequencies.

2. Find

3. Find the cumulative frequency just grater than .4. Find its corresponding class, it is the median class.

5. Find median by using the formula,

Merits:

It is easy to understand and compute.

It is quite rigidly defined.

It eliminates the effect of extreme items.

It is amenable to further process.

Median can be calculated for even qualitative phenomenon.

Its value generally lies in the distribution.

It can be calculated for frequency distribution with open-end class interval.

This can be located graphically.

Demerits:

If the series is of irregular nature, median cannot be computed.

It ignores the extreme values.

In the case of continuous case and even number of observations, median is estimated but not calculated.

It is not based on all observations.

It is not amenable to algebraic treatments.

It is affected by the fluctuations of sampling.

It cant be calculated for continuous frequency distribution with exclusive type class interval. To calculate the median the class interval has to be converted into inclusive type class interval by adding the value to both the limits (Upper And Lower).

Mode: A single value that appears more number of times (more frequently) than other observations in the data set is called as mode.

1). For ungrouped Data:

i). count the observations frequency.

ii). The observation that has occurred more number of times is the mode of that data set.

2). For Grouped data: (Discrete frequency Distribution)

i). From the frequency distribution identify the highest frequency.

ii). The observation corresponding to the highest frequency is the mode of distribution.

3). For Grouped data: (continuous frequency Distribution)

i). From the frequency distribution identify the highest frequency.

ii). The class interval corresponding to the highest frequency is the modal class.

iii). Find mode by using the formula,

Merits:

It is easy to understand and calculate.

It is not affected by extreme values. It is simple and precise.

It ca be located by mere inspection.

It can be determined by the graphic method. This value can be determined to the open-end class interval.

Demerits:

It is ill-defined (If there is two observations occurs equal number of times we cant calculate the mode-bi-modal distribution)

It is amenable to further mathematical treatment.

It is not based on all observations.

It is difficult to compute, when there are both positive and negative data in the series.

It is stable only when the sample size is large.

If there are both positive and negative values or any one or more observation is zero, we cant find the mode of distribution.

Comparison of Measures of Central Tendency Tools:

CharteristicsMeanMedianMode

Precious DefinitionGivenGivenNot given

Procedure UnderstandingEasyEasyEasy

CalculationEasyEasyEasy

Observations UtilizationAll obsn:sNot all obsn:sNot all obsn:s

Further treatmentAmenableNot amenableNot amenable

Sampling fluctuationsLeast affectedMuch affectedMuch affected

Effect of extreme valuesMuch affectedNot affectedNot affected

From the comparison table of Measures of Central Tendency table it is noted that, among the tools Mean holds many of the idle average characteristics. Hence, Mean is considered as good or Idle average.

Measures and dispersion: The statistical tool that measures the variation or the scattered ness of values from its representative (Central) value is called as dispersion.

Properties of good measure of variation are,

It should be easy to calculate and understand.

It should be rigorously defined.

It should be based on all observations and amenable to further treatment.

It must have sampling stability.

If should not affected by extreme values.

The types measures of dispersion are,

Range,

Variance and Standard Deviation,

Mean deviation.

Range:The simplest measure of dispersion that is calculated by subtracting the minimum value from the maximum value of the data set is called as range.

(i.e.) Range = maximum value - minimum value.

Standard deviation: A most widely used important measure of dispersion that is defined as positive square root of arithmetic means of squared deviation values from arithmetic mean is called as standard deviation. Standard deviation is denoted by, .

That is, to stabilize the negative and positive variations. The square of deviations is taken.

Formula for calculating standard deviation value is,

Merits:

It is rigorously defined.

Its value is always definite.

It is based on all observation of data.

It is amenable for further analysis.

It is less affected by sampling fluctuations.

It serves basis for measuring coefficient of correlation. Sampling and statistical inference.

This is the most appropriate measure for the variability, measurement of distribution.

As a best measure of dispersion, it posses most of the characteristics of an ideal measure of dispersion.

Demerits:

It is not easy to understand and calculate.

It gives more weight to extreme values by squaring them.

It cannot be used for comparison

Co-efficient of variation: Hundred times the co-efficient of dispersion based on Standard deviation is known as Co-efficient of Variation. To find the variability of data set, find the individual Co-efficient of Variation. The data set with greater co-efficient of variation will have more variability.Bivariate data analysis:

Analyzing or obtaining the representative measure for two sets of variables by considering both the variables simultaneously is called bivariate data analysis. The variables type may be quantitative or qualitative.

The two types of bivaritate measures are,

Associative measure and

Functional measure

Associative measure: The measure that is used to measure the inter-relationship between the two types of variables is called associative measure.

The two types of associative measures are,

Correlation and

Chi-square association

Correlation: The statistical method that discovers amount of relationship and the direction between two sets of quantitative variables is called as correlation. The correlation provides nature and indent of the relationship.

(i.e.) if correlation between A and B is 0.48 then the negative sign express that the relationship is negative and the value 0.48 expresses the amount of relation between the variables A and B.

Correlations value will always lie on the interval of 1 and +1.

The nature of correlation: Positive correlation: The correlation is said to be positive correlation. If its value lie on the interval 0 and +1.

Negative correlation: The correlation is said to be negative correlation. If its value lie on the interval 0 and -1.

No correlation: The correlation is said to be no correlation. If its value is 0.

Chi-square association: The bivariate method that is used to measure the relationship between two qualitative variables is called chi square association method. This method tests whether the two qualitative variables are dependent or independent.

Friedman test: A non-parametric statistical method that is applied to ranking data set to find the common agreement of ranking between the respondents about the various factors is called Friedman test.

Logistic regression: This method is used to examine the relationship among the set of variables. That is, the statistical method that is used to study about a dichotomous response variable, which is explained by a number of explanatory variables, is called as logistic regression. (It may be ordinal or interval or ranking data)

The assumptions for logistic regression are,

Response variable is binary

The model for response and explanatory variable is log linear.

Cohort Study:

Is defined as a group of people who share a common characteristic or experience within a defined time period.

Example

Conducting dental disease investigation age of 30-40 years aged people

Whose are like dental college clinics- graduates is our population

Other example

Exposure to a drug or vaccine

Pregnancy insured persons

Experimental Data,

Survey Data,

Observational Data,

Spatial Data.

Secondary Data

Is Error?

Is analytical difficulty?

Pilot survey

Pre-test

Review & Refine

Questionnaire

Main Survey

Yes

No

Yes

No

Decide the type of data needed

Primary Data

Based on study

Accuracy,

Validity, and

Reliability

Figure 1.

Numerical Data

Take precautions

Statistical Analysis

Population Definition

Data Presentation

Interpretation

Decide the Data collection type

Decide the data

Collection method

Sampling

Census

Multivariate

Simple Random Sampling,

Stratified Random Sampling,

Systematic Random Sampling,

Cluster sampling,

Multistage sampling,

Multi-phase Sampling.

Convenience Sampling,

Judgment Sampling,

Quota Sampling.

Data Analysis

Bivariate

Ogives,

Less than Ogive,

More than Ogive,

Frequency Curve,

Frequency Polygon.

Classification

Tabulation

Presentation

One-way Table,

Two-way Table,

Manifold Table.

Quantitative Classification,

Qualitative Classification,

Chronological Classification,

Spatial Classification.

Bar Chart,

Simple Bar,

Multiple Bar,

Stacked Bar,

Percentage Bar.

Pie-Chart,

Histogram.

Direct personal investigation,

Indirect oral investigation,

Questionnaire method,

Local correspondent method,

Enumeration method.

Univariate

Non-Probability Sampling.

Probability Sampling.

Page 21 of 21

_1192082693.unknown

_1192082820.unknown

_1192089910.unknown

_1192666938.unknown

_1298802747.unknown

_1192667413.unknown

_1192112045.unknown

_1192089599.unknown

_1192082744.unknown

_1192082819.unknown

_1192082721.unknown

_1192082486.unknown

_1192082537.unknown

_1192082664.unknown

_1192082499.unknown

_1192082415.unknown

_1192082449.unknown

_1192082127.unknown