Statistics Basics UNIT-1

Embed Size (px)

Citation preview

  • 8/3/2019 Statistics Basics UNIT-1

    1/74

    History of Statistics

    The Word statistics have been derived from Latin word Status or the Italian word Statista,meaning of these words is Political State or a Government. In the past, the statistics was used

    by rulers. The application of statistics was very limited but rulers and kings needed informationabout lands, agriculture, commerce, population of their states to assess their military potential,their wealth, taxation and other aspects of government.

    Gottfried Achenwall used the word statistik at a German University in 1749 which means thatpolitical science of different countries. In 1771 W. Hooper (Englishman) used the word statisticsin his translation of Elements of Universal Erudition written by Baron B.F Bieford, in his bookstatistics has been defined as the science that teaches us what is the political arrangement of all themodern states of the known world. There is a big gap between the old statistics and the modernstatistics, but old statistics also used as a part of the present statistics.

    Meanings of Statistics

    The word statistics has three different meanings (sense) which are discussed below:(1) Plural Sense (2) Singular Sense (3) Plural of the word Statistic

    (1) Plural Sense:

    In plural sense, the word statistics refer to numerical facts and figures collected in asystematic manner with a definite purpose in any field of study. In this sense, statistics are alsoaggregates of facts which are expressed in numerical form. For example, Statistics on industrial

    production, statistics or population growth of a country in different years etc.

    (2) Singular Sense:

    In singular sense, it refers to the science comprising methods which are used in collection,analysis, interpretation and presentation of numerical data. These methods are used to drawconclusion about the population parameter

    For Example: If we want to have a study about the distribution of weights of students in a certaincollege. First of all, we will collect the information on the weights which may be obtained fromthe records of the college or we may collect from the students directly. The large number ofweight figures will confuse the mind. In this situation we may arrange the weights in groups suchas: 50 Kg to 60 Kg 60 Kg to 70 Kg and so on and find the number of students fall in each

    group. This step is called a presentation of data. We may still go further and compute the averagesand some other measures which may give us complete description of the original data.

    (3) Plural of Word Statistic:

    The word statistics is used as the plural of the word Statistic which refers to a numericalquantity like mean, median, variance etc, calculated from sample value.

    For Example: If we select 15 student from a class of 80 students, measure their heights and findthe average height. This average would be a statistic.

    Kinds or Branches of Statistics:

    Statistics may be divided into two main branches:

  • 8/3/2019 Statistics Basics UNIT-1

    2/74

    (1) Descriptive Statistics (2) Inferential Statistics

    (1) Descriptive Statistics:

    In descriptive statistics, it deals with collection of data, its presentation in various forms,such as tables, graphs and diagrams and findings averages and other measures which woulddescribe the data.

    For Example: Industrial statistics, population statistics, trade statistics etc Such as businessmanmake to use descriptive statistics in presenting their annual reports, final accounts, bankstatements.

    (2) Inferential Statistics:

    In inferential statistics, it deals with techniques used for analysis of data, making the estimates anddrawing conclusions from limited information taken on sample basis and testing the reliability ofthe estimates.

    For Example: Suppose we want to have an idea about the percentage of illiterates in our country.We take a sample from the population and find the proportion of illiterates in the sample. Thissample proportion with the help of probability enables us to make some inferences about the

    population proportion. This study belongs to inferential statistics.

    Definition of Statistics:

    Statistics like many other sciences is a developing discipline. It is not nothing static. It hasgradually developed during last few centuries. In different times, it has been defined in different

    manners. Some definitions of the past look very strange today but those definitions had their placein their own time. Defining a subject has always been difficult task. A good definition of todaymay be discarded in future. It is difficult to define statistics. Some of the definitions arereproduced here:

    (1) The kings and rulers in the ancient times were interested in their manpower. They conductedcensus of population to get information about their population. They used information to calculatetheir strength and ability for wars. In those days statistics was defined as

    the science of kings, political and science of statecraft

    (2) A.L. Bowley defined statistics as

    statistics is the science of counting

    This definition places the entries stress on counting only. A common man also thinks as ifstatistics is nothing but counting. This used to be the situation but very long time ago. Statisticstoday is not mere counting of people, counting of animals, counting of trees and counting offighting force. It has now grown to a rich methods of data analysis and interpretation.

    (3) A.L. Bowley has also defined as

    science of averages

  • 8/3/2019 Statistics Basics UNIT-1

    3/74

    This definition is very simple but it covers only some area of statistics. Average is very simpleimportant in statistics. Experts are interested in average deaths rates, average birth rates, averageincrease in population, and average increase in per capita income, average increase in standard ofliving and cost of living, average development rate, average inflation rate, average production ofrice per acre, average literacy rate and many other averages of different fields of practical life. Butstatistics is not limited to average only. There are many other statistical tools like measure of

    variation, measure of correlation, measures of independence etc Thus this definition is weak andincomplete and has been buried in the past.

    (4) Prof: Boddington has defined statistics as

    science of estimate and probabilities

    This definition covers a major part of statistics. It is close to the modern statistics. But it is notcomplete because it stress only on probability. There are some areas of statistics in which

    probability is not used.

    (5) A definition due to W.I. King is the science of statistics is the method of judging collection,natural or social phenomena from the results obtained from the analysis or enumeration or

    collection of estimates. This definition is close to the modern statistics. But it does notcover the entire scope of modern statistics. Secrist has given a detailed definition ofstatistics in plural sense. His definition is given on the previous. He has not given anyimportance to statistics in singular sense. Statistics both in the singular and the pluralsense has been combined in the following definition which is accepted as the moderndefinition of statistics.

    Statistics are the numerical statement of facts capable of analysis and interpretation and thescience of statistics is the study of the principles and the methods applied in collecting, presenting,

    analysis and interpreting the numerical data in any field of inquiry.

    Characteristics of Statistics: Some of its important characteristics are given below:

    Statistics are aggregates of facts. Statistics are numerically expressed. Statistics are affected to a marked extent by multiplicity of causes. Statistics are enumerated or estimated according to a reasonable standard of accuracy. Statistics are collected for a predetermine purpose. Statistics are collected in a systemic manner. Statistics must be comparable to each other.

    Limitations of Statistics: The important limitations of statistics are:

    (1) Statistics laws are true on average. Statistics are aggregates of facts. So single observation isnot a statistics, it deals with groups and aggregates only.

    (2) Statistical methods are best applicable on quantitative data.

    (3) Statistical cannot be applied to heterogeneous data.

    (4) If sufficient care is not exercised in collecting, analysing and interpretation the data, statisticalresults might be misleading.

  • 8/3/2019 Statistics Basics UNIT-1

    4/74

    (5) Only a person who has an expert knowledge of statistics can handle statistical data efficiently.

    (6) Some errors are possible in statistical decisions. Particularly the inferential statistics involvescertain errors. We do not know whether an error has been committed or not.

    Functions or Uses of Statistics:

    (1) Statistics helps in providing a better understanding and exact description of a phenomenon ofnature.

    (2) Statistics helps in proper and efficient planning of a statistical inquiry in any field of study.

    (3) Statistics helps in collecting an appropriate quantitative data.

    (4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and graphicform for an easy and clear comprehension of the data.

    (5) Statistics helps in understanding the nature and pattern of variability of a phenomenon throughquantitative observations.

    (6) Statistics helps in drawing valid inference, along with a measure of their reliability about thepopulation parameters from the sample data.

    Importance of Statistics in Different Fields:

    Statistics plays a vital role in every fields of human activity. Statistics has important role in

    determining the existing position of per capita income, unemployment, population growth rate,housing, schooling medical facilities etcin a country. Now statistics holds a central position inalmost every field like Industry, Commerce, Trade, Physics, Chemistry, Economics, Mathematics,Biology, Botany, Psychology, Astronomy etc, so application of statistics is very wide. Now wediscuss some important fields in which statistics is commonly applied.

    (1) Business:

    Statistics play an important role in business. A successful businessman must be very quickand accurate in decision making. He knows that what his customers wants, he should therefore,know what to produce and sell and in what quantities. Statistics helps businessman to plan

    production according to the taste of the costumers, the quality of the products can also be checkedmore efficiently by using statistical methods. So all the activities of the businessman based onstatistical information. He can make correct decision about the location of business, marketing ofthe products, financial resources etc

    (2) In Economics:

    Statistics play an important role in economics. Economics largely depends upon statistics.National income accounts are multipurpose indicators for the economists and administrators.Statistical methods are used for preparation of these accounts. In economics research statisticalmethods are used for collecting and analysis the data and testing hypothesis. The relationship

    between supply and demands is studies by statistical methods, the imports and exports, theinflation rate, the per capita income are the problems which require good knowledge of statistics.

  • 8/3/2019 Statistics Basics UNIT-1

    5/74

    (3) In Mathematics:

    Statistical plays a central role in almost all natural and social sciences. The methods ofnatural sciences are most reliable but conclusions draw from them are only probable, because theyare based on incomplete evidence. Statistical helps in describing these measurements more

    precisely. Statistics is branch of applied mathematics. The large number of statistical methods likeprobability averages, dispersions, estimation etc is used in mathematics and different techniques

    of pure mathematics like integration, differentiation and algebra are used in statistics.

    (4) In Banking:

    Statistics play an important role in banking. The banks make use of statistics for a numberof purposes. The banks work on the principle that all the people who deposit their money with the

    banks do not withdraw it at the same time. The bank earns profits out of these deposits by lendingto others on interest. The bankers use statistical approaches based on probability to estimate thenumbers of depositors and their claims for a certain day.

    (5) In State Management (Administration):

    Statistics is essential for a country. Different policies of the government are based onstatistics. Statistical data are now widely used in taking all administrative decisions. Suppose if thegovernment wants to revise the pay scales of employees in view of an increase in the living cost,statistical methods will be used to determine the rise in the cost of living. Preparation of federaland provincial government budgets mainly depends upon statistics because it helps in estimatingthe expected expenditures and revenue from different sources. So statistics are the eyes ofadministration of the state.

    (6) In Accounting and Auditing:

    Accounting is impossible without exactness. But for decision making purpose, so muchprecision is not essential the decision may be taken on the basis of approximation, know asstatistics. The correction of the values of current assets is made on the basis of the purchasing

    power of money or the current value of it.

    In auditing sampling techniques are commonly used. An auditor determines the samplesize of the book to be audited on the basis of error.

    (7) In Natural and Social Sciences:

    Statistics plays a vital role in almost all the natural and social sciences. Statistical methodsare commonly used for analysing the experiments results, testing their significance in Biology,Physics, Chemistry, Mathematics, Meteorology, Research chambers of commerce, Sociology,Business, Public Administration, Communication and Information Technology etc

    (8) In Astronomy:

    Astronomy is one of the oldest branches of statistical study; it deals with the measurementof distance, sizes, masses and densities of heavenly bodies by means of observations. During thesemeasurements errors are unavoidable so most probable measurements are founded by usingstatistical methods.

  • 8/3/2019 Statistics Basics UNIT-1

    6/74

    Example: This distance of moon from the earth is measured. Since old days the astronomers havebeen statistical methods like method of least squares for finding the movements of stars.

    Application of Inferential Statistics in Managerial Decision Making:

    The main objective of Business Statistics is to make inferences (e.g., prediction, making decisions)about certain characteristics of a population based on information contained in a random samplefrom the entire population. The condition for randomness is essential to make sure the sample isrepresentative of the population.

    Statistical inference refers to extending your knowledge obtained from a random sample from theentire population to the whole population. Its main application is in hypotheses testing about agiven population. Statistical inference guides the selection of appropriate statistical models.Models and data interact in statistical work. Inference from data can be thought of as the processof selecting a reasonable model, including a statement in probability language of how confident

    one can be about the selection.

    Estimation and Hypothesis Testing: Inference in statistics is of two types. The firstis estimation, which involves the determination, with a possible error due to sampling, of theunknown value of a population characteristic, such as the proportion having a specific attribute orthe average value of some numerical measurement. To express the accuracy of the estimates of

    population characteristics, one must also compute the standard errors of the estimates. The secondtype of inference is hypothesis testing. It involves the definitions of a hypothesis as one set of

    possible population values and an alternative, a different set. There are many statistical proceduresfor determining, on the basis of a sample, whether the true population characteristic belongs to theset of values in the hypothesis or the alternative.

    Some Basic Definitions in Statistics:

    Constant:

    A quantity which can be assuming only one value is called a constant. It is usually denotedby the first letters of alphabets a,b,c.

    For Example: Value of = 22/7 = 3.14159 and value of e = 2.71828

    Variable: A quantity which can vary from one individual or object to and other is called a variable. Itis usually denoted by the last letters of alphabets x,y,z.

    For Example: Heights and Weights of students, Income, Temperature, No. of Children in afamily etc

    Continuous Variable:

    A variable which can assume each and every value within a given range is called a

    continuous variable. It can occur in decimals.

  • 8/3/2019 Statistics Basics UNIT-1

    7/74

    For Example: Heights and Weights of students, Speed of a bus, the age of a Shopkeeper, the lifetime of a T.V etc

    Continuous Data:

    Data which can be described by a continuous variable is called continuous data.For Example: Weights of 50 students in a class.

    Discrete Variable:

    A variable which can assume only some specific values within a given range is calleddiscrete variable. It cannot occur in decimals. It can occur in whole numbers.For Example:Number of students in a class, number of flowers on the tree, number of houses ina street, number of chairs in a room etc

    Discrete Data:

    Data which can be described by a discrete variable is called discrete data.For Example:Number of students in a college.

    Quantitative Variable:

    A characteristic which varies only in magnitude from on individual to another is calledquantitative variable. It can be measurable.

    For Example: Wages, Prices, Heights, Weights etc

    Qualitative Variable:

    A characteristic which varies only in quality from one individual to another is calledqualitative variable. It cannot be measured.

    For Example: Beauty, Marital Status, Rich, Poor, Smell etc

    Collection of Statistical Data

    Statistical Data:

    A sequence of observation, made on a set of objects included in the sample drawn frompopulation is known as statistical data.

    (1) Ungrouped Data:

    Data which have been arranged in a systematic order are called raw data or ungrouped data.

    (2) Grouped Data:

    Data presented in the form of frequency distribution is called grouped data.

    Collection of Data:

  • 8/3/2019 Statistics Basics UNIT-1

    8/74

    The first step in any enquiry (investigation) is collection of data. The data may be collected for thewhole population or for a sample only. It is mostly collected on sample basis. Collection of data isvery difficult job. The enumerator or investigator is the well trained person who collects thestatistical data. The respondents (information) are the persons whom the information is collected.

    Types of Data:

    There are two types (sources) for the collection of data.(1) Primary Data (2) Secondary Data

    1) Primary Data:

    The primary data are the first hand information collected, compiled and published byorganization for some purpose. They are most original data in character and have not undergoneany sort of statistical treatment.

    Example: Population census reports are primary data because these are collected, complied andpublished by the population census organization.

    (2) Secondary Data:

    The secondary data are the second hand information which are already collected by someone (organization) for some purpose and are available for the present study. The secondary dataare not pure in character and have undergone some treatment at least once.Example: Economics survey of India is secondary data because these are collected by more thanone organization like NSSO, Indian statistical Organisation, Board of Revenue, the Banks etc

    Methods of Collecting Primary Data:

    Primary data are collected by the following methods:

    1. Observation method

    2. Interview method

    3.Through questionnaires

    4.Through schedules

    5. Other methods include

    Warranty cards

    Distributor audits Pantry audits Consumer panels Using mechanical devices Through projective techniques Depth interviews Content analysis

    OBSERVATION METHOD: The observation method is the most commonly used methodespecially in studies relating to behavioural sciences. In a general way we all observe thingsaround us, but this sort of observation is not scientific observation. Observation becomes ascientific tool and the method of data collection for the researcher, when it serves a formulatedresearch purpose, systematically planned and recorded and is subjected to checks and controls on

  • 8/3/2019 Statistics Basics UNIT-1

    9/74

    validity and reliability. Under the observation method, the information is sought by way ofinvestigators own direct observation without asking from the respondent.

    INTERVIEW METHOD: The interview method of collecting data involves presentation of oral-verbal stimuli and reply in terms of oral-verbal responses. This method can be used through

    personal interviews and, if possible, through telephone interviews.

    QUESTIONNAIRES METHOD: This method of data collection is quite popular, particularly incase of big enquiries. It is being adopted by private individuals, research workers, private and

    public organisations and even by governments. In this method a questionnaire is sent (usually bypost) to the persons concerned with a request to answer the questions and return the questionnaire.A questionnaire consists of a number of questions printed or typed in a definite order on a form orset of forms. The questionnaire is mailed to respondents who are expected to read and understandthe questions and write down the reply in the space meant for the purpose in the questionnaireitself. The respondents have to answer the questions on their own. The method of collecting data

    by mailing the questionnaires to respondents is most extensively employed in various economicand business surveys

    SCHEDULES: This method of data collection is very much like the collection of data throughquestionnaire, with little difference which lies in the fact that schedules (proforma containing a setof questions) are being filled in by the enumerators who are specially appointed for the purpose.These enumerators along with schedules, go to respondents, put to them the questions from the

    proforma in the order the questions are listed and record the replies in the space meant for thesame in the proforma. In certain situations, schedules may be handed over to respondents and

    enumerators may help them in recording their answers to various questions in the said schedules.Enumerators explain the aims and objects of the investigation and also remove the difficultieswhich any respondent may feel in understanding the implications of a particular question or thedefinition or concept of difficult terms. This method requires the selection of enumerators forfilling up schedules or assisting respondents to fill up schedules and as such enumerators should

    bevery carefully selected. The enumerators should be trained to perform their job well and thenature and scope of the investigation should be explained to them thoroughly so that they maywell understand the implications of different questions put in the schedule.Enumerators should be intelligent and must possess the capacity of cross examination in order to

    find out the truth. Above all, they should be honest, sincere, and hardworking and should havepatience and perseverance. This method of data collection is very useful in extensive enquiries andcan lead to fairly reliable results

    DIFFERENCE BETWEEN QUESTIONNAIRES AND SCHEDULES

    Both questionnaire and schedule are popularly used methods of collecting data in researchsurveys. There is much resemblance in the nature of these two methods and this fact has mademany people to remark that from a practical point of view the two methods can be taken to be thesame. But from the technical point of view there is difference between the two:

  • 8/3/2019 Statistics Basics UNIT-1

    10/74

    Questionnaire Method Schedule Method

    1. Questionnaire is generally sent through mailto informants to be answered.2. Data collection is cheap.3.Non response is usually high as many

    people do not respond.

    4. It is not clear that who replies.5. The questionnaire method is likely to be veryslow since many respondents do not return thequestionnaire.6.No personal contact is possible in caseof questionnaire.

    1. Schedules is generally filled by the resarchworker or enumerator, who can interpret thequestions when necessary.2.Data collection is more expensive asmoney is spent on enumerators.

    3.Non response is very low because thisis filled by enumerators.4.Identity of respondent is known.5.Information is collected well in time.6.Direct personal contact is established.

    Methods of Collecting Secondary Data:

    The secondary data are collected by the following sources:

    Official: e.g. The publications of the Statistical Division, Ministry of Finance, the FederalBureaus of Statistics, Ministries of Food, Agriculture, Industry, Labor etc

    Semi-Official: e.g. State Bank, Railway Board, Central Cotton Committee, Boards ofEconomic Enquiry etc

    Publication of Trade Associations, Chambers of Commerce etc

    Technical and Trade Journals and Newspapers.

    Research Organizations such as Universities and other institutions.

    Difference between Primary and Secondary Data:

    The difference between primary and secondary data is only a change of hand. The primarydata are the first hand data information which is directly collected form one source. They are mostoriginal data in character and have not undergone any sort of statistical treatment while thesecondary data are obtained from some other sources or agencies. They are not pure in character

    and have undergone some treatment at least once.For Example: Suppose we interested to find the average age of MS students. We collect the agesdata by two methods; either by directly collecting from each student himself personally or gettingtheir ages from the university record. The data collected by the direct personal investigation iscalled primary data and the data obtained from the university record is called secondary data.

    Editing of Data:

    After collecting the data either from primary or secondary source, the next step is itsediting. Editing means the examination of collected data to discover any error and mistake before

    presenting it. It has to be decided before hand what degree of accuracy is wanted and what extentof errors can be tolerated in the inquiry. The editing of secondary data is simpler than that of

    primary data.

  • 8/3/2019 Statistics Basics UNIT-1

    11/74

    Classification of Data:

    The process of arranging data into homogenous group or classes according to some commoncharacteristics present in the data is called classification.

    For Example: The process of sorting letters in a post office, the letters are classified according tothe cities and further arranged according to streets.

    Bases of Classification:

    There are four important bases of classification:

    (1) Qualitative Base

    (2) Quantitative Base

    (3) Geographical Base

    (4) Chronological or Temporal Base

    (1) Qualitative Base:

    When the data are classified according to some quality or attributes such as sex, religion,literacy, intelligence etc(2) Quantitative Base:

    When the data are classified by quantitative characteristics like heights, weights, ages,income etc

    (3) Geographical Base: When the data are classified by geographical regions or location, like states, provinces,cities, countries etc(4) Chronological or Temporal Base:

    When the data are classified or arranged by their time of occurrence, such as years, months,weeks, days etc For Example: Time series data.

    Types of Classification:

    (1) One -way Classification:

    If we classify observed data keeping in view single characteristic, this type of classification isknown as one-way classification.

    For Example: The population of world may be classified by religion as Muslim, Christians etc

    (2) Two -way Classification:

    If we consider two characteristics at a time in order to classify the observed data then we are doingtwo way classifications.For Example: The population of world may be classified by Religion and Sex.

    (3) Multi -way Classification:

    We may consider more than two characteristics at a time to classify given data or observed data.In this way we deal in multi-way classification.For Example: The population of world may be classified by Religion, Sex and Literacy.

  • 8/3/2019 Statistics Basics UNIT-1

    12/74

    Tabulation of Data:

    The process of placing classified data into tabular form is known as tabulation. A table is asymmetric arrangement of statistical data in rows and columns. Rows are horizontal arrangementswhereas columns are vertical arrangements. It may be simple, double or complex depending uponthe type of classification.

    Types of Tabulation:

    (1) Simple Tabulation or One-way Tabulation:

    When the data are tabulated to one characteristic, it is said to be simple tabulation or one-waytabulation.For Example: Tabulation of data on population of world classified by one characteristic likeReligion is example of simple tabulation.

    (2) Double Tabulation or Two-way Tabulation:

    When the data are tabulated according to two characteristics at a time. It is said to be doubletabulation or two-way tabulation.

    For Example: Tabulation of data on population of world classified by two characteristics likeReligion and Sex is example of double tabulation.

    (3) Complex Tabulation:

    When the data are tabulated according to many characteristics, it is said to be complex

    tabulation.

    For Example: Tabulation of data on population of world classified by two characteristics likeReligion, Sex and Literacy etcis example of complex tabulation.

    Construction of Statistical Table:

    A statistical table has at least four major parts and some other minor parts.(1) The Title(2) The Captions(3) The Stub

    (4) The Body(5) Prefatory Notes/Head Note(6) Foots Notes(7) Source NotesThe general sketch of table indicating its necessary parts is shown below:

    ----THE TITLE----

    ----Prefatory Notes----

    ----Box Head----

  • 8/3/2019 Statistics Basics UNIT-1

    13/74

    ----Row Captions---- ----Column Captions----

    ----Stub Entries---- ----The Body----

    Foot Notes

    Source Notes

    (1) The Title:

    A title is the main heading written in capital shown at the top of the table. It must explain thecontents of the table and throw light on the table as whole different parts of the heading can beseparated by commas there are no full stop be used in the little.

    (2) The Captions:

    The vertical heading and subheading of the column are called columns captions. The spaces werethese column headings are written is called box head. Only the first letter of the box head is incapital letters and the remaining words must be written in small letters.

    (3) The Stub:

    The horizontal headings and sub heading of the row are called row captions and the space wherethese rows headings are written is called stub.

    (4) The Body:

    It is the main part of the table which contains the numerical information classified with respect torow and column captions.

    (5) Prefatory Notes /Head Notes :

    A statement given below the title and enclosed in brackets usually describe the units ofmeasurement is called prefatory notes.

    (6) Foot Notes:

    It appears immediately below the body of the table providing the further additional explanation.

    (7) Source Notes:

    The source notes is given at the end of the table indicating the source from when information has

    been taken. It includes the information about compiling agency, publication etcGeneral Rules of Tabulation:

  • 8/3/2019 Statistics Basics UNIT-1

    14/74

    A table should be simple and attractive. There should be no need of further explanations(details).

    Proper and clear headings for columns and rows should be need. Suitable approximation may be adopted and figures may be rounded off. The unit of measurement should be well defined. If the observations are large in number they can be broken into two or three tables. Thick lines should be used to separate the data under big classes and thin lines to separate

    the sub classes of data.

    Difference Between Classification and Tabulation:

    (1) First the data are classified and then they are presented in tables, the classification andtabulation in fact goes together. So classification is the basis for tabulation.

    (2) Tabulation is a mechanical function of classification because in tabulation classified data areplaced in row and columns

    (3) Classification is a process of statistical analysis where as tabulation is a process of presentingthe data in suitable form.

    Frequency Distribution:

    A frequency distribution is a tabular arrangement of data into classes according to the size ormagnitude along with corresponding class frequencies (the number of values fall in each class).

    Ungrouped Data or Raw Data:

    Data which have not been arranged in a systemic order is called ungrouped or raw data.

    Grouped Data:

    Data presented in the form of frequency distribution is called grouped data.

    Array: The numerical raw data is arranged in ascending or descending order is called an array.

    Example: Array the following data in ascending or descending order 6, 4, 13, 7, 10, 16, 19.Solution:

    Array in ascending order is 4, 6, 7, 10, 13, 16, and 19Array in descending order id 19, 16, 13, 10, 7, 6, and 4

    Class Limits/Class Interval:

    The variant values of the classes or groups are called the class limits. The smaller value of theclass is called lower class limit and larger value of the class is called upper class limit. Class limitsare also called inclusive classes.

    For Example: Let us take the class 10 19, the smaller value 10 is lower class limit and largervalue 19 is called upper class limit.

  • 8/3/2019 Statistics Basics UNIT-1

    15/74

    Class Boundaries:

    The true values, which describe the actual class limits of a class, are called class boundaries. Thesmaller true value is called the lower class boundary and the larger true value is called the upperclass boundary of the class. It is important to note that the upper class boundary of a classcoincides with the lower class boundary of the next class. Class boundaries are also known asexclusive classes.

    For Example:

    Weights in Kg No of Students

    60 65 865 70 1270 75 5 25

    A student whose weights are between 60kg and 64.5kg would be included in the 60 65 class. A

    student whose weight is 65kg would be included in next class 65 70.

    Open-end Classes:

    A class has either no lower class limit or no upper class limit in a frequency table is called anopen-end class. We do not like to use open-end classes in practice, because they create problemsin calculation.

    For Example:

    Weights (Pounds) No of Persons

    Below 110 6110 120 12120 130 20130 140 10

    140 Above 2

    Class Mark or Mid Point:

    The class marks or mid point is the mean of lower and upper class limits or boundaries. So itdivides the class into two equal parts. It is obtained by dividing the sum of lower and upper classlimit or class boundaries of a class by 2.

    For Example: The class mark or mid point of the class 60 69 is 60+69/2 = 64.5

    Size of Class Interval:

    The difference between the upper and lower class boundaries (not between class limits) of a class

    or the difference between two successive mid points is called size of class interval.

  • 8/3/2019 Statistics Basics UNIT-1

    16/74

    Types of class Interval: Classes can be formed in two ways:

    Exclusive Class Interval: When the class intervals are fixed in such a way that the upper limit ofone class is the lower limit of the next class, it is termed as exclusive method of class interval.

    Example:

    Marks (Percentage) No. of Students

    0-1010-2020-3030-4040-5050-60

    151722303945

    Inclusive Class Interval: In case of inclusive class intervals, the upper limit of a class in notequal to the lower limit of the next class.

    Example:

    Marks (Percentage) No. of Students

    0-910-1920-29

    30-3940-49

    587

    1325

    Construction of Frequency Distribution:

    Following steps are involved in the construction of a frequency distribution.

    (1) Find the range of the data: The range is the difference between the largest and the smallestvalues.

    (2) Decide the approximate number of classes: Which the data are to be grouped. There are nohard and first rules for number of classes. Most of the cases we have 5 to 20 classes. H.A. Sturgeshas given a formula for determining the approximation number of classes.

    K = 1 + 3.322 log NWhere K = Number of Classes

    Where log N = Logarithm of the total number of observations

    For Example: If the total number of observations is 50, the number of classes would be

    K = 1 + 3.322 log N

  • 8/3/2019 Statistics Basics UNIT-1

    17/74

    K = 1 + 3.322 log 50

    K = 1 + 3.322 (1.69897)

    K = 1 + 5.644

    K = 6.644 Or 7 classes approximately.

    (3) Determine the approximate class interval size: The size of class interval is obtained bydividing the range of data by number of classes and denoted by h class interval

    sizeIn case of fractional results, the next higher whole number is taken as the size of the class interval.

    (4) Decide the starting point: The lower class limits or class boundary should cover the smallestvalue in the raw data. It is a multiple of class interval.For Example: 0, 5, 10, 15, 20 etcare commonly used.

    (5) Determine the remaining class limits (boundary): When the lowest class boundary of thelowest class has been decided, then by adding the class interval size to the lower class boundary,compute the upper class boundary. The remaining lower and upper class limits may be determined

    by adding the class interval size repeatedly till the largest value of the data is observed in the class.

    (6) Distribute the data into respective classes: All the observations are marked into respectiveclasses by using Tally Bars (Tally Marks) methods which is suitable for tabulating theobservations into respective classes. The number of tally bars is counted to get the frequencyagainst each class. The frequency of all the classes is noted to get grouped data or frequency

    distribution of the data. The total of the frequency columns must be equal to the number ofobservations.

    Example Construction of Frequency Distribution:

    Construct a frequency distribution with suitable class interval size of marks obtainedby 50 students of a class are given below:

    23, 50, 38, 42, 63, 75, 12, 33, 26, 39, 35, 47, 43, 52, 56, 59, 64, 77, 15, 21, 51, 54, 72, 68, 36, 65,52, 60, 27, 34, 47, 48, 55, 58, 59, 62, 51, 48, 50, 41, 57, 65, 54, 43, 56, 44, 30, 46, 67, 53

    Solution: Arrange the marks in ascending order as

    12, 15, 21, 23, 26, 27, 30, 33, 34, 35, 36, 38, 39, 41, 42, 43, 43, 44, 46, 47, 47, 48, 48, 50, 50, 51,51, 52, 52, 53, 54, 54, 55, 56, 56, 57, 58, 59, 59, 60, 62, 63, 64, 65, 65, 67, 68, 72, 75, 77

    Minimum Value = 12Maximum = 77Range = Maximum Value Minimum Value = 77 - 12 = 65

    Number of Classes = 1 + 3.322 log N= 1 + 3.322 log 50

    = 1 + 3.322 (1.69897)= 1 + 5.644= 6.644 Or 7 classes approximately.

  • 8/3/2019 Statistics Basics UNIT-1

    18/74

    Class Interval Size (h) = = = 9.3 or 10

    Marks

    Class Limits

    C.L

    Tally

    Marks

    Number of

    Students

    f (Frequency)

    Class

    Boundary

    C.B

    Class

    Marks

    x

    Note: For finding the class boundaries, we take half of the difference between lower class limit ofthe 2nd class and upper class limit of the 1st class (20-190/2 = = 0.5. This value is subtractedfrom lower class limit and added in upper class limit to get the required class boundaries.

    Frequency Distribution of Discrete Data:

    Discrete data is generated by counting; each and every observation is exact. When an observationis repeated. It is counted the number for which the observation is repeated is called frequency ofthat observation. The class limits in discrete data are true class limit; there are no class boundariesin discrete data.

    More than cumulative frequency distribution:

    It is obtained by finding the cumulate total of frequencies starting from the highest to the lowestclass. The less than cumulative frequency distribution and more than cumulative frequencydistribution for the frequency distribution given below are:

    Diagrammatic & Graphical Representation of Statistical Data:

  • 8/3/2019 Statistics Basics UNIT-1

    19/74

    In the previous section we have discussed the techniques of classification and tabulation that helpus in organizing the collected data in a meaningful fashion. However, this way of presentation ofstatistical data does not always prove to be interesting to a layman. Too many figures are oftenconfusing and fail to convey the massage effectively. One of the most effective and interestingalternative way in which a statistical data may be presented is through diagrams and graphs. Thereare several ways in which statistical data may be displayed pictorially such as different types ofgraphs and diagrams. The commonly used diagrams and graphs to be discussed in subsequent

    paragraphs are given as under:

    Types of Diagrams/Charts:

    1. Simple Bar Chart2. Multiple Bar Chart or Cluster Chart3. Staked Bar Chart or Sub-Divided Bar Chart or Component Bar Chart

    Simple Component Bar Chart Percentage Component Bar Chart Sub-Divided Rectangular Bar Chart Deviation Bar Diagram Pie Chart

    Types of Diagrams/Charts:

    1. Histogram2. Frequency Curve and Polygon3. Ogive(cumulative frequency curves)

    Simple Bar Chart:

    A simple bar chart is used to represents data involving only one variable classified on spatial,quantitative or temporal basis. In simple bar chart, we make bars of equal width but variablelength, i.e. the magnitude of a quantity is represented by the height or length of the bars.Following steps are undertaken in drawing a simple bar diagram:

    Draw two perpendicular lines one horizontally and the other vertically at an appropriateplace of the paper.

    Take the basis of classification along horizontal line (X-axis) and the observed variablealong vertical line (Y-axis) or vice versa.

    Marks signs of equal breath for each class and leave equal or not less than half breath inbetween two classes.

    Finally marks the values of the given variable to prepare required bars.

    Example:Draw simple bar diagram to represent the profits of a bank for 5 years.

    Years 1989 1990 1991 1992 1993

    Profit (million $) 10 12 18 25 42

    Simple bar chart showing the profit of a bank for 5 years.

  • 8/3/2019 Statistics Basics UNIT-1

    20/74

    Multiple Bar Chart:

    By multiple bars diagram two or more sets of inter-related data are represented (multiple bardiagram facilities comparison between more than one phenomena). The technique of simple barchart is used to draw this diagram but the difference is that we use different shades, colors, or dotsto distinguish between different phenomena. We use to draw multiple bar charts if the total ofdifferent phenomena is meaningless.

    Example:

    Draw a multiple bar chart to represent the import and export of a country (values inMillion $) for the years 1991 to 1995.

    Years Imports Exports

    1991 7930 4260

    1992 8850 5225

    1993 9780 6150

    1994 11720 7340

    1995 12150 8145

    Simple bar chart showing the import and export of a country from 1991 1995.

  • 8/3/2019 Statistics Basics UNIT-1

    21/74

    Component Bar Chart:

    Sub-divided or component bar chart is used to represent data in which the total magnitude is

    divided into different or components.

    In this diagram, first we make simple bars for each class taking total magnitude in that class andthen divide these simple bars into parts in the ratio of various components. This type of diagram

    shows the variation in different components within each class as well as between different classes.Sub-divided bar diagram is also known as component bar chart or staked chart.

    Example:

    The table below shows the quantity in hundred kgs of Wheat, Barley and Oats produced ona certain form during the years 1991 to 1994.

    Years Wheat Barley Oats

    1991 34 18 27

    1992 43 14 24

    1993 43 16 27

    1994 45 13 34

    Construct a component bar chart to illustrate this data.

  • 8/3/2019 Statistics Basics UNIT-1

    22/74

    Solution:

    To make the component bar chart, first of all we have to take year wise total production.

    Years Wheat Barley Oats Total

    1991 34 18 27 79

    1992 43 14 24 811993 43 16 27 86

    1994 45 13 34 92

    The required diagram is given below:

    Percentage Component Bar Chart:

    Sub-divided bar chart may be drawn on percentage basis. To draw sub-divided bar chart onpercentage basis, we express each component as the percentage of its respective total. In drawingpercentage bar chart, bars of length equal to 100 for each class are drawn at first step and sub-divided in the proportion of the percentage of their component in the second step. The diagram so

    obtained is called percentage component bar chart or percentage staked bar chart. This type ofchart is useful to make comparison in components holding the difference of total constant.

    Example:

    The table given on next page shows the quantity in hundred kgs of Wheat, Barley and Oatsproduced on a certain form during the years 1991 to 1994.

  • 8/3/2019 Statistics Basics UNIT-1

    23/74

    Years Wheat Barley Oats

    1991 34 18 27

    1992 43 14 24

    1993 43 16 27

    1994 45 13 34

    Construct a percentage component bar chart to illustrate this data.

    Solution: Necessary computations for the construction of percentage bar chartgiven below:

    Item 1991 1992 1993 1994

    % Cum % % Cum % % Cum % % Cum %

    Wheat 43.0 43.0 53.1 53.1 50.0 50.0 48.9 48.9

    Barley 22.8 65.8 17.3 70.4 18.6 68.6 14.1 63.0

    Oats 34.2 100 29.6 100 31.4 100 37.0 100

    Total 100 100 100 100

    % indicates Percentage of each item Cum % indicates the cumulative percentage.

  • 8/3/2019 Statistics Basics UNIT-1

    24/74

  • 8/3/2019 Statistics Basics UNIT-1

    25/74

    ItemsFamily Budget

    Expenditure $ Angle of Sectors Cumulative Angle

    Food 600 144 144

    Clothing 100 24 168

    House Rent 400 96 264

    Fuel and Lighting 100 24 288

    Miscellaneous 300 72 360

    Total 1500 360

    Histogram: Histogram is a two dimensional frequency diagram. The histograms are diagramswhich represent the class interval and the frequency in the form of a rectangle. There will be asmany adjoining rectangles as there are class intervals.

    Properties of Histogram:

    (1) In a histogram, class intervals are shown on X-axis and frequencies on Y-axis.(2) The scales for both the axes need not be the same.(3) Class intervals must be exclusive. If the intervals are in inclusive form, we have to convertthem to the exclusive form.(4) In a histogram, there are rectangles with class intervals as bases and the correspondingfrequencies as heights.(5) The class limits are marked on the horizontal axis and the frequency is marked on the verticalaxis. Thus, a rectangle is constructed on each class interval.

    (6) If the intervals are equal, then the height of each rectangle is proportional to the correspondingclass frequency.(7) If the intervals are unequal, then the area of each rectangle is proportional to the correspondingclass frequency.

  • 8/3/2019 Statistics Basics UNIT-1

    26/74

    Example:Class Interval 0-5 5-10 10-15 15-20 20-25

    Frequency 4 10 18 8 6

    Frequency Polygon:

    A frequency polygon is a line graph drawn by joining the mid-points of the tops of the adjoiningrectangles in a histogram. The mid-points of the first and the last classes are joined to the mid-

    points of the classes preceding and succeeding respectively at zero frequency to complete thepolygon.

    Example:

    Class (Height in Centimetres) 150-160

    160-170

    170-180

    180-190

    190-200

    Frequency 4 3 8 6 1

  • 8/3/2019 Statistics Basics UNIT-1

    27/74

    Ogives (Cumulative Frequency Curves): The curves that we have discussed earlier have a

    similar property that the frequency of each class was shown against it but sometimes it needs tocumulate the frequency and the curves that are based on cumulative frequencies are known ascumulative frequency curves or ogives.

    Series can be cumulated in two ways. In the first way, the frequencies of all the preceding classintervals are added to the frequency of a class and in the second way, the frequencies of all thesucceeding classes are added to the frequency if a class. The frequencies added in the first waywould show the frequency of the classes less than a particular value and the frequency added inthe second way would show the frequency of classes more than a particular value. Simply the firstmethod give a series cumulated upwards and a second one a series cumulated downwards.

    The method of drawing the frequency curve and the more than frequency curve are almost samewith a little difference that in case of a simple frequency curve, the frequency is plotted against themidpoint of a class interval but in case of a cumulative frequency curve, it is plotted at the upperor lower limit of the class depending upon the manner in which series has been cumulated.

    It will be plotted against the upper limit of the class interval when the cumulation has been doneupwards (less than series) but if the cumulation has been done downwards, i.e. the more thanseries has been formed, frequencies would be plotted against the lower limit of the class-intervals.

    Age Interval(yrs) Frequency Cumulative frequency

    15-19 13 13

    20-24 15 28

    25-29 20 48

    30-34 10 58

    35-39 8 66

    40-44 4 70

  • 8/3/2019 Statistics Basics UNIT-1

    28/74

    Measures of Central Tendency

  • 8/3/2019 Statistics Basics UNIT-1

    29/74

    Meaning of Central Tendency: Condensation of data is very much required in statistical analysisbecause a large number of figures are confusing and are very difficult to handle. To reduce thecomplexity of data, it is essential that the number of figures should be reduced to one figure so thatfigures can be compared easily. Such single figures are called measures of central tendency oraverages.

    According to Simpson & Kafka: A measure of central tendency is a typical value aroundwhich other figure congregate.

    Average:

    A single value which can represent the whole set of data is called an average. If theaverage tends to lie or indicating the center of the distribution is called measure of centraltendency or sometimes they locate the general position of the data, so they are also called measureof location.

    Desirable Qualities of a Good Average:

    An average possesses all or most of the following qualities (characteristics) is considered a goodaverage:

    (1) It should be easy to calculate and simple to understand.(2) It should be clearly defined by a mathematical formula.(3) It should not be affected by extreme values.(4) It should be based on all the observations.(5) It should be capable of further mathematical treatment.(6) It should have sample stability.

    Types of Averages:

    Mathematical averages are:

    (1) Arithmetic Mean

    (2) Geometric Mean

    (3) Harmonic Mean

    Positional averages are:

    (4) Median

    (5) Mode

    Arithmetic Mean:

    It is the most commonly used average or measure of the central tendency applicable only in caseof quantitative data. Arithmetic mean is also simply called mean. Arithmetic mean is defined as:

    Arithmeticmean is quotient of sum of the given values and number of the given values.

  • 8/3/2019 Statistics Basics UNIT-1

    30/74

    The arithmetic mean can be computed for both ungroup data (raw data: a data withoutany statistical treatment) and grouped data (a data arranged in tabular form containing differentgroups).

    If x is the involved variable, then arithmetic mean of x is abbreviated as A.M. of x and

    denoted by . The arithmetic mean of x can be computed by any of the following methods.

    Methods NameNature of Data

    Ungrouped Data Grouped Data

    Direct Method

    Indirect or

    Short-Cut Method

    Method of

    Step-Deviation

    Wherex : Indicates values of the variable X.n : Indicates number of values of X.f : Indicates frequency of different groups.

    A : Indicates assumed mean.D = Indicates deviation from A i.e, D = (x A)u = (x - A)/c

    u : Step-deviation andc : Indicates common divisorh : Indicates size of class or class interval in case of grouped data. : Summation or addition.

    Example (1):

    The one-sided train fare of five selected BS students is recorded as follows($):-10, 5, 15, 8 and 12. Calculate arithmetic mean of the following data.

    Solution:

    Let train fare is indicated by x, then

    Arithmetic mean of , we decide to use above-mentioned formula.Form the given data, we have x = 50 and n = 5. Placing these two quantities in above formula,

    we get the arithmetic mean for given data.

    ;

    x(S)105

    158

    12x = 50

  • 8/3/2019 Statistics Basics UNIT-1

    31/74

    .Weighted Arithmetic Mean:

    In calculation of arithmetic mean, the importance of all the items was considered to be equal.However, there may be situations in which all the items under considerations are not equalimportance. For example, we want to find average number of marks per subject who appeared in

    different subjects like Mathematics, Statistics, Physics and Biology. These subjects do not haveequal importance. If we find arithmetic mean by giving Mean.

    Thus, arithmetic mean computed by considering relative importance of each items is calledweighted arithmetic mean. To give due importance to each item under consideration, we assignnumber called weight to each item in proportion to its relative importance.Weighted Arithmetic Mean is computed by using following formula:

    Where:

    Stands for weighted arithmetic mean.Stands for values of the items andStands for weight of the item

    Example:

    A student obtained 40, 50, 60, 80, and 45 marks in the subjects of Math, Statistics, Physics,Chemistry and Biology respectively. Assuming weights 5, 2, 4, 3, and 1 respectively for the abovementioned subjects. Find Weighted Arithmetic Mean per subject.

    Solution:

    Subjects

    Marks Obtained

    x

    Weight

    wWx

    Math 40 5 200

    Statistics 50 2 100

    Physics 60 4 240

    Chemistry 80 3 240Biology 45 1 45

    Total w = 15 wx = 825

    Now we will find weighted arithmetic mean as:

    marks/subject.

    Merits and Demerits of Arithmetic Mean:

    Merits:

  • 8/3/2019 Statistics Basics UNIT-1

    32/74

    It is rigidly defined. It is easy to calculate and simple to follow. It is based on all the observations. It is determined for almost every kind of data. It is finite and not indefinite. It is readily put to algebraic treatment. It is least affected by fluctuations of sampling.

    Demerits:

    The arithmetic mean is highly affected by extreme values. It cannot average the ratios and percentages properly. It is not an appropriate average for highly skewed distributions. It cannot be computed accurately if any item is missing. The mean sometimes does not coincide with any of the observed value.

    Mathematical Properties of Mean:

    1. The sum of the deviations of the items from the mean (taking plus and minus sign) is alwaysequal to zero.

    2. The sum of squared deviations of the items from the mean is (minimum) less than the sum ofsquared deviations of items from any other value.

    3. If we have the arithmetic mean and the number of observations of two or more than two relatedgroups, we can compare combined average of these groups by applying the following formula:

    X12 = N1 X1 + N2 X2

    N1 + N2

    Where

    X12 = Combined mean of two groups

    N1 = No. of observations in the first group

    N2 = No. of observations in the second group

    X1 = Arithmetic mean of first group

    X2 = Arithmetic mean of second group

    Geometric Mean:

    It is another measure of central tendency based on mathematical footing like arithmetic mean.Geometric mean can be defined in the following terms:Geometric mean is the nth positive root of the product of n positive given values

    Hence, geometric mean for a value X containing n values such as x1, x2, x3, xn isdenoted by G.M. of X and given as under:

    (For Ungrouped Data)

  • 8/3/2019 Statistics Basics UNIT-1

    33/74

    If we have a series of n positive values with repeated values such as x1, x2, x3, xk arerepeated f1, f2, f3, fk times respectively then geometric mean will becomes:

    (For Grouped Data)

    Where

    Example:

    Find the Geometric Mean of the values 10, 5, 15, 8, 12Solution:

    Here x1 = 10, x2 = 5 x3 = 15, x4 = 8 x5 = 12 and n = 5

    Example:

    Find the Geometric Mean of the following Data

    X 13 14 15 16 17

    f 2 5 13 7 3

    Solution:

    We may write it as given below:

    Here x1 = 13, x2 = 14, x3 = 15, x4 = 16, x5 = 17f1 = 2, f2 = 5, f3 = 13, f4 = 7 f5 = 3

    n = f = f1 + f2 + f3 + f4 + f5 = 2+5+13+7+3 = 30

    Using the formula of geometric mean for grouped data, geometric mean in this case will become:

    The method explained above for the calculation of geometric mean is useful when thenumbers of values in given data are small in number and the facility of electronic calculator isavailable. When a set of data contains large number of values then we need an alternative way for

  • 8/3/2019 Statistics Basics UNIT-1

    34/74

    computing geometric mean. The modified or alternative way of computing geometric mean isgiven as under:

    For Ungrouped Data For Grouped Data

    Example:

    Find the Geometric Mean of the values 10, 5, 15, 8, 12

    Example:

    Find the Geometric Mean for the followingdistribution of students marks:

    Marks 0-30 30-50 50-80 80-100

    No. of Students 20 30 40 10

    Solution:

    Marks No. of Studentsf Mid Pointsx f log x

    0-30 20 15 20 log 15 = 23.5218

    30-50 30 40 30 log 40 = 48.0168

    50-80 40 65 40 log 65 = 72.5165

    80-100 10 90 10 log 90 = 19.5424

    Total f = 100 f log x = 163.6425

    X log x

    10 1.0000

    5 0.6990

    15 1.1761

    8 0.9031

    12 1.0792

    Total log x = 4.8573

  • 8/3/2019 Statistics Basics UNIT-1

    35/74

    Combined Geometric Mean: Just like the combined arithmetic mean, we can also calculatecombined geometric mean by using the following formula.

    Log G = N1 Log G1 + N2 Log G2

    N1+N2

    Where

    G1 and G2 are the respective geometric means for two groups and N1 and N2 means the number ofobservations in each group.

    Properties of Geometric Mean:

    The main properties of geometric mean are:

    The geometric mean is less than arithmetic mean, G.M

  • 8/3/2019 Statistics Basics UNIT-1

    36/74

    Demerits:

    It cannot be calculated if any of the observation is zero or negative. Its calculation is rather difficult. It is not easy to understand. It may not coincide with any of the observations.

    Harmonic Mean

    Harmonic mean is another measure of central tendency and also based on mathematic footing likearithmetic mean and geometric mean. Like arithmetic mean and geometric mean, harmonic meanis also useful for quantitative data. Harmonic mean is defined in following terms:

    Harmonic mean is quotient of number of the given values and sum of the reciprocals of the

    given values.

    Harmonic mean in mathematical terms is defined as follows:For Ungrouped Data For Grouped Data

    Example: Calculate the harmonic mean of the numbers: 13.5, 14.5, 14.8, 15.2 and 16.1Solution:

    The harmonic mean is calculated as below:

    Example:

    Given the following frequency distribution of first yearstudents of a particular college. Calculate the Harmonic Mean.

    Age (Years) 1314 15 16 17

    Number of Students 2 5 13 7 3

    x 1/x

    13.2 0.0758

    14.2 0.0704

    14.8 0.0676

    15.2 0.0658

    16.1 0.0621

    Total (1/x) = 0.3417

  • 8/3/2019 Statistics Basics UNIT-1

    37/74

    Solution:

    The given distribution belongs to a grouped data and the variable involved is ages of firstyear students. While the number of students Represent frequencies.

    Ages (Years)

    xNumber of Students

    f1/x

    13 2 0.1538

    14 5 0.3571

    15 13 0.8667

    16 7 0.4375

    17 3 0.1765

    Total f = 30 (1/x) = 1.9916

    Now we will find the Harmonic Mean as

    years.

    Merits and Demerits of Harmonic Mean:

    Merits:

    It is based on all observations. It not much affected by the fluctuation of sampling. It is capable of algebraic treatment. It is an appropriate average for averaging ratios and rates. It does not give much weight to the large items.

    Demerits:

    Its calculation is difficult. It gives high weight-age to the small items. It cannot be calculated if any one of the items is zero. It is usually a value which does not exist in the given data.

    Relationship among the A.M., G.M. and H.M.:

    In any distribution when the observations differ in size, the value of A.M., G.M. and H.M., will bein the following order

    A.M.> G.M. >H.M.

  • 8/3/2019 Statistics Basics UNIT-1

    38/74

    If all the items in the distribution have same value then the relationship will be

    A.M. G.M. H.M.

    Mode

    Mode is the value which occurs the greatest number of times in the data. When each value occurthe same numbers of times in the data, there is no mode. If two or more values occur the samenumbers of time, then there are two or more modes and distribution is said to be multi-mode. Ifthe data having only one mode the distribution is said to be uni-model and data having two modes,the distribution is said to be bi-model.

    Mode from Ungrouped Data:

    Mode is calculated from ungrouped data by inspecting the given data. We pick out that valuewhich occur the greatest numbers of times in the data.

    Mode from Grouped Data:

    When frequency distribution with equal class interval sizes, the class which has maximumfrequency is called model class.

    Or

    Where

    l = Lower class boundary of the model classfm = Frequency of the model class (maximum frequency)f1 = Frequency preceding the model class frequencyf2 = Frequency following the model class frequencyh = Class interval size of the model class

    Mode from Discrete Data:

    When the data follows discrete set of values, the mode may be found by inspection. Mode is thevalue ofXcorresponding to the maximum frequency.

    Example:

    Find the mode of the values 5, 7, 2, 9, 7, 10, 8, 5, 7

    Solution:

    Mode is 7 because it occur the greatest number of times in the data.

  • 8/3/2019 Statistics Basics UNIT-1

    39/74

    Example:

    The weights of 50 college students are given in the following table. Find the mode of thedistribution.

    Weights (Kg) 60 64 65 69 70 74 75 79 80 84

    No of Students 5 9 16 12 8

    Solution:

    Weights (Kg)No of Students

    fClass Boundary

    60 64 5 59.5 64.5

    65 69 9 64.5 69.5

    70 74 16 69.5 74.5

    75 79 12 74.5 79.5

    80 84 8 79.5 84.5

    Example:

    The following frequency distribution shows the numbers of children in each family in alocality. Find the mode.

    No of Children 0 1 2 3 4 5 6

    No of Families 6 30 42 55 25 18 5

    Solution:

    The data follows discrete set of values

    So, Mode = 3 (corresponding to the maximum frequency)

    Graphic Location of Mode: Mode being a positional average so it can be located graphically bythe following process.

    o A histogram of the frequency distribution is drawn.o In histogram, the highest rectangle represents the model class.

  • 8/3/2019 Statistics Basics UNIT-1

    40/74

    o The top left corner of the highest rectangle is joined with the top left corner of thefollowing rectangle and top right corner of the highest rectangle is joined with thetop right corner of the preceding rectangle respectively.

    o From the point of intersection of both the lines a perpendicular drawn on X-axis,check that point on X-axis. This will be the required value of mode.

    Important Notes:

    1. If there is a series with unequal class intervals, then the series must be regrouped to make theclass interval equal in magnitude.

    2. In case of inclusive class interval, the formula for calculation of Mode will be the same but theclass intervals have to be converted in exclusive type.

    Advantages and Disadvantages of Mode:

    Advantages:

    It is easy to understand and simple to calculate. It is not affected by extreme large or small values. It can be located only by inspection in ungrouped data and discrete frequency distribution. It can be useful for qualitative data. It can be computed in open-end frequency table. It can be located graphically.

    Disadvantages:

    It is not well defined.

    It is not based on all the values. It is stable for large values and it will not be well defined if the data consists of small

    number of values. It is not capable of further mathematical treatment. Sometimes, the data having one or more than one mode and sometimes the data having no

    mode at all.

    Median

    Median is the most middle value in the arrayed data. It means that when the data are arranged,median is the middle value if the number of values is odd and the mean of the two middle values,

    if the numbers of values is even. A value which divides the arrayed set of data in two equal parts iscalled median, the values greater than the median is equal to the values smaller than the median. Itis also known as a positional average.

    Median from Ungrouped Data(Individual & Discrete Series):

    Median = Value of item

    Example: Find the median of the values 4, 1, 8, 13, 11

  • 8/3/2019 Statistics Basics UNIT-1

    41/74

    Solution:

    Arrange the data 1, 4, 8, 11, 13

    Median = Value of item

    Median = Value of item = itemMedian = 8

    Example:

    Find the median of the values 5, 7, 10, 20, 16, 12

    Solution:

    Arrange the data 5, 7, 10, 12, 16, 20

    Median = Value of item

    Median = Value of item = item

    Median =

    Median from Grouped Data:

    The median for grouped data, we find the cumulative frequencies and then calculated the mediannumber n/2. The median lies in the group (class) which corresponds to the cumulative frequencyin which n/2 lies. We use following formula to find the median.

    Where

    l = Lower class boundary of the model class

    f = Frequency of the median class

    n = f = Number of values or total frequencyc = Cumulative frequency of the class preceding the median classh = Class interval size of the model class

    Example:

    Calculate median from the following data.

    Group 60 64 65 69 70 74 75 79 80 84 85 89

  • 8/3/2019 Statistics Basics UNIT-1

    42/74

    Frequency 1 5 9 12 7 2

    Solution:

    Group f Class Boundary

    Cumulative

    Frequency

    60 64 1 59.5 64.5 1

    65 69 5 64.5 69.5 6

    70 74 9 69.5 74.5 15

    75 79 12 74.5 79.5 27

    80 84 7 79.5 84.5 34

    85 89 2 84.5 89.5 36

    item item

    Median from Discrete Data:

    When the data follows the discrete set of values grouped by size. We use theformula (n+1/2)th item for finding median. First we form a cumulative frequency distribution andmedian is that value which corresponds to the cumulative frequency in which (n+1/2)th item lies.

    Example:

    The following frequency distribution is classified according to the number of leaves ondifferent branches. Calculate the median number of leaves per branch.

    No of Leaves 1 2 3 4 5 6 7

    No of Branches 2 11 15 20 25 18 10

    Solution:

    No of Leaves

    X

    No of Branches

    f

    Cumulative

    Frequency C.F

    1 2 2

    2 11 133 15 28

    4 20 48

  • 8/3/2019 Statistics Basics UNIT-1

    43/74

    5 25 73

    6 18 91

    7 10 101

    Total 101

    Median = Size of item item

    Median = 5 because item lies corresponding to5

    Quartiles:

    There are three quartiles called, first quartile, second quartile and third quartile. There quartilesdivides the set of observations into four equal parts. The second quartile is equal to the median.The first quartile is also called lower quartile and is denoted by Q1. The third quartile is also calledupper quartile and is denoted by Q3. The lower quartile Q1 is a point which has 25% observationsless than it and 75% observations are above it. The upper quartile Q3 is a point with 75%observations below it and 25% observations above it.

    Quartile for Individual Observations (Ungrouped Data):

    Quartile for a Frequency Distribution (Discrete Data):

    Quartile for Grouped Frequency Distribution:

  • 8/3/2019 Statistics Basics UNIT-1

    44/74

    Example:

    The wheat production (in Kg) of 20 acres is given as: 1120, 1240, 1320, 1040, 1080, 1200,1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470, 1750, and 1885. Findthe quartile deviation and coefficient of quartile deviation.

    Solution:

    After arranging the observations in ascending order, we get1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750,1755, 1785, 1880, 1885, 1960.

    Example:

    Calculate the quartile deviation and coefficient of quartile deviation from the data givenbelow:

    Maximum Load

    (short-tons)Number of Cables

    9.3-9.7 2

    9.8-10.2 5

  • 8/3/2019 Statistics Basics UNIT-1

    45/74

    10.3-10.7 12

    10.8-11.2 17

    11.3-11.7 14

    11.8-12.2 6

    12.3-12.7 3

    12.8-13.2 1

    Solution:

    The necessary calculations are given below:

    Maximum Load(short-tons)

    Number of Cables(f)

    ClassBoundaries

    CumulativeFrequencies

    9.3-9.7 2 9.25-9.75 2

    9.8-10.2 5 9.75-10.25 2 + 5 = 7

    10.3-10.7 12 10.25-10.75 7 + 12 = 19

    10.8-11.2 17 10.75-11.25 19 + 17 = 36

    11.3-11.7 14 11.25-11.75 36 + 14 = 50

    11.8-12.2 6 11.75-12.25 50 + 6 =56

    12.3-12.7 3 12.25-12.75 56 + 3 = 59

    12.8-13.2 1 12.75-13.25 59 + 1 =60

    lies in the class

    Where , , , and

    lies in the class

  • 8/3/2019 Statistics Basics UNIT-1

    46/74

    Where , , , and

    Graphic Location of Median:

    Median and other partition values can be located from the graph of the cumulative frequencypolygon (Ogive Polygon). Suppose we have a graph of the cumulative frequency polygon asshown in the figure below.

    On Y-axis, we mark the height equal to and from this point we draw a straight line parallel to X-axis which intersects the polygon at the point m. from the point m, we draw a perpendicular whichtouches the X-axis at M. This point on X-axis is the median. Similarly, for the lower quartile wetake height equal to n/4 on the Y-axis. From this we draw a line parallel to X-axis which the

    polygon at the point q. From this point we draw perpendicular on X-axis which touches it at thepoint Q1 which is the first quartile. For upper quartile take the height on Y-axis equal to 3n/4

    Deciles and Percentiles:

    Deciles:

    The deciles are the partition values which divides the set of observations into tem equalparts. There are nine deciles namely D1,D2,D3 . . . . D9. The first decile is D1 is a point which has10% of the observations below it.

    Deciles for Individual Observations (Ungrouped Data):

  • 8/3/2019 Statistics Basics UNIT-1

    47/74

    Quartile for a Frequency Distribution (Discrete Data):

    Decile for Grouped Frequency Distribution:

    Percentiles:

    The percentiles are the points which divide the set of observations into one hundred equal

  • 8/3/2019 Statistics Basics UNIT-1

    48/74

    parts. These points are denoted byP1, P2, P3 . . . . P99, and are called the first, second, third, ,ninety ninth percentiles. The percentiles are calculated for very large number of observations likeworkers in factories and the population in provinces or countries. The percentiles are usually

    calculated for grouped data. The first percentile denoted by is calculated as P1 = Value of(n/100)thitem. We find the group in which the (n/100)th itemlies and then P1 is interpolated from

    the formula.

    Introduction to Measure of Dispersion:

    A modern student of statistics is mainly interested in the study of variability and uncertainty. Inthis section we shall discuss variability and its measures and uncertainty will be discussed in

    probability. We live in a changing world. Changes are taking place in every sphere of life. A manof statistics does not show much interest in those things which are constant. The total area of theearth may not be very important to a research minded person but the area under different crops,area covered by forests, area covered by residential and commercial buildings are figures of greatimportance because these figures keep on changing form time to time and from place to place.Very large number of experts is engaged in the study of changing phenomenon. Experts workingin different countries of the world keep a watch on forces which are responsible for bringing

    changes in the fields of human interest. The agricultural, industrial and mineral production andtheir transportation from one part to the other parts of the world are the matters of great interest tothe economists, statisticians, and other experts. The changes in human population, the changes instandard living, and changes in literacy rate and the changes in price attract the experts to make

  • 8/3/2019 Statistics Basics UNIT-1

    49/74

    detailed studies about them and then correlate these changes with the human life. Thus variabilityor variation is something connected with human life and study is very important for mankind.

    Dispersion:

    The word dispersion has a technical meaning in statistics. The average measures the centerof the data. It is one aspect observations. Another feature of the observations is as to how the

    observations are spread about the center. The observation may be close to the center or they maybe spread away from the center. If the observation are close to the center (usually the arithmeticmean or median), we say that dispersion, scatter or variation is small. If the observations arespread away from the center, we say dispersion is large. Suppose we have three groups of studentswho have obtained the following marks in a test. The arithmetic means of the three groups are alsogiven below:

    Group A: 46, 48, 50, 52, 54

    Group B: 30, 40, 50, 60, 70

    Group C: 40, 50, 60, 70, 80

    In a group A and B arithmetic means are equal i.e. . But in group A theobservations are concentrated on the center. All students of group A have almost the same level of

    performance. We say that there is consistence in the observations in group A. In group B the meanis 50 but the observations are not closed to the center. One observation is as small as 30 and oneobservation is as large as 70. Thus there is greater dispersion in group B. In group C the mean is60 but the spread of the observations with respect to the center 60 is the same as the spread of theobservations in group B with respect to their own center which is 50. Thus in group B and C the

    means are different but their dispersion is the same. In group A and C the means are different andtheir dispersions are also different. Dispersion is an important feature of the observations and it ismeasured with the help of the measures of dispersion, scatter or variation. The word variability isalso used for this idea of dispersion

    The study of dispersion is very important in statistical data. If in a certain factory there isconsistence in the wages of workers, the workers will be satisfied. But if some workers have highwages and some have low wages, there will be unrest among the low paid workers and they mightgo on strikes and arrange demonstrations. If in a certain country some people are very poor andsome are very high rich, we say there is economic disparity. It means that dispersion is large. The

    idea of dispersion is important in the study of wages of workers, prices of commodities, standardof living of different people, distribution of wealth, distribution of land among framers and variousother fields of life. Some brief definitions of dispersion are:

    1. The degree to which numerical data tend to spread about an average value is called thedispersion or variation of the data.

    2. Dispersion or variation may be defined as a statistics signifying the extent of thescatteredness of items around a measure of central tendency.

    3. Dispersion or variation is the measurement of the scatter of the size of the items of a seriesabout the average

    Measures of Dispersion:

  • 8/3/2019 Statistics Basics UNIT-1

    50/74

    For the study of dispersion, we need some measures which show whether the dispersion is smallor large. There are two types of measure of dispersion which are:

    (a) Absolute Measure of Dispersion

    (b) Relative Measure of Dispersion

    Absolute Measures of Dispersion:

    These measures give us an idea about the amount of dispersion in a set of observations. They givethe answers in the same units as the units of the original observations. When the observations arein kilograms, the absolute measure is also in kilograms. If we have two sets of observations, wecannot always use the absolute measures to compare their dispersion. We shall explain later as towhen the absolute measures can be used for comparison of dispersion in two or more than two setsof data. The absolute measures which are commonly used are:

    1. The Range

    2. The Quartile Deviation3. The Mean Deviation4. The Standard deviation and Variance

    Relative Measure of Dispersion:

    These measures are calculated for the comparison of dispersion in two or more than two sets ofobservations. These measures are free of the units in which the original data is measured. If theoriginal data is in dollar or kilometers, we do not use these units with relative measure ofdispersion. These measures are a sort of ratio and are called coefficients. Each absolute measure ofdispersion can be converted into its relative measure. Thus the relative measures of dispersion are:

    1. Coefficient of Range or Coefficient of Dispersion.2. Coefficient of Quartile Deviation or Quartile Coefficient of Dispersion.3. Coefficient of Mean Deviation or Mean Deviation of Dispersion.4. Coefficient of Standard Deviation or Standard Coefficient of Dispersion.5. Coefficient of Variation (a special case of Standard Coefficient of Dispersion)

    Range and Coefficient of Range:

    The Range:

    Range is defined as the difference between the maximum and the minimum observation of the

    given data. Ifxm denotes the maximum observation xo denotes the minimum observation then therange is defined as

    Range = xm - xo

    In case of grouped data, the range is the difference between the upper boundary of the highest

    class and the lower boundary of the lowest class. It is also calculated by using the differencebetween the mid points of the highest class and the lowest class. It is the simplest measure ofdispersion. It gives a general idea about the total spread of the observations. It does not enjoy any

    prominent place in statistical theory. But it has its application and utility in quality control

  • 8/3/2019 Statistics Basics UNIT-1

    51/74

    methods which are used to maintain the quality of the products produced in factories. The qualityof products is to be kept within certain range of values.

    The range is based on the two extreme observations. It gives no weight to the centralvalues of the data. It is a poor measure of dispersion and does not give a good picture of theoverall spread of the observations with respect to the center of the observations. Let us considerthree groups of the data which have the same range:

    Group A: 30, 40, 40, 40, 40, 40, 50Group B: 30, 30, 30, 40, 50, 50, 50Group C: 30, 35, 40, 40, 40, 45, 50

    In all the three groups the range is 50 30 = 20. In group A there is concentration ofobservations in the center. In group B the observations are friendly with the extreme corner and ingroup C the observations are almost equally distributed in the interval from 30 to 50. The rangefails to explain these differences in the three groups of data. This defect in range cannot beremoved even if we calculate the coefficient of range which is a relative measure of dispersion. Ifwe calculate the range of a sample, we cannot draw any inferences about the range of the

    population.

    Coefficient of Range:

    It is relative measure of dispersion and is based on the value of range. It is also called rangecoefficient of dispersion. It is defined as:

    Coefficient of Range

    The range xm - xo is standardized by the total xm + xo

    Let us take two sets of observations. Set A contains marks of five students in Mathematicsout of 25 marks and group B contains marks of the same student in English out of 100 marks.

    Set A: 10, 15, 18, 20, 20Set B: 30, 35, 40, 45, 50The values of range and coefficient of range are calculated as:

    Range Coefficient of Range

    Set A: (Mathematics) 20 - 10 = 10

    Set B: (English) 50 - 30 = 20

    In set A the range is 10 and in set B the range is 20. Apparently it seems as if there isgreater dispersion in set B. But this is not true. The range of 20 in set B is for large observations

  • 8/3/2019 Statistics Basics UNIT-1

    52/74

    and the range of 10 in set A is for small observations. Thus 20 and 10 cannot be compareddirectly. Their base is not the same. Marks in Mathematics are out of 25 and marks of English areout of 100. Thus, it makes no sense to compare 10 with 20. When we convert these two valuesinto coefficient of range, we see that coefficient of range for set A is greater than that of set B.Thus there is greater dispersion or variation in set A. The marks of students in English are morestable than their marks in Mathematics.

    Example: Following are the wages of 8 workers of a factory. Find the range and the coefficient ofrange. Wages in ($) 1400, 1450, 1520, 1380, 1485, 1495, 1575, 1440.

    Solution:

    Here Largest value xm = 1575 and Smallest Value xo =1380Range = xm - xo = 1575 1380 = 195

    Coefficient of Range

    Example:

    The following distribution gives the numbers of houses and the number of persons perhouse.

    Number of Persons 1 2 3 4 5 6 7 8 9 10Number of Houses 26 113 120 95 60 42 21 14 5 4

    Calculate the range and coefficient of range.

    Solution:

    Here Largest value xm = 10 and Smallest Value = xo = 1Range = xm - xo = 10 1 = 9

    Coefficient of Range

    Example:

    Find the range of the weight of the students of a university.

    Weights (Kg) 60-62 63-65 66-68 69-71 72-74

    Number of Students 5 18 42 27 8

    Calculate the range and coefficient of range.

    Solution:

    Weights (Kg) Class Boundaries Mid Value No. of Students

  • 8/3/2019 Statistics Basics UNIT-1

    53/74

    60-62 59.5-62.5 61 5

    63-65 62.5-65.5 64 18

    66-68 65.5-68.5 67 42

    69-71 68.5-71.5 70 27

    72.74 71.5-74.5 73 8

    Method 1:

    Here xm = Upper class boundary of the highest class = 74.5 xo = Lower class boundary of the lowest class = 59.5

    Range = xm - xo = 74.5 - 59.5 = 15 Kilogram

    Coefficient of Range

    Method 2:

    Here xm = Mid value of the highest class = 73 xo = Mid value of the lowest class = 61

    Range = xm - xo = 73 61 = 12 Kilogram

    Coefficient of Range

    Quartile Deviation and its Coefficient:

    Quartile Deviation:

    It is based on the lower quartile Q1and the upper quartile Q3. The difference Q3 - Q1 is called theinter quartile range. The difference Q3 - Q1 divided by 2 is called semi-inter-quartile range or thequartile deviation. Thus

    Quartile Deviation (Q.D)

    The quartile deviation is a slightly better measure of absolute dispersion than the range. But itignores the observation on the tails. If we take difference samples from a population and calculate

    their quartile deviations, their values are quite likely to be sufficiently different. This is calledsampling fluctuation. It is not a popular measure of dispersion. The quartile deviation calculated

  • 8/3/2019 Statistics Basics UNIT-1

    54/74

    from the sample data does not help us to draw any conclusion (inference) about the quartiledeviation in the population.

    Coefficient of Quartile Deviation:

    A relative measure of dispersion based on the quartile deviation is called the coefficient of quartile

    deviation. It is defined as

    Coefficient of Quartile Deviation

    It is pure number free of any units of measurement. It can be used for comparing the dispersion intwo or more than two sets of data.

    Example:

    The wheat production (in Kg) of 20 acres is given as: 1120, 1240, 1320, 1040, 1080, 1200,1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470, 1750, and 1885. Findthe quartile deviation and coefficient of quartile deviation.

    Solution:

    After arranging the observations in ascending order, we get1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750,1755, 1785, 1880, 1885, 1960.

    Quartile Deviation (Q.D)

  • 8/3/2019 Statistics Basics UNIT-1

    55/74

    Coefficient of Quartile Deviation

    Mean Deviation and its Coefficient:

    The Mean Deviation:

    The mean deviation or the average deviation is defined as the mean of the absolute deviations ofobservations from some suitable average which may be the arithmetic mean, the median or themode. The difference (X - Average) is called deviation and when we ignore the negative sign, thisdeviation is written as |X - Average| and is read as mod deviations. The mean of these mod orabsolute deviations is called the mean deviation or the mean absolute deviation. Thus for sample

    data in which the suitable average is the , the mean deviation (M.D.) is given by the relation:

    Fo