The how and why of statistical classifications1 - Europa

courrier des statistiques, english series no. 15, 2009 3

The how and why of statistical classifications1

! Michel Boeda*

Statisticians rely on coordinated classifications to map the economic and social sphere. They use regulatory classifications—chiefly because they are required to do so—and statistical classifications, which they create or help to develop. Moreover, the European Union framework is a driver for close harmonization of national classifications. Their convergence, now largely complete in the area of economic production, is still in progress in the socio-economic field.

Statisticians measure entities that have been defined beforehand

in a field that has been identified, named, bounded, and “cadastered”—one might say “taxonomized.” They can use the most refined level of a classification like a zoom, a detailed level beyond which the possibility or significance of the measurement is lost. From another angle, the aggregated (“collapsed”) levels offer summary information.

Statisticians (here: official statisticians) are involved in many fields but use only a limited number of coordinated classifications. We can link classifications—for example, those of activities and products (which are symmetrical) and those of occupations and social categories (which are nested). We can also combine them (specializations and educational attainment) or create networks.

classifications age because reality changes. classifications need to be revised periodically, but that is a difficult exercise for statisticians. The linkage between consecutive series is

inevitably a complicated, imperfect, and problematic process that requires complex solutions (see article by Pinel). There is no exact correspondence between one or more items of the old and new classifications; otherwise, the “revision” would simply involve switching categories (why bother?) or nesting (change of scale).

nesting is precisely the key to international comparability, and especially to european Union (eU) harmonization, now essential.

Overview of classifications

Statisticians begin by finding their bearings in a territory and laying down markers in areas that have largely been classified without their participation, such as accounting, law, and regulations.

* Michel Boeda was Head of the classifications Division at InSee’s Head office, 1989-1995, then Deputy Head of the Statistical and Accounting Standards Department, and has ended his career at ceFIL (InSee’s Training center in Libourne), where, among other projects, he organized seminars to prepare statisticians from the european Union accession countries for their new statistical environment.1. originally published as “Les nomenclatures statistiques: pourquoi et comment,” Courrier des statistiques (French series), no. 125, nov.-Dec. 2008, pp. 5-11, http://www.insee.fr/fr/ffc/docs_ffc/cs125b.pdf.

Sou

rce:

Wik

iped

ia

Systema Naturæ (“The Systems of Nature”) (1748), by the taxonomist Carl Von Linnæus

Box 1: Nomenclatures and classifications

In ancient Rome, the nomenclator was the usher who announced the names and titles of senators—a nomenklatura before its time! The term “nomenclature” refers to the concept of naming. “classifications” are more suggestive of the need to organize knowledge categories.

The structuring of the economic and social sphere has been a very gradual process. International harmonization is recent and incomplete. Some typologies are built from data analyses and intended mainly for study purposes. Multiple-use statistical classifications are informed by principles and objectives. To identify—for example, in a business register or by assigning a geographic code—is not to classify.

The terms “nomenclature” and “classification” exist in French and english. French speakers tend to use the first, english speakers the second. We can treat them as synonyms, each focusing on one aspect of the concept: a system for filing items in drawers, with specific instructions on what goes where (classification), and a set of labels describing the general content of each drawer (nomenclature).

Michel Boeda

4

But statisticians also contribute to the evolution of regulatory, economic, social, and other standards. one example is national accounting, which defines and organizes flows in the economic system. Similarly, sociodemographic statistics explore different aspects of people and their relationship to work, an indicator of social category. Statisticians flesh out and arrange the administrative foundations, thereby helping to structure the economic and social sphere.

French administrative territorial units, from regions to municipalities (communes), are the fruit of our history: regions are ranked in nUTS (eU nomenclature of Territorial Units for Statistics) at the second level, the 36,000 communes at the fifth level. These indivisible atoms of the French Geographic code account for about one-third of eU items—hardly a balanced situation. There are a host of other geographic divisions that statistical methods have helped to define. Special mention should be made of “employment areas” (zones d’emploi), which are based

on commuting patterns but comply with the administrative constraints imposed by regional boundaries.

To structure the world of enterprises, the SIRene register uses categories that reflect mandatory reporting by firms and offer an outline of “institutional sectors.” The General chart of Accounts (Plan comptable Général)—a central reference—is not a statistical classification, although it incorporates features requested by statisticians. In France, corporate tax returns for “income from industrial and commercial activities” (bénéfices industriels et commerciaux: BIc) notably serve to prepare “intermediate corporate accounts” (comptes intermédiaires des entreprises), which can be broken down by economic activity.

National accounting, a representation of economic flows, makes reference to various classifications defined by the United nations System of national Accounts (SnA 93) in the accounts of institutional sectors: transactions in goods and services, distribution-of-income transactions, and financial

transactions. national accounting also relies on classifications of activities (for industry accounts) and products (for supply-and-use tables) as well as on functional classifications for household consumption and government expenditures. “Satellite accounts” build bridges between national accounting and various sectors with specific classifications, such as tourism, research, and agriculture.

The health field (diseases, causes of death, and so on) has its own international statistical norms. It also uses social-insurance management tools for tracking medical procedures, medical and paramedical occupations, and so on. The same is true for education with UneSco’s International Standard classification of education (ISceD) and management tools used by French educational district authorities (rectorats).

Classifications regarding individuals (giving age, vital statistics, nationality, and other characteristics) are used by statisticians in the most neutral manner possible. But they are primarily

PartitionA “flat” (single-level) classification forms a partition of the field studied that is a breakdown into disjoint equivalence classes. A multi-level classification consists of nested partitions.

Partitions have a “lattice structure” with respect to nesting, just as whole numbers do with respect to divisibility: A nests within B, B nests within A, or there is no nesting. Like least common multiples (LcMs) and largest common denominators (LcDs), the “product” classification (intersection of two classifications) is the one in which we need to collect information if we want to publish results in both classifications; the “sum” classification (union) is the one in which we can compare the result of the data collected in either classification (Arkhipoff, 1976).

We can identify equivalence classes (codes, descriptions) but there is no natural order. An international classification cannot serve as a bank of basic items and, at the same time, supply

the nested aggregation categories for national classifications.

Tree structureclassifications (espalier tree structures) fall within the scope of graph theory. This approach is better suited to research on the “closeness” of two classifications, notably between countries, by specifying a “distance” between tree structures. We can also assess the homogeneity of “more or less dense” tree structures using an entropic concept (disaggregation vs. aggregation) applied to information distribution.

InSee has taken part in a european research project implicitly aimed at transcending dialectical debates by means of a technical approach, notably for revising the international product classification. There are indeed three ways to design such a classification:– by origin (european approach)– by purpose (American approach)

– by intrinsic nature of products (as in initial cPc).

Methodological advances have not sufficed to settle the debate (Boeda et al., 2002).

Data analysisInstead of taking formal classification structure as its starting point, data analysis uses information on the objects to be classified in order to deduce aggregation classes and tree structures. It assumes the existence of data, a metric for the “distance” between two objects, and a choice of levels for aligning the tree-structure levels. The classification obtained depends on the data: any new information may call it into question (Volle et al., 1970).

Area divisions for study purposes routinely draw on data analysis. Its initial applications have revealed relevant and robust macroeconomic groupings. The classification of sports activities has relied on data analysis (Desrosières, 1972).

Box 2: Classifications and mathematics

The how and why of statistical classifications


administrative classifications subject to various limitations with respect to civil or penal legal age, number of tax-deduction units per household, and so on. Having been rejected by the constitutional council, ethnic- and religious-based typologies are not used in France.

The customs classification, used extensively by statisticians, clearly illustrates the constraints of a regulatory classification. Its goal is to enable international trade to expand in a transparent setting and to provide a framework where rules can be stated with their legal consequences. Such rules apply to customs duties and refunds, quotas, narcotics, arms, hazardous products, and so on. The first obligation, therefore, is an unambiguous identification of all merchandise, objectively observable with state-of-the-art technology (for example: traces of genetically modified organisms [GMos]). customs categories are therefore far more focused on the boundaries of an item than on its core—the opposite of the statistician’s approach. Their description may be either a very long enumeration or a simple “other,” a balancing item spelled out at the next level of detail. often, the economic destination of products is of little interest to customs authorities. The statistician consequently interprets “crawler tractor” as a construction machine, a “wheeled tractor” as an agricultural machine.

Statisticians may also be called in to address a specific need. Three very different examples—education/training, waste, and physical and sports activities—illustrate “co-building” between classification supply and demand, involving various players and institutions.

The classification of education/training specializations addresses a long-latent need to reconstruct a classification that had become obsolete and mainly focused on public education programs provided in initial schooling. Technical training programs were poorly represented and, most importantly, lifelong education for working adults was

ignored. Via cnIS, InSee has invited representatives from the public education system and the continuing education system—two worlds that usually do not work together—to sit around the same table (Gensbittel et al., 1992). A classification needs to be negotiated; it cannot be forced on users.

The classification of waste is a result of the technical impossibility of implementing the “european Waste catalogue” prepared by jurists. To put it bluntly, the catalogue merely listed economic activities and inserted “waste from” before each. But many types of waste are not generated by activities. examples include products at the end of their life cycle, from old documents to the French aircraft carrier Clemenceau.

At eurostat’s request, IFen, ADeMe, InSee, and a few international experts formed a working group. It defined waste categories by their nature, ranking them by hazard where applicable (chemical, radioactive or biological), on the basis of degradability or recyclability in other cases. A secondary criterion was the sequence of mandatory treatment

stages for the waste flows: collection, sorting, processing, and disposal.

The working group’s report was buried for three years. It was resurrected as a appendix to the 2002 eU regulation on waste statistics, but with an artificial linkage to the european Waste catalogue—which goes to show how hard it is to abrogate a regulation.

The classification of physical and sports activities was developed by a working group composed of InSee and the statistical unit of the Ministry of Youth Affairs and Sports (MJS, 2002). A wide variety of data, supplementing those of the 2000 Sports Participation Survey, were “crunched” through the ascending hierarchical classification (AHc) analysis method. The project leader was responsible for assigning weights to the different data categories. The outcome was a classification comprising 9 classes, 34 families, and 335 disciplines. While the names of the disciplines are drawn from standard sports terminology, the groupings identified by the data analysis are outright creations. Thus the labels invented to describe them mean nothing to people outside of the working group. Good luck to these new expressions!

Box 3: Naming, or why words count

The Swiss classification of activities had distinguished between—and therefore named—metal roofing work as Bauspenglerei (German), travaux de ferblanterie (French), and lavori di lattoneria (Italian): three languages, three different metals. Moreover, in “French French,” ferblanterie would be replaced by zinguerie (evoking zinc rather than tin)! This example shows that literal translation is not always possible.

Translations from French into english and back again offer surprises: InSee had suggested adding “certification of civil-engineering structures” (in French: certification des ouvrages d’art) to the explanatory notes for “technical inspection services.” The translation came back as “authentication of works of art” (authentification d’œuvres d’art).

In French, occupation can change with gender: the boulanger (“baker,” masculine) kneads the dough and minds the oven, the boulangère (“baker,” feminine) serves customers and operates the cash register. But the human brain is rather good at decoding ambiguities: of the three expressions coupe de cheveux (“haircut”), coiffeur (“hairdresser”) salon de coiffure (“hair salon”), only the first denotes an activity; the second describes an occupation, and the third an establishment.

A likelihood test in a past population census had turned up a totally anomalous number of farmers in urban areas, nearly all of them female. By checking the sources, InSee was able to identify the source of the anomaly: the occupation jardinière d’enfants (“kindergarten worker,” feminine) had been shortened to jardinière (“gardener,” feminine)—the watchword, at the time, was to save computer processing space. The recurring error was easy to correct.

Michel Boeda

6

Structural classifications: activities and products

Formerly, each application program was implemented using a specific classification, for activities as well as products. There was no clear distinction between economic activities and individual activities (occupations). This tower of Babel prevented full use of the information available.

Modern times

The inter-departmental register that preceded SIRene created an opportunity to impose the classification of economic activities (nAe 59). The national accounts—most notably the input-output table—provided a strong case for a classification of products arranged in the same way as the classification of activities. At the same time, it made sense to submit product questionnaires to the firms that carried the corresponding Principal economic Activity (APe) code. These desiderata were fulfilled by the “nAP 73” classification of activities and products. As its name does not indicate, it actually consisted of a pair of mirror classifications with 600 matching items. The “products” section was later refined in order to adapt it to the “industry” surveys and to begin the structuring of the vast tertiary sector (noDeP: detailed classification of products).

The inter-ministerial decree promulgating nAP made its use compulsory in official statistics. It specified that the classification by activity did not, in itself, create rights or duties for firms. And it reminded non-statistician users of their own responsibility (see article by Roussel).

The strictly national history of nAP ended after twenty years of good and faithful service (Lainé, 1999). But customs classifications had been sidelined—hence the lack of consistency at detailed level between the production and external-trade spheres.

The customs model

Starting in the late 1960s, the european customs Union required Member States to use national customs classifications based on a “nesting” european matrix. The only latitude granted to individual countries was the right to subdivide any european item at the most detailed (“final”) level.

– The next two allow a doubling of detailed breakdowns (from 5,000 to 10,000 items) in the eU’s “combined nomenclature” (cn) for the common customs Tariff and external trade statistics.

– The French classification (nGP) has a ninth position to express our exceptions: wine, cheese, and so on.

These classifications evolve in tandem, their regulatory purpose leaving little room to accommodate statisticians’ needs.

The international and eU decision has been to use the customs classifications as the reference, which—in theory—settles the issues of consistency between the production sphere and the external-trade sphere. every good is defined by a whole number of HS positions at international level and by a whole number of cn positions (if needed) in europe. There are some adaptations, however. For example, customs classifications recognize only processed milk (a product of the food industry) and ignore raw milk (a product of livestock breeding) and

Sou

rce:

Wik

iped

ia

Russian dolls

Box 4: Understanding the activities-products correspondence

each activity generates characteristic products. Must every product originate from a single activity? If we applied this principle to the lowest level of the classification of activities, we would be assuming a nesting relation that would make the product classification a sort of expanded version of the activity classification. The U.n. Statistical commission eventually decided to design the cPc like the balance of payments.

Yet a concrete example brings the issue back to its proper proportions: in the old cPF, the “production of fish” activity corresponded to the “fish” product. But why deprive ourselves of the distinction between fishing and fish-farming activities, which exhibit major differences such as employment at sea or on land, processing equipment vs. boats, and resource management? Admittedly, statisticians cannot discern fish (unless, perhaps, they are also gourmets). We therefore have a product that is common to two activities at the detailed level (coding linked to the higher level). Taking an ordinary activities-products correspondence as our starting point, we had merely distinguished between two modes of production for reasons of relevance to business statistics, without impacting product statistics.

Retail trade offers a case that is more complex but open to the same analysis. Merchandising consists in offering customers the products they want in the right conditions. This is a service that justifies a profit margin. each contract lists the products sold (invoice), and retail trade is accordingly broken down by product ranges sold. Retailing activity takes various forms, such as specialized stores, non-specialized department stores, street markets, mail order, and online vendors. each form should be tracked, along with its effects on employment, urban planning, and the social bond. Here, the activities-products relationship takes the form of a matrix that cross-tabulates retailing methods and margins per product range sold.

This model has been systematized. Since 1988, the same “Russian doll” arrangement applies:

– The first six digits of the customs code are those of the Harmonized System (HS).



many perishable products such as fresh pastry. experience has led to a loosening of strict principles in the latest revision of the classifications (see article by Lacroix and Fuger).

The Single European Market

The first european community classification of activities (nAce 70) is contemporary with nAP 73, but there is no one-for-one equivalence. The prospect of a single market in 1993 required good comparability of national statistics on the production sector. The solution was a revised nAce in which national classifications would be nested.

The operation was carried out in tandem with the third revision of the International Standard Industrial classification of All economic Activities (ISIc), administered by the United nations—hence the same Russian-doll arrangement: ISIc Rev. 3 - nAce Rev. 1 - nAF. each is broken down in detail in the next classification, but the nesting is not visible in the code.

The operation has just been repeated, in parallel with the fourth revision of ISIc. This time, eU transparency has been achieved. The present issue of Courrier des statistiques is largely devoted to the operation and its impact on French statistics.

The national implementation of the latest classification change reproduced the initial procedure (Boeda, 1996), but with better testing, supervision, and documentation. The timetable was tighter, as the statistical calendar was now more europeanized.

The first result of eU discussions is the linkage between activities and products, as advocated by the French. This contributes to the overall consistency with customs classifications (see diagram on loose sheet inserted in this issue).

Along with the demands of national accountants, the decisive role fell to statisticians in charge of industrial statistics (Prodcom): to what should one link a european list of several thousand industrial goods, if not to the activity code of the industry of origin? eurostat quickly understood that the eU classification of products would become an empty shell if it failed to fit in between Prodcom and nAce.

This realization led to the establishment of the classification of Products by Activity (cPA). The cPA code reproduces the nAce code at aggregate levels, broken down into two supplementary positions for a detailed description, plus a two-digit position for the Prodcom list (for goods-producing industries).

In other words, this repeated the nAP 73 arrangement, fleshed out by noDeP and industry surveys. The now larger majority of eU statistical authorities has approved the cPA structure, although the latest cPc revision endorses its initial structure and U.S. statisticians have defended an alternative choice.

Macroeconomic classifications

Macroeconomic analysis must operate on large, economically significant categories, combining market characteristics and corporate strategy. For example, consumer industries have to be accommodative toward wholesalers and retailers, woo customers, and segment the market; by contrast, capital-goods industries exploit their technical know-how or that of a network of specialized subcontractors by catering to the needs of large customers. That is what the “Summary economic classification” (nomenclature Économique de Synthèse: neS) sought to capture through its groupings, in contrast to the ISIc/nAce groupings, which were solely production-focused. Ultimately, neS was adopted only in France, whose statistical institute (InSee)

Sou

rce:

InS

ee

French Classifications of Economic Activities and Products, 1973 version (NAP 73)

Box 5: The association criterion

The French theoretical approach is based on a correspondence between activities and products and an association criterion. This prescribes a grouping of activities (including to form an elementary “building block”) that respects the associations most often encountered in units. Multiple activity is thus minimal and the significance of the classification is maximal.

Underneath this empirical observation lies a microeconomic determinism. If the market-entry cost of a key product is high (machinery, technology, research, etc.), the firm that takes the step sets up a near-monopoly on the production that depends on the key product. The firm has every incentive to press its advantage. conversely, the producer of an ordinary product, exposed to stiff competition, will try to tailor the range to its customers, including as reseller. It is the combined set of products and activities that develops its structure in the market: groupings are shaped by a supply- or demand-driven rationale, as the case may be.

The association criterion, seldom articulated, informs discussions during revision processes. For example, the latest revision saw the end of the centuries-old association between printing and publishing, a separation between production and repair of industrial goods, a confirmation of the association between trade and repair of motor vehicles, and a convergence of multimedia activities.

Michel Boeda

8

displays the singular characteristic of performing economic studies as well. Another aim was to counter non-coordinated groupings in official statistics. The issue re-emerged in the latest revision (see article by Madinier), leading to an eU compromise that involved the abandonment of neS and officialized the dissemination of statistics on different grouping levels.

Functional classifications

Beyond the production system, we need to track the uses of products—above all, household consumption.

The current international classification is the classification of Individual consumption by Purpose (coIcoP: only the english abbreviation is used), with no eU or national version. A conversion table shows the correspondence with cPc (and hence cPA/cPF). coIcoP is used for the Household Budget Survey—conducted throughout the eU—and to present the eU Harmonised Index of consumer Prices (HIcP). Purchasing power parities (PPPs) between countries are determined by means of a detailed breakdown of the lowest coIcoP level.

The classification of the Functions of Government (coFoG) is the international system used to categorize government expenditures. The latest version, revised for consistency with coIcoP, is more specifically aimed at breaking down general-government final consumption (in the national-accounting sense) by category: general administration, defense, public order, education, health, social protection, and so on. Individual-consumption items such as education and health can thus be aggregated with similar items financed directly by households.

Structural classifications: occupations and socio-occupational categories

France’s system of “occupations and Socio-occupational categories” (Professions et catégories Socioprofessionnelles: PcS) succeeds

the “Socio-occupational categories” (catégories Socioprofessionnelles: cSP), still used in everyday language. This is a distinctly French development (see article by Desrosières). The very concept of social category—introduced in the middle of the cold War—was bold. Technically, PcS comprises two nested classifications: occupations and socio-occupational categories.

The basic rationale is that social identity is built in the workplace. occupation is decisive for social positioning. It is understood in the broad sense, i.e., including job description, skills, status, and terms of “collective agreements” between employers and employees in each industry. A person’s occupation reflects his or her education and training, family background, and the context in which (s)he engages in it. Income, lifestyle, and consumption patterns go hand in hand with occupation. The correlation also applies to retired people—and for households, so strongly does endogamy persist in occupational categories.

econometricians therefore view the social category as an overall indicator that possesses a high explanatory power when applied to household behavior—and so eliminates the need for multiple kinds of information that are hard to access (such as income). There is no comparable international system: occupations lie within the scope of the International Labor office (ILo), whereas social categories tend to be the object of academic study. By trying to promote convergence between these fields, the european Socio-economic classification (eSec) project is leading the way.

France has used PcS in censuses since 1982, introducing a new version in the 1999 census. A 2003 revision concerned only the “occupations” section. The classification is used in household surveys, while its variant for employees (PcS-eSe, where eSe stands for “emploi Salarié en entreprise” [paid employment in firms]) is used for surveys or administrative forms filed by employers.

At the same time as InSee was updating its national classification of occupations, eurostat was promoting the application of the International Standard classification of occupations (ISco, 1988 version). However, ISco was issued in a marginally adapted form for europe called ISco(coM). This venture registered some successes, particularly in new Member States with obsolete national classifications. But an old ambiguity endures: ISco is focused not so much on people’s occupations as on the jobs they hold (see articles by Brousse and Torterat, who notably discuss ISco and its european future).

PcS is very accurate if the protocol is followed strictly. That is not so easy a task, as it requires information ranging beyond the job position. ISco is probably easier to code but leaves a wider margin for interpretation. The ILo expressly recognizes the need for national classifications of occupations, which should reflect the structure of national employment markets as faithfully as possible.

Work on ISco 2008 is ending without truly convincing results, all the more so as it was performed on an international scale and that certain now-established eU practices have been challenged, such as the use of the category “administrative managers in the public sector” (cadres administratifs publics).

Absent an approved international standard, social categorization remains a strictly eU undertaking. eurostat had already been obliged to recommend pseudo-social categories (ISco-based occupational groupings) for eU Household Budget Surveys: in so doing, it endorsed a founding principle of the French PcS. The theoretical inspiration for the current project is Goldthorpe’s table of classes; the project reference is the socio-economic classification (eSec: see article by Brousse). The studies under way seek to measure the prototype’s capacity to capture occupations in a PcS and/or ISco framework and to provide adequate explanatory power in various applications of the



“common core” questionnaire in eU household surveys.

Conclusion

economic classifications are highly standardized in the eU because the process began long ago, and because trade, technology, and the single market all provided incentives in the same direction (with reservations for

services, where local specificities endure).

In the social sphere, the historical legacy of particularisms is becoming the rule. After half a century of european convergence, reciprocal recognition of education degrees has not made much progress, and the language barrier perpetuates heavy labor-market segmentation.

The advantages of international harmonization in the production sphere vastly outweigh the drawbacks. That argument is less self-evident in the social sphere. Despite their age, “tailor-made” national systems remain attractive by comparison to an off-the-shelf eU system that has not yet found its bearings. n

Bibliography

Arkhipoff, O., 1976, “Taxonomie et sémantique: étude formelle des nomenclatures,” Journal de la Société statistique de Paris, vol. 117, no. 3.

Boeda, M., 1996, “Le changement de nomenclatures d’activités et de produits,” Actualités du CNIS, no. 16.

Boeda, M., Bruneau, E., Rivière, P., and Rousseau R., 2002, “Insee contribution to the ‘Foundations’ Sub-project,” cLAMoUR project, www.statistics.gov.uk/methods_quality/clamour/.

Desrosières, A., 1972, “Un découpage de l’industrie en trois secteurs,” Économie et statistique, no. 40.

Gensbittel, M.-H., Hillau, B., and Join-Lambert, E., 1992, “Vers une nouvelle nomenclature des spécialités de formation,” Courrier des statistiques, no. 63.

Lainé, F., 1999, “Logiques sectorielles et nomenclatures d’activités,” Économie et statistique, no. 323.

MJS, 2002, “Une nomenclature pour les activités physiques et sportives,” MJS Stat Info, no. 02-02, March 2002.

Volle, M. et al., 1970, “L’analyse arborescente,” followed by “L’analyse de données et la construction des nomenclatures,” Annales de l’INSEE, no. 4.

Documents

The how and why of statistical classifications1 - Europa