Empirical methods for knowledge evolution across Knowledge
Organization Systems
Evolution and variation of classification systems – KnoweScape workshop March 4-‐5, 2015 Amsterdam
Richard P. Smiraglia
* Knowledge Organization Systems (KOSs), including classifications, can be evaluated and explained by reference to concept theory—i.e., knowledge expresses concepts, which are represented by terms; knowledge elements represent predicates and referents of specific knowledge units, knowledge units represent the synthesis of concept characteristics. Classes are large knowledge units that represent groupings of concepts according to prescribed characteristics, often as represented in texts (Dahlberg 2006 and 1978; Hjørland 2009).
KOS = Concepts
* The concepts, their representations, and their groupings as represented in texts or other contextual environments (or domains) are derived according to a system known as warrant (see Beghtol 2010).
Warrant
* 1) how well it represents its warranted concepts both individually and in contextual groupings, and * 2) how well the individual classes, divisions and subdivisions are populated by target objects.
Two tests
* Eleven empirical approaches (Smiraglia 2015) * Subject pathfinders * Special classifications and thesauri * Empirical user studies * Informetric studies * Historical studies * Document and genre studies * Epistemological and critical studies * Terminological studies * Database semantics * Discourse analyses * Cognition, expert knowledge and AI
Domain Analysis
Smiraglia, Richard P. 2015. Domain analysis for knowledge organization. London: Chandos-‐Elsevier. (Forthcoming)
* •An ontological base that reveals an underlying teleology * Does the group share a common goal that is implicit or explicit in its knowledge base?;
* •A set of common hypotheses * Is there a theoretical paradigm in operation? If so, it will dictate the hypotheses used in
the domain for testing theoretical parameters. In non-‐scholarly domains, we can consider a parallel consideration to apply to means employed by the group to contribute to the evolution of its common goal;
* •An epistemological consensus on methodological approaches * Most domains that embrace a single theoretical paradigm (or a consistent set of such
paradigms) will share methodological approaches rooted in different epistemological points of view; and,
* •Social semantics * At the simplest level this simply means that the group should be visibly in conversation
utilizing its common ontology. At higher levels of complexity it means that there should be records of communication and exchange of ideas; in scholarly domains citation, intercitation, and co-‐citation will be evidence of social semantics.
Operationalizing domains for analysis
* Knowledge Space Lab * Studying the collectivity of the UDC as a stable reference
classification
* Big Classification * Studying the population of the UDC * Bibliographic characteristics associated with elements of UDC * WorldCat * Leuven
* Data values associated with elements of UDC in both venues
Empirical “Ontogeny”
61 Medical sciences 62 Engineering 63 Agriculture …..
53 Physics 54 Chemistry 55 Earth Sciences
The spectral signature of the UDC (branch weight)
65 Communcation and transportation ind. … 66 Chemical technology … 67 Various industries
www.magnaview.nl
2005
2008
SCIENCE AND KNOWLEDGE
PHILOSOPHY. PSYCHOLOGY
RELIGION. THEOLOGY
SOCIAL SCIENCES
MATHEMATICS. NATURAL SCIENCES
APPLIED SCIENCES. MEDICINE. TECHNOLOGY
THE ARTS
LANGUAGE. LINGUISTICS. LITERATURE
GEOGRAPHY. BIOGRAPHY. HISTORY
1905
2008
Change over time
Changes in the Main Classes
Fig. 1: (a) Distribution of main UDC classes, inner ring 1905, outer ring 1994. (b): Distribution of main UDC classes in 1994 (most inner ring), 1997, 1998, 2005, 2008 and 2009 (most outer ring).
Fig. 2: (a) Distribution of special auxiliaries among the main classes over the years. (b): Percentage of special auxiliaries to main classes, and their changes over time
Figure 3: (a) Distribution of common auxiliaries over the years. (b) Changes in the record number of main classes from 1993 to 2009.
* 394.4 is a main UDC number standing for “Public ceremonial, coronations” * Colon “:” is a connecting symbol representing “simple relation” * Square brackets are used for subgrouping. Everything within the [….] brackets is a unity. This
unity starts with another main UDC class number 92, standing for “Biographical studies. Genealogy. Heraldry.Flags”
* The () parentheses when starting with a non-‐zero numeric character denote a common auxiliary number of place. (100+437) indicates “(100) All countries in general” AND “(437) Czechoslovakia (1918-‐1992)”
* 329.15 is for “Political parties with a communist attitude” * The auxiliary of place “(437) Czechoslovakia (1918-‐1992)” is intercalated between 329 and 15 to
allow for collocation of all Czechoslovakian parties irrespective of their political orientation, and then ordered by a type -‐ thus the entire number represents a topic “Communist party of Czechoslovakia” which is then further specified by a common auxiliary of form (091) denoting presentation in a historical form to express "the history of communist party of Czechoslovakia”
* Plus “+” is the common auxiliary for addition/coordination introducing the next UDC number combination in the string consisting of two parts: “327.32 International solidarity of the working class” and "(100) All countries in general”
Data processing 394.4 :[92(100+437):329(437).15(091)+327.32(100)]
Data processing
394.4 :[92(100+437):329(437).15(091)+327.32(100)]
Public celebrations/ceremonies with significant biographical and historical elements, or even artifacts to do with celebrations (e.g. flags, banners) and which involve historical personalities (both Czechoslovakian and international) linked to the history of Czechoslovakian Communists Party and international movement of solidarity of the working class -‐ in the world.
Network of UDC classes in the OCLC dataset by the use of the operations “:” (Simple relation), “/” (Consecutive extension),
“+” (Addition)
!
* Methodology * Random sample of 400 from 9 million UDC/OCLC
number pairs/95000 Leuven strings * 95% confidence ±5% * MARC text records located * Bibliographic characteristics recorded * UDC numbers deconstructed and operators recorded * IBM-‐SPSS used to look for correlations, mostly by cross-‐
tabulation
Big Classification: Population of the UDC
Correlated characteristics
ISBN Edition Series Bibliog. Linked record
ISBN .003
Edition .006
Series .006 .024
Bibliog. .003
Linked record .024
Correlated subject indicators
name topic place Index term Genre Form
name
topic .046 .018
place .046
Index term .063 .008
Genre Form .018 .008
Correlate bibliographic characteristics and subject indicators
ISBN Edi)on Series Bibliog Linked record
name
topic .043
place .091
Index term .001 .048
Genre Form
UDC population
UDC main class Frequency Percentage
0 38 .09
1 15 .03
2 15 .03
3 85 .21
5 31 .08
6 70 .18
7 38 .10
8 62 .16
9 34 .09
Auxiliary operators and UDC main classes
“+” “:” “/”
“+”
“:”
“/”
“+” “:” “/”
0 .001
1
2
3 .006
5
6 .019 .001 .006
7 .006
8
9 .019 .006
Statistically significant correlations occurred among most of the
deconstructed components of the UDC numbers.
Statistically significant correlations were
discovered among the MARC-‐designated elements of
the respective bibliographic records.
Also, statistically significant correlations were discovered between the elements of
classification and the bibliographic elements.
!
Correlate Everything
* Classification correlates with collection and catalog characteristics in predictable ways
* In a classified bibliographic dataset predictable pathways of association exist through the data.
* But, These UDC numbers are recent and not collection-‐specific; and, * Values (e.g., subject headings, languages, publisher names) were not tested
* Leuven data a slightly different picture.
It worked …., but ….
Some bibliographic characteristics from Leuven data
Language %
English 28.2
French 21
Dutch 19.5
German 14.8
Italian 3.8
Spanish 1.8
Latin 1.5
Chinese 0.8
Polish 0.8
Japanese 0.5
Korean 0.3
ISBN 39.2%
Edition statement 10.6%
Series statement 38.6%
Population of the UDC UDC 0 20%
UDC 1 1%
Language 4 56%
UDC 5 6%
UDC 6 8%
UDC 7 6%
UDC 8 3% UDC 0
UDC 1
Language 4
UDC 5
UDC 6
UDC 7
UDC 8
UDC 0 14%
UDC 1 2%
UDC 2 18%
UDC 3 21%
UDC 5 11%
UDC 6 16%
UDC 7 8%
UDC 8 1%
UDC 9 9%
UDC 0
UDC 1
UDC 2
UDC 3
UDC 5
UDC 6
UDC 7
UDC 8
UDC 9 aux 0 20%
aux 1 1%
aux 4 57%
aux 5 6%
aux 6 7%
aux 7 6%
aux 8 3%
aux 0
aux 1
aux 4
aux 5
aux 6
aux 7
aux 8
Common auxiliaries linked to UDC classes
UDC classes linked with common auxiliary signs
Correlated characteristics
UDC class ISBN Edition Series “:” “*” “-“ “/” “=” “<>”
0 X
1 X X
2 X X X
3 X X
5
6 X X
7 X
8
9
Sborníky 9
Marruecos 7 Učebnice vysokých škol 7
Energía de la biomasa 4 Křesťansky život 4
PainOng 4 PainOng, Modern 4 Sborníky konferenci 4
Asociaciones 3 Brožury 3
Česko 3 ChrisOan life 3
Español (lengua) 3 Katalogy 3
Literatura española 3
malířstvi 3 Populárne-‐naučne publikace 3 Příručky 3 Textbooks (higher) 3
turizem 3
“Subject” terms in 65x fields in WorldCat
• Czech Republic $x cultural relations $y to 1918
• Danmark Illerup Jylland Arkeologi Parallelltekst
• Catequesis $v Manuales para animadores
• Unilateral acts (International law) • Neutrophils • Jews $x persecutions $y 1939-‐1945 • Weapons of mass destruction
0 1 2 3 5 6 7 8 9 + : /
Madrid .
Prague
Barcelona
Helsinki
Ljublijana
New York
Correlations: Place names, Publisher names
0 1 2 3 5 6 7 8 9 : + /
SPN
Kirjalito
-‐41 PainOng, Czech
PainOng, Modern-‐-‐20th century-‐-‐Czech Republic
PainOng, Modern-‐-‐21st century-‐-‐Czech Republic
-‐437.313 PainOng, Modern-‐-‐19th century-‐-‐Russia
PainOng, Modern-‐-‐20th century-‐-‐Russia
PainOng-‐-‐Czech Republic-‐-‐Náchod PainOng, Russian
7.03 PainOng-‐-‐19th-‐20th centuries
72/76 PainOng
75-‐x(1/9) Gothic painOng-‐-‐mural painOng-‐-‐Slovenia
75-‐x(1/9) painOng-‐-‐mural painOng—frescoes-‐-‐17th/18th cent.-‐-‐Italy
“painting”—semantic cluster with UDC strings
* Energía de la biomasa-‐-‐Aspectos 361 ambientales-‐-‐Países de la Unión Europea. * Energías renovables-‐-‐Aspecto del medio ambiente. * 35-‐2â * Energiförbrukning. * 663.42
47 Richard P. Smiraglia, Knowledge Organization Research Group, 2014
small semantic cluster
The iSchool at UWM
* Leuven data * Semantic clusters from populations * Logistic regression with structural features
Still to be done
Beghtol, Clare. 2010. Classification theory. Encyclopedia of library and information sciences, 3rd ed., 1no.1: 1045-‐60. Berry, John W. 1997. Immigration, acculturation, and adaptation. Applied psychology 46no.1: 5–34. Dahlberg, Ingetraut. 1978. A referent-‐oriented, analytical concept theory for Interconcept. International classification 5: 142-‐51. Dahlberg, Ingetraut. Knowledge organization: a new science? Knowledge organization 33: 11-‐19. Hjørland, Birger. 2002. Domain analysis in information science: eleven approaches, traditional as well as innovative. Journal of documentation 58: 422-‐62. Hjørland, Birger. 2009. Concept theory. Journal of the American Society for Information Science and Technology 60: 1519-‐36. http://arizona.openrepository.com/arizona/bitstream/10150/105762/1/tennis_2007_dlist.pdf Salah, Almila Akdag, Cheng Gao, Kryzstof Suchecki, Andrea Scharnhorst and Richard P. Smiraglia. 2012. The evolution of classification systems: ontogeny of the UDC. In A. Neelameghan and K.S. Raghavan eds. Categories, contexts, and relations in knowledge organization: Proceedings of the Twelfth International ISKO Conference, 6-‐9 August 2012, Mysore, India. Advances in knowledge organization 13. Würzburg: Ergon Verlag, pp. 51-‐57. Scharnhorst, Andrea and Richard P. Smiraglia. 2012. Evolution of classification systems. In Proceedings of the ASIST SIG/CR Classification Workshop, Baltimore, Md. October 25, 2012. Smiraglia, Richard P. 2013. Big classification: using the empirical power of classification interaction.” In Campbell, D.Grant ed., Proceedings of the ASIST SIG/CR Classification Workshop, Montréal, 2 November 2013. Smiraglia, Richard P. 2014. Classification interaction demonstrated empirically.” In Wiesław Babik ed., Knowledge organization in the 21st century: Between Historical Patterns and Future Prospects, Proceedings of the 13th International ISKO Conference, Krakow, Poland, May 19-‐22, 2014. Advances in knowledge organization v. 14. Würzburg: Ergon-‐Verlag, pp. 176-‐83. Smiraglia, Richard P. 2015. Domain analysis for knowledge organization. London: Chandos-‐Elsevier. (Forthcoming) Smiraglia, Richard, Andrea Scharnhorst, Almila Akdag Salah and Cheng Gao. 2013. UDC in action. In Slavic, Aïda, Almila Akdag Salah and Sylvie Davies eds., Classification and Visualization: Interfaces to Knowledge, Proceedings of the International UDC Seminar, 24-‐25 October 2013, The Hague, The Netherlands. Würzburg: Ergon Verlag. Tennis, Joseph T. (2006) Versioning concept schemes for persistent retrieval. Bulletin of the American Society of Information Science and Technology, 32 (5), pp. 13-‐16. Also available at: http://www.asis.org/Bulletin/Jun-‐06/tennis.html. Tennis, Joseph T. (2007) Diachronic and synchronic indexing: modeling conceptual change in indexing languages. In: Information Sharing in a Fragmented World, Crossing Boundaries. Proceedings of the 35th Annual Meeting of the Canadian Association for Information Science/L’Association Canadienne Des Sciences De L'information, Montreal. Edited by C. Arsenault and K. Dalkir. Montreal: Canadian Association for Information Science.
References