© Copyright 2007 Dow Jones and Company, Inc.
Centralized Taxonomy Management for Enterprise Information Systems
Enterprise Search Summit Wednesday, September 24th, 2:00 pm – 2:30 pm
Dow Jones Client Solutions ProQuest Synaptica Manager, Taxonomy [email protected] [email protected]
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
Dow Jones Taxonomy Solutions
Words Dow Jones taxonomy
licensing Other taxonomy licensing
(Taxonomy Warehouse) Taxonomy customization Taxonomy development
Expertise Taxonomy Assessment
Taxonomy Consulting
Analysis
Recommendations
Implementation
Workshops
Tools Synaptica:
Taxonomy / Metadata -- Management Tool
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
A taxonomy is a hierarchical topic structure to which information can be assigned through the dual
processes of classification (filing to a location) and categorisation (tagging with relevant metadata).
A taxonomy provides browsable navigation and supports filtered searching
Some Definitions
A thesaurus is a controlled vocabulary linking an organisation’s common language to its taxonomy
structure. It accommodates synonyms, acronyms, language variants and other near equivalences. It
also signposts non-hierarchical linkages within and across the taxonomy facets. A thesaurus is usually
employed to interpret and guide user search queries
An ontology is the working model of entities and interactions in a particular domain of knowledge or
content set. It is a set of concepts - such as things, events, and relations - that are specified in some
way in order to create an agreed-upon vocabulary for exchanging information. An ontology is
increasingly used to visualise (or map) a set of search results and discover new or hidden
connections
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
Classic taxonomy…groups things or
concepts into families
UP
DOWNSIDEWAYS
Traditional thesaurus…captures the different names of the family
members and explores some more distant
associations(cousins & close friends)
Multi-
Directional
Emerging ontology…shows a network ofmulti-dimensional relationships and
properties both within and outside the family groups
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
TelephonesIs a broader term than
Mobile Phones
UP
DOWNSIDEWAYS
Mobile PhonesAKA as
Cell Phones &Hand Phones
And Similar toHand Held Devices
& PDAs
Multi-
Directional
Mobile Phones
Are made by
Phone Manufacturers
And use the networks ofTelecoms
Service Providers
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
•Metadata’s Evolutionary Path
Dictionaries& Flat Lists
HierarchicalTaxonomies
ControlledVocabularyThesauri
Ontologies
StructuredAuthority Files
Metadata is evolving organically – the less
complex metadata elements form the building blocks for creating the more complex
structures
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
Portal navigation and browsable website menus Conceptual access to large databases Records management and cataloging e-Commerce online product catalogues Inventory control and de-duplication Auto-classification of internal documents and email Multilingual search and browse Metasearch of enterprise-wide resources
Practical Applications
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
CentralizedTaxonomy
ManagementSystem
Synaptica®
PortalsPortals Categorizers
PortalsPortalsSearch Engines
PortalsPortalsContent Portals
Multiple usersworking in
collaborative and compartmentalized
space
Permissions
Centralized Taxonomy and Metadata Management
As a centralized repository for multi-lingual semantic management that is:
- Independent from systems like web-portal search and categorization systems - Scalable; capable of evolving with emerging corporate semantic standards
HTML
CSV
XML
ZThes
SKOS
OWL
WebServices
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
Metadata can transcend information islands and data silos but only if the enterprise is committed to common standards
A centralized system that supports both collaboration and compartmentalization allows common metadata to be shared while also allowing user communities the independence to manage specialized metadata files
Why Centralized?
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
Enterprises are increasingly making use of multiple proprietary and open source software tools for categorization, search and portal tasks
While many of these tools support some level of metadata management the diversity of standards, data formats and business rules they support can actually result in exacerbating the data silo problem by creating metadata silos
Why Independent?
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
DMS CMSShared
Docs
News &
ResearchData
Search Engine
Taxonomy & Metadata Platform
Information Processing, Management and Storage
Where taxonomy fits with Search
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
4 Good Reasons for Taxonomy
Search Relevancy
Search Completeness
Search Federation
Search Visualisation
Effective Research/Risk Mitigation
Knowledge Worker Productivity
Discovery & Innovation
Better & Faster Decisions
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
1. Improved Search Relevancy
Ambiguity of Language Is a Blackberry a fruit or a handheld device?
By including this brand name in a taxonomy we can give context to the user search query
In a telecoms domain we can assume that the user means the latter and only return content tagged as such
Alternatively we can weight the results, promoting those documents about handheld devices above those that refer to the fruit
Either way the result is increased search precision which translates into time savings
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
2. Improved Search Completeness
Synonymous and Related Term Relationships Mobile Phone (PT) = Cell Phone (NPT) = Hand Phone
(NPT) Mobile Phone is related to Hand Held Device (RT)
User Search Query = “Cell Phones” The taxonomy simultaneously broadens the search and
prioritises the returned results giving increased recall without compromising relevancy
Content tagged with Mobile Phone category are promoted over those not tagged using a weighting in the search algorithm
Content tagged with Hand Held Device category may also receive a weighting
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
3. Search federation and data integration
A snapshot or dashboard is often more desirable than a list of document titles or snippets, especially when looking for information on a customer, supplier or competitor
Also, information will most likely reside in a number of internal repositories, each with their own levels of metadata structure
Taxonomy allows the combination of news, internal CI reports, price plans, coverage data, market share data, share price etc. in one consolidated view by providing mappings or cross-walks
This is essentially applying business intelligence discipline to the world of unstructured information
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
4. Search Visualisation
The previous three scenarios assume the user knows what they are looking for
But what about serendipitous discovery?
By being able see across an aggregation of content and extract facts and relationships from deep within the information stores, true (and sometimes fortunate) discovery can take place
Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.
Document, Content& Records
Management
Synaptica®Vocabulary & Metadata
Management
Thesauri
Ontologies
Filing & Storage
MetadataTagging
(Categorisation)Process
SearchEngine
Visualisation
Navigation
Intranet / PortalUser Interface
Back EndInformation Structure
Front EndInformation Intelligence
Librarians; Taxonomists; Indexers;Knowledge & Information Managers
Information Creators;Records Managers;Content Managers;Librarians; Indexers
Information Users(the business; the public)
Taxonomies
CIOs; CTOs;IT Architects
Paula R. McCoyManager, Taxonomy Development
Centralized Taxonomy Management for Centralized Taxonomy Management for Enterprise Information SystemsEnterprise Information Systems
Description of ProQuest Controlled Vocabulary & Authority Files
Taxonomy Management -- Overview
Managing Terms Manually
Synaptica Thesaurus Management System
Topics of DiscussionTopics of Discussion
Access to over 125 billion digital pages of content from magazine, trade, & scholarly publications, current &
historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds
Subscription-based ProQuest® online information service available in academic and public libraries
ProQuest Controlled Vocabulary used to index subjects; Authority Files used to index company, geographic, personal, product names
CV applied to non-periodical & third-party content via mapping, to allow cross-searching of multiple DBs with one vocabulary
Created in 1970s for ABI/INFORM business database
Based on Library of Congress Subject Headings
Natural language, hierarchical vocabulary complying with ANSI/NISO Standard Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies)
ProQuest Controlled VocabularyProQuest Controlled Vocabulary
ProQuest Controlled VocabularyProQuest Controlled Vocabulary
Thesaurus subjects:Business, economics & trade – 4300 termsScience, math & technology – 1600 termsMedicine – 1150 termsHumanities – 960 termsGovernment & policy – 850 termsEducation – 400 terms
Merged with general reference vocabulary in 1980s
Major development effort in past 4 years to boost science, education & medical terms
ProQuest CV: StatisticsProQuest CV: Statistics
Preferred terms: 11,046
Non-preferred terms: 5631
Scope Notes: 3194 (29%)
Cross-references (Broader, Narrower, Related terms): 67,700
Terms added in 2007: 77
Terms added in 2008: 58+
Authority Files: StatisticsAuthority Files: Statistics
Corporate/Organization Names: 438,098 Names added in 2008: 5489
Personal Names: 416,239 Names added in 2008: 1526
Geographic (Location) Names: 34,331 Names added in 2008: 144
Product Names: 38,210 Names added in 2008: 54
The Taxonomy Manager’s JobThe Taxonomy Manager’s Job
Add subject terms as dictated by new concepts and new content to index
Maintain hierarchies & Scope Notes
Load updated Thesaurus to ProQuest interface
Manage authority files to maintain standards & control file size
The Taxonomy Manager’s JobThe Taxonomy Manager’s Job
To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest
OBJECTIVE:
Sample Subject TermSample Subject Term
Chronic obstructive pulmonary disease SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow UF COPD BT Disease BT Respiratory diseases NT Asthma NT Bronchitis NT Emphysema RT Airway management RT Lungs
Preferred, or main termPreferred, or main term
Scope note defining term and how it is used
Scope note defining term and how it is used
Non-preferred term: points to term used to index
Non-preferred term: points to term used to index
Terms broader in nature to main term: COPD is a
disease, and specifically, a respiratory disease
Terms broader in nature to main term: COPD is a
disease, and specifically, a respiratory disease
Terms narrower in nature to main term: these are
chronic lung diseases
Terms narrower in nature to main term: these are
chronic lung diseases
Terms related to main term that might be used to
narrow the search
Terms related to main term that might be used to
narrow the search
New scientific content requiring a huge enhancement to vocabulary
Seven MS Word vocabulary documents— English and foreign language (French, German, Spanish)—printed for internal use only
Six authority files & 3 vocabulary files in Oracle databases, requiring duplicate entry of subject terms in Word and Oracle
Legacy editorial system in process of being replaced
Managing Terms ManuallyManaging Terms Manually
Thesaurus Management SystemsThesaurus Management SystemsBuying CriteriaBuying CriteriaThesaurus Management System: Thesaurus Management System: RequirementsRequirements
Eliminate double entry
Improve editorial interface with vocabulary
Automate entry of reciprocal relationships
Life With SynapticaLife With Synaptica
Word – Old, Bad Synaptica – New, Good
Adding Terms Today: 3 Easy StepsAdding Terms Today: 3 Easy Steps
2. Export report of new terms into Word
1. Enter term and relationships into Synaptica “Item Details” window
3. Send Word document to editors
Improving Thesaurus ManagementImproving Thesaurus ManagementCategories FeatureCategories Feature
Subject Term CategoriesSubject Term Categories
CORP Names – Categories & WebsiteCORP Names – Categories & Website
Foreign-Language VocabulariesForeign-Language Vocabularies
Language EquivalentsLanguage
Equivalents
Foreign-Language VocabulariesForeign-Language Vocabularies
Life With Synaptica
SpanishSpanish
GermanGerman FrenchFrench
Spanish
Alphabetical by languageAlphabetical by language
Synaptica UpdatesSynaptica Updates
Synaptica version 6.0 released in early 2006
Synaptica version 7.0 is being implemented now: • Enhanced user interface • Semantic Web standardization (RDF, OWL, SKOS) and Web Services integration• Expanded Reporting functionality • Enhanced adding and editing of term relationships including “rapid-fire” simple drag-and-drop editing• Improved global term editing• Online help and user guides
Benefits of SynapticaBenefits of Synaptica
Greater awareness of thesaurus standards and terminology, e.g.: “preferred” and “non-preferred” instead of Use and Used For
Long-needed updating and improvement in term hierarchies; ability to provide thesaurus statistics
Increase in Company name NPTs — from 1935 to 8952 today
Immediate responsiveness to indexer needs — real-time term additions, esp. NPTs and SNs
Easier loading of updated Thesaurus on PQ interface
thank you!