Using a Controlled Vocabulary for Managing a Digital Library Platform
Sean Boisen ([email protected])Logos Bible Software
SemTech 2010Slides:
http://semanticbible.com/other/talks/2010/semtech/lcv.html
Outline
• Introduce the Logos digital library• Logos Controlled Vocabulary (LCV)
– What it is– How do we use it– What’s interesting about it
• Next steps
Who Am I?
• 19 years with BBN Technologies– Information extraction, human language
technology– Scientist, technology manager
• 3+ years with Logos Bible Software– Senior Information Architect– Manager of Design & Editorial Dept.– Academic Products Manager
The Importance of the Bible
• The most widely distributed book – ~83M per year worldwide
• The most widely translated work – > 2000 languages– 50 languages at www.biblegateway.com
• Spans 1000s of years of ancient history
Logos Bible Software• High-end desktop
digital library– > 10k titles– > 100k users in 180
countries– Extensive cross-indexing
and hyper linking– Resources in a dozen
languages– Windows/Mac/iPhone/mobile
• Leading publisher and developer of digital resources for Bible study
• http://logos.com
Network Effects
• Rich markup and original content• Information integration
Added Value Strategy
• Domain-specific focus• Task-oriented guides that automate
research • Integrated tools and content• Unique digital assets that integrate
information and provide answers
Controlled Vocabularies
• Organized system for labeling content– Using English terms
• Consistent representation of content• More effective search
Logos Controlled Vocabulary (LCV)• Domain-specific (Biblical studies)• Semantic organization of reference
book content – not just terms• Mitigates problems of ambiguity,
homographs, synonyms, spelling variation
LCV Value Proposition
• Recognizes key terms in the knowledge domain
• Provides alternate search terms and query expansion
• Supports user-created content and reading lists
• Integrates reference content• Provides semantic “glue” for the
library
Example: Ambiguity
Example: Homographs
Example: Variation
Scope
TimBL's rules for Linked Data:• Use URIs to identify things (=
Identity) – Use HTTP URIs so people can look things
up
• Provide useful information in a standard format when someone references a URI (=Utility)
• Include links to other URIs (= Relationships)
LCV as Linked Data: PriscaId: Prisca_Person Label: “Prisca”
Type: Person Name: True
PrefLabel: “Prisca” Extra-biblical:
False
AltLabel: “Priscilla”
Entities: agent:Prisca.1
Articles: Anchor.PRISCAPERSON, Tyndale.L4559, …
Topics: http://topics.logos.com/Prisca
Wikipedia: Priscilla and Aquila
Identity
Utility
Relationships
LCV as Linked Data: DeceitId: deceit Label: “Deceit”
Type: Name: False
PrefLabel:
“Deceit” Extra-biblical:
False
AltLabel: “Deception”, “Deceitful”, “Deceive”
Articles: ISBE.DECEIT, NBD.R494, …
Topics: http://topics.logos.com/deceit
Identity
Utility
Relationships
Example Semanticslcvinst:Aaron_Person rdf:type skos:Concept ; skos:prefLabel "Aaron"@en ; lcv:isname "true"^^xsd:boolean ; lcv:termType lcv:Person ; skos:related lcvinst:aaronsRod ; lcv:bkentity bk:Aaron .
res:anch.AARONPERSON rdf:type foaf:Document ; dct:subject lcvinst:Aaron_Person .res:TYNBIBDCT.L1 rdf:type foaf:Document ; dct:subject lcvinst:Aaron_Person .res:isbe.AARON rdf:type foaf:Document ; dct:subject lcvinst:Aaron_Person .
Semantic Inter-relationships
Person
Concept
Thing
Place
Text
Concrete
Conceptual
LCV Development
• Developed by merging content from 7 Bible dictionaries – Extract headwords– Do automatic
alignment (conservative)
– Review manually
• Reduced > 40k concepts down to ~10k
LCV Development Continues
• Additional resources suggest new concepts: – Archaeol. Dict. of the Holy Land: 90/547 (16%)
• Mostly very specific locations (%EinSamiya_Place)
– Nelson's Illus. Bible Dictionary: 200/4833 (4%)– Harper's Bible Dictionary: 81/2962 (3%)
• Adding alternate terms• Subject areas for further expansion:
– Individuals from church history– Specialized theological concepts
Use Case: Improved Topic Search• Link to the same concept regardless
of how originally labeled • Provide consistent semantics for
content • Suggest alternate concepts for the
same term • Provide query expansions for full text
search
Use Case: Information Discovery• Automatically link
– Reference to concepts – Concept to related concepts – Concept to references
Text Mining: Reference to Concepts• Aggregate reference
counts– Each article votes on
most likely references– Each concept votes
on the most likely concepts for a reference
• Reverse index from reference to concepts
• Estimates should improve with more content
Text Mining: Related Concepts• Extract and aggregate key terms• Cluster documents
Conclusions
• Controlled vocabulary coupled with parallel content
• Platform for text mining, user contribution
• Future Work– Continue adding resources– Additional content extraction– Add hierarchy (LCSH, WordNet)– Crowdsourcing
Resources
• A Controlled Vocabulary for Biblical Studies (Boisen). Presentation at BibleTech:2010.
• Domain-Specific Tools to Add Value to E-Books (Pritchett). Presentation at O'Reilly Tools of Change for Publishing Conference 2010.
• Deploying Semantic Technologies for Digital Publishing (Boisen). Presentation at SemTech:2007.