Upload
earl-watts
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
The Indexer’s Legacy:Promoting Access to a Million
Books
Michael HuggettEdie Rasmussen
ICDL 2010
Overview
• Problem statement• Background to study
– Indexers and Indexes– From Print to Digital Book Collections– Searching Digital Collections
• Research Project – Pilot Study– Phase I: Building the Collection– Phase II: Deconstructing the Indexes– Phase III: Building a meta-index– Phase IV: Index-augmented search
Digital Book Projects
• Project Gutenberg (1971+)• Million Book Project (2002+)
– Universal Digital Library• Google Books Library Project (2004+)• Open Content Alliance (2005+)
– Universal Digital Library, Internet Archive• And many others...
Searching Digital Collections
• Combination of ’dirty OCR’ of text plus page image
• Standard IR retrieval techniques: query leads to relevance ranked output
• Text level vs. Passage level retrieval (e.g. INEX Book Track)
• Adequate for many purposes• Problems with heterogeneity of text, ambiguity of
terms
Problem Statement I
The ”million books problem”– ”…the human life contains only about 30,000
days; reading a book a day we would finsih a million books only after 30 lifetimes of reading…No longer a distant probability, a digital representation of [the vast written record of humanity] is taking shape before us… ”
– ”digitization does provide scale (or quantity) but does so at the price of rich, largely manual encoding” (Many More Than A Million, 2007)
Problem Statement II
• Role of indexes: the index is one of the oldest known information retrieval devices, representing a network of interrelationships among concepts in a text
• Intellectual effort: an index represents hours of interpretation and analysis
• Intellectual content: includes information about a book’s content but also incorporates the structure of knowledge in a given field
• Standard information retrieval techniques reduce index terms (and all text terms) to a ’bag of words’ model
Research Goal
As we move from print to digital collections of scholarly works, how can we retain, extract and use the knowledge that is embedded in the indexes?
The goal of this research is to develop techniques that will help to capture, visualize and access the world’s digital knowledge through application of text processing techniques to digital indexes of legacy materials
The Indexing Process
• Read identify indexable concepts (mark) create vocabulary invert? sort and format (s/ware) add cross references -edit for consistency
• Reduces contents of a book to its essentials (5 – 10%)
• Vocabulary is author’s plus indexer’s• Goal is to facilitate access to material in the text
Knowledge in Indexes
• Premises:– The index identifies the most significant topics
in the book– The index expresses the topics in the author’s
vocabulary and in the vocabulary of the field (i.e. that of the reader)
– The index provides links between concepts, showing how they are related
– As indexes on a topic are aggregated, significant concepts related to that topic, and the relationships between them are reinforced, creating both a vocabulary and a guide to the collection
Challenges
• Not all books are indexed• Indexing conventions have changed over time• Books in public domain are older; quality of index
may be lower• Quality of OCR, errors in text• No markup; index structure is indicated visually
(e.g. indents, punctuation)• Matching page numbers in index to physical
pages in text
Related Research
• ’key ideas’ (Schilit and Kolak, Google Research, 2008)– Mining and linking ideas in digital books– Quotation extraction (quote plus context)
• ’Searching in a book’ (Liesaputra, Witten and Bainbridge, NZDL, 2009)
• E-book usability with indexes (Noorhidawata, 2007)
• Reorganizing indexes (Chi et al., 2004)– Creating mini-indexes ’on the fly’
Pilot Study I• Work on a small number of digital items
– 3 biographies of Charles Darwin– 12 books on BC history from UBC University
Press• Software to parse indexes
– From pdf to index structures• Operator driven: scan and correct ocr
errors; key indicators in database• Parse index terms and entries by shared
references• Identify common words on shared page
references
Pilot Study II
• Preliminary results:– Measure of coherence
• Rank terms by frequency and normalize• Deviation = ∑(average rank – term rank) • Calculated for content, index entries, index
words• Calculated for all terms, and for shared
terms only
Pilot Study III
• Preliminary Results: – index terms show more coherence than
corpus terms– Suggests that BoB are a good source of
corpus-level keywords
Corpus Index Entries Index Words
All terms 0.5361 0.3163 0.1913
Shared terms 0.3792 0.0129 0.0278
Phase I: Building a Test Collection
• Needed:– General collection
• Collection of 1000 books • With indexes!• In the public domain
– Topic-oriented collections (5-6?)• Collections of 100(?) books in a topic
• GRAs to identify and download target books• Result: a test collection for this project (and
others)
Phase II: Deconstructing the Indexes
• No BoB indexing standards• No controlled vocabulary• A few indexing conventions
– Headings, subheadings, sub-subheadings...– Structure is indicated by spacing and
punctuation• Need to parse the index to identify entries and
page references• Parsing software written and tested
Phase II: Research Questions
• How can index structure (run-on or indented, heading hierarchies) be extracted?
• Can keyphrases be extracted (proper nouns, concepts)?
• What are the syntax and semantics of indexes?• Can we identify the historical development of
indexes? How have they changed over time?• Can we use XML to create a useful intermediate
product?
Phase III: Building a Meta-Index
• Meta-index: a digital collection-level aggregation of the BoB indexes for a digital collection
• Merging/ concatenating index entries• May be a standard index format (alphabetical,
hierarchical entries), i.e., a digital browsalbe index
• Or may use new formats, e.g. Visualizations, topic maps
INDEX
BOOK
META-INDEX
1 DIGITAL
COLLECTION
META-INDEX
2
Phase III: Research Questions• Can digital versions of BoB indexes be used to
facilitate access to digital collections?• What form should these indexes take?
• Conventional index format (alphabetical/searchable with headings and subheadings)
• Index visualization• How do these meta-indexes compare to a
standard search engine when searching a digital collection?
• Evaluation: task-oriented evaluation with human subjects (e.g. Humanities scholars)
Phase IV: Index Augmented Search
• Using the index information in new ways– Building a ontology in domain areas– Identifying concept relationships between
index vocabulary and term vocabulary– Use for
• Query expansion• Question answering• Summarization• Categorization
Phase IV: Research Questions
• Based on standard text processing procedures, i.e. stemming, use of stopwords, keyphrase extraction, term weighting such as tf*idf or BM25
• How strong is the relationship between the index entry and the words on the page(s) referred to?
• Assume that for a single entry, this relationship is weak; over multiple similar entries in many books, do real relationships emerge and false ones disappear?
• Evaluation: using external collections, e.g. TREC or INEX, to measure contribution of index term relationships to retrieval performance
Further Research
• Building themed or personalized collections (using index for book similiarity measures)
• Ability to mine large multidisciplinary collections for references (historical, economic, etc.)
• Ability to mine collections and build special-format indexes and browsers (e.g. images, figures)
• Changes in topics over time, evolution of thinking on a subject
• Knowledge discovery: detecting previously undiscovered links between topics
The Indexer’s Legacy…
(a) an archaic addendum to an obsolete medium?
OR(b) value-added knowledge in electronic text
that enhances access to digital collections?
Thank you!