The Indexer’s Legacy: Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010

The Indexer’s Legacy:Promoting Access to a Million

Books

Michael HuggettEdie Rasmussen

ICDL 2010

Overview

• Problem statement• Background to study

– Indexers and Indexes– From Print to Digital Book Collections– Searching Digital Collections

• Research Project – Pilot Study– Phase I: Building the Collection– Phase II: Deconstructing the Indexes– Phase III: Building a meta-index– Phase IV: Index-augmented search

Digital Book Projects

• Project Gutenberg (1971+)• Million Book Project (2002+)

– Universal Digital Library• Google Books Library Project (2004+)• Open Content Alliance (2005+)

– Universal Digital Library, Internet Archive• And many others...

Searching Digital Collections

• Combination of ’dirty OCR’ of text plus page image

• Standard IR retrieval techniques: query leads to relevance ranked output

• Text level vs. Passage level retrieval (e.g. INEX Book Track)

• Adequate for many purposes• Problems with heterogeneity of text, ambiguity of

terms

Problem Statement I

The ”million books problem”– ”…the human life contains only about 30,000

days; reading a book a day we would finsih a million books only after 30 lifetimes of reading…No longer a distant probability, a digital representation of [the vast written record of humanity] is taking shape before us… ”

– ”digitization does provide scale (or quantity) but does so at the price of rich, largely manual encoding” (Many More Than A Million, 2007)

Problem Statement II

• Role of indexes: the index is one of the oldest known information retrieval devices, representing a network of interrelationships among concepts in a text

• Intellectual effort: an index represents hours of interpretation and analysis

• Intellectual content: includes information about a book’s content but also incorporates the structure of knowledge in a given field

• Standard information retrieval techniques reduce index terms (and all text terms) to a ’bag of words’ model

Research Goal

As we move from print to digital collections of scholarly works, how can we retain, extract and use the knowledge that is embedded in the indexes?

The goal of this research is to develop techniques that will help to capture, visualize and access the world’s digital knowledge through application of text processing techniques to digital indexes of legacy materials

The Indexing Process

• Read identify indexable concepts (mark) create vocabulary invert? sort and format (s/ware) add cross references -edit for consistency

• Reduces contents of a book to its essentials (5 – 10%)

• Vocabulary is author’s plus indexer’s• Goal is to facilitate access to material in the text

Knowledge in Indexes

• Premises:– The index identifies the most significant topics

in the book– The index expresses the topics in the author’s

vocabulary and in the vocabulary of the field (i.e. that of the reader)

– The index provides links between concepts, showing how they are related

– As indexes on a topic are aggregated, significant concepts related to that topic, and the relationships between them are reinforced, creating both a vocabulary and a guide to the collection

Challenges

• Not all books are indexed• Indexing conventions have changed over time• Books in public domain are older; quality of index

may be lower• Quality of OCR, errors in text• No markup; index structure is indicated visually

(e.g. indents, punctuation)• Matching page numbers in index to physical

pages in text

Related Research

• ’key ideas’ (Schilit and Kolak, Google Research, 2008)– Mining and linking ideas in digital books– Quotation extraction (quote plus context)

• ’Searching in a book’ (Liesaputra, Witten and Bainbridge, NZDL, 2009)

• E-book usability with indexes (Noorhidawata, 2007)

• Reorganizing indexes (Chi et al., 2004)– Creating mini-indexes ’on the fly’

Pilot Study I• Work on a small number of digital items

– 3 biographies of Charles Darwin– 12 books on BC history from UBC University

Press• Software to parse indexes

– From pdf to index structures• Operator driven: scan and correct ocr

errors; key indicators in database• Parse index terms and entries by shared

references• Identify common words on shared page

references

Pilot Study II

• Preliminary results:– Measure of coherence

• Rank terms by frequency and normalize• Deviation = ∑(average rank – term rank) • Calculated for content, index entries, index

words• Calculated for all terms, and for shared

terms only

Pilot Study III

• Preliminary Results: – index terms show more coherence than

corpus terms– Suggests that BoB are a good source of

corpus-level keywords

Corpus Index Entries Index Words

All terms 0.5361 0.3163 0.1913

Shared terms 0.3792 0.0129 0.0278

Phase I: Building a Test Collection

• Needed:– General collection

• Collection of 1000 books • With indexes!• In the public domain

– Topic-oriented collections (5-6?)• Collections of 100(?) books in a topic

• GRAs to identify and download target books• Result: a test collection for this project (and

others)

Phase II: Deconstructing the Indexes

• No BoB indexing standards• No controlled vocabulary• A few indexing conventions

– Headings, subheadings, sub-subheadings...– Structure is indicated by spacing and

punctuation• Need to parse the index to identify entries and

page references• Parsing software written and tested

Phase II: Research Questions

• How can index structure (run-on or indented, heading hierarchies) be extracted?

• Can keyphrases be extracted (proper nouns, concepts)?

• What are the syntax and semantics of indexes?• Can we identify the historical development of

indexes? How have they changed over time?• Can we use XML to create a useful intermediate

product?

Phase III: Building a Meta-Index

• Meta-index: a digital collection-level aggregation of the BoB indexes for a digital collection

• Merging/ concatenating index entries• May be a standard index format (alphabetical,

hierarchical entries), i.e., a digital browsalbe index

• Or may use new formats, e.g. Visualizations, topic maps

INDEX

BOOK

META-INDEX

1 DIGITAL

COLLECTION

META-INDEX

2

Phase III: Research Questions• Can digital versions of BoB indexes be used to

facilitate access to digital collections?• What form should these indexes take?

• Conventional index format (alphabetical/searchable with headings and subheadings)

• Index visualization• How do these meta-indexes compare to a

standard search engine when searching a digital collection?

• Evaluation: task-oriented evaluation with human subjects (e.g. Humanities scholars)

Phase IV: Index Augmented Search

• Using the index information in new ways– Building a ontology in domain areas– Identifying concept relationships between

index vocabulary and term vocabulary– Use for

• Query expansion• Question answering• Summarization• Categorization

Phase IV: Research Questions

• Based on standard text processing procedures, i.e. stemming, use of stopwords, keyphrase extraction, term weighting such as tf*idf or BM25

• How strong is the relationship between the index entry and the words on the page(s) referred to?

• Assume that for a single entry, this relationship is weak; over multiple similar entries in many books, do real relationships emerge and false ones disappear?

• Evaluation: using external collections, e.g. TREC or INEX, to measure contribution of index term relationships to retrieval performance

Further Research

• Building themed or personalized collections (using index for book similiarity measures)

• Ability to mine large multidisciplinary collections for references (historical, economic, etc.)

• Ability to mine collections and build special-format indexes and browsers (e.g. images, figures)

• Changes in topics over time, evolution of thinking on a subject

• Knowledge discovery: detecting previously undiscovered links between topics

The Indexer’s Legacy…

(a) an archaic addendum to an obsolete medium?

OR(b) value-added knowledge in electronic text

that enhances access to digital collections?

Thank you!

Documents

The Indexer’s Legacy: Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010