Upload
martihearst
View
1.292
Download
0
Embed Size (px)
DESCRIPTION
Describes the castanet algorithm for inducing faceted metadata and its application to biological journal titles.
Citation preview
NLP Support for Faceted Navigation in Scholarly Collections
ACL’09 Workshop on NLP for Scholarly Collections
Marti Hearst and Emilia Stoica
Presented by Preslav Nakov
Marti Hearst, Taxonomy Bootcamp ‘06
Motivation
Faceted navigation is now standard for “vertical” content collections e-commerce stores image collections
It is also being used for digital libraries WorldCat, NCSU, Chicago
Problem: the facets for the SUBJECT facet need to be richer. How to automatically create these facets? Our solution: CastaNet applied to scholarly
collections
Marti Hearst, Taxonomy Bootcamp ‘06
Outline
Definition of faceted metadata
Examples of faceted navigation in use
Castanet: an algorithm for (semi) automatic creation of facet hierarchies
Application of Castanet to a scholarly collection
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
Facets are a way of labeling data A kind of Metadata (data about data) Can be thought of as properties of items
Facets vs. Categories Items are placed INTO a category system Multiple facet labels are ASSIGNED TO items
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
Create INDEPENDENT categories (facets) Each facet has labels (sometimes arranged in a
hierarchy)
Assign labels from the facets to every item Example: recipe collection
Course
Main Course
CookingMethod
Stir-fry
Cuisine
Thai
Ingredient
Bell Pepper
Curry
Chicken
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
Break out all the important concepts into their own facets
Sometimes the facets are hierarchical Assign labels to items from any level of the
hierarchy
Preparation Method Fry Saute Boil Bake Broil Freeze
Desserts Cakes Cookies Dairy Ice Cream Sorbet Flan
Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple
Marti Hearst, Taxonomy Bootcamp ‘06
Using Facets
Now there are multiple ways to get to each item
Preparation Method Fry Saute Boil Bake Broil Freeze
Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan
Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple
Fruit > PineappleDessert > Cake
Preparation > Bake
Dessert > Dairy > SherbetFruit > Berries > Strawberries
Preparation > Freeze
Marti Hearst, Taxonomy Bootcamp ‘06
Faceted navigation’s advantages:
Integrate browsing and searching seamlessly
Support exploration and learning Avoid dead-ends, “pogo’ing”, and
“lostness”
Marti Hearst, Taxonomy Bootcamp ‘06
Uses of Faceted Navigation in Online Digital Libraries
Marti Hearst, Taxonomy Bootcamp ‘06
WorldCat
Marti Hearst, Taxonomy Bootcamp ‘06
WorldCat
Marti Hearst, Taxonomy Bootcamp ‘06
U Chicago
Marti Hearst, Taxonomy Bootcamp ‘06
U Chicago
Marti Hearst, Taxonomy Bootcamp ‘06
Advantages of Facets
Can’t end up with empty results sets (except with keyword search)
Helps avoid feelings of being lost. Easier to explore the collection.
Helps users infer what kinds of things are in the collection.
Evokes a feeling of “browsing the shelves” Is preferred over standard search for
collection browsing in usability studies. (Interface must be designed properly)
Marti Hearst, Taxonomy Bootcamp ‘06
Limitation of Facets
Do not naturally capture MAIN THEMES Facets do not show RELATIONS explicitly
AquamarineRed
Orange
DoorDoorway
Wall
Which color associated with which object?Photo by J. Hearst, jhearst.typepad.com
Marti Hearst, Taxonomy Bootcamp ‘06
Usability Studies (using Flamenco)
Usability studies done on 3 collections: Recipes (epicurious): 13,000 items Architecture Images: 40,000 items Fine Arts Images: 35,000 items
Conclusions: Users like and are successful with the
dynamic faceted hierarchical metadata, especially for browsing tasks
Very positive results, in contrast with studies on earlier iterations.
How to Create Facet Hierarchies?Our Approach: Castanet
Marti Hearst, Taxonomy Bootcamp ‘06
Biomedical Journal Titles (3275 Titles)
"Journal of clinical hypertension"
American journal of hypertension : journal of the American Society of HypertensionHypertension in pregnancy : official journal of the International Society for the Study of Hypertension in PregnancyJournal of interventional cardiac electrophysiology : an international journal of arrhythmias and pacingHeart failure reviews
Hypertension research : official journal of the Japanese Society of Hypertension
Current hypertension reports
European journal of heart failure : journal of the Working Group on Heart Failure of the European Society of Cardiology
"Congestive heart failure (Greenwich, Conn.)"
"Clinical and experimental hypertension (New York, N.Y. : 1993)"
Hypertension
Journal of human hypertension
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (Bio titles)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (Bio titles)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (LibraryThing tags)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (LibraryThing Tags)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (LibraryThing Tags)
Marti Hearst, Taxonomy Bootcamp ‘06
Our Approach:Leverage the structure of WordNet
Marti Hearst, Taxonomy Bootcamp ‘06
Our Approach
Leverage the structure of WordNet
Doc
umen
ts
WordNet
Get hypernym
paths
Sel
ect
ter
ms
Build tree
Compresstree
Divide into facets
Marti Hearst, Taxonomy Bootcamp ‘06
1. Select Terms
red blue
Select well distributed terms from collection D
ocum
ent
s
WordNet
Get hypernym
pathsSel
ect
term
s
Build tree
Comp. tree
Marti Hearst, Taxonomy Bootcamp ‘06
2. Get Hypernym Path
red blue
chromatic color
abstraction
property
visual property
color
red, redness
abstraction
property
visual property
color
blue, blueness
chromatic color
Doc
ume
nts
WordNet
Get hypernym
pathsSel
ect
te
rms
Build tree
Comp. tree
Marti Hearst, Taxonomy Bootcamp ‘06
3. Build Tree
red blue
chromatic color
abstraction
property
visual property
color
red, redness
abstraction
property
visual property
color
blue, blueness
chromatic color
red blue
abstraction
property
visual property
color
red, redness
chromatic color
blue, blueness
Doc
ume
nts
WordNet
Get hypernym
pathsSel
ect
te
rms
Buildtree
Comp. tree
Marti Hearst, Taxonomy Bootcamp ‘06
4. Compress Tree
Doc
ume
nts
WordNet
Get hypernym
pathsSel
ect
te
rms
Build tree
Comp.tree
red, redness
color
red
chromatic color
blue, blueness
blue
green, greenness
green green red
color
chromatic color
blue
Marti Hearst, Taxonomy Bootcamp ‘06
4. Compress Tree (cont.)
red
color
chromatic color
blue green
color
red blue green
Doc
ume
nts
WordNet
Get hypernym
pathsSel
ect
te
rms
Build tree
Comp. tree
Marti Hearst, Taxonomy Bootcamp ‘06
5. Divide into Facets
Divide into facets
Marti Hearst, Taxonomy Bootcamp ‘06
Disambiguation
Ambiguity in: Word senses Paths up the hypernym tree
Sense 1 for word “tuna”organism, being => plant, flora => vascular plant => succulent => cactus
=> tuna
Sense 2 for word “tuna”organism, being => fish => food fish => tuna => bony fish => spiny-finned fish => percoid fish => tuna
2 paths for same word
2 paths for
same sense
Marti Hearst, Taxonomy Bootcamp ‘06
How to Select the Right Senses and Paths?
First: build core tree (1) Create paths for words with only one sense (2) Use Domains
Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc.
Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or
may add own Paths for terms that match the selected domains are added to
the core tree
Then: add remaining terms to the core tree.
Marti Hearst, Taxonomy Bootcamp ‘06
Using Domains
dip glosses:
Sense 1: A depression in an otherwise level surface
Sense 2: The angle that a magnet needle makes with horizon
Sense 3: Tasty mixture into which bite-size foods are dipped
dip hypernyms
Sense 1 Sense 2 Sense 3
solid shape, form food
=> concave shape => space => ingredient, fixings
=> depression => angle => flavorer
Given domain “food”, choose sense 3
Castanet Evaluation
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Evaluation
This is a tool for information architects, so people of this type did the evaluation
We compared output on Recipes Biomedical journal titles
We compared to two state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (Bio titles)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (Bio titles)
Marti Hearst, Taxonomy Bootcamp ‘06
LDA Output (Bio titles)
Marti Hearst, Taxonomy Bootcamp ‘06
LDA Output (Bio titles)
Marti Hearst, Taxonomy Bootcamp ‘06
Evaluation Method
Information architects assessed the category systems
For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels
Then comment on overall properties Meaningful? Systematic? Likely to use in your work?
Marti Hearst, Taxonomy Bootcamp ‘06
Evaluation Results (Bio titles)
15 participants, all PubMed Users Results for “Would you use this system in
your work?” Answering “Yes in some cases” or “yes definitely”
Pine (Castanet): 11/15 Oak (LDA): 1/7 Birch (Subsumption): 1/8
Marti Hearst, Taxonomy Bootcamp ‘06
Evaluation Results (recipes)
Results on recipes collection for “Would you use this system in your work?” Yes in some cases or yes definitely:
Pine (Castanet): 29/34 Oak (LDA): 0/18 Birch (Subsumption): 6/16
Results on quality of categories:
Marti Hearst, Taxonomy Bootcamp ‘06
Conclusions
Flexible application of hierarchical faceted metadata is a proven approach for navigating scholarly collections. Midway in complexity between simple hierarchies
and deep knowledge representation.
Currently in use in digital library sites, but the SUBJECT categories need more work.
Algorithms are needed to help create faceted metadata structures Our WordNet-based algorithm, while not perfect,
provides a good starting point for scholarly collections
For more information:flamenco.berkeley.edu
Thank you!Preslav Nakov, Marti Hearst & Emilia Stoica