45
NLP Support for Faceted Navigation in Scholarly Collections ACL’09 Workshop on NLP for Scholarly Collections Marti Hearst and Emilia Stoica Presented by Preslav Nakov

The Castanet Algorithm for semi-automatically inducing faceted metadata

Embed Size (px)

DESCRIPTION

Describes the castanet algorithm for inducing faceted metadata and its application to biological journal titles.

Citation preview

Page 1: The Castanet Algorithm for semi-automatically inducing faceted metadata

NLP Support for Faceted Navigation in Scholarly Collections

ACL’09 Workshop on NLP for Scholarly Collections

Marti Hearst and Emilia Stoica

Presented by Preslav Nakov

Page 2: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Motivation

Faceted navigation is now standard for “vertical” content collections e-commerce stores image collections

It is also being used for digital libraries WorldCat, NCSU, Chicago

Problem: the facets for the SUBJECT facet need to be richer. How to automatically create these facets? Our solution: CastaNet applied to scholarly

collections

Page 3: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Outline

Definition of faceted metadata

Examples of faceted navigation in use

Castanet: an algorithm for (semi) automatic creation of facet hierarchies

Application of Castanet to a scholarly collection

Page 4: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

The Idea of Facets

Facets are a way of labeling data A kind of Metadata (data about data) Can be thought of as properties of items

Facets vs. Categories Items are placed INTO a category system Multiple facet labels are ASSIGNED TO items

Page 5: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

The Idea of Facets

Create INDEPENDENT categories (facets) Each facet has labels (sometimes arranged in a

hierarchy)

Assign labels from the facets to every item Example: recipe collection

Course

Main Course

CookingMethod

Stir-fry

Cuisine

Thai

Ingredient

Bell Pepper

Curry

Chicken

Page 6: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

The Idea of Facets

Break out all the important concepts into their own facets

Sometimes the facets are hierarchical Assign labels to items from any level of the

hierarchy

Preparation Method Fry Saute Boil Bake Broil Freeze

Desserts Cakes Cookies Dairy Ice Cream Sorbet Flan

Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple

Page 7: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Using Facets

Now there are multiple ways to get to each item

Preparation Method Fry Saute Boil Bake Broil Freeze

Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan

Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple

Fruit > PineappleDessert > Cake

Preparation > Bake

Dessert > Dairy > SherbetFruit > Berries > Strawberries

Preparation > Freeze

Page 8: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Faceted navigation’s advantages:

Integrate browsing and searching seamlessly

Support exploration and learning Avoid dead-ends, “pogo’ing”, and

“lostness”

Page 9: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Uses of Faceted Navigation in Online Digital Libraries

Page 10: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

WorldCat

Page 11: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

WorldCat

Page 12: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

U Chicago

Page 13: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

U Chicago

Page 14: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Advantages of Facets

Can’t end up with empty results sets (except with keyword search)

Helps avoid feelings of being lost. Easier to explore the collection.

Helps users infer what kinds of things are in the collection.

Evokes a feeling of “browsing the shelves” Is preferred over standard search for

collection browsing in usability studies. (Interface must be designed properly)

Page 15: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Limitation of Facets

Do not naturally capture MAIN THEMES Facets do not show RELATIONS explicitly

AquamarineRed

Orange

DoorDoorway

Wall

Which color associated with which object?Photo by J. Hearst, jhearst.typepad.com

Page 16: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Usability Studies (using Flamenco)

Usability studies done on 3 collections: Recipes (epicurious): 13,000 items Architecture Images: 40,000 items Fine Arts Images: 35,000 items

Conclusions: Users like and are successful with the

dynamic faceted hierarchical metadata, especially for browsing tasks

Very positive results, in contrast with studies on earlier iterations.

Page 17: The Castanet Algorithm for semi-automatically inducing faceted metadata

How to Create Facet Hierarchies?Our Approach: Castanet

Page 18: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Biomedical Journal Titles (3275 Titles)

"Journal of clinical hypertension"

American journal of hypertension : journal of the American Society of HypertensionHypertension in pregnancy : official journal of the International Society for the Study of Hypertension in PregnancyJournal of interventional cardiac electrophysiology : an international journal of arrhythmias and pacingHeart failure reviews

Hypertension research : official journal of the Japanese Society of Hypertension

Current hypertension reports

European journal of heart failure : journal of the Working Group on Heart Failure of the European Society of Cardiology

"Congestive heart failure (Greenwich, Conn.)"

"Clinical and experimental hypertension (New York, N.Y. : 1993)"

Hypertension

Journal of human hypertension

Page 19: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (Bio titles)

Page 20: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (Bio titles)

Page 21: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (LibraryThing tags)

Page 22: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (LibraryThing Tags)

Page 23: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (LibraryThing Tags)

Page 24: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Our Approach:Leverage the structure of WordNet

Page 25: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Our Approach

Leverage the structure of WordNet

Doc

umen

ts

WordNet

Get hypernym

paths

Sel

ect

ter

ms

Build tree

Compresstree

Divide into facets

Page 26: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

1. Select Terms

red blue

Select well distributed terms from collection D

ocum

ent

s

WordNet

Get hypernym

pathsSel

ect

term

s

Build tree

Comp. tree

Page 27: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

2. Get Hypernym Path

red blue

chromatic color

abstraction

property

visual property

color

red, redness

abstraction

property

visual property

color

blue, blueness

chromatic color

Doc

ume

nts

WordNet

Get hypernym

pathsSel

ect

te

rms

Build tree

Comp. tree

Page 28: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

3. Build Tree

red blue

chromatic color

abstraction

property

visual property

color

red, redness

abstraction

property

visual property

color

blue, blueness

chromatic color

red blue

abstraction

property

visual property

color

red, redness

chromatic color

blue, blueness

Doc

ume

nts

WordNet

Get hypernym

pathsSel

ect

te

rms

Buildtree

Comp. tree

Page 29: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

4. Compress Tree

Doc

ume

nts

WordNet

Get hypernym

pathsSel

ect

te

rms

Build tree

Comp.tree

red, redness

color

red

chromatic color

blue, blueness

blue

green, greenness

green green red

color

chromatic color

blue

Page 30: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

4. Compress Tree (cont.)

red

color

chromatic color

blue green

color

red blue green

Doc

ume

nts

WordNet

Get hypernym

pathsSel

ect

te

rms

Build tree

Comp. tree

Page 31: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

5. Divide into Facets

Divide into facets

Page 32: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Disambiguation

Ambiguity in: Word senses Paths up the hypernym tree

Sense 1 for word “tuna”organism, being => plant, flora => vascular plant => succulent => cactus

=> tuna

Sense 2 for word “tuna”organism, being => fish => food fish => tuna => bony fish => spiny-finned fish => percoid fish => tuna

2 paths for same word

2 paths for

same sense

Page 33: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

How to Select the Right Senses and Paths?

First: build core tree (1) Create paths for words with only one sense (2) Use Domains

Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc.

Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or

may add own Paths for terms that match the selected domains are added to

the core tree

Then: add remaining terms to the core tree.

Page 34: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Using Domains

dip glosses:

Sense 1: A depression in an otherwise level surface

Sense 2: The angle that a magnet needle makes with horizon

Sense 3: Tasty mixture into which bite-size foods are dipped

dip hypernyms

Sense 1 Sense 2 Sense 3

solid shape, form food

=> concave shape => space => ingredient, fixings

=> depression => angle => flavorer

Given domain “food”, choose sense 3

Page 35: The Castanet Algorithm for semi-automatically inducing faceted metadata

Castanet Evaluation

Page 36: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Evaluation

This is a tool for information architects, so people of this type did the evaluation

We compared output on Recipes Biomedical journal titles

We compared to two state-of-the-art algorithms LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99)

Page 37: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Subsumption Output (Bio titles)

Page 38: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Subsumption Output (Bio titles)

Page 39: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

LDA Output (Bio titles)

Page 40: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

LDA Output (Bio titles)

Page 41: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Evaluation Method

Information architects assessed the category systems

For each of 2 systems’ output: Examined and commented on top-level Examined and commented on two sub-levels

Then comment on overall properties Meaningful? Systematic? Likely to use in your work?

Page 42: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Evaluation Results (Bio titles)

15 participants, all PubMed Users Results for “Would you use this system in

your work?” Answering “Yes in some cases” or “yes definitely”

Pine (Castanet): 11/15 Oak (LDA): 1/7 Birch (Subsumption): 1/8

Page 43: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Evaluation Results (recipes)

Results on recipes collection for “Would you use this system in your work?” Yes in some cases or yes definitely:

Pine (Castanet): 29/34 Oak (LDA): 0/18 Birch (Subsumption): 6/16

Results on quality of categories:

Page 44: The Castanet Algorithm for semi-automatically inducing faceted metadata

Marti Hearst, Taxonomy Bootcamp ‘06

Conclusions

Flexible application of hierarchical faceted metadata is a proven approach for navigating scholarly collections. Midway in complexity between simple hierarchies

and deep knowledge representation.

Currently in use in digital library sites, but the SUBJECT categories need more work.

Algorithms are needed to help create faceted metadata structures Our WordNet-based algorithm, while not perfect,

provides a good starting point for scholarly collections

Page 45: The Castanet Algorithm for semi-automatically inducing faceted metadata

For more information:flamenco.berkeley.edu

Thank you!Preslav Nakov, Marti Hearst & Emilia Stoica