How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006

How (Not) to Use a How (Not) to Use a Semi-automated Semi-automated Clustering ToolClustering Tool

Kat HagedornKat Hagedorn

University of MichiganUniversity of Michigan

April 11, 2006April 11, 2006

Update on UM’s effortsUpdate on UM’s efforts

Built three research portalsBuilt three research portals DLF DLF <http://www.hti.umich.edu/cgi/b/bib-idx?c=imls><http://www.hti.umich.edu/cgi/b/bib-idx?c=imls>

MODS MODS <http://www.hti.umich.edu/m/mods><http://www.hti.umich.edu/m/mods>

Aquifer Aquifer <http://www.hti.umich.edu/a/aquifer><http://www.hti.umich.edu/a/aquifer>

Improvements for search / displayImprovements for search / display Integration of MODS format recordsIntegration of MODS format records Simple vs. advanced searchingSimple vs. advanced searching Inclusion of thumbnails Inclusion of thumbnails

The need to clusterThe need to cluster

Want to offer more than search within a Want to offer more than search within a generic, large corpus of datageneric, large corpus of data

How to partition the data?How to partition the data? Emory’s MetaCombine tool promising as a Emory’s MetaCombine tool promising as a

topical clustering agenttopical clustering agent (Also interested in clustering by format, (Also interested in clustering by format,

access restriction, OAI software used, etc.)access restriction, OAI software used, etc.)

Clustering vs. classificationClustering vs. classification

Clustering is main focusClustering is main focus Huge amount of dataHuge amount of data Needed a tool to “find the topic”Needed a tool to “find the topic” Preferably a disjunctive tool (placing files under Preferably a disjunctive tool (placing files under

more than one topic)more than one topic) Classification is secondary focusClassification is secondary focus

Have potential classification (UM’s browse)Have potential classification (UM’s browse) Marrying to current system nigh on impossibleMarrying to current system nigh on impossible

Results: durationResults: duration

First tried with small repository of ~5500 First tried with small repository of ~5500 records (amnh)records (amnh)

Took around 25 minutesTook around 25 minutes Multiple tries with larger repository of ~270K Multiple tries with larger repository of ~270K

records (dlps)records (dlps) Took around 12 hoursTook around 12 hours

Results: cluster namesResults: cluster names

Examples of set names from clustering Examples of set names from clustering UM’s metadataUM’s metadata Good: “europe”, “mechanical”, “architecture”Good: “europe”, “mechanical”, “architecture” Not so good: “general”, “michigan”, “build”Not so good: “general”, “michigan”, “build” Favorite: “southern literari literature fine Favorite: “southern literari literature fine

messenger”messenger” Granted…Granted…

Only asked for 20 clustersOnly asked for 20 clusters Didn’t cluster hierarchicallyDidn’t cluster hierarchically

CaveatsCaveats

Metadata will always be difficult to clusterMetadata will always be difficult to cluster Using a tool developed as a Web service, Using a tool developed as a Web service,

with obvious benefitswith obvious benefits Expect necessity of mapping set names to Expect necessity of mapping set names to

real topical cluster namesreal topical cluster names

What we needWhat we need

Running the tool locally, with a local WSDL Running the tool locally, with a local WSDL instance, would save lots (and lots) of timeinstance, would save lots (and lots) of time

Better set names…does this mean a better Better set names…does this mean a better algorithm?algorithm?

Ability to cluster by any criteria, not just Ability to cluster by any criteria, not just topic, i.e., a post-processing moduletopic, i.e., a post-processing module

Disjunctive clustering, meaning (so as not to Disjunctive clustering, meaning (so as not to hog storage) filename (not file) clusteringhog storage) filename (not file) clustering

Documents

How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006