Upload
james-grant
View
212
Download
0
Embed Size (px)
Citation preview
How (Not) to Use a How (Not) to Use a Semi-automated Semi-automated Clustering ToolClustering Tool
Kat HagedornKat Hagedorn
University of MichiganUniversity of Michigan
April 11, 2006April 11, 2006
Update on UM’s effortsUpdate on UM’s efforts
Built three research portalsBuilt three research portals DLF DLF <http://www.hti.umich.edu/cgi/b/bib-idx?c=imls><http://www.hti.umich.edu/cgi/b/bib-idx?c=imls>
MODS MODS <http://www.hti.umich.edu/m/mods><http://www.hti.umich.edu/m/mods>
Aquifer Aquifer <http://www.hti.umich.edu/a/aquifer><http://www.hti.umich.edu/a/aquifer>
Improvements for search / displayImprovements for search / display Integration of MODS format recordsIntegration of MODS format records Simple vs. advanced searchingSimple vs. advanced searching Inclusion of thumbnails Inclusion of thumbnails
The need to clusterThe need to cluster
Want to offer more than search within a Want to offer more than search within a generic, large corpus of datageneric, large corpus of data
How to partition the data?How to partition the data? Emory’s MetaCombine tool promising as a Emory’s MetaCombine tool promising as a
topical clustering agenttopical clustering agent (Also interested in clustering by format, (Also interested in clustering by format,
access restriction, OAI software used, etc.)access restriction, OAI software used, etc.)
Clustering vs. classificationClustering vs. classification
Clustering is main focusClustering is main focus Huge amount of dataHuge amount of data Needed a tool to “find the topic”Needed a tool to “find the topic” Preferably a disjunctive tool (placing files under Preferably a disjunctive tool (placing files under
more than one topic)more than one topic) Classification is secondary focusClassification is secondary focus
Have potential classification (UM’s browse)Have potential classification (UM’s browse) Marrying to current system nigh on impossibleMarrying to current system nigh on impossible
Results: durationResults: duration
First tried with small repository of ~5500 First tried with small repository of ~5500 records (amnh)records (amnh)
Took around 25 minutesTook around 25 minutes Multiple tries with larger repository of ~270K Multiple tries with larger repository of ~270K
records (dlps)records (dlps) Took around 12 hoursTook around 12 hours
Results: cluster namesResults: cluster names
Examples of set names from clustering Examples of set names from clustering UM’s metadataUM’s metadata Good: “europe”, “mechanical”, “architecture”Good: “europe”, “mechanical”, “architecture” Not so good: “general”, “michigan”, “build”Not so good: “general”, “michigan”, “build” Favorite: “southern literari literature fine Favorite: “southern literari literature fine
messenger”messenger” Granted…Granted…
Only asked for 20 clustersOnly asked for 20 clusters Didn’t cluster hierarchicallyDidn’t cluster hierarchically
CaveatsCaveats
Metadata will always be difficult to clusterMetadata will always be difficult to cluster Using a tool developed as a Web service, Using a tool developed as a Web service,
with obvious benefitswith obvious benefits Expect necessity of mapping set names to Expect necessity of mapping set names to
real topical cluster namesreal topical cluster names
What we needWhat we need
Running the tool locally, with a local WSDL Running the tool locally, with a local WSDL instance, would save lots (and lots) of timeinstance, would save lots (and lots) of time
Better set names…does this mean a better Better set names…does this mean a better algorithm?algorithm?
Ability to cluster by any criteria, not just Ability to cluster by any criteria, not just topic, i.e., a post-processing moduletopic, i.e., a post-processing module
Disjunctive clustering, meaning (so as not to Disjunctive clustering, meaning (so as not to hog storage) filename (not file) clusteringhog storage) filename (not file) clustering