Upload
sarah-weeks
View
1.859
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A tutorial on using Open Refine based on a sample project of standardizing the names of cities of publication.
Citation preview
OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP AND
LINK YOUR METADATA TO THE WIDER WORLD
SARAH BETH WEEKS
LIBRARY TECHNOLOGY CONFERENCE 2013
[email protected]@RASCALWHALE
Situation: Wanted to match publishers of our books against a list of important Nordic American Publishers (compiled by Penny Huff man) to find materials for our special collections.Problem: Hard to compare when publication info is not controlled:
SAMPLE PROJECT: NORDIC AMERICAN IMPRINTS
Google Refine can “match and merge” messy data filled with:Random, leading or trailing spacesstray punctuationtyposodd capitalization and more!
ANSWER: GOOGLE REFINE!
CREATE YOUR PROJECT USING ANY SPREADSHEET
USE “COMMON TRANSFORMS” TO FIX “WHITESPACE” PROBLEMS IN A SINGLE
CLICK
3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS
(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
4. REPEAT COMMON TRANSFORMS
5. CLUSTER AND EDIT
(THIS IS WHERE THE MAGIC HAPPENS)
FUNCTION 1: FINGERPRINT (MOST RELIABLE)
NGRAM METHOD(STILL RELIABLE: MORE MATCHES BUT LESS RELIABILITY AS YOU DECREASE
NGRAM SIZE)
PHONETIC MATCHING(ESPECIALLY USEFUL WHEN DEALING
WITH TRANSLATED TEXT)
(MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)
NEAREST NEIGHBOR (PPM) MATCHING
(SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS
MISS)
(SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE
MORE MATCHES)
AFTER USING OTHER METHODS, RUN THROUGH FINGERPRINT AND NGRAM
AGAIN
BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE
BEEN FIXED
6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES
YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR
PROBLEMS
CLICK EDIT TO TYPE NEW TEXT FOR ALL CELLS WITH THIS VALUE
OTHER CLEAN-UP WE DID:PUBLISHERS
OTHER CLEAN-UP WE DID:GIFT NOTES
ALSO WORKS FOR NUMBERS/DATES
END RESULT?
Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500.
This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated).
The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.
BUT WAIT! THERE’S MORE!!LINKED DATA!!!
FREEBASE IS THE DEFAULT SERVICE (WIKIPEDIA-ESQUE DATA OWNED BY GOOGLE)
CHOOSE THE RIGHT “TYPE” AND MOST CELLS WILL BE AUTO-
MATCHED
FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS
Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic
OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE
FROM
EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!
CHOOSE WHAT INFO YOU WANT TO ADD
THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET
Browse the properties at: http://schemas.freebaseapps.com/
TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE:
MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)
Install the RDF Extension for Google Refi ne http://refi ne.deri.ie/
SPARQL Endpointshttp://labs.mondeca.com/sparqlEndpointsStatus/index.htmlCKAN Data Hub: http://datahub.io/dataset/
SPARQL ENDPOINTS
ADD SPARQL-BASED RECONCILIATION SERVICE
Questions?
Link to a public version of this presentation at my (personal) blog:
gardenandalibrary.blogspot.comI’m also happy to take questions by e-mail
THANK YOU!