37
OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP AND LINK YOUR METADATA TO THE WIDER WORLD SARAH BETH WEEKS LIBRARY TECHNOLOGY CONFERENCE 2013 [email protected] @RASCALWHALE

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Embed Size (px)

DESCRIPTION

A tutorial on using Open Refine based on a sample project of standardizing the names of cities of publication.

Citation preview

Page 1: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP AND

LINK YOUR METADATA TO THE WIDER WORLD

SARAH BETH WEEKS

LIBRARY TECHNOLOGY CONFERENCE 2013

[email protected]@RASCALWHALE

Page 2: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Situation: Wanted to match publishers of our books against a list of important Nordic American Publishers (compiled by Penny Huff man) to find materials for our special collections.Problem: Hard to compare when publication info is not controlled:

SAMPLE PROJECT: NORDIC AMERICAN IMPRINTS

Page 3: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Google Refine can “match and merge” messy data filled with:Random, leading or trailing spacesstray punctuationtyposodd capitalization and more!

ANSWER: GOOGLE REFINE!

Page 4: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

CREATE YOUR PROJECT USING ANY SPREADSHEET

Page 5: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

USE “COMMON TRANSFORMS” TO FIX “WHITESPACE” PROBLEMS IN A SINGLE

CLICK

Page 6: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS

(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)

Page 7: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

4. REPEAT COMMON TRANSFORMS

Page 8: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

5. CLUSTER AND EDIT

Page 9: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

(THIS IS WHERE THE MAGIC HAPPENS)

Page 10: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

FUNCTION 1: FINGERPRINT (MOST RELIABLE)

Page 11: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

NGRAM METHOD(STILL RELIABLE: MORE MATCHES BUT LESS RELIABILITY AS YOU DECREASE

NGRAM SIZE)

Page 12: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

PHONETIC MATCHING(ESPECIALLY USEFUL WHEN DEALING

WITH TRANSLATED TEXT)

Page 13: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

(MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)

Page 14: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

NEAREST NEIGHBOR (PPM) MATCHING

(SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS

MISS)

Page 15: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

(SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE

MORE MATCHES)

Page 16: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

AFTER USING OTHER METHODS, RUN THROUGH FINGERPRINT AND NGRAM

AGAIN

Page 17: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE

BEEN FIXED

Page 18: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES

Page 19: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR

PROBLEMS

Page 20: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

CLICK EDIT TO TYPE NEW TEXT FOR ALL CELLS WITH THIS VALUE

Page 21: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

OTHER CLEAN-UP WE DID:PUBLISHERS

Page 22: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

OTHER CLEAN-UP WE DID:GIFT NOTES

Page 23: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

ALSO WORKS FOR NUMBERS/DATES

Page 24: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

END RESULT?

Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500.

This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated).

The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.

Page 25: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

BUT WAIT! THERE’S MORE!!LINKED DATA!!!

Page 26: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

FREEBASE IS THE DEFAULT SERVICE (WIKIPEDIA-ESQUE DATA OWNED BY GOOGLE)

Page 27: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

CHOOSE THE RIGHT “TYPE” AND MOST CELLS WILL BE AUTO-

MATCHED

Page 28: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS

Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic

Page 29: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE

FROM

Page 30: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!

Page 31: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

CHOOSE WHAT INFO YOU WANT TO ADD

Page 32: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET

Page 33: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Browse the properties at: http://schemas.freebaseapps.com/

TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE:

Page 34: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)

Page 35: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Install the RDF Extension for Google Refi ne http://refi ne.deri.ie/

SPARQL Endpointshttp://labs.mondeca.com/sparqlEndpointsStatus/index.htmlCKAN Data Hub: http://datahub.io/dataset/

SPARQL ENDPOINTS

Page 36: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

ADD SPARQL-BASED RECONCILIATION SERVICE

Page 37: OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Questions?

Link to a public version of this presentation at my (personal) blog:

gardenandalibrary.blogspot.comI’m also happy to take questions by e-mail

[email protected]

THANK YOU!