49
LITA National Forum 2015 Data Designed for Discovery Roy Tennant OCLC Research

Data Designed for Discovery

  • Upload
    oclc

  • View
    4.105

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Designed for Discovery

LITA National Forum 2015

Data Designed for Discovery

Roy TennantOCLC Research

Page 2: Data Designed for Discovery

Place Speaker Photo Here

Cataloging Unchained!

Page 3: Data Designed for Discovery

THE PAST AND PRESENT SITUATION

Page 4: Data Designed for Discovery

The Canonical Entity (of the past)

Page 5: Data Designed for Discovery

The Canonical Entity (of the present)

Page 6: Data Designed for Discovery

• A collection of statements…• Taken from the piece itself…• Sometimes “enhanced” with inferred

parentheticals (e.g., [1975] )…• Or additional statements not on the piece (e.g.,

subject headings)• Where punctuation, which may or may not be

present, is used (inconsistently) for structure• Mostly uncontrolled text strings that are only

loosely connected to anything else

The Classic Bib Record

MARC is machine readable,

NOT machine understandable!

Page 7: Data Designed for Discovery

THE PROBLEM

Page 8: Data Designed for Discovery

• Identification Problems:– “The Hamlet Problem” (titles aren’t enough)– “The Wang/Li/Zhang Problem” (names aren’t enough)

• Linkage Problems:– “The Web Problem” (text strings aren’t enough, you

need links)– “The Language Problem” (the right translation for a

given user in a way they can understand)• Quality Problems:

– “The Legacy Problem” (strings are not controlled terms; often, they cannot be turned into them)

Actually, A Number of Problems

Page 9: Data Designed for Discovery
Page 10: Data Designed for Discovery
Page 11: Data Designed for Discovery

First, define ALL

THE THINGS

Page 12: Data Designed for Discovery

entity/ˈɛntɪti/noun

a thing with distinct and independent existence.

relationship/rɪˈleɪʃ(ə)nʃɪp/noun

the way in which two or more people or things are connected

Page 13: Data Designed for Discovery

RecordTitle: "War and Peace"Author: "Leo Tolstoy 1828-1910"ISBN: 0307266931

Type: WorkName: "War and Peace"Author: http://worldcat.org/entity/person/id/1234

Entity (http://worldcat.org/entity/work/id/115206288)

Type: PersonName: "Leo Tolstoy "Born: 1828Died: 1910Birthplace: http://worldcat.org/entity/place/id/8976

Entity (http://worldcat.org/entity/person/id/1234)

Type: PlaceName: "Yasnaya Polyana"SameAs: http://geonames.org/468686

Entity (http://worldcat.org/entity/place/id/8976)

⤵ ⟶

Page 14: Data Designed for Discovery

person place

object concept

organization work

Entities of Initial Focus

work

person

Page 15: Data Designed for Discovery

person place

object concept

organization work

subjectitemavailability

author

Relationships between Entities are established

Page 16: Data Designed for Discovery

Using authoritative sources whenever possible

Page 17: Data Designed for Discovery

And linkingto other authoritativedata sources

Page 18: Data Designed for Discovery

“Shredding”

Page 19: Data Designed for Discovery

From Records to Entities: Works

Page 20: Data Designed for Discovery
Page 21: Data Designed for Discovery
Page 22: Data Designed for Discovery
Page 23: Data Designed for Discovery
Page 24: Data Designed for Discovery

LCSH = Mixed entities

Headings before conversion

600 10 Bennet, Elizabeth (Fictitious character) $v Fiction.600 10 Darcy, Fitzwilliam (Fictitious character) $v Fiction.650 0 Gentry $z England $v Fiction.650 0 Social classes $z England $v Fiction.650 0 Young women $v Fiction.650 0 Mate selection $v Fiction.650 0 Courtship $v Fiction.650 0 Sisters $v Fiction.651 0 England $x Social life and customs $y 19th century $v Fiction.

Person/Character

Genre

Place

Topic

Time

Page 25: Data Designed for Discovery

FAST = Each entity in separate field

Entity/Facet Time Period……. 1800 - 1899 Person…………….. Bennet, Elizabeth (Fictitious character) Darcy, Fitzwilliam (Fictitious character) Topic………………. Courtship Gentry Manners and customs Mate selection Sisters Social classes Young women Place……………… England Form/Genre………. Fiction

Page 26: Data Designed for Discovery

Creating linked data assertions from record-oriented data

MARC Record

Enhanced WorldCat

MARC Record

MARC Records

• FRBR Clustering

• String matching with controlled vocabularies

• Addition of standard identifiers

Page 27: Data Designed for Discovery

OCLC Production Services

External OCLC Research Systems

Internal OCLC Research Resources

enhancedWorldCat

WORKS

Kindred Works

Classify

Identities

FictionFinder

Cookbook Finder

LCSH

FAST

VIAF

GMGPC

GSAFD

GTT

DDCLCTGM MeSH

Linked Data Entities

Page 28: Data Designed for Discovery

Creating linked data assertions from record-oriented data

MARC Record

Enhanced WorldCat

MARC Record

Persons

Organizations

Places

Concepts

Events

Works

MARC Records RDF Entities

• FRBR Clustering

• String matching with controlled vocabularies

• Addition of standard identifiers

Page 29: Data Designed for Discovery

Creating linked data assertions from record-oriented data

MARC Record

Enhanced WorldCat

MARC Record

Persons

Organizations

Places

Concepts

Events

Works

MARC Records RDF Entities Triples

• FRBR Clustering

• String matching with controlled vocabularies

• Addition of standard identifiers

Subject Predicate Object

Subject Predicate Object

Subject Predicate Object

Subject Predicate Object

Subject Predicate Object

Subject Predicate Object

Subject Predicate Object

Subject Predicate Object

Subject Predicate Object

Subject Predicate Object

Page 30: Data Designed for Discovery

A series of recent Google Research papers describe the use of probabilistic models and machine learning to assess the truth of statements made by multiple sources.• Li, X., Dong, X. L., Lyons, K., Meng, W., Srivastava, D. (2013). Truth

Finding on the Deep Web: Is the Problem Solved? • Dong, X. L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S., Zhang, W. (2013). From

Data Fusion to Knowledge Fusion.• Dong, X. L., Murphy, K., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., ... & Zhang, W. (2014). Knowledge

Vault: A Web-scale approach to probabilistic knowledge fusion• Dong, X. L., Gabrilovich, E., Murphy, K. Dang, V., Horn, W., … & Zhang, W. (2015). Knowledge-Based

Trust: Estimating the Trustworthiness of Web Sources

Estimating “Truthiness” of assertions

Page 31: Data Designed for Discovery

Data Sources

Extraction

Knowledge Triples

Scored Triples

Fusion KnowledgeVault

EnhancedWorldCat

VIAF

FAST

Knowledge Vault data flow

Extractor

Extractor

Extractor

Fusers

Collective

Etc.

Page 32: Data Designed for Discovery

OK…BUT SO WHAT?

Page 33: Data Designed for Discovery

Improving Discovery

Mockup

Page 34: Data Designed for Discovery
Page 36: Data Designed for Discovery
Page 37: Data Designed for Discovery
Page 38: Data Designed for Discovery
Page 39: Data Designed for Discovery

Solving the Hamlet Problem!

Page 40: Data Designed for Discovery
Page 41: Data Designed for Discovery

Embedding Authority Control

Solving the Wang/Li/Zhang Problem!

Page 42: Data Designed for Discovery

Title: Journey to the WestLanguage: EnglishTranslator: Anthony C. YuDate: 1977IsTranslationOf:

Title: Journey to the WestLanguage: EnglishTranslator: W. J. F. JennerDate: 1982-1984IsTranslationOf:

Title: 西遊記Language: ChineseAuthor: 吳承恩Created: 1592HasTranslation:

Title: Tay du ky binh khaoLanguage: VietnameseTranslator: Phan QuanDate: 1980IsTranslationOf:

Title: 西遊記Language: JapaneseTranslator: 中野美代子Date: 1986IsTranslationOf:

Title: PilgerfahrtLanguage: GermanTranslator: Georgette Boner Date: 1983IsTranslationOf:

Page 43: Data Designed for Discovery

Mockup

Page 44: Data Designed for Discovery

Mockup

Solving the Language Problem!

Page 45: Data Designed for Discovery

Exposing the Quality Problem

Page 46: Data Designed for Discovery

• Requires a new kind of thinking: “sets of assertions about something” NOT a “record”

• Requires much more than simply translating a record from MARC to a new format

• There are things we can do now to make MARC “linked data ready”

• Quality is a pursuit, not an achievable goal

It’s Not the Record, It’s the Linkage

Page 47: Data Designed for Discovery

Why OCLC wants to manage entities

Connectivity– Entity recognition helps to connect library data

to the networked environment/web• e.g., Schema.org, BIBFRAME, etc.

Efficiency & Quality– Facilitates efficient creation of quality metadata

Creativity– Opens the door to creative reuse of library data

Page 48: Data Designed for Discovery

• Person Entity Lookup service• Using EntityJS as one way for humans to

interact with, and evaluate, and potentially correct linked data

• Others as indentified…

Piloting New Services

Photo by drivethrucafehttps://www.flickr.com/photos/128758398@N07/CC BY-SA 2.0

Page 49: Data Designed for Discovery

SMTogether we make breakthroughs possible.

Thank you!Roy [email protected]/roytennant/

LITA National Forum 2015

©2015 OCLC This work is licensed under a Creative Commons Attribution 4.0 International License. Suggested attribution: This work uses content from “Data Designed for Discovery” © OCLC, used under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/.