48
Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Embed Size (px)

Citation preview

Page 1: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Natural Language Processing for LODLAM

Presented at IGeLU 2014by Corey A Harper2014-09-16

A brief intro to machine learning & data science

for Libraries

Page 2: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Context

Narrative

Story telling

The Library's story,

and the Archives story,

but also…

Page 3: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Users’ stories

Scholars' stories

Adding context through recombinant metadata

Page 4: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Scholars & Users Stories – Tim Sherratt (@wragge)

Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/

Page 5: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Library Authority Data

“Include links to other URIs. so that they can discover more things.”

Short of providing and linking to URIs, this *is* authority data.

This is what our authority files are for.

Page 6: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Linked data is about context

authorities provide context

and yet our controlled vocabs

are nearly gone

because the interfaces to them were broken

Page 7: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 8: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 9: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

The Death of Browse

• Next-Gen Discovery Systems don't make use of Authority Control

• “Browse” was/is broken as a UI Design

• Rich data in Authorities, disconnected from narrative, context, search

• Richer “Authority” type data outside libraries...

• “Next Gen Next Gen Discovery…

Page 10: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 11: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 12: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 13: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 14: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Fuzzy Wuzzy – Seat GeekF

uzzy Wuzzy – A

wesom

e Library from S

eatGeek

https://github.com/seatgeek/fuzzyw

uzzyh

ttp://se

atg

ee

k.com

/blo

g/d

ev/fu

zzywu

zzy-fuzzy-strin

g-m

atch

ing

-in-p

ytho

n

Page 15: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Slide courtesy of Doug Oard Univ. of Maryland

Page 16: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Tools - Natural Language Processing

• DBPedia Spotlighthttps://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki

• Zemanta: http://www.zemanta.com/?wpst=1

• Open Calais: http://www.opencalais.com/

• Open Refine: http://openrefine.org/

• DataTXT: https://dandelion.eu/products/datatxt/

• AlchemyAPI: http://www.alchemyapi.com/

• FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy

Page 17: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 18: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 19: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Where does this lead?

We need new interfaces

new tools

for new kind of catalogers

for knowledge organization experts

Page 20: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Linked Jazz Back End

Page 21: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Primo PNX and Authorities

• Indexing Cross References

• New Browse Functionality

• Authority Control from Aleph / Alma• What about non-MARC, or non-

Aleph Data?

• Matching Strings to Authorities

Page 22: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Enter Open Refinehttp://freeyourm

etadata.org/

Page 23: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Match strings to vocabularies…

Page 24: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Like LCNAF…

Page 25: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Or Wikipedia

Page 26: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Automated Authority Control?

Page 27: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 28: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Open Refine RDF Skeleton

Page 29: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Page 30: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Proposed System Architecture

Page 31: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Hydra Modeling & Architecture

• Approaches to Provenance• Prov-O

• Named Graphs

• Named Datastreams

• “n” nyucore “records”• Same properties defined for each

• Keep data sources separate

• Merge for display in Blacklight & export to Primo

Page 32: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Separate Metadata Datastreams

• source_metadata, enrich_metadata• Reload one or both without affecting other

or native metadata

• native_metadata• Edited only through Hydra UI• Partitioned from external sources

Page 33: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Metadata Provenance

Page 34: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Fedora Datastreams

Page 35: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Blacklight User Interface

Page 36: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Where does this lead?

We need new interfaces

new tools

for new kind of catalogers

for knowledge organization experts

Page 37: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

A Role for Ex Libris

• Alma &/or Primo• Named Entity Recognition

• Vocabulary Reconciliation

• Provenance Management

• Primo Central• Named Entity Recognition on Full Text

• Auto Classification

Page 38: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

A bit louder...

we need new interfaces

we need enterprise tools

Integrated into our metadata management systems

for new kind of catalogers

for knowledge organization experts

Page 39: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Simplified Workflow Proposal

Page 40: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

More Tools – At Programming Level

• Open NLP: https://opennlp.apache.org/

• Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml

• Python Tools • SciKitLearn, Pandas, NLTK, SciPi, NumPi• https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience

• http://pandas.pydata.org/

• http://www.nltk.org/

Page 41: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

More Data Science-ey Toolshttp://w

ww

.rexeranalytics.com/D

ata-Miner-S

urvey-Results-2013.htm

l

Page 42: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Data Science Techniques

• Feature Extraction / Feature Engineering

• Predictive Modeling

• Probabilistic Classification – Large Multi-Class Problems

• Text Analytics• Vectorization

• Bags & Sets of Words

• TF/IDF

• N-Grams

• Sparse Matrices

Page 43: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Simple Example – Predict Yelp Star Ratings

Page 44: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Fitting a Model – Naïve Bayes

Page 45: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Data Science Venn Diagramhttp://drew

conway.com

/zia/2013/3/26/the-data-science-venn-diagram

Page 46: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

1+ ln𝑇𝑜𝑡𝑎𝑙 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝐶𝑜𝑢𝑛𝑡

𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔𝑇𝑒𝑟𝑚

http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323

Page 47: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Where can we go from here?

• NER is just the beginning

• Feature Engineering

• Hiring Statisticians

• Clustering & Classification

• Vocabulary Pruning and Engineering• Manageable 10-20k Class Text Classification Problems

• Domain Specific

• Ex Libris’ Activity in this space

Page 48: Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

Thanks!

[email protected]

212.998.2479

@chrpr