64
Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of Engineering 1

Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Information Extraction

Pedro Szekely

Information Sciences Institute,

USC Viterbi School of Engineering

1

Page 2: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Agenda

Information extraction classification

Text extraction techniques

Storing extractions in knowledge graphs

myDIG demo

Summary

Page 3: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Document FeaturesText

paragraphs

without

formatting

Grammatical

sentences

plus some

formatting &

links

Non-grammatical snippets,

rich formatting & linksTables

Astro Teller is the CEO and co-founder of

BodyMedia. Astro holds a Ph.D. in Artificial

Intelligence from Carnegie Mellon University, where

he was inducted as a national Hertz fellow. His M.S.

in symbolic and heuristic computation and B.S. in

computer science are from Stanford University. His

work in science, literature and business has appeared

in international media from the New York Times to

CNN to NPR.

Charts

3

Page 4: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely

Scope

Web site specific

Genre specific

(e.g., forums) Wide, non-specific

4

Page 5: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Pattern ComplexityE.g., word patterns

Closed set

He was born in Alabama…

Regular set

Phone: (413) 545-1323

Complex pattern

University of Arkansas

P.O. Box 140

Hope, AR 71802…was among the six houses

sold by Hope Feldman that

year.

Ambiguous patterns,

needing context and

many sources of evidence

The CALD main office can be

reached at 412-268-1299The big Wyoming sky…

U.S. states U.S. phone numbers

U.S. postal addresses

Person names

Headquarters:

1128 Main Street, 4th Floor

Cincinnati, Ohio 45210

Pawel Opalinski, Software

Engineer at WhizBang Labs.

Courtesy of Andrew McCallum

“YOU don't wanna miss out on

ME :) Perfect lil booty Green

eyes Long curly black hair Im a

Irish, Armenian and Filipino

mixed princess :) ❤ Kim ❤

7○7~7two7~7four77 ❤ HH 80

roses ❤ Hour 120 roses ❤ 15

mins 60 roses”

5

Page 6: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

small amount of relevant contentirrelevant content very similar to relevant content

6

Page 7: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Practical Considerations

How good (precision/recall) is necessary?High precision when showing extractions to users

High recall when used for ranking results

How long does it take to construct?Minutes, hours, days, months

What expertise do I need?None (domain expertise), patience (annotation), simple scripting, machine learning guru

What tools can I use?Many …

7

Page 8: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Information Extraction Process

8

Segmentation Data Extraction

Page 9: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Information Extraction Process

9

Segmentation Data Extraction

Page 10: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Information Extraction Process

10

Segmentation Data Extraction

Name:

Legacy Ventures Intl, Inc.

Stock:

LGYV

Date:

2017-07-14

Market Cap:

391,030

Page 11: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Segmentation

Semi-structured extraction

Table extraction

Main content identification

Custom regular expressions

11

Page 12: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Segmentation

Semi-structured extraction

Table extraction

Main content identification

Custom regular expressions

12

Text

segments

Page 13: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Text Extraction Techniques

Glossary

Regular expressions

Natural language rules

Named entity recognition

Sequence labeling (Conditional Random Fields)

13

Page 14: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Glossary Extraction

Page 15: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Glossary Extraction

Simplelist of words or phrases to extract

ChallengesAmbiguity: Charlotte is a name of a person and a city

Colloquial expressions: “Asia Broadband, Inc.” vs “Asia Broadband”

ResearchImproving precision of glossary extractions using context

Creating/extending glossaries automatically

15

Page 16: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Regex Extraction

Page 17: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Extraction Using Regular

Expressions Too difficult for non-programmers

regex for North American phone numbers:

^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-

9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-

9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-

9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$

Brittle and difficult to adapt to unusual domains

unusual nomenclature and short-hands

obfuscation

17

Page 18: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

NLP Rule-Based Extraction

Page 19: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

NLP Rule-Based Extraction

19

TokenizationPattern

Matching

Page 20: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Tokenization

20

My name is Pedro My name is Pedro

310-822-1511 310-822-1511

310 - 822 1511-

Candy is here Candy is here

Candy is here

Page 21: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Token Properties

Surface propertiesLiteral, type, shape, capitalization, length, prefix, suffix, minimum, maximum

Language propertiesPart of speech tag, lemma, dependency

21

Page 22: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern
Page 23: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern
Page 24: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern
Page 25: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern
Page 26: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Token Types

Page 27: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Patterns

27

Pattern := Token-Spec

[Token-Spec]

Token-Spec +

Token-Spec Pattern

Optional

One or more

Page 28: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Positive/Negative Patterns

PositiveGenerate candidates

NegativeRemove candidates

Output overlaps positive candidates

28

Page 29: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Positive/Negative Patterns

PositiveGenerate candidates

NegativeRemove candidates

Output overlaps positive candidates

29

General

Specific

Page 30: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely

DIG Demo

30

Page 31: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely 31

https://spacy.io/docs/usage/rule-based-matching

Page 32: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Advantages/Disadvantages

AdvantagesEasy to define

High precision

Recall increases with number of rules

DisadvantagesText must follow strict patterns

32

Page 33: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely

NLP Rule-Based Extraction

Tokenization for unusual domainstokenize on white-space, punctuation and emojis

Token propertiesliteral, part of speech tag, lemma, in/out of dictionarydependency parsing relationships (advanced)type (alphanumeric, alphabetic, numeric)shape (pattern of digits and characters), capitalization, prefix and suffixnumber of characters, range (numbers)

PatternSequence of required/optional tokenspositive and negative patterns

33

Page 34: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Named-Entity Recognizers

Page 35: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely

Named Entity Recognizers

Machine learning modelspeople, places, organizations and a few others

SpaCycomplete NLP toolkit, Python (Cython), MIT license

code: https://github.com/explosion/spaCy

demo: http://textanalysisonline.com/spacy-named-entity-recognition-ner

Stanford NERpart of Stanford’s NLP software library, Java, GNU license

code: https://nlp.stanford.edu/software/CRF-NER.shtml

demo: http://nlp.stanford.edu:8080/ner/process

35

Page 36: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely

https://spacy.io/docs/usage/entity-recognition

36

Page 37: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely

https://demos.explosion.ai/displacy-ent

37

Page 38: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Advantages/Disadvantages

AdvantagesEasy to use

Tolerant of some noise

Easy to train

DisadvantagesPerformance degrades rapidly for new genres, language models

Requires hundreds to thousands of training examples

38

Page 39: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Conditional Random Fields

Page 40: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Discriminative Vs. Generative ● Generative Model: A model that generate observed

data randomly

● Naïve Bayes: once the class label is known, all the

features are independent

● Discriminative: Directly estimate the posterior

probability; Aim at modeling the “discrimination”

between different outputs

● MaxEnt classifier: linear combination of feature

function in the exponent,

Both generative models and discriminative models describe distributions over (y , x), but they work in

different directions.

slide by Daniel Khashabi

Page 41: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Discriminative Vs. Generative

=unobservable=observable

slide by Daniel Khashabi

Page 42: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Chain CRFs ● Each potential function will operate on pairs of adjacent label variables

● Parameters to be estimated,

Feature functions

=unobservable

=observable

slide by Daniel Khashabi

Page 43: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Chain CRF● We can change it so that each state depends on more observations

● Or inputs at previous steps

● Or all inputs

=unobservable

=observable

slide by Daniel Khashabi

Page 44: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Modeling Problems With CRF

44

iX1

(word)

X2

(capitalized)

X3

(POS Tag)

Y

(entity)

1 My 1 Possessive Pron Other

2 name 0 Noun Other

3 is 0 Verb Other

4 Pedro 1 Proper Noun Person-Name

5 Szekely 1 Proper Noun Person-Name

Page 45: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Modeling Problems With CRF

45

iX1

(word)

X2

(capitalized)

X3

(POS Tag)

Y

(entity)

1 My 1 Possessive Pron Other

2 name 0 Noun Other

3 is 0 Verb Other

4 Pedro 1 Proper Noun Person-Name

5 Szekely 1 Proper Noun Person-Name

Other common features:

lemma, prefix, suffix, length

Page 46: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Modeling Problems With CRF

46

iX1

(word)

X2

(capitalized)

X3

(POS Tag)

Y

(entity)

1 My 1 Possessive Pron Other

2 name 0 Noun Other

3 is 0 Verb Other

4 Pedro 1 Proper Noun Person-Name

5 Szekely 1 Proper Noun Person-Name

fj(x, yi-1, yi, i)feature functions

Page 47: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Advantages/Disadvantages

AdvantagesExpressive

Tolerant of noise

Stood test of time

Software packages available

DisadvantagesRequires feature engineering

Requires thousands of training examples

47

Page 48: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Open Information Extraction

Page 49: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely

http://openie.allenai.org/

49

Page 50: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Kejriwal, Szekely

Practical IE Technologies

Glossary Regex NLP RulesSemi-

StructuredCRF NER Table

Effortassemble

glossaryhours hours minutes

O(1000)

annotati

ons

zero

O(10)

annotati

ons

Expertise minimal

high,

program

mer

low minimallow-

mediumzero minimal

Precision

medium

(ambiguit

y)

high high highmedium-

high

medium-

highhigh

Recall

medium

(formatti

ng)

low

f(#

regex)

medium

f(# rules)high medium medium high

50

Page 51: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

how to represent KGs?

51

Page 52: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

KG Definition

a directed, labeled multi-relational graph

representing facts/assertions as triples

(h, r, t) head entity, relation, tail entity

(s, p, o) subject, predicate, object

Page 53: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Simplest Knowledge Graph

LGY

V

Legacy Ventures

International Inc

Damn Good

Penny Stocks

Entities

mentions

Easiest to build

Page 54: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Simple, But Useful KG

LGY

V

Legacy Ventures

International Inc

Damn Good

Penny Stocks

Entities + properties

company

“Easy” to build54

Page 55: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Semantic Web KG (RDF/OWL)

LGY

V

Legacy Ventures

International Inc

Damn Good Penny Stocks

stock-ticker

Entities + properties + classes

promoter

Compan

y

is-a

is-a

Very hard to buildKejriwal, Szekely

Page 56: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

“Ideal” KG

LGY

V

Legacy Ventures

International Inc

Damn Good

Penny Stocks

stock-ticker

Entities + properties + classes + qualifiers

promoter

Compan

y

is-a

is-a

start-date June 2017

sourcestockreads.co

m

Very very hard to build

Page 57: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

”More Ideal” KGEntities + properties + provenance + confidence + qualifiers

“Not so hard” to build

Page 58: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Where to Store KGs?

Page 59: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Serializing Knowledge Graphs

Resource Description Framework (RDF) Database (triple store): AllegroGraph, Virtuoso,

Query: SPARQL (SQL-like)

Key-Value, Document StoresData model: Node-centric

Databases: Hbase, MongoDB, Elastic Search, …

Query: filters, keywords, aggregation (no joins)

Graph DatabasesData model: graph

Databases: Neo4J, Cayley, MarkLogic, GraphDB, Titan, OrientDB, Oracle, …

Query: GraphQL, Gremlin, Cypher59

Page 60: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Popularity Ranking Of Graph

Databases

https://db-engines.com/en/ranking_trend/graph+dbms

Page 61: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

ElasticSearch, MongoDB & Neo4J Have Wide

Adoption

Triple Stores

https://db-engines.com/en/ranking_trend/graph+dbms

Page 62: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

myDIG: A KG Construction ToolkitPython, MIT license, https://github.com/usc-isi-i2/dig-etl-engine

Enable end-users to construct domain-specific KGsend users from 5 government orgs constructed KGs in less than one day

Suite of extraction techniquessemi-structured HTML pages, glossaries, NLP rules, NER, tables (coming soon)

KG includes provenance and confidencesenable research to improve extractions and KG quality

Scalableruns on laptop (~100K docs), cluster (> 100M docs)

RobustDeployed to many law enforcement agencies

Easy to installDocker deployment with single “docker compose up” installation

62

Page 63: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

myDIG Demo

Page 64: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern

Summary

Partition pages into segments

Select technology based on segment features

Do knowledge graph completion (next topic)

Choose representation based on application

demands