37
NICTA Copyright 2014 Text, Knowledge, and Information Extraction Lizhen Qu

Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

Embed Size (px)

Citation preview

Page 1: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Text, Knowledge, and Information Extraction

Lizhen Qu

Page 2: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

A bit about Myself… •  PhD: Databases and Information Systems Group

(MPII) –  Advisors: Prof. Gerhard Weikum and Prof. Rainer Gemulla –  Thesis: “Sentiment Analysis with Limited Training Data”

•  Now: machine learning group at NICTA, adjunct research fellow at ANU.

Page 3: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Macquarie

3

Page 4: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

News about Macquarie Bank

4

Page 5: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Negative News about Macquarie Bank

5

Page 6: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Simple Math Problem

Bob has 15 apples. He gives 9 to Sarah. How many apples does Bob have now?

6

Page 7: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Bob has 15 apples. He gives 9 to Sarah. How many apples does Bob have now?

7

Page 8: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Information Extraction

8

•  Named entity recognition •  Named entity disambiguation •  Relation extraction

Page 9: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Knowledge Bases (Open Linked Data)

9

Entity Graph

Economic Graph

OpenIE (Ollie, Reverb)

(Bob_Dylan, compose, Like_a_rolling_stone”) (The_Dark_Night, directedBy, Christopher_Nolan)

Page 10: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Knowledge Bases (Open Linked Data)

10

Entity Graph

Economic Graph

OpenIE (Ollie, Reverb)

YAGO #classes: 350,000 #entities: 10 million #facts: 120 million #language: 10

Page 11: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Knowledge Bases (Open Linked Data)

11

Entity Graph

Economic Graph

OpenIE (Ollie, Reverb)

DBpedia #classes: 735 #entities: 38.3 million #triples: 6.9 billion #languages: 128

Page 12: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Knowledge Bases (Open Linked Data)

12

Entity Graph

Economic Graph

OpenIE (Ollie, Reverb)

Freebase #entities: 50 million #facts: 3 billion #languages: almost 70

Page 13: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Construct YAGO from (Semi) Structured Data

13

Page 14: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

IE Challenge: ambiguity of Natural Language

14

I made her duck.

i.  I cooked waterfowl for her. ii.  I cooked waterfowl belonging her. iii.  I created the duck she owns. iv.  I caused her to quickly lower her head or body. v.  I waved my magic wand and turned her into a

waterfowl.

Page 15: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Named Entity Recognition

Research at Stanford led to a search engine company, founded by Page and Brin.

15

PER PER

ORG

O O ORG O

O

O

O O O

O

O

PER O

PER O

Research at Stanford led to search engine company , founded by Page and Brin .

TASK:

Machine Learning Problem:

Page 16: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Learning and Prediction

16

Feature Extraction Sentences

train models

prediction

Labeled Sentences

has labels

no labels

Page 17: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Feature Extraction •  Use features to represent each word.

•  Vectorise feature representations.

17

w-2 Research w-1 at w0 Stanford w+1 led w+2 to POS noun capitalized? true

w-2 to w-1 a w0 search w+1 engine w+2 company POS noun capitalized? false

Features of Stanford :

Features of Search :

w-2 = research capitalized w0 = stanford w0 = search …

1 1 1 0 …

Page 18: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Standard Model: Conditional Random Fields

18

•  Assigns local score to different (word, label) pairs. •  Joint inference to find best label sequences.

CRF: p(y|x) =

exp

⇥PTt=1

Pi �ifi(yt�1, yt, xt)

Z

Stanford NER [1]: 86% Best system [8]: 89%

Page 19: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Named Entity Disambiguation

Research at Stanford led to a search engine company, founded by Page and Brin.

19

PER PER

ORG TASK:

Larry Page Stanford Univeristy Sergey Brin

Page 20: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

AIDA-light [2]

20

Page 21: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

First Stage

21

Page 22: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Second Stage

22

AIDA-light [2]: 84.8% DBPepdia spotlight: 75%

Page 23: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Relation Extraction •  Relation mention extraction.

•  Expand knowledge bases.

23

Research at Stanford led to a search engine company, founded by Page and Brin.

PER: Larry_Page PER: Sergey_Brin

ORG: Stanford_University

?

Larry Page Stanford Univeristy

The Dark Night Christopher Nolan

?

?

Page 24: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Relation Mention Extraction •  Multi-class classification. •  Example features of a pair of entity mentions [3].

24

Research at Stanford led to a search engine company, founded by Page and Brin.

?

words between (Stanford, Page)

led, to, a, search, engine, company, founded, by

Named entity types (ORG, PER) Number of mentions between (Stanford, Page)

0

F-Measure on ACE: 71.2% [3]

Page 25: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Expand Knowledge Base •  Multi-instance, multi-label [4,5]. •  Distant supervision.

25

Larry Page Sergey Brin

relation-level label

Freebase

Research at Stanford led to a search engine company, founded by Page and Brin.

Larry Page and Sergey Brin explained why they just created Alphabet.

mention-level label ? mention-level label ?

MAP [3] : 56% MAP [4] : 66%

Page 26: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Open Information Extraction •  Extract triples of any relations from the web [6].

•  Optional: link triples to knowledge bases.

26

(“Bob Dylan”, “record”, “Like a rolling stone”)

It was exactly 50 years ago today that Bob Dylan walked into Studio A at Columbia Records in New York and recorded "Like a Rolling Stone”.

(“Bob Dylan”, “record”, “Like a rolling stone”)

The_Dark_Night Like_a_Rolling _Stone record

F1 [6] : 19.6% F1 [9] : 28.3%

Page 27: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Harvest Domain-Specific Knowledge •  Deep learning.

–  Learn cross-domain features. –  minimize training data.

•  Transfer learning.

27

newswire

source domain target domain

nurse handovers

Page 28: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Word Representation •  One-hot representation.

•  Distributed representation.

28

stanford [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] university [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] oxford [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] conference [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ] talk [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0 ]

stanford

university

oxford

conference

talk

= [0.01, 0.3, -0.5, 0.6]

Page 29: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014 29

Distributed Representation

Page 30: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Apply Distributed Representations for NER

30

compare

stanford

university

and

Feature Matrix o

UNI

UNI

o

label

oxford UNI

current word

first word to the right

2nd word to the right

first word to the left

2nd word to the left

Represent words based on positions rather than IDs.

Page 31: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Results of Named Entity Recognition [7]

31

•  Reduce the amount of training data. •  Tiny differences between word embeddings.

Page 32: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

NER for Novel Named Entity Types •  Goals:

–  Minimize labeled training data. –  Leverage existing resources:

•  Labeled corpora. •  Unlabeled text. •  Existing knowledge bases.

32

person

orgnization

location

doctor

corporation

city

patient

hotel country

source domain target domain

Page 33: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Experimental Results on I2B2

33

Page 34: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Learn Text Representations for Relations •  Unsupervised pre-training. •  Distant supervision.

34

Larry Page Sergey Brin

co-founders

Freebase

Research at Stanford led to a search engine company, founded by Page and Brin.

Larry Page and Sergey Brin explained why they just created Alphabet.

Inferred mention-level label Inferred mention-level label

Page 35: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

NICTA Deep Learning for IE Toolkit •  A fully integrated deep learning toolkit for NLP.

–  Pipelines include both NLP preprocessing and DL components.

–  Written in Scala/Java. –  Easy to write new ML component. –  Reuse UIMA NLP components.

•  Scalable. –  Easy switch between GPUs and CPUs. –  Learning on GPUs. –  Make use of UIMA for prediction.

35

Page 36: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

References •  [1] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local

Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005).

•  [2] Nguyen, Dat Ba, et al. "Aida-light: High-throughput named-entity disambiguation." Linked Data on the Web at WWW2014 (2014).

•  [3] Chan, Yee Seng, and Dan Roth. "Exploiting background knowledge for relation extraction." Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010.

•  [4] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D. Manning. “Multi-instance Multi-label Learning for Relation Extraction.” Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Natural Language Learning, 2012.

•  [5] Riedel, Sebastian, et al. "Relation extraction with matrix factorization and universal schemas." (2013).

•  [6] Schmitz, Michael, et al. "Open language learning for information extraction." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012.

•  [7] Qu, Lizhen, et al. "Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representation on Sequence Labelling Tasks." arXiv preprint arXiv:1504.05319 (2015).

•  [8] Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853

•  [9] Angeli, Gabor, Melvin Johnson Premkumar, and Christopher D. Manning. "Leveraging Linguistic Structure For Open Domain Information Extraction."

36

Page 37: Text, Knowledge, and Information Extraction - Meetupfiles.meetup.com/14535342/Text, Knowledge and Information...Information into Information Extraction Systems by Gibbs Sampling. Proceedings

NICTA Copyright 2014

Resources •  YAGO:

http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago

•  DBPedia: http://wiki.dbpedia.org/ •  Alchemy : http://querybuilder.alchemyapi.com/builder •  Deep learning: http://www.deeplearning.net/ •  Word2vec : https://code.google.com/p/word2vec/ •  Mallet (Java): http://mallet.cs.umass.edu/ •  Factorie (Scala): http://factorie.cs.umass.edu/ •  Stanford CoreNLP: http://nlp.stanford.edu:8080/corenlp/ •  NLP conferences.

–  ACL, EMNLP, COLING, NAACL, EACL … •  NLP online courses.

–  https://www.coursera.org/course/nlangp –  https://www.youtube.com/playlist?list=PL6397E4B26D00A269

•  ML online courses. –  https://www.coursera.org/course/ml –  https://www.coursera.org/course/neuralnets –  http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial

37