95
Information extraction 9. Applications Simon Razniewski Winter semester 2019/20

Information extraction - Max Planck Institute for Informatics

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information extraction - Max Planck Institute for Informatics

Information extraction

9. Applications

Simon Razniewski

Winter semester 2019/20

Page 2: Information extraction - Max Planck Institute for Informatics

Announcements

• Results assignment 8

• Evaluation (w/ break in middle)

• Tentative exam schedule

• 15 minutes/exam

• Sample questions today’s lab

• Conflicts? Forum

2

Page 3: Information extraction - Max Planck Institute for Informatics

Assignment 8 - Sample rules

3

Page 4: Information extraction - Max Planck Institute for Informatics

Outline

1. Academic projects

• Scraping and Harvesting

• Pattern-based text extraction and OpenIE

2. Industrial Knowledge Bases

3. Knowledge Base Question Answering

4. Semantic Web

4

Page 5: Information extraction - Max Planck Institute for Informatics

DBpedia (2007)

• Large-scale Wikipedia infobox+category scraping

• Manually designed mappings to consolidate synonymous

attributes

• See lecture/assignment 3

• Multilingual

• No persistent IDs

• For long considered the “core” of Semantic Web (see later)

• Data access

• Per entity: http://dbpedia.org/page/Max_Planck_Institute_for_Informatics

• SPARQL endpoint:

• http://dbpedia.org/snorql/?query=SELECT+%3Fitem+WHERE+%7B%0D%0A%3Fi

tem+dbo%3AalmaMater+dbr%3ASaarland_University%0D%0A%7D

• Data dumps

• https://wiki.dbpedia.org/develop/datasets

• https://wiki.dbpedia.org/downloads-2016-10

5

Page 6: Information extraction - Max Planck Institute for Informatics

YAGO (2007)

• Precision-oriented Wikipedia infobox+category extraction

• Subset of 76 important relations, cleaning steps (>95%

precision)

• Much focus on type extraction from categories

• “French writers” “Writer” + “French person”

• WordNet disambiguation and linking

• Data access

• Per-entity access: https://gate.d5.mpi-

inf.mpg.de/webyago3spotlx/Browser

• Or https://gate.d5.mpi-

inf.mpg.de/webyago3spotlxComp/SvgBrowser/

• SPARQL access: (currently down)

• Data dumps: https://www.mpi-

inf.mpg.de/departments/databases-and-information-

systems/research/yago-naga/yago/downloads/

6

Page 7: Information extraction - Max Planck Institute for Informatics

BabelNet (2012)

7

Focus on general terms, sense disambiguation, instead of

named entities

http://babelfy.org/

Page 8: Information extraction - Max Planck Institute for Informatics

Wikidata (2012)

• Largely supersedes YAGO and DBpedia

• Not itself built using automated IE techniques

• Community generally disapproves of automated extraction

• Isolated projects, e.g. https://github.com/google/sling

• https://www.wikidata.org/wiki/User:Anders-sandholm

• Nonetheless highly important for IE

• Disambiguation reference

• Training data source (distant supervision)

• Data access:

• SPARQL: https://w.wiki/DKU

• Individual entities: https://www.wikidata.org/wiki/Q565400

• JSON:

https://www.wikidata.org/wiki/Special:EntityData/Q565400.json

• Dumps:

https://www.wikidata.org/wiki/Wikidata:Database_download

• ~65 GB zipped

8

Page 9: Information extraction - Max Planck Institute for Informatics

Outline

1. Academic projects

• Scraping and Harvesting

• Pattern-based text extraction and OpenIE

2. Industrial Knowledge Bases

3. Knowledge Base Question Answering

4. Semantic Web

9

Page 10: Information extraction - Max Planck Institute for Informatics

10

(2010)

327 manually

designed relations

each with a few

curated training

examples

Sales point: Continuous nature of extraction and learning

http://rtw.ml.cmu.edu/rtw/

Page 11: Information extraction - Max Planck Institute for Informatics

11

Page 12: Information extraction - Max Planck Institute for Informatics

ReVerb/OpenIE 4.0

• Knowledge base built using open information

extraction

• 5 billion extractions from general web crawls

• https://openie.allenai.org/

• (previous lecture)

12

Page 13: Information extraction - Max Planck Institute for Informatics

13

Page 14: Information extraction - Max Planck Institute for Informatics

Outline

1. Academic projects

• Scraping and Harvesting

• Pattern-based text extraction and OpenIE

2. Industrial Knowledge Bases

3. Knowledge Base Question Answering

4. Semantic Web

14

Page 15: Information extraction - Max Planck Institute for Informatics

Industrial projects

15

- Baidu

Page 16: Information extraction - Max Planck Institute for Informatics

Google Knowledge Vault (2014)

• Ambitious project combining text extraction,

semistructured extraction, and predictive models

• See lecture 8

• Usage status unknown

16

[Dong, Xin, et al. "Knowledge vault: A

web-scale approach to probabilistic

knowledge fusion, KDD 2014]

Page 17: Information extraction - Max Planck Institute for Informatics

17

• https://developers.google.com/knowledge-graph

• Wikidata noise copied (see lecture 3)

(since ~2012)

Page 18: Information extraction - Max Planck Institute for Informatics

18

Page 19: Information extraction - Max Planck Institute for Informatics

19

Page 20: Information extraction - Max Planck Institute for Informatics

20

[Noy, Natasha, et al. "Industry-scale Knowledge Graphs:

Lessons and Challenges." Queue 17.2 (2019): 20]

Page 21: Information extraction - Max Planck Institute for Informatics

21

Page 22: Information extraction - Max Planck Institute for Informatics

22

Page 23: Information extraction - Max Planck Institute for Informatics

23

Watson outperformed the 74-fold human

winner in the Jeopardy quiz show

Page 24: Information extraction - Max Planck Institute for Informatics

24

Page 25: Information extraction - Max Planck Institute for Informatics

Baidu

• Non-English languages traditionally underrepresented

• Open (academic) solutions:

• Zhishi.me: Chinese-language equivalent of DBpedia

• Based on Baidu Baike, Hudong Baike, Chinese Wikipedia

• Xlore: English-Chinese alignment KB

• Baidu has apparently three internal knowledge graphs

• https://www.mdpi.com/2071-1050/10/9/3245/htm

• Huawei building a knowledge graph?

• https://www.huawei.com/en/press-

events/news/2019/9/atlas-series-products-cloud-services-

all-scenario-ai-solutions

25

Page 26: Information extraction - Max Planck Institute for Informatics

Outline

1. Academic projects

• Scraping and Harvesting

• Pattern-based text extraction and OpenIE

2. Industrial Knowledge Bases

3. Knowledge Base Question Answering

4. Semantic Web

26

Page 27: Information extraction - Max Planck Institute for Informatics

Evaluation

27

Page 28: Information extraction - Max Planck Institute for Informatics

Advertisement: Thesis topics

• http://simonrazniewski.com/#theses

1. Social commonsense knowledge extraction

• “Americans like guns, Germans speed on highways, Japanese bow for greetings”

• Treasure of world knowledge, yet risk of bias and prejudice

• Context: IE, ML

2. Commonsense extraction from children (audio)books

• Does infant content make commonsense more explicit?

• Context: IE

3. Stability and completeness prediction in Wikidata

• What information is complete, and which one is stable?

• Context: Data management, Machine Learning

4. Topical image representativeness and coverage

• What’s missing in an image collection?

• Context: Data management, (Computer Vision)

5. Knowledge-grounded story generation

• Can structured knowledge yield better stories?

• Context: Text generation

• Limited availability

For planning please write me till Feb 17

(actual start flexible)

28

Page 29: Information extraction - Max Planck Institute for Informatics

Outline

1. Scraping and Harvesting

• DBpedia, Yago, BabelNet, (Wikidata)

2. Pattern-based text extraction and OpenIE

• NELL and ReVerb

3. Industrial Knowledge Bases

4. Knowledge Base Question Answering

5. Semantic Web

29

Page 30: Information extraction - Max Planck Institute for Informatics

30

Question answering: Vital for information access

★ Direct answers to questions

★ Saves time and effort

★ Natural in voice UI

Page 31: Information extraction - Max Planck Institute for Informatics

31

Question answering: Vital for information access

Page 32: Information extraction - Max Planck Institute for Informatics

32

Approaches to question answering

• Traditional IR-style approach: Match question with

text phrases in documents

• “What is the capital of Belgium”

• “Brussels is the capital of Belgium”

• Works only for simple questions

• Misses additional conditions

• Google, Siri, Echo et al.

• Precision much more important than recall

• Answer origin needs to be debuggable/explainable

Question answering from structured sources much preferred

Page 33: Information extraction - Max Planck Institute for Informatics

33

Question answering is a hot topic

★ QA over knowledge graphs [Abujabal et al. 2018]

★ Reading comprehension QA [Reddy et al. 2018]

★ Visual and multimodal QA [Lu et al. 2016]

★ Community QA [Hoogeveen et al. 2018]

★ Passage retrieval and sentence selection [Shen et al. 2018]

★ Non-factoid: Causal, procedural, ...

Page 34: Information extraction - Max Planck Institute for Informatics

34

QA over Knowledge Graphs

Which Oscar nominations did Nolan

receive?

ChristopherNolan

gender

Male Director

type

BestDirector

nominatedFor

Inception

directed

AcademyAwardtype

<ChristopherNolan, gender, Male>

<ChristopherNolan, type, Director>

<ChristopherNolan, directed, Inception>

<ChristopherNolan, nominatedFor,

BestDirector>

<BestDirector, type, AcademyAward>

<ChristopherNolan, birthDate, 30 July 1970>birthDate 30 July 1970

Page 35: Information extraction - Max Planck Institute for Informatics

<ChristopherNolan, gender, Male>

<ChristopherNolan, type, Director>

<ChristopherNolan, directed, Inception>

<ChristopherNolan, nominatedFor, BestDirector>

<BestDirector, type, AcademyAward>

<ChristopherNolan, birthDate, 30 July 1970>

35

QA over Knowledge Graphs

Which Oscar nominations did Nolan

receive?

ChristopherNolan

?ANS

nominatedFor

AcademyAwardtype

SELECT ?ANSWHERE {ChristopherNolan nominatedFor ?ANS .?ANS type AcademyAward }

BestDirector

SPARQL

Page 36: Information extraction - Max Planck Institute for Informatics

36

Generalizing QA

★ If we can answer:

○ What are the Oscar award nominations of Nolan?

★ Then we should be able to answer:

○ What are the Cannes award nominations of Ryan Coogler?

○ Which Oscar award nominations did Nolan receive?

Same syntax!

Same semantics!

Page 37: Information extraction - Max Planck Institute for Informatics

★ Interpretable

Template-based Question Answering

37

Question

Who is Inception’s director?

Question template

Who is <NOUN1>’s <NOUN2>?

Query

?ANS director Inception

Query template

?ANS <PRED1> <ENT1>

1 SPARQL

triple pattern

Page 38: Information extraction - Max Planck Institute for Informatics

★ Generalizes to new domains

Template-based Question Answering

38

Question

Who is Inception’s director?

Question template

Who is <NOUN1>’s <NOUN2>?

Query

?ANS director Inception

Query template

?ANS <PRED1> <ENT1>

Who is Libya’s president?

Who is Messi’s manager?

1 SPARQL

triple pattern

Page 39: Information extraction - Max Planck Institute for Informatics

★ Generalizes to new domains

Template-based Question Answering

39

Question

Who plays the role of Cobb in

Inception?

Question template

Who <VERB> <DT> <NOUN>

<PREP> <NOUN>?

Query

?ANS playsIn Inception

?ANS role Cobb

Query template

?ANS <PRED1> <ENT1>

?ANS <PRED2> <ENT2>

2 SPARQL

triple patterns

Page 40: Information extraction - Max Planck Institute for Informatics

★ Hand-crafted by experts

(Fader et al. 2014; Unger et al. 2013)

★ Low coverage

★ Solution: Learn templates

○ Question templates

○ Query templates

○ Slot alignments

Challenges with templates

40

Page 41: Information extraction - Max Planck Institute for Informatics

Question: What are the Oscar award nominations of Nolan?

Dependency parse

Question template (labeled nodes and edges)

41

Dependency-parse-based templates

Page 42: Information extraction - Max Planck Institute for Informatics

Query: ChristopherNolan nominatedFor ?VAR .

?VAR awardTitle ?ANS .

?ANS type AcademyAward

Graphical Query

Query template

42

Graphical query templates

ChristopherNolan nominatedFor

awardTitle

?VAR

?ANStypeAcademyAward

ENT PRED1

PRED2

?VAR

?ANStypeTYPE

Page 43: Information extraction - Max Planck Institute for Informatics

43

Slot Alignments

ENT PRED1 PRED2?VAR ?ANS type TYPE

Page 44: Information extraction - Max Planck Institute for Informatics

Template-based question answering

Question-Query

log

TrainTemplate-based

answering

Template

bank

Match

New question

44

No answer found

Similarity-based

answering

Match

User feedback on

answers Add

QA pairs

Generalize

Page 45: Information extraction - Max Planck Institute for Informatics

45

Training template-based QA

★ Collecting question-query pairs difficult

★ Start with question-answer pairs instead

★ Create queries by distant supervision

★ Generalize to create slot-aligned templates

Page 46: Information extraction - Max Planck Institute for Informatics

46

ChristopherNolan nominatedFor

awardTitle

Award

BestDirector

type

AcademyAward

★ Retain shortest path between question and

answer entities

★ Retain answer type information

Question: What are the Oscar award nominations of Nolan?

Answer: Best Director

Distant supervision from Q-A pairs

Page 47: Information extraction - Max Planck Institute for Informatics

Question: What are the Oscar award nominations of Nolan?

Answer: Best Director

Query: ChristopherNolan nominatedFor ?VAR .

?VAR awardTitle ?ANS .

?ANS Type AcademyAward

47

ChristopherNolan nominatedFor

awardTitle

?VAR

?ANS

type

AcademyAward

Distant supervision from Q-A pairs

Page 48: Information extraction - Max Planck Institute for Informatics

Question-schema alignment

4848

nominatedFor AcademyAwardawardTitle

oscar nominations oscarnominations ofwhat nominations

award nominationsaward

oscar award

oscar award nominations nominationswhat are

Question

KB schema

Page 49: Information extraction - Max Planck Institute for Informatics

★ Bipartite graph with edge weights (Yahya et al. 2012)

★ Weights from lexicons LP

and LT

(Abujabal et al. 2017,

Berant and Liang 2013)

Create Candidate Alignments

4949

nominatedFor AcademyAwardawardTitle

oscar nominations oscarnominations ofwhat nominations

award nominationsaward

oscar award

oscar award nominations nominationswhat are

0.90.5

0.7

Page 50: Information extraction - Max Planck Institute for Informatics

Create Candidate Alignments

5050

nominatedFor AcademyAwardawardTitle

oscar nominations oscarnominations ofwhat nominations

award nominationsaward

oscar award

oscar award nominations nominationswhat are

0.90.5

0.7

Phrase KG Predicate Weight

nominee for nominatedFor 0.8

nominations of nominatedFor 0.9

oscar nominations nominatedFor 0.5

Phrase KG Type Weight

Academy Award AcademyAward 0.9

Oscar AcademyAward 0.7

Oscar Award AcademyAward 0.8

Page 51: Information extraction - Max Planck Institute for Informatics

★ Best alignment of items with Integer Linear Program (ILP)

★ At least/at most constraints

★ Type coherence

Optimal Mapping via ILP

51

nominatedFor AcademyAwardawardTitle

oscar nominations oscarnominations ofwhat nominations

award nominationsaward

oscar award

oscar award nominations nominationswhat are

0.90.5

0.7

Page 52: Information extraction - Max Planck Institute for Informatics

★ Best alignment of items with Integer Linear Program (ILP)

★ At least/at most constraints

★ Type coherence

Optimal Mapping via ILP

5252

nominatedFor AcademyAwardawardTitle

oscarnominations of

award

Page 53: Information extraction - Max Planck Institute for Informatics

53

Apply Alignment to Question-Query

ChristopherNolan nominatedFor awardTitle?VAR ?ANS

type

AcademyAward

Page 54: Information extraction - Max Planck Institute for Informatics

Answering with templates

New question: What are the Cannes award nominations of Ryan

Coogler?

54

Match Template

bank

Page 55: Information extraction - Max Planck Institute for Informatics

Instantiating Queries

What are the Cannes award nominations of Ryan Coogler?

Query template 1 Query template 2 ...

...Question template 1 Question template 2

Query 1 Query 2 ...

RyanCoogler nominated

?VAR .

?VAR awardTitle ?ANS .

?ANS Type CannesAward

Rank queries with learning to

rank and execute best query55

RyanCoogler awarded

?VAR .

?VAR awardTitle ?ANS .

?ANS Type GoldenGlobe

Page 56: Information extraction - Max Planck Institute for Informatics

★ So far, assumed all answers were correct: Pseudo-relevance

★ Pseudo-relevance degrades quality

★ Users provide feedback on answers

Question: Which Oscar nominations did Nolan receive?

Answer: Best Director

User:

★ Positive feedback:

○ Learn new template from question-query

○ Add new question-query to log

○ Update learning-to-rank model

Closing the Loop with User Feedback

56

Page 57: Information extraction - Max Planck Institute for Informatics

Outline

1. Academic projects

• Scraping and Harvesting

• Pattern-based text extraction and OpenIE

2. Industrial Knowledge Bases

3. Knowledge Base Question Answering

4. Semantic Web

57

Page 58: Information extraction - Max Planck Institute for Informatics

58

Page 59: Information extraction - Max Planck Institute for Informatics

59

Page 60: Information extraction - Max Planck Institute for Informatics

60

Page 61: Information extraction - Max Planck Institute for Informatics

61

Page 62: Information extraction - Max Planck Institute for Informatics

62

Page 63: Information extraction - Max Planck Institute for Informatics

63

Reminder: RDF

Page 64: Information extraction - Max Planck Institute for Informatics

64

Page 65: Information extraction - Max Planck Institute for Informatics

65

Page 66: Information extraction - Max Planck Institute for Informatics

66

Page 67: Information extraction - Max Planck Institute for Informatics

67

Page 68: Information extraction - Max Planck Institute for Informatics

68

Page 69: Information extraction - Max Planck Institute for Informatics

69

Page 70: Information extraction - Max Planck Institute for Informatics

70

Page 71: Information extraction - Max Planck Institute for Informatics

71

RDF and RDFS vocabularies

Page 72: Information extraction - Max Planck Institute for Informatics

72

Page 73: Information extraction - Max Planck Institute for Informatics

73

Page 74: Information extraction - Max Planck Institute for Informatics

74

Page 75: Information extraction - Max Planck Institute for Informatics

75

Page 76: Information extraction - Max Planck Institute for Informatics

76

Page 77: Information extraction - Max Planck Institute for Informatics

77

https://www.wikidata.org/wiki/Special:EntityData/Q565400.rdf

Page 78: Information extraction - Max Planck Institute for Informatics

78

Page 79: Information extraction - Max Planck Institute for Informatics

79

Page 80: Information extraction - Max Planck Institute for Informatics

80

Interlinking on the Semantic Web

Page 81: Information extraction - Max Planck Institute for Informatics

81

Page 82: Information extraction - Max Planck Institute for Informatics

82

Page 83: Information extraction - Max Planck Institute for Informatics

83

Page 84: Information extraction - Max Planck Institute for Informatics

84

Page 85: Information extraction - Max Planck Institute for Informatics

85

Page 86: Information extraction - Max Planck Institute for Informatics

86

https://suchanek.name/

index.html#contact

Page 87: Information extraction - Max Planck Institute for Informatics

87

• JSON-LD

Page 88: Information extraction - Max Planck Institute for Informatics

88

Page 89: Information extraction - Max Planck Institute for Informatics

89

Page 90: Information extraction - Max Planck Institute for Informatics

90

Page 91: Information extraction - Max Planck Institute for Informatics

91

Page 92: Information extraction - Max Planck Institute for Informatics

92

Page 93: Information extraction - Max Planck Institute for Informatics

References

• Selected references

• Further reading

• qa.mpi-inf.mpg.de

• Slides

• Adapted from Fabian Suchanek and Rishiraj Saha Roy

93

Page 94: Information extraction - Max Planck Institute for Informatics

Assignment 9

• No assignment

• Tutorial today: Exam questions

94

Page 95: Information extraction - Max Planck Institute for Informatics

Take home

• IE important tool for building structured knowledge

• Wikipedia popular resource

• Free text extraction harder but possible

• KBs in widespread use in tech companies

• Actual methods guarded secrets

• Source of data not always known

• Signature application: Question answering

• Challenge: From unstructured user question to structured KB query

• Semantic web: Vision of interlinked and machine-readable

internet

• Schema reuse essential for (simple) machine-readability

95