36
technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik Name Autor | Ort und Datum

Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Embed Size (px)

Citation preview

Page 1: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

technische universität dortmund

Fakultät für InformatikLS 8

An Application-Oriented View of Automatic Tagging and Information ExtractionKatharina Morik

Name Autor | Ort und Datum

Page 2: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Overview

Handling texts – overview Mark-up languages

Services based on annotated texts Automatic tagging

From lay-out information to tags Named entity recognition

Data-intensive Approach Counting in very large unlabeled corpus Turning frequencies into features Compiling sequences into features

Overview

Page 3: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Handling Texts

Granularity:hypertext structure, text, paragraph, word, letters

Learning mode:batch, incremental

Learning goal:adapted organization,class or clustering,syntactic or semantic structures

Application tasks:Personalization, optimization of information access, integration in business processes, reporting

Handling Texts

Hyper- text

Text Para- graph

Word

Adapta-tion

Alesker, Joachims, Neifach

Veltmann Hüppe, Mintert, Thomas

Helbig

Extraction Rössler

Clustering Schewe, Wurst

Classifica- tion

Joachims, Klinken- berg

this talk

Page 4: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Intelligent Publishing Using Mark-Ups

Search qualified by semantic category

Self-contained parts of text (atoms) as search result

Composition of one’s own text Presentation according to

semantic category

IP4W3 System by Stefan Mintert 1999

Mark-up languages

Query: category + word

Webserver

Result: list of atoms

Text

Selection

Search

Composition

User

Presentation

Page 5: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Qualified search

Mark-up languages

Page 6: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Presentation of Results

Mark-up languages

Page 7: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Text Composition

Mark-up languages

Selected results from 2 Queries combined

Page 8: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Applications e-Learning, e-Publishing

Intelligent publication in the web:users customize the material to their own needs.“IP4W3” Stefan Mintert 1999, Dortmund

Course material for different groups:from the central repository of presentations or texts, courses are designed for special interests.“Slicing Books” Ingo Dahm 2001, Koblenz-Landau

Additional sequence information allows to tailor courses to learning types, e.g. Top-down from definition to applicationBottom-up from application to definition.Moritz Thomas 1999, Dortmund

Mark-up languages

Page 9: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Behind the Curtain

Mark-up editor Editor for defining qualified search<!---- search pattern --- ><element type= “block”> <any> <target-element type=“definition”> <\target-element> <\any><\element>

fits<block> <em> <definition> Characters <\definition> are the atomic

unit of texts according to ISO/IEC 10646. … <\em><\block>

Mark-up languages

DTD/Schema

Webserver

Search patterns Style sheets

Administrator

Author

Annotated text

Bottleneck!

Page 10: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Automatic tagging

WISDOM++ Univ. BariFrom scanned texts to blocks to XML tags – classification of blocks by C4.5Altamura, Esposito, Malerba 2000

ADT Univ. DortmundFrom RTF annotation to XML tags – classification by C4.5Christian Hüppe 2003

Automatic tagging

Page 11: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

ADT – Input Document

Automatic Tagging

Page 12: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

ADT – Manual Annotation of Examples

Automatic Tagging

Page 13: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

ADT – Attributes of Examples

RTF control words Presence of control word in current and preceding paragraph

ff: neither in this nor in preceding paragraph ft: not in this but in preceding paragraph tf: in this but not in preceding paragraph tt: as well in this as in preceding paragraph

Value of indention in current and preceding paragraph First and second word of paragraph

Automatic Tagging

Page 14: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

ADT -- Learning

Automatic Tagging

Page 15: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

ADT – Classification of Paragraphs

Automatic Tagging

No. of examples for each class

F-measure

1 41 %

2 94 %

3 98,3%

4 99,68%

9 classes (tags)159 paragraphs

Page 16: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Application Options

Automatic Tagging

Named entity recognition necessary!

Combination of ADT and IP4W3 offers qualified search, tailored courseware, and enhanced e-learning without tedious annotation on behalf of the author or administrator.

Semantic information within paragraphs cannot be captured.

Page 17: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Named Entity Recognition

Classification of single words into given semantic categories (e.g., person, location, date).

A phrase of the category is a sequence of the same label. Features of a word:

Linguistic features (e.g., part of speech) Letters (e.g., beginning with upper case letter) Word length N-grams

Knowledge intensive vs. data intensive approaches: Linguistic rules Examples Unlabeled text (corpus)

Training time, classification time – size of training and test setsNamed Entity Recognition

Page 18: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

The Task

Biomedical task on 22 000 word forms (JNLPBA) 472 000 labeled occurrences for training 54 173 occurrences for testing 100 Mio. word forms from Medline as background

German corpus 33 000 word forms (CoNLL) 220 189 labeled occurrences for training 54 173 occurrences for testing

40 Mio. word forms from Frankfurter Rundschau as background

Fast learning and classification necessary!

Named Entity Recognition

Page 19: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Data-intensive Approach -- Marc Rössler Knowledge-poor:

No linguistic knowledge No given word lists No hand-written rules

Use of very large given corpora: Distribution of word occurrence in corpus Frequencies of words Frequencies of word sequences

Bootstrapping of features:1. Learn classifiers from examples2. Apply classifiers to unlabeled corpus3. Extract features from now labeled corpus,

enhance examples4. Learn classifiers from enhanced examples

Named Entity Recognition

Stop

Page 20: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

The Base Classifier -- Input

Features: 1 out of 30 word surface features (e.g., 4-digit number, uppercase only,

starting with capital letter) Word length Positional substrings (at most 8):

Last character z Before last and last character nz Last 3 character enz First trigram Kon Second trigram… onk Fifth trigram urr

Window of 3 preceding and 2 succeeding wordsEbenso schnell hat Peter Müllers Konkurrenz

Vector of 60 features for each occurrence

Named Entity Recognition

Page 21: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

The Base Classifier -- Output

A classifier fc is trained for each category against all others

A classifier fNE is trained for “is a NE” vs. “is no NE”

Tagging the focus of the sliding window according to

Named Entity Recognition

bxwxf

)()( bxwsignxfNE

0,max jNEjcc

xfxf

Page 22: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Corpus-based Features – Internal Evidence

Applying the base classifier to the corpus results in new features. Membership frequencies (how often a word v was seen as a

member of the category c ) –

where v is the token described by All fc > 0 become a feature with the ratio as value.

Example:

Named Entity Recognition

x

10

35.0,0,:12""

10

20,5.0,:11""

PERSON

PERSON

fPeter

fPeter

Page 23: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Sequences -- Windows

The sequences of words with the same label are considered one token within the sliding window.

Ebenso schnell hat Peter Müllers Konkurrenz

P P

Ebenso schnell hat Peter Müllers Konkurrenz die

)(...)(

...))((:

1

)1(

ejcjcjc

sjcsjcc xfsignxfsignxfsign

xfsignxfsignseq

3 2 1 1 2

Page 24: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Compiling Sequences into Features

Membership frequencies (how often a word v was seen as the first (last) in a sequence labeled c ) – internal evidence

Membership frequencies become new features.Example:

Named Entity Recognition

10

2,:14""

10

2,1:13""

PERSON

PERSON

seqlastMüller

seqstPeter

Page 25: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Corpus-based Features – External Evidence

Context frequencies (how often a sequence seqc was preceded or succeeded by certain words)

Sequence s preceding seqc is written seqpreC

Contexts with relative frequency >0.01 become features of the preceding words in the sliding window

Example:

Named Entity Recognition

34

1,:""

23

1,:""

23

1,:""

2

1

1

prePERSON

prePERSON

prePERSON

seqfirsthat

seqfirsthat

seqthirdEbenso

Page 26: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Enhanced Features

Based on the tagging of the unlabeled corpus by the base classifiers, features are extracted: Internal evidence:

fc X intervals

First/last in seqc

External evidence: First, second, third in seqpreC

First/second in seqsucC

Training is again performed using the enriched feature set. Tagging is enhanced by max(length(seqi)) (read again)

Named Entity Recognition

cc

jNEjcc

seqlengthxfxf max,0,max

Page 27: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Experiments

Does the sequence in window focus enhance the learning result? Does the use of unlabeled background corpus enhance learning results? How is the enhancement per round? How many rounds are necessary? Is the knowledge-poor approach compatible with approaches using linguistic knowledge? Would a Hidden Markov Model be better?

Named Entity Recognition

Page 28: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Does the sequence in window focus enhance the learning result?

Instances in Training/Test

F-measureLOC

F-measurePER

F-measureORG

Overall Precision

Overall Recall

Overall F-measure

Regular N-grams

101 810/ 25 909

50.6 42.67 45.38 69.82 34.18 45.9

Sequences

113 245/ 30 792

52.68 44.19 49.1 89.72 33.13 48.39

Named Entity RecognitionYes.

Page 29: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Does the use of unlabeled background corpus enhance learning results?

F-measureLOC

F-measurePER

F-measureORG

Overall Precision

Overall Recall

Overall F-measure

No use of corpus, sequences

52.68 44.19 49.10 89.72 33.13 48.39

Corpus for internal and external evidence

75.04 91.09 65.36 83.69 73.82 78.44

Named Entity RecognitionYes.

Page 30: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

How is the enhancement per round? How many rounds are necessary?

Named Entity Recognition

0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

90,00

100,00

0 1 2 3 4 5 6 7

Overall Recall

Overall F-measure

Overall Precision

Page 31: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Number of Support Vectors

Named Entity Recognition

Number of Support Vectors

0

2000

4000

6000

8000

10000

12000

0 1 2 3 4

Rounds

#SV

LOC

PER

ORG

Page 32: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Is the knowledge-poor approach compatible with approaches using linguistic knowledge?

Named Entity Recognition

Author F-measureLOC

F-measurePER

F-measureORG

Volk, Clematide 2001

85.7 88.9 78.4

Neumann, Piskorski 2002

81.1 88.0 79.4

Florian et al. 2003 (best CoNLL)

77.71 83.57 71.08

Rössler 75.94 91.09 65.36

Hmm…

Page 33: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Would a Hidden Markov Model be better?

Named Entity Recognition

0

10

20

30

40

50

60

70

80

Protein D N A RNA Cell Type Cell Line Overall Recall OverallPrecision

Overall F-measure

Base

HMM

Base+HMM

No, but turning its classification into a feature helps SVM!

Page 34: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Summary of Rössler’s Approach System consists of 3 components SVM training:

fast in large, sparse vector space Feature extraction from large corpora:

fast automatic adaptation to new domain The outer loop

Splitting instances of a m-class learning problem into m-1 binary problems

Tagging using a voting mechanism Enhancing examples by extracted features

The feature approach easily integrates linguistic knowledge or predictions of other learners, if given.

The data-driven approach is language independent. Results are compatible with knowledge-based approaches.

Named Entity Recognition

Page 35: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund

Conclusion

Tagged data allow for enhanced services. Automatic tagging of paragraphs or tables can easily be done using very few examples in an

interactive, incremental way. Named entity recognition for automatic tagging remains a challenge.

Page 36: Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

Name Autor | Ort und Datum

Fakultät für InformatikLS 8

technische universität dortmund