Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik

technische universität dortmund

Fakultät für InformatikLS 8

An Application-Oriented View of Automatic Tagging and Information ExtractionKatharina Morik

Name Autor | Ort und Datum




Overview

Handling texts – overview Mark-up languages

Services based on annotated texts Automatic tagging

From lay-out information to tags Named entity recognition

Data-intensive Approach Counting in very large unlabeled corpus Turning frequencies into features Compiling sequences into features

Overview




Handling Texts

Granularity:hypertext structure, text, paragraph, word, letters

Learning mode:batch, incremental

Learning goal:adapted organization,class or clustering,syntactic or semantic structures

Application tasks:Personalization, optimization of information access, integration in business processes, reporting

Handling Texts

Hyper- text

Text Para- graph

Word

Adapta-tion

Alesker, Joachims, Neifach

Veltmann Hüppe, Mintert, Thomas

Helbig

Extraction Rössler

Clustering Schewe, Wurst

Classifica- tion

Joachims, Klinken- berg

this talk




Intelligent Publishing Using Mark-Ups

Search qualified by semantic category

Self-contained parts of text (atoms) as search result

Composition of one’s own text Presentation according to

semantic category

IP4W3 System by Stefan Mintert 1999

Mark-up languages

Query: category + word

Webserver

Result: list of atoms

Text

Selection

Search

Composition

User

Presentation




Qualified search

Mark-up languages




Presentation of Results

Mark-up languages




Text Composition

Mark-up languages

Selected results from 2 Queries combined




Applications e-Learning, e-Publishing

Intelligent publication in the web:users customize the material to their own needs.“IP4W3” Stefan Mintert 1999, Dortmund

Course material for different groups:from the central repository of presentations or texts, courses are designed for special interests.“Slicing Books” Ingo Dahm 2001, Koblenz-Landau

Additional sequence information allows to tailor courses to learning types, e.g. Top-down from definition to applicationBottom-up from application to definition.Moritz Thomas 1999, Dortmund

Mark-up languages




Behind the Curtain

Mark-up editor Editor for defining qualified search<!---- search pattern --- ><element type= “block”> <any> <target-element type=“definition”> <\target-element> <\any><\element>

fits<block> <em> <definition> Characters <\definition> are the atomic

unit of texts according to ISO/IEC 10646. … <\em><\block>

Mark-up languages

DTD/Schema

Webserver

Search patterns Style sheets

Administrator

Author

Annotated text

Bottleneck!




Automatic tagging

WISDOM++ Univ. BariFrom scanned texts to blocks to XML tags – classification of blocks by C4.5Altamura, Esposito, Malerba 2000

ADT Univ. DortmundFrom RTF annotation to XML tags – classification by C4.5Christian Hüppe 2003

Automatic tagging




ADT – Input Document

Automatic Tagging




ADT – Manual Annotation of Examples

Automatic Tagging




ADT – Attributes of Examples

RTF control words Presence of control word in current and preceding paragraph

ff: neither in this nor in preceding paragraph ft: not in this but in preceding paragraph tf: in this but not in preceding paragraph tt: as well in this as in preceding paragraph

Value of indention in current and preceding paragraph First and second word of paragraph

Automatic Tagging




ADT -- Learning

Automatic Tagging




ADT – Classification of Paragraphs

Automatic Tagging

No. of examples for each class

F-measure

1 41 %

2 94 %

3 98,3%

4 99,68%

9 classes (tags)159 paragraphs




Application Options

Automatic Tagging

Named entity recognition necessary!

Combination of ADT and IP4W3 offers qualified search, tailored courseware, and enhanced e-learning without tedious annotation on behalf of the author or administrator.

Semantic information within paragraphs cannot be captured.




Named Entity Recognition

Classification of single words into given semantic categories (e.g., person, location, date).

A phrase of the category is a sequence of the same label. Features of a word:

Linguistic features (e.g., part of speech) Letters (e.g., beginning with upper case letter) Word length N-grams

Knowledge intensive vs. data intensive approaches: Linguistic rules Examples Unlabeled text (corpus)

Training time, classification time – size of training and test setsNamed Entity Recognition




The Task

Biomedical task on 22 000 word forms (JNLPBA) 472 000 labeled occurrences for training 54 173 occurrences for testing 100 Mio. word forms from Medline as background

German corpus 33 000 word forms (CoNLL) 220 189 labeled occurrences for training 54 173 occurrences for testing

40 Mio. word forms from Frankfurter Rundschau as background

Fast learning and classification necessary!





Data-intensive Approach -- Marc Rössler Knowledge-poor:

No linguistic knowledge No given word lists No hand-written rules

Use of very large given corpora: Distribution of word occurrence in corpus Frequencies of words Frequencies of word sequences

Bootstrapping of features:1. Learn classifiers from examples2. Apply classifiers to unlabeled corpus3. Extract features from now labeled corpus,

enhance examples4. Learn classifiers from enhanced examples


Stop




The Base Classifier -- Input

Features: 1 out of 30 word surface features (e.g., 4-digit number, uppercase only,

starting with capital letter) Word length Positional substrings (at most 8):

Last character z Before last and last character nz Last 3 character enz First trigram Kon Second trigram… onk Fifth trigram urr

Window of 3 preceding and 2 succeeding wordsEbenso schnell hat Peter Müllers Konkurrenz

Vector of 60 features for each occurrence





The Base Classifier -- Output

A classifier fc is trained for each category against all others

A classifier fNE is trained for “is a NE” vs. “is no NE”

Tagging the focus of the sliding window according to


bxwxf

)()( bxwsignxfNE

0,max jNEjcc

xfxf




Corpus-based Features – Internal Evidence

Applying the base classifier to the corpus results in new features. Membership frequencies (how often a word v was seen as a

member of the category c ) –

where v is the token described by All fc > 0 become a feature with the ratio as value.

Example:


x

10

35.0,0,:12""

10

20,5.0,:11""

PERSON

PERSON

fPeter

fPeter




Sequences -- Windows

The sequences of words with the same label are considered one token within the sliding window.

Ebenso schnell hat Peter Müllers Konkurrenz

P P

Ebenso schnell hat Peter Müllers Konkurrenz die

)(...)(

...))((:

1

)1(

ejcjcjc

sjcsjcc xfsignxfsignxfsign

xfsignxfsignseq

3 2 1 1 2




Compiling Sequences into Features

Membership frequencies (how often a word v was seen as the first (last) in a sequence labeled c ) – internal evidence

Membership frequencies become new features.Example:


10

2,:14""

10

2,1:13""

PERSON

PERSON

seqlastMüller

seqstPeter




Corpus-based Features – External Evidence

Context frequencies (how often a sequence seqc was preceded or succeeded by certain words)

Sequence s preceding seqc is written seqpreC

Contexts with relative frequency >0.01 become features of the preceding words in the sliding window

Example:


34

1,:""

23

1,:""

23

1,:""

2

1

1

prePERSON

prePERSON

prePERSON

seqfirsthat

seqfirsthat

seqthirdEbenso




Enhanced Features

Based on the tagging of the unlabeled corpus by the base classifiers, features are extracted: Internal evidence:

fc X intervals

First/last in seqc

External evidence: First, second, third in seqpreC

First/second in seqsucC

Training is again performed using the enriched feature set. Tagging is enhanced by max(length(seqi)) (read again)


cc

jNEjcc

seqlengthxfxf max,0,max




Experiments

Does the sequence in window focus enhance the learning result? Does the use of unlabeled background corpus enhance learning results? How is the enhancement per round? How many rounds are necessary? Is the knowledge-poor approach compatible with approaches using linguistic knowledge? Would a Hidden Markov Model be better?





Does the sequence in window focus enhance the learning result?

Instances in Training/Test

F-measureLOC

F-measurePER

F-measureORG

Overall Precision

Overall Recall

Overall F-measure

Regular N-grams

101 810/ 25 909

50.6 42.67 45.38 69.82 34.18 45.9

Sequences

113 245/ 30 792

52.68 44.19 49.1 89.72 33.13 48.39

Named Entity RecognitionYes.




Does the use of unlabeled background corpus enhance learning results?

F-measureLOC

F-measurePER

F-measureORG

Overall Precision

Overall Recall

Overall F-measure

No use of corpus, sequences

52.68 44.19 49.10 89.72 33.13 48.39

Corpus for internal and external evidence

75.04 91.09 65.36 83.69 73.82 78.44

Named Entity RecognitionYes.




How is the enhancement per round? How many rounds are necessary?


0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

90,00

100,00

0 1 2 3 4 5 6 7

Overall Recall

Overall F-measure

Overall Precision




Number of Support Vectors


Number of Support Vectors

0

2000

4000

6000

8000

10000

12000

0 1 2 3 4

Rounds

#SV

LOC

PER

ORG




Is the knowledge-poor approach compatible with approaches using linguistic knowledge?


Author F-measureLOC

F-measurePER

F-measureORG

Volk, Clematide 2001

85.7 88.9 78.4

Neumann, Piskorski 2002

81.1 88.0 79.4

Florian et al. 2003 (best CoNLL)

77.71 83.57 71.08

Rössler 75.94 91.09 65.36

Hmm…




Would a Hidden Markov Model be better?


0

10

20

30

40

50

60

70

80

Protein D N A RNA Cell Type Cell Line Overall Recall OverallPrecision

Overall F-measure

Base

HMM

Base+HMM

No, but turning its classification into a feature helps SVM!




Summary of Rössler’s Approach System consists of 3 components SVM training:

fast in large, sparse vector space Feature extraction from large corpora:

fast automatic adaptation to new domain The outer loop

Splitting instances of a m-class learning problem into m-1 binary problems

Tagging using a voting mechanism Enhancing examples by extracted features

The feature approach easily integrates linguistic knowledge or predictions of other learners, if given.

The data-driven approach is language independent. Results are compatible with knowledge-based approaches.





Conclusion

Tagged data allow for enhanced services. Automatic tagging of paragraphs or tables can easily be done using very few examples in an

interactive, incremental way. Named entity recognition for automatic tagging remains a challenge.

Documents

Technische universität dortmund Fakultät für Informatik LS 8 An Application-Oriented View of Automatic Tagging and Information Extraction Katharina Morik