21
Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff I-KNOW'03, Friday, 4th of July

Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Embed Size (px)

Citation preview

Page 1: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Automatic Discovery and Aggregation of Compound

Names for the Use in Knowledge Representations

Christian BiemannUwe QuasthoffKarsten BöhmChristian Wolff

I-KNOW'03, Friday, 4th of July

Page 2: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 2

Goals

• extraction of multiterms

• from unannotated text corpora

• for the use in information visualisation

Example: Company Names

Page 3: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 3

Topics

• Patterns of Company Names

• From Patterns to Pattern Rules

• Search and Verification Algorithm: Use Pattern Rules to find Name Parts

• Term Extraction using Name Parts

• Experiments and Evaluation

• Aggregation of Name Variants

• Application to Semantic Networks

Page 4: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 4

Patterns of Company Names

Regular Expression to capture the structure:

(ABBR(FS|CONJ)?)* (NAME|CONN)* (KIND(FS|CONJ)?)*

FS: Full StopCONJ: Conjunctions, like +,&, ...?: Zero or one *: Zero or more

ABBReviation NAME parts Legal Form (KIND)

A & W Elektrogeräte GmbH & Co. KGA. Baumgarten GmbH  Hagedorn GmbH  Institut für Angewandte Kreativität  

DASAG   GmbH  Japan Steel Works Ltd.

K.F.C. Germany Inc.LABSCO Laboratory Supply Company GmbH & Co. KG

Page 5: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 5

How to learn missing parts

Suppose: • Japanese is known NAME• Inc. is known KINDThen: • Steelworks should be NAME• ASD should be ABBR• comprising should not be part of the nameUse flat features: _UC Upper Case _CAP Capitalized _LC Lower Case _MIX Mixed Case

ASD Japanese Steelworks Inc....comprising Japanese Steelworks Inc., ...

Page 6: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 6

From Patterns to Pattern Rules

• PatternABBR NAME NAME KIND

• Pattern Rules_CAP* NAME NAME KIND -> ABBRABBR _UC* NAME KIND -> NAMEABBR NAME _UC* KIND -> NAMEABBR NAME NAME _MIX* -> KIND...

(ABBR(FS|CONJ)?)* (NAME|CONN)* (KIND(FS|CONJ)?)*

Page 7: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 7

Pattern Rules Characteristics

• operate on sequences of - flat features - classes of known words

• Problem: match too often- high coverage- low precision

ABBR _UC* NAME KIND -> NAME

Page 8: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 8

Search and Verification Algorithm

Initialise pattern rulesLet unused elements := initial set of elements with class Loop: For each unused element

Find candidates for new elements by the search step For each candidate Do the verification step Add accepted candidates to new unused elementsOutput new unused elementsUnused elements = new unused elements

Page 9: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 9

Search Step

• use unused element to find example sentences"Film" -> 100 sentences

• apply Pattern Rules to obtain candidates CineMedia NAME Odeon NAME Senator NAME Lunaris NAME Die NAME ...

Fragments containing "Film":

•Die CineMedia Film AG übernahm

•die Odeon Film AG mit

•darunter ein Film über

•zu jedem Film interessante

•die Senator Film AG über

•zukunftsweisenden Film "Jurassic Park"

•die Lunaris Film GmbH

•erfolgreichsten Film der

•. Die Film AG stellte nach...

•der Odeon Film AG.

_UC* NAME KIND -> NAME, AGKIND, FilmNAME

Page 10: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 10

Verification Step

• use candidate to find example sentences"Odeon" -> 30 sentences

• apply Pattern Rules and check classifications of candidate"Odeon" is NAME in 17/30 cases ->accept"Senator" is NAME in 2/30 cases -> reject"Die" is NAME in 0/30 cases -> reject

Fragments containing "Odeon":

•Die Odeon Film AG (3x)

•des Vorstands der Odeon Film AG

•rennomierte Viedovertriebskette Odeon

•teilen sich Hecos Odeon Sub/200/Center

•Rahmenvertrag mit Odeon Zwo

•setzt auf Odeon Film AG

•...

_UC* NAME KIND -> NAME, AGKIND, FilmNAME

Page 11: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 11

Extraction of Multiterms

• Patterns with delimiters can be used for extraction

• Patterns only select appropriate multiterms, not single occurrences of name parts

_DELIML NAME NAME KIND _DELIMR =company, AGKIND, Film NAME, OrionNAME

Multiterms containing "Odeon":

•Die Odeon Film AG (3x)

•des Vorstands der Odeon Film AG

•rennomierte Viedovertriebskette Odeon

•teilen sich Hecos Odeon Sub/200/Center

•Rahmenvertrag mit Odeon Zwo

•setzt auf Odeon Film AG

"Odeon Film AG"

Page 12: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 12

Experiment

• Prerequisites- take arbitrary company list, - sort words by frequency, - truncate top 1'000 - assign classes NAME, KIND

• Pattern Rules- Generate Patterns from Regexp- Generate Pattern Rules from Patterns- Add delimiters to Patterns to get Extraction Patterns

Page 13: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 13

Evaluation

• Input- 1'002 Items- 47 Pattern Rules- 106 Extraction Patterns

• Output- over 12'000 Items- over 6'000 multiterms (company names)

Category Correct With Specifier Fractions Errors

Example Odeon Film AG Grazer Andritz AG Großmarkt GmbH Plastik MiniDIL

Fraction 75.80 % 17.36 % 6.08 % 0.76 %

Page 14: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 14

Aggregation of Name Variants

• Rule 1: first word is location: remove if short form has high frequency

• Rule 2: generic name not aligned to term border: Keep to distinguish between subsidaries

• Rule 3: long form has generic name for first word: remove if short form has higher frequency

Long candidate Short candidate Correct name

Düsseldorfer Bank eG Bank eG Düsseldorfer Bank eG

Düsseldorfer Rheinmetall AG Rheinmetall AG Rheinmetall AG

Mannheimer Pharmexx GmbH Pharmexx GmbH Pharmexx GmbH

Infomatec Media AG Infomatec AG Infomatec Media AG

JENOPTIK Automatisierungstechnik GmbH

Jenoptik GmbH JENOPTIK Automatisierungstechnik GmbH

Jenoptik Bauentwicklung GmbH Jenoptik GmbH Jenoptik Bauentwicklung GmbH

Kleindienst Datentechnik GmbH Kleindienst GmbH Kleindienst Datentechnik GmbH

Nachrichtenagentur dpa-AFX dpa-AFX dpa-AFX

Infomatec-Tochtergesellschaft Igel GmbH

Igel GmbH Igel GmbH

Page 15: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 15

Application to Semantic NetworksMedia – Helkon Media AG, ProSieben Media AG, I-D Media AG

http://www.wortschatz.uni-leipzig.dehttp://www.texttech.de

Page 16: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 16

Media Example (2)Helkon Media AG

Page 17: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 17

Media Example (3)ProSieben Media AG

Page 18: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 18

Media Example (4)I-D Media AG

Page 19: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 19

Telekom ExampleTelekom vs. Deutsche Telekom AG

Page 20: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 20

Summary

• Example-based unsupervised learning algorithm for multiterm extraction

• Disambiguation of generic company name parts in knowledge representation

• More finegrained representation of complex concepts

Page 21: Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff

Chris Biemann - IKnow'03 Graz 21

END

Thanks for your attention!