26
Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm ViBRANT Virtual Biodiversity

Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm ViBRANT Virtual Biodiversity

Embed Size (px)

Citation preview

Improved Bibliographic Reference Parsing Based on Repeated Patterns

Guido Sautter, Klemens Böhm

ViBRANTVirtual Biodiversity

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

2

Bibliographic References - Parsing

• Why parse bibliographic references?– Generation of BibTeX records, etc.– Rendering in different styles– Reconciliation– …

Absolute necessity when compiling large bibliographies

Thor, A.U., Cond, S.E. 2012. The article title. The Journal 7: 8-15

Author Author Year Title Journal Pagi-nation

Volu

me

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

3

Bibliographic References - Examples

Diversity with regard to– Reference style (order of fields, intermediate punctuation)– Type of referenced work

Thor, AU, SE Cond (2012) The article title. The Journal 7: 8-15

Thor, AU, Cond, SE. The article title, The Journal 7 (2012): 8-15

Thor, A.U. 2012. The paper title. Proc. ICST 2012, Location.

Thor, AU, Cond, SE. 2012. The chapter title. In:Itor, ED (Ed.) The book title. Location: Publisher: 8-15

Thor, A.U. 2012. The book title, Publisher, Location, 151 pp.

Thor, AU, SE Cond, 2012. The 3rd article title. In:Itor, ED (Ed.) The 1st special issue. The Journal 7: 8-15

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

4

Bibliographic References - Fields

• Fields present in references to (almost) all types of works– Authors (can be given in different styles)– Year of publication (four-digit Arabic number)– Title

• Fields present in references to specific types of works:– Publisher and Location / Journal name– Pagination ((mostly) Arabic number or number range)– Volume / issue / numero number (Arabic number)– Volume title / Proceedings title– Editors (can be given in different styles)– URL / DOI / ISBN / ISSN

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

5

Overview

• Bibliographic References• Previous Parsing Approaches• The RefParse Algorithm• Evaluation• Summary & Outlook

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

6

Pattern Based Parsers

• Principle:– Patterns match individual field values– Meta patterns arrange field patterns– One meta pattern per reference style

• Most prominent: ParaCite (now offline)

• Strengths:– Numerical fields– Author names

• Weaknesses:– Meta patterns to be created for every single reference style– Combinatorial explosion with alternatives for individual fields

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

7

Learning Based Parsers

• Learn statistical models from pre-parsed references– Hidden Markov Models– Conditional Random Fields– Finite State Transducers– etc.

• Strengths:– Can handle all cases covered in training set– No handcrafting of rules or patterns

• Weaknesses:– Need for training data covering all cases– Usually do not exploit morphology– Incremental training hard

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

8

Knowledge Based Parsers

• Divide references into blocks at punctuation marks• Classify blocks by comparing them to knowledge base

• Examples: FLUX-CiM, INFOMAP

• Strengths:– No handcrafting of rules or patterns– Learn domain specific journal names, etc. very well

• Weaknesses:– Need for representative training data covering domain– Abbreviations interfere with blocking– Problems with numerical fields– Problems with highly variable fields like author names

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

9

Alignment Based Parsers

• Morphologically classify word, numbers, punctuation marks• Interpret sequence of classes as gene sequence• Try to align this sequence with learned one

• Strengths:– No handcrafting of rules or patterns– Learn reference styles

• Weaknesses:– Need for representative training data covering many cases– Abbreviations interfere with alignment

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

10

Overview

• Bibliographic References• Previous Parsing Approaches• The RefParse Algorithm• Evaluation• Summary & Outlook

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

11

RefParse: The Idea

• Observation of previous approaches:– For each field, some approach is strong– Reference styles need to be in training set or created manually

• Observation gathering data:– References rarely come individually– Paper bibliographies are a common source

Lists of references following the same style

• Idea:– Exploit structural redundancy given in reference lists– Use individual approaches for fields they handle best

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

12

Exploiting Redundancy

• Get field values that patterns identify reliably:– Author names (all possible styles)– Numerical elements (year, volume, etc., pagination)– Ambiguous numbers become candidates for all they match

• Generate all possible field arrangements• Compare field arrangements across reference list …• … and pick the one that fits the best

• Align references against one another …• … to infer meta pattern at runtime

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

13

Thor, AU. The article title. The Journal 1998 (1987): 1997

Reference Alignment - Example

• Only alignment with second referencedisambiguates numbers in first one

• Exploiting redundancy overcomes inherentweaknesses of reference-by-reference parsers

Cond, SE. Another article title. Another Journal 7 (2012): 8-15

Volume?Year? Page?

Volume?Year? Page?

Volume

Year

Pages

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

14

Reference Alignment Result

• After alignment steps, RefParse has identified– Author lists, including style– Years of publication– Pagination (where present)– Volume / issue / numero numbers (where present)– Reference style (order of fields, intermediate punctuation)

Reference List

1. Base Element

Extraction

2a. Author List Assembly

2b. Author List Selection

3. Reference Style Inference

4. Volume Reference Extraction

5. Periodical / Publisher Extraction

6. Title Extraction

Parsed References

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

15

Handling Volume References

• Embedded references to books or journal volumes– In principle, references on their own (safe for year)– Extract and handle in recursive step

Thor, AU, Cond, SE. 2012. The chapter title. In:Itor, ED (Ed.) The book title. Location: Publisher: 8-15

Thor, AU, SE Cond, 2012. The 3rd article title. In:Itor, ED (Ed.) The 1st special issue. The Journal 7: 8-15

Reference List

1. Base Element

Extraction

2a. Author List Assembly

2b. Author List Selection

3. Reference Style Inference

4. Volume Reference Extraction

5. Periodical / Publisher Extraction

6. Title Extraction

Parsed References

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

16

Journal / Publisher Extraction

• Morphologically, names of journal and publisher very similar (Word block in title case)

• Sometimes heavily abbreviated (dots interfere with blocking)– Recognize title case abbreviation blocks– Handle parts in brackets / quotes as single blocks

• Use patterns to find candidates (optionally, use lexicons)• Choose candidate closest to volume number / pagination

Reference List

1. Base Element

Extraction

2a. Author List Assembly

2b. Author List Selection

3. Reference Style Inference

4. Volume Reference Extraction

5. Periodical / Publisher Extraction

6. Title Extraction

Parsed References

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

17

Title Extraction – Finally

• Title most important field of reference …• … but also most variable one pattern matching hard

• Having identified all other fields, however …• … title is what remains in middle of reference

• Circumvents matching or aligning title

Reference List

1. Base Element

Extraction

2a. Author List Assembly

2b. Author List Selection

3. Reference Style Inference

4. Volume Reference Extraction

5. Periodical / Publisher Extraction

6. Title Extraction

Parsed References

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

18

Overview

• Bibliographic References• Previous Parsing Approaches• The RefParse Algorithm• Evaluation• Summary & Outlook

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

19

Experimental Setup

• Corpora:– Cora Corpus: 500 individual references– Plazi Corpus: ~25.000 references from ~1.000 documents

• Experiments:– RefParse without training (empty lexicons)– RefParse with training (50% / 50% data split)– ParseCit (model based parser for comparison)– FreeCite (model based parser for comparison)

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

20

Experiments with Cora Corpus

RefParse clearly outperforms related approaches

Interestingly, accuracy lower with training (in a minute)

RefParse-g RefParse-l ParsCit FreeCite

Word / Token 91.5% 89.8% 83.0% 83.8%

Field:

- Author / Editor 98.6% / 74.6% 98.6% / 78.6% 95.7% / 0% 95.7% / 0%

- Title 79.0% 74.5% 91.0% 91.0%

- Year of Publication 98.8% 99.1% 96.7% 96.7%

- Pagination 97.7% 97.0% 88.9% 1.6%

- Part Designators 96.0% 89.2% 66.7% 96.0%

- Volume Title 38.8% 38.6% 46.3% 50%

- Journal / Publisher 68.0% 61.6% 53.1% 54.2%

Instance 58.4% 52.1% 23.4% 12.2%

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

21

Experiments with Plazi Corpus

RefParse clearly outperforms related approaches

Again, accuracy lower with training (next slide)

RefParse-g RefParse-l ParsCit FreeCite

Word / Token 94.3% 93.7% 78.9% 79.7%

Field:

- Author / Editor 97.2% / 83.7% 97.7% / 81.0% 88.3% / 0% 88.0% / 0%

- Title 78.4% 78.5% 40.4% 32.4%

- Year of Publication 99.5% 99.5% 95.5% 89.7%

- Pagination 99.3% 99.3% 20.4% 0.3%

- Part Designators 97.7% 95.1% 42.0% 64.3%

- Volume Title 63.2% 52.5% 0.6% 0.3%

- Journal / Publisher 76.6% 75.5 % 54.3% 44.3%

Instance 69.9% 69.2% 65.6% 3.4%

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

22

Lexicons can be Harmful ?!

• Observation in experiments:Accuracy for title and journal/publisher lower with lexicons

• Totally counter-intuitive at first glance

• What happens:– Frequent infix of long, rare journal name found in lexicon …– … and are taken as journal name proper …– … preventing whole journal name from being found

Infix Match Problem

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

23

Overview

• Bibliographic References• Previous Parsing Approaches• The RefParse Algorithm• Evaluation• Summary & Outlook

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

24

Summary

• RefParse algorithm:– Combines strengths of previous approaches– Processes whole reference lists– Infers reference style by mutual alignment– Independent of training data

• RefParse clearly outperforms previous approaches

• Lexicon lookup phenomenon: Infix Match Problem

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

25

Outlook

• Overcome infix match problem

• Improve overall accuracy in title and journal/publisher– Blocking & block scoring (akin to knowledge backed parsers)– Exploiting redundancy to find separating punctuation

• Gather experience in real-world deployment

Guido SautterKIT

Improved Bibliographic ReferenceParsing Based on Repeated Patterns

26

Questions?

Guido Sautter, Klemens Böhm:Improved Bibliographic Reference Parsing Based on Repeated Patterns

ViBRANTVirtual Biodiversity