An Open Corpus for Named Entity Recognition in Historic Newspapers
Clemens NeudeckerBerlin State Library
@cneudecker
LREC2016, 23-28 May 2016, Portorož, Slovenia
Background
• Europeana Newspapers EU-project:www.europeana-newspapers.eu
• OCRed 12m pages of historic newspapers from Europe (an estimated 25 billion words!)
• Newspaper content from 23 libraries, in 40 languages, covering 4 centuries (1618-1990)
• Public domain full-text available for download per language/content provider
Formats & Standards
• Full-text produced in ALTO• Metadata (structural) in METS• Metadata (bibliographic) in EDM• Not a fan of XML?
Good ol‘ plain text (UTF-8) is also available…research.europeana.eu/itemtype/newspapers
• Currently working on:– API for text/search– API for images (IIIF)
Approach
• 3 languages selected for NER:Dutch, German, French – in collab. with
• Content in these languages constitutes about 50% of the overall full-text in the collection
Methodology
• Select 100 representative pages per language– If a classifier already exists for given language –
run it on the selected 100 pages– Ingest tagged/untagged pages to annotation tool– Manually add/correct annotations
(>=2 librarians per language)– Export and convert tagged data to BIO format– Train classifier from BIO & gazetteers (if available)– Evaluate derived classifier using 4-fold cross-eval– Repeat until classification performance converges
NER software
• Tested Stanford NER, OpenNLP, NLTK, Gate• Adaptation of Stanford NER package (CRF)– Mature, well-documented, widely used– Open source (GPL)– Thread-safe & platform-independent (JVM)– Machine learning scales out more easily
to multiple languages– Prior experience working with CRF
NER encoding in ALTO
• In ALTO versions >2.1, this is possible:
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"></String><String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"></String>…<Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/></Tags>
Annotation
• Evaluated BRAT, WebAnno, INL Attestation• Reasons for selection of INL Attestation:– Speed– Support
of ALTO format
– Supportfrom INLavailable
Annotation statsLanguage # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
Challenges
• Clear, comprehensive & common guidelines for manual annotation
• OCR quality – on average 80% word accuracy• Wide variation in historical spelling• Mix of languages on a single page• Lack/loss of metadata on page/word level• Some data corruption occured when ingesting
pre-tagged data into the annotation tool
Attempted workarounds
• Introduce OCR error patterns into training data actually yields less precision/recall
• Introduce a spelling variation module in the NER classifier rewrite rules (e.g. „frorn“ „from“) high integration effort requires reasonable amount of rules abandoned due to high complexity
Evaluation NL
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
Evaluation FR
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
Use cases
• Improving search, information retrieval– Within digital newspapers, a vast majority of
user queries are person and place names • Linking of named entities to authority files
to create linked data– The classification and disambiguation of named
entities allows the assignment of unique identifiers from authorative sources – thus enabling cross-language/cross-collection linking
Next steps
• Volunteers wanted! Help correct corpus and collaboratively create a free dataset – instructions on GitHub wiki:– github.com/EuropeanaNewspapers/ner-corpora/wiki
/Corpus-cleanup • Plans to improve performance:– Add distributional similarity as feature (Clark 2003)– Semantic generalisation (Faruqui & Padò 2010)– Specialised gazetteers (e.g. list of historic place names)– Data, data, data
Open resources
• European Newspapers NER dataset (CC0):– github.com/EuropeanaNewspapers/ner-corpora
• Europeana Newspapers NER software (EUPL):– github.com/EuropeanaNewspapers/
europeananp-ner– github.com/EuropeanaNewspapers/
europeananp-dbpedia-disambiguation• Annotated ALTO files:– lab.kbresearch.nl/static/html/eunews.html
References
• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:Large scale refinement of digital historical newspapers with named entity recognitionProceedings of the IFLA Newspaper Section Satellite Meeting, 2014, Geneva, Switzerland.
• Y. Mossalam, A. Abi-Haidar, J.G. Ganascia:Unsupervised named entity recognition and disambiguation: An application to old French journalsAdvances in Data Mining. Applications and Theoretical Aspects, Springer LNCS, 2014.
Thank you for your attention!Questions?
Clemens NeudeckerBerlin State Library
@cneudecker