Project Enki: Named Entity Recognition on Analog Documents

Amit Gupte

Program Manager Microsoft - MAIDAP

Collaborators

Alexey Romanov

Data Scientist Microsoft - MAIDAP

Jianjie Lui

Software Engineer Microsoft - MAIDAP

Dalitso Banda

Software Engineer Microsoft - MAIDAP

Raza Khan

Data Scientist Azure AI

Lakshmanan Ramu Meenal

Software EngineerAzure AI

Benjamin Han

Data Science ManagerAzure AI

Soundar Srinivasan

DirectorMicrosoft - MAIDAP

Sahitya Mantravadi

Data Scientist Microsoft - MAIDAP

Document Digitization

Optical Character Recognition

Search, Summarization

However, OCR on scanned Docs is far from

This is due to the noise in those documents

Errors in OCR Process Hampers

Downstream NLP tasks

Cannot mine Information

From Documents Accurately

Simulate OCR Noise

Improve performance

on Noisy Data

NER LabelledDocuments

Synthetic DataGeneration

Module

Analogdocuments

Inference

Training

AlignedLabelled

DocumentsPredicted

NER labels

OCRDocuments

OCR EngineRestoration

ModelTraining

OCRRestorationNER ModelScoring

Module

TextAlignment

Module

RestorationModel

NER F1Score

Genalog

synthetic documents

degradations

Azure OCR

text alignment i

https://microsoft.github.io/genalog/

Genalog

Alignment

RETAS method 100x∗

Text Alignment Label Propagation

A Fast Alignment Scheme for Automatic OCR Evaluation of Books. And we found similar results in our experiments.

Seq2Seq

• Suffers with long sequences

Bi-LSTM prediction

• Character shift problem

Action prediction

• Our approach

--w-Y------w-------

Nev ork is onderful

N e v Y o k i s

1d convolution

Char fully connected

Action fully connected

Char embedding

Dataset Degradation Type

OCR Accuracy Reconstruction

Accuracy

Char Word Char Word

CoNLL 2012 All Degradation Light 0.986 0.927 0.991 0.962

CoNLL 2003 All Degradation Light 0.989 0.942 0.994 0.969

CoNLL 2012 All Degradation Heavy 0.900 0.661 0.907 0.732

CoNLL 2003 All Degradation Heavy 0.900 0.646 0.903 0.700

Dataset Degradation TypeNER on

Clean Text

NER on

Degraded Text

NER on

Restored Text

Relative Gap

Reduction

CoNLL 2012 All Degradation Light 0.832 0.783 0.819 73%

CoNLL 2003 All Degradation Light 0.860 0.820 0.841 52%

CNN Daily Mail All Degradation Light 0.989 0.590 0.895 76%

synthetic images

degradations

Action prediction character shift

problem

synthetic data

restore the text from OCR errors

significantly reduces

downstream NER task

Checkout Genalog and give it a like

Give our Paper a read

Connect with us ☺

Two column Scientific Paper Two column Scientific Paper

with Plots Letter Style Block Text

Project Enki: Named Entity Recognition on Analog Documents

Documents

Named Entity Recognition I

Information Extraction and Named Entity Recognition

Salient named entity identification system

Adaptive Co-attention Network for Named Entity Recognition ... · Named entity recognition (NER) tries to identify the named entities in text. Named entities fall into pre-deﬁned

Named Entity Recognition Using BERT and ELMo · Group 8 : Mikaela Guerrero Vikash Kumar Nitya Sampath Saumya Shah. Introduction to Named Entity Recognition Named entity recognition

Naive Bayes Named Entity Recognizer

Named EntIty Recognition in Query

Crowdsourcing Named Entity Recognition and Entity Linking ... · Crowdsourcing Named Entity Recognition and Entity Linking Corpora 3 Fig. 1 The methodological framework for crowdsourcing

DataEngConf: Concrete Named Entity Disambiguation

Biomedical Named Entity Recognition

Malay Named Entity Recognition Based on Rule-Based Approach · 2017-10-19 · Index Terms—Information extraction, Malay named entity recognition, named entity recognition, rule-based

Named Entity Recognition

NEREx: Named-Entity Relationship Exploration in Multi ...bib.dbvis.de/uploadedFiles/namedentityrelationship.pdf · named-entity extraction, we abstract important entities into ten

Named-Entity-Recognition Pipeline · • Named-Entity-Recognition (NER) ... • Conclusion of preprocessing Several steps based on each other ... Predictive words

Dictionary-based named entity recognition

Analysis of Named Entity Recognition and Linking for Tweetsgiusepperizzo.github.io/publications/Derczynski_Manyard_Rizzo-IPM… · 2. Named Entity Recognition Named entity recognition

Named Entity Recognition from a Data-Driven Perspectivecseweb.ucsd.edu/classes/fa19/cse259-a/files/jingbo.pdf · What’s Named Entity Recognition? 3 q Wikipedia: q Named-entity recognition

Improving Language-Dependent Named Entity Detection

Named Entity Recognition Neural Network Based

Evaluation of Named Entity Extraction Systems · 2009-07-17 · choice or the design of more suitable NER tools for a specific project. Keywords: Named entity extraction, named entity