35
InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012 1 / 35 Mission impossible? Computer Aided Extraction of Generic Chemical Structures from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector‘ Josef Eiblmaier, Valentina Eigner-Pitto, Hans Kraut, Larisa Isenko, Heinz Saller and Peter Loew

Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

1 / 35

Mission impossible? Computer Aided

Extraction of Generic Chemical Structures

from Patents. A Critical Review of the

Technologies Applied and Some Results of

the Theseus-Project 'ChemProspector‘

Josef Eiblmaier, Valentina Eigner-Pitto, Hans Kraut, Larisa Isenko, Heinz Saller and Peter Loew

Page 2: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

2 / 35

Outline

© cora / PIXELIO, www.pixelio.de

» Introduction

› ChemProspector, a THESEUS project

› Markush in a nutshell

» Goals and Approach

» Results

» Outlook

Page 3: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

3 / 35

Introduction

Page 4: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

4 / 35

» Research program initiated by the Federal Ministry of Economy and Technology (BMWi)

» ‘New Technologies for the Internet of Services’

» Supported with approx. 100 million Euros

» Duration: five years (2007 - 2011)

» Phase one: development of core technologies (2007 - 2008)

» Phase two: THESEUS SME (2009 - 2011)

Page 5: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

5 / 35

ChemProspector: Basic Data

» Main emphasis:

‘The automatic extraction of Markush Structures

from patent documents‘

» Research SME-project within THESEUS research program

» Duration: July 2009 – end of 2011

Page 6: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

6 / 35

What is a ‘Markush Structure‘?

http://www.colorantshistory.org

» Dr. Eugene A. Markush (1888-1968), Pharma Chemical Corporation (1917)

» USP No. 1,506,316 (1924), first usage of generic structures in a patent

Page 7: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

7 / 35

Markush Structure Example

Page 8: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

8 / 35

Approach

© Gerd Altmann / PIXELIO, www.pixelio.de

Page 9: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

9 / 35

Basic assumptions

» Markush notations follow particular grammar rules

› Generic graphical core structure

› Definition of generic groups in the subsequent text

› Usage of ‘Markush specific’ phrases

» Markush-Structures can be categorized

› Level 1 (easy)

› Level 2 (medium)

› Level 3 (hard)

Page 10: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

10 / 35

Challenges in Markush Structures

» The Information is contained in the text …

› Substituents variation

› Homology variation

› Topology variation

» ... in the images ...

› Position variation

› Frequency variation

Page 11: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

11 / 35

Challenges in Markush Structures

» ... and both, text and images

› Frequency variation

› Bond variation

Page 12: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

12 / 35

Classification of Markush Structures

» Level 1: simple standard notations (easy)

› relevant information is in the image or directly in the subsequent text

› simple grammar rules are used

› all variable parts are clearly defined

› variable parts do not contain further nested variable groups, generic organic groups

(e.g. alkyl) are allowed

» Level 2: complex standard notations (medium)

› relevant information is in the image or in the text, clear references to other places

in the document must be there

› more complex grammar rules may be used, but have to follow certain rules

› generic groups may have further nested generic groups but must be comprehensible

› may have conditional R-groups as long as they are clearly structured

and unambiguous

» Level 3: complex notations, singletons (hard)

Page 13: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

13 / 35

Main Components for Markush Recognition

ICANNOTATOR

Markush-Parser

Image classifier

Chemical

recognition

Page 14: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

14 / 35

» Proprietary development

› started mid 2010

» Multiple step process:

› page segmentation

› image classification

› vectorization, OCR, reconstruction

Extracts chemical images

Image Recognition: ICImg2Struct

Image classifier

Chemical

recognition

Page 15: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

15 / 35

Extracts chemical named entities

Exact chemical entities methyl, ethyl, n-propyl, phenyl, chloro, nitro, amino,

hydroxy, hydrogen, carbon, 1-naphthyl, 2-pyridyl, tosyl,

piperidyl ...

Generic and homology

groups, fragments

alkyl, alkoxy, aryl, halogenid, hydrocarbon ...

Combinations alkylamino, 4-aryl-phenyl, ...

Named Entity Recignition: ICANNOTATOR

ICANNOTATOR

Page 16: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

16 / 35

Extracts Markush specific entities

Formula definitions formula 1, general formula (I), derivatives

represented by (3), …

Variable definitions R, R1, R2, R', A, X, Y, Z, Ar, …

Wherein definitions where, wherein, in which, …

Link group represents, may be, one of, is selected from, …

Chain lengths 3-20 carbon atoms, …

Topologic definitions branched or unbranched, …

Bond types may contain double bonds, …

References as defined above, …

Substitutions optionally substituted by, …

Markush-Parser

Markush-Parser

Page 17: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

17 / 35

Grammar

rule 1

Grammar

rule 2

Grammar

rule 3

Finds patterns of entities/

reassembles the components

SemanticParser

SemanticParser

FORMULA

DEFINITION

CHEMICAL

STRUCTURE

WHERIN

DEFINITION LIGAND LIST

CHEMICAL

STRUCTURE

CHEMICAL

STRUCTURE

CHEMICAL

STRUCTURE

WHERIN

DEFINITION

FORMULA

DEFINITION LIGAND LIST

FORMULA

DEFINITION LIGAND LIST

Grammar

rule n

Page 18: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

18 / 35

SemanticParser Grammar Rules

» 135 patterns (regular expressions)

» 102 macros (sequences)

» 291 rules (Backus-Naur-Form)

Page 19: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

19 / 35

Overall Workflow Approach

ICANNOTATOR

Markush-Parser

Image classifier

Chemical

recognition

Page seg-

mentation

OCR

SemanticParser

Page 20: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

20 / 35

Results

© G. Altmann / PIXELIO, www.pixelio.de

Page 21: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

21 / 35

Original Image ICImg2Struct Accuracy

100%

100%

96.3%

Results: Image Recognition

Page 22: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

22 / 35

» Wavy bonds

» Atom numbers

» Brackets

» Circles

» Fused characters

» Charges

» Crossing bonds

» Variable bonds

» …

Image Recognition: More Challenges

Page 23: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

23 / 35

Evaluation: Key Questions

» What is the distribution of Level 1, 2

and 3 within a defined test corpus?

» How many Markush-Struktures

are identified?

» How many correct core structure

are identified?

» How many totally correct Markush-

Structures are extracted?

© Rainer Sturm / PIXELIO, www.pixelio.de

Page 24: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

24 / 35

Evaluation: Test Set of Documents

» Random selection of 100 patent documents

» Manual abstraction of Markush-Structures

contained therein

» Level 1: 474 Markush-Structures

» Level 2: 453 Markush-Structures

» Automatic comparison

Page 25: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

25 / 35

Results: Document Classification

Page 26: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

26 / 35

Results: Output ICF Proprietary Format

Page 27: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

27 / 35

Example 2: Incomplete Recognition

Page 28: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

28 / 35

Example 3: Full Recognition

Page 29: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

29 / 35

Example 4: Full Recognition

Page 30: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

30 / 35

Evaluation: Results

Page 31: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

31 / 35

Conclusions

Mission impossible?

Not impossible …

… but not accomplished yet!

Page 32: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

32 / 35

Outlook

» Extension of grammar rules for Markush Structures

» Further development of CDX extraction

» Further development of ICImg2Struct

Page 33: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

33 / 35

Acknowledgements

» InfoChem ChemProspector team

» German Federal Ministry of Economy and Technology (BMWi)

Page 34: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

34 / 35

© P. Storz / PIXELIO, www.pixelio.de

Thank you!

Page 35: Mission impossible? Computer Aided Extraction of …...from Patents. A Critical Review of the Technologies Applied and Some Results of the Theseus-Project 'ChemProspector ‘ Josef

InfoChem GmbH © 2012 Dr. Josef Eiblmaier ACS National Meeting, Philadelphia, August 19 - 22, 2012

35 / 35

Questions?