56
Overview Large-scale, Parallel Automatic Patent Annotation Thomas Heitz & GATE Team Computer Science Dept. - NLP Group - Sheffield University Patent Information Retrieval 2008 30 October 2008 T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 1 / 33

Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

Large-scale, Parallel

Automatic Patent Annotation

Thomas Heitz & GATE TeamComputer Science Dept. - NLP Group - Sheffield University

Patent Information Retrieval 2008

30 October 2008

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 1 / 33

Page 2: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Automatic Patent Annotation

Objectives

Fully automatic method.

Scaling up without sacrificing computational performanceand accuracy.

Methods

Keywords based queries: 10 degree, 20 degree Celsius, 18 ◦F,etc.

Semantic annotations based queries: measurement.unit =

’degree Celsius’, measurement.value = {10,30}; will findFahrenheit equivalent as well.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 2 / 33

Page 3: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Automatic Patent Annotation

Objectives

Fully automatic method.

Scaling up without sacrificing computational performance andaccuracy.

Methods

Keywords based queries: 10 degree, 20 degree Celsius, 18 ◦F,etc.

Semantic annotations based queries: measurement.unit =

’degree Celsius’, measurement.value = {10,30}; will findFahrenheit equivalent as well.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 2 / 33

Page 4: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Large-scale parallel Information Extraction

System characteristics

Insufficient training data for learning ⇒ Rule-Based system

Robust, Scalable ⇒ Shallow IE (Deep in PatExpert [16]).

Large volume of data ⇒ Automatic and Parallel

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 3 / 33

Page 5: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Results

Performance and quality

Processed 1.3 million patents in 6 days with 12 parallelprocesses.

Strict precision and recall greater than 90% for mostannotations.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 4 / 33

Page 6: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Results

Performance and quality

Processed 1.3 million patents in 6 days with 12 parallelprocesses.

Strict precision and recall greater than 90% for mostannotations.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 4 / 33

Page 7: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Contents

1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33

Page 8: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Contents

1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33

Page 9: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Contents

1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33

Page 10: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Overview

TaskApproachResultsIn the following

Contents

1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33

Page 11: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Contents

1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 6 / 33

Page 12: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Patent data and structure

Dataset from Matrixware

American patents (USPTO): 1.3 million, 108 GB, averagefile size is 85KB.

European patents (EPO): 27 thousand, 780MB, average filesize is 29KB.

Structure in three main parts

The first page containing bibliographical data and abstract,

the description of the invention,

the usage of the invention,

the claim part

and the bibliography part.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 7 / 33

Page 13: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Patent data and structure

Dataset from Matrixware

American patents (USPTO): 1.3 million, 108 GB, average filesize is 85KB.

European patents (EPO): 27 thousand, 780MB, average filesize is 29KB.

Structure in three main parts

The first page containing bibliographical data and abstract,

the description of the invention,

the usage of the invention,

the claim part

and the bibliography part.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 7 / 33

Page 14: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Section annotations (EPO)

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 8 / 33

Page 15: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Section annotations

SectionsBibliographicData,Abstract and Claims sectionspre-existing.

heading annotations gives thebeginning of a section, ifpresent.

Use of keywords to guess thesection type.

About 20 section types.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 9 / 33

Page 16: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Reference annotations (USPTO)

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 10 / 33

Page 17: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Reference annotations

ReferencesClaim, Example, Figure,Formula, Table are quitestraightforward except forintervals like Fig. 1 to 3 and 5.

A lot more difficult are Patent

because of the variability offormat.

And even more Literature, forexample authors can havenumerous format: Warwel, S.;

S. Warwel; Siegfried Warwel;

etc.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 11 / 33

Page 18: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Measurement annotations (EPO)

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 12 / 33

Page 19: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent data and structureSection annotationsReference annotationsMeasurement annotations

Measurement annotations

Measurements

Most measurements comprise ascalarValue followed by aunit, e.g. 350 nm.

Two scalarValue with orwithout unit can be contained inan interval, e.g. 150 to 350

nm.

Large number of measurementunits in existence so we used anontology populated from adatabase.

One letter unit are ambiguous.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 13 / 33

Page 20: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Contents

1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 14 / 33

Page 21: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

GATE

GATE and ANNIE

GATE [5], the General Architecture for Text Engineering,is a framework providing support for a variety of languageengineering tasks.

It includes a vanilla information extraction system, ANNIE.

The processing resources we use from ANNIE are as follows:tokeniser, completely customised gazetteer and finitestate transduction grammars.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 15 / 33

Page 22: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

GATE

GATE and ANNIE

GATE [5], the General Architecture for Text Engineering, is aframework providing support for a variety of languageengineering tasks.

It includes a vanilla information extraction system, ANNIE.

The processing resources we use from ANNIE are as follows:tokeniser, completely customised gazetteer and finitestate transduction grammars.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 15 / 33

Page 23: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Gazetteers

Reference and measurement unit gazetteers

The rules use some clue words like Table followed by anumber for table references.

We use gazetteers to annotate such clue words with all theirinflections.

For reference: 314 entries.

For measurements unit: more than 30K entries (Createdautomatically from a database).

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 16 / 33

Page 24: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Gazetteers

Reference and measurement unit gazetteers

The rules use some clue words like Table followed by anumber for table references.

We use gazetteers to annotate such clue words with all theirinflections.

For reference: 314 entries.

For measurements unit: more than 30K entries (Createdautomatically from a database).

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 16 / 33

Page 25: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

GATE JAPE

We use GATE JAPE rule that consists of two parts: left handside (LHS) and right hand side (RHS).

LHS consists of an annotation pattern that should bematched in the text.

RHS declares the action that should be taken when thepattern specified in LHS is found in the document.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 17 / 33

Page 26: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

GATE JAPE

We use GATE JAPE rule that consists of two parts: left handside (LHS) and right hand side (RHS).

LHS consists of an annotation pattern that should bematched in the text.

RHS declares the action that should be taken when thepattern specified in LHS is found in the document.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 17 / 33

Page 27: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a Measurement

E.g. 350 nm.

Measurement Annotation ruleRule: Measurement

( // LHS

{Number}

{Unit}

):match

--> // RHS

:match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

Page 28: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a Measurement

E.g. 350 nm.

Measurement Annotation ruleRule: Measurement

( // LHS

{Number}

{Unit}

):match

--> // RHS

:match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

Page 29: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a Measurement

E.g. 350 nm.

Measurement Annotation ruleRule: Measurement

( // LHS

{Number}

{Unit}

):match

--> // RHS

:match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

Page 30: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a Measurement

E.g. 350 nm.

Measurement Annotation ruleRule: Measurement

( // LHS

{Number}

{Unit}

):match

--> // RHS

:match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

Page 31: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a Measurement

E.g. 350 nm.

Measurement Annotation ruleRule: Measurement

( // LHS

{Number}

{Unit}

):match

--> // RHS

:match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

Page 32: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a Measurement

E.g. 350 nm. In total, 30 rules are used for measurements.

Measurement Annotation ruleRule: Measurement

( // LHS

{Number}

{Unit}

):match

--> // RHS

:match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

Page 33: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”

Elsevier: Amsterdam, 1966.

Literature Annotation ruleRule: Literature

( // LHS

{LiteratureContext}

({LiteratureStart}

{LiteratureEnd}

):match

):match-with-context

--> // RHS

:match.Literature = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

Page 34: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”

Elsevier: Amsterdam, 1966.

Literature Annotation ruleRule: Literature

( // LHS

{LiteratureContext}

({LiteratureStart}

{LiteratureEnd}

):match

):match-with-context

--> // RHS

:match.Literature = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

Page 35: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”

Elsevier: Amsterdam, 1966.

Literature Annotation ruleRule: Literature

( // LHS

{LiteratureContext}

({LiteratureStart}

{LiteratureEnd}

):match

):match-with-context

--> // RHS

:match.Literature = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

Page 36: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”

Elsevier: Amsterdam, 1966.

Literature Annotation ruleRule: Literature

( // LHS

{LiteratureContext}

({LiteratureStart}

{LiteratureEnd}

):match

):match-with-context

--> // RHS

:match.Literature = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

Page 37: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”

Elsevier: Amsterdam, 1966.

Literature Annotation ruleRule: Literature

( // LHS

{LiteratureContext}

({LiteratureStart}

{LiteratureEnd}

):match

):match-with-context

--> // RHS

:match.Literature = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

Page 38: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”

Elsevier: Amsterdam, 1966.

Literature Annotation ruleRule: Literature

( // LHS

{LiteratureContext}

({LiteratureStart}

{LiteratureEnd}

):match

):match-with-context

--> // RHS

:match.Literature = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

Page 39: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Annotation rules

To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”

Elsevier: Amsterdam, 1966. 24 rules are used for references.

Literature Annotation ruleRule: Literature

( // LHS

{LiteratureContext}

({LiteratureStart}

{LiteratureEnd}

):match

):match-with-context

--> // RHS

:match.Literature = {}

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

Page 40: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

GATEGazetteersRulesApplication

Application

Application pipeline

Phase Gate processing resource

1 Section Finder

2 English Tokeniser

3 Patent-specific gazetteers

4 Reference Finder

5 Measurements Finder

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 20 / 33

Page 41: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

SetupOptimisationPerformance

Contents

1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 21 / 33

Page 42: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

SetupOptimisationPerformance

Setup

Large Data Collider (LDC)

Our experiments were carried out on the IRF’s LDC with Java(jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes.

SGI Altix 4700 system comprising 20 nodes each with four1.4GHz Itanium cores and 18GB RAM.

In comparison, we found it 4x faster on Intel Core 2 2.4GHz.

Specific applications

GATE batch mode: dispatches files to process on severalGATE applications; do not stop on error.

GATE benchmarking: generate time stamps for eachresource and display charts from them.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 22 / 33

Page 43: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

SetupOptimisationPerformance

Setup

Large Data Collider (LDC)

Our experiments were carried out on the IRF’s LDC with Java(jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes.

SGI Altix 4700 system comprising 20 nodes each with four1.4GHz Itanium cores and 18GB RAM.

In comparison, we found it 4x faster on Intel Core 2 2.4GHz.

Specific applications

GATE batch mode: dispatches files to process on severalGATE applications; do not stop on error.

GATE benchmarking: generate time stamps for eachresource and display charts from them.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 22 / 33

Page 44: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

SetupOptimisationPerformance

Optimisation

Benchmarking and refactoring

Benchmarking of each processing resources.

Removing of unnecessary resources like ANNIEMorphological analyser and Named Entities Recognition tokeep only the Tokenizer.

Optimisation of the JAPE rules where the benchmarkingdetect abnormal execution time.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 23 / 33

Page 45: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

SetupOptimisationPerformance

Performance

Baseline vs. optimized

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 24 / 33

Page 46: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent Gold StandardEvaluation on the Patent Gold Standard

Contents

1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 25 / 33

Page 47: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent Gold StandardEvaluation on the Patent Gold Standard

Patent Gold Standard

Creation of the Gold Standard

Selection of patents from two very different fields:mechanical engineering and biomedical technology.

Manual annotation of USPTO and EPO patents by more than10 person with several annotators for each patent.

In total: 51 patents, 2,5 million characters.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 26 / 33

Page 48: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent Gold StandardEvaluation on the Patent Gold Standard

Statistics on Gold Standard

Annotation type USPTO EPO

Section.Abstract 23 28S.BackgroundArt 19 22S.BestMode 2 5S.BibliographicData 23 28S.Bibliography 0 8S.Claims 23 0S.CrossReferenceToR.A. 6 1S.DetailedDescription 11 18S.DisclosureOfInvention 3 6S.DrawingDescription 16 20S.Effects 1 2S.Examples 17 25S.PreferredEmbodiment 10 7S.PriorArt 4 6S.Sponsorship 2 0S.SummaryOfTheInvent. 20 18S.TechnicalField 14 17S.UsageOfInvention 1 6Annotations/Doc 8.5 8

Annotation type USPTO EPO

Reference.Claim 352 2R.Example 99 264R.Figure 375 570R.Formula 79 66R.Literature 114 488R.Patent 92 182R.Table 59 105Annotations/Doc 51 60

M.scalarValue 1998 3409Measurement.unit 1613 2994M.interval 432 375Annotations/Doc 176 242

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 27 / 33

Page 49: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent Gold StandardEvaluation on the Patent Gold Standard

Statistics on Gold Standard

Annotation type USPTO EPO

Section.Abstract 23 28S.BackgroundArt 19 22S.BestMode 2 5S.BibliographicData 23 28S.Bibliography 0 8S.Claims 23 0S.CrossReferenceToR.A. 6 1S.DetailedDescription 11 18S.DisclosureOfInvention 3 6S.DrawingDescription 16 20S.Effects 1 2S.Examples 17 25S.PreferredEmbodiment 10 7S.PriorArt 4 6S.Sponsorship 2 0S.SummaryOfTheInvent. 20 18S.TechnicalField 14 17S.UsageOfInvention 1 6Annotations/Doc 8.5 8

Annotation type USPTO EPO

Reference.Claim 352 2R.Example 99 264R.Figure 375 570R.Formula 79 66R.Literature 114 488R.Patent 92 182R.Table 59 105Annotations/Doc 51 60

M.scalarValue 1998 3409Measurement.unit 1613 2994M.interval 432 375Annotations/Doc 176 242

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 27 / 33

Page 50: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent Gold StandardEvaluation on the Patent Gold Standard

Results on Gold Standard, Micro-averaged precision, recall

Annotation type USPTO EPOP. R. F1 P. R. F1

S.BackgroundArt 74 74 74 56 68 61S.DrawingDescr. 75 75 75 84 80 82Section.Examples 65 65 65 61 56 58S.SummaryOf. 89 80 84 83 83 83S.TechnicalField 80 57 67 94 94 94

Reference.Claim 100 100 100 100 100 100R.Example 97 100 99 100 99 99R.Figure 99 99 99 99 98 98R.Formula 99 99 99 100 100 100R.Literature 69 75 72 70 74 72R.Patent 76 77 77 72 84 78R.Table 100 98 99 100 100 100

M.scalarValue 96 93 94 94 92 93Measurement.unit 95 92 93 94 93 93M.interval 93 92 93 82 81 82

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 28 / 33

Page 51: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent Gold StandardEvaluation on the Patent Gold Standard

Section annotation: Examples (EPO)

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 29 / 33

Page 52: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent Gold StandardEvaluation on the Patent Gold Standard

Reference annotation: Literature (USPTO)

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 30 / 33

Page 53: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Task: patent annotationTools: GATE gazetteers and rules

Experiments: large scale and parallelEvaluation: gold standard

Patent Gold StandardEvaluation on the Patent Gold Standard

Measurement annotation: interval (EPO)

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 31 / 33

Page 54: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Conclusion Conclusion

Contents

In conclusion...

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 32 / 33

Page 55: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Conclusion Conclusion

Conclusion

Fully automatic, scaling up method (million patents, 100GB).

Quality close to human annotators.

Perspective

Machine learning from annotated patents.

Semantic queries with Patent Ontology.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 33 / 33

Page 56: Large-scale, Parallel Automatic Patent Annotation · The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the

Conclusion Conclusion

Conclusion

Fully automatic, scaling up method (million patents, 100GB).

Quality close to human annotators.

Perspective

Machine learning from annotated patents.

Semantic queries with Patent Ontology.

T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 33 / 33