Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Overview
Large-scale, Parallel
Automatic Patent Annotation
Thomas Heitz & GATE TeamComputer Science Dept. - NLP Group - Sheffield University
Patent Information Retrieval 2008
30 October 2008
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 1 / 33
Overview
TaskApproachResultsIn the following
Automatic Patent Annotation
Objectives
Fully automatic method.
Scaling up without sacrificing computational performanceand accuracy.
Methods
Keywords based queries: 10 degree, 20 degree Celsius, 18 ◦F,etc.
Semantic annotations based queries: measurement.unit =
’degree Celsius’, measurement.value = {10,30}; will findFahrenheit equivalent as well.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 2 / 33
Overview
TaskApproachResultsIn the following
Automatic Patent Annotation
Objectives
Fully automatic method.
Scaling up without sacrificing computational performance andaccuracy.
Methods
Keywords based queries: 10 degree, 20 degree Celsius, 18 ◦F,etc.
Semantic annotations based queries: measurement.unit =
’degree Celsius’, measurement.value = {10,30}; will findFahrenheit equivalent as well.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 2 / 33
Overview
TaskApproachResultsIn the following
Large-scale parallel Information Extraction
System characteristics
Insufficient training data for learning ⇒ Rule-Based system
Robust, Scalable ⇒ Shallow IE (Deep in PatExpert [16]).
Large volume of data ⇒ Automatic and Parallel
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 3 / 33
Overview
TaskApproachResultsIn the following
Results
Performance and quality
Processed 1.3 million patents in 6 days with 12 parallelprocesses.
Strict precision and recall greater than 90% for mostannotations.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 4 / 33
Overview
TaskApproachResultsIn the following
Results
Performance and quality
Processed 1.3 million patents in 6 days with 12 parallelprocesses.
Strict precision and recall greater than 90% for mostannotations.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 4 / 33
Overview
TaskApproachResultsIn the following
Contents
1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33
Overview
TaskApproachResultsIn the following
Contents
1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33
Overview
TaskApproachResultsIn the following
Contents
1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33
Overview
TaskApproachResultsIn the following
Contents
1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Contents
1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 6 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Patent data and structure
Dataset from Matrixware
American patents (USPTO): 1.3 million, 108 GB, averagefile size is 85KB.
European patents (EPO): 27 thousand, 780MB, average filesize is 29KB.
Structure in three main parts
The first page containing bibliographical data and abstract,
the description of the invention,
the usage of the invention,
the claim part
and the bibliography part.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 7 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Patent data and structure
Dataset from Matrixware
American patents (USPTO): 1.3 million, 108 GB, average filesize is 85KB.
European patents (EPO): 27 thousand, 780MB, average filesize is 29KB.
Structure in three main parts
The first page containing bibliographical data and abstract,
the description of the invention,
the usage of the invention,
the claim part
and the bibliography part.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 7 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Section annotations (EPO)
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 8 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Section annotations
SectionsBibliographicData,Abstract and Claims sectionspre-existing.
heading annotations gives thebeginning of a section, ifpresent.
Use of keywords to guess thesection type.
About 20 section types.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 9 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Reference annotations (USPTO)
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 10 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Reference annotations
ReferencesClaim, Example, Figure,Formula, Table are quitestraightforward except forintervals like Fig. 1 to 3 and 5.
A lot more difficult are Patent
because of the variability offormat.
And even more Literature, forexample authors can havenumerous format: Warwel, S.;
S. Warwel; Siegfried Warwel;
etc.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 11 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Measurement annotations (EPO)
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 12 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent data and structureSection annotationsReference annotationsMeasurement annotations
Measurement annotations
Measurements
Most measurements comprise ascalarValue followed by aunit, e.g. 350 nm.
Two scalarValue with orwithout unit can be contained inan interval, e.g. 150 to 350
nm.
Large number of measurementunits in existence so we used anontology populated from adatabase.
One letter unit are ambiguous.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 13 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Contents
1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 14 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
GATE
GATE and ANNIE
GATE [5], the General Architecture for Text Engineering,is a framework providing support for a variety of languageengineering tasks.
It includes a vanilla information extraction system, ANNIE.
The processing resources we use from ANNIE are as follows:tokeniser, completely customised gazetteer and finitestate transduction grammars.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 15 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
GATE
GATE and ANNIE
GATE [5], the General Architecture for Text Engineering, is aframework providing support for a variety of languageengineering tasks.
It includes a vanilla information extraction system, ANNIE.
The processing resources we use from ANNIE are as follows:tokeniser, completely customised gazetteer and finitestate transduction grammars.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 15 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Gazetteers
Reference and measurement unit gazetteers
The rules use some clue words like Table followed by anumber for table references.
We use gazetteers to annotate such clue words with all theirinflections.
For reference: 314 entries.
For measurements unit: more than 30K entries (Createdautomatically from a database).
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 16 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Gazetteers
Reference and measurement unit gazetteers
The rules use some clue words like Table followed by anumber for table references.
We use gazetteers to annotate such clue words with all theirinflections.
For reference: 314 entries.
For measurements unit: more than 30K entries (Createdautomatically from a database).
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 16 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
GATE JAPE
We use GATE JAPE rule that consists of two parts: left handside (LHS) and right hand side (RHS).
LHS consists of an annotation pattern that should bematched in the text.
RHS declares the action that should be taken when thepattern specified in LHS is found in the document.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 17 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
GATE JAPE
We use GATE JAPE rule that consists of two parts: left handside (LHS) and right hand side (RHS).
LHS consists of an annotation pattern that should bematched in the text.
RHS declares the action that should be taken when thepattern specified in LHS is found in the document.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 17 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a Measurement
E.g. 350 nm.
Measurement Annotation ruleRule: Measurement
( // LHS
{Number}
{Unit}
):match
--> // RHS
:match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a Measurement
E.g. 350 nm.
Measurement Annotation ruleRule: Measurement
( // LHS
{Number}
{Unit}
):match
--> // RHS
:match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a Measurement
E.g. 350 nm.
Measurement Annotation ruleRule: Measurement
( // LHS
{Number}
{Unit}
):match
--> // RHS
:match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a Measurement
E.g. 350 nm.
Measurement Annotation ruleRule: Measurement
( // LHS
{Number}
{Unit}
):match
--> // RHS
:match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a Measurement
E.g. 350 nm.
Measurement Annotation ruleRule: Measurement
( // LHS
{Number}
{Unit}
):match
--> // RHS
:match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a Measurement
E.g. 350 nm. In total, 30 rules are used for measurements.
Measurement Annotation ruleRule: Measurement
( // LHS
{Number}
{Unit}
):match
--> // RHS
:match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”
Elsevier: Amsterdam, 1966.
Literature Annotation ruleRule: Literature
( // LHS
{LiteratureContext}
({LiteratureStart}
{LiteratureEnd}
):match
):match-with-context
--> // RHS
:match.Literature = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”
Elsevier: Amsterdam, 1966.
Literature Annotation ruleRule: Literature
( // LHS
{LiteratureContext}
({LiteratureStart}
{LiteratureEnd}
):match
):match-with-context
--> // RHS
:match.Literature = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”
Elsevier: Amsterdam, 1966.
Literature Annotation ruleRule: Literature
( // LHS
{LiteratureContext}
({LiteratureStart}
{LiteratureEnd}
):match
):match-with-context
--> // RHS
:match.Literature = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”
Elsevier: Amsterdam, 1966.
Literature Annotation ruleRule: Literature
( // LHS
{LiteratureContext}
({LiteratureStart}
{LiteratureEnd}
):match
):match-with-context
--> // RHS
:match.Literature = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”
Elsevier: Amsterdam, 1966.
Literature Annotation ruleRule: Literature
( // LHS
{LiteratureContext}
({LiteratureStart}
{LiteratureEnd}
):match
):match-with-context
--> // RHS
:match.Literature = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”
Elsevier: Amsterdam, 1966.
Literature Annotation ruleRule: Literature
( // LHS
{LiteratureContext}
({LiteratureStart}
{LiteratureEnd}
):match
):match-with-context
--> // RHS
:match.Literature = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Annotation rules
To find a literature referenceE.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium”
Elsevier: Amsterdam, 1966. 24 rules are used for references.
Literature Annotation ruleRule: Literature
( // LHS
{LiteratureContext}
({LiteratureStart}
{LiteratureEnd}
):match
):match-with-context
--> // RHS
:match.Literature = {}
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
GATEGazetteersRulesApplication
Application
Application pipeline
Phase Gate processing resource
1 Section Finder
2 English Tokeniser
3 Patent-specific gazetteers
4 Reference Finder
5 Measurements Finder
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 20 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
SetupOptimisationPerformance
Contents
1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 21 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
SetupOptimisationPerformance
Setup
Large Data Collider (LDC)
Our experiments were carried out on the IRF’s LDC with Java(jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes.
SGI Altix 4700 system comprising 20 nodes each with four1.4GHz Itanium cores and 18GB RAM.
In comparison, we found it 4x faster on Intel Core 2 2.4GHz.
Specific applications
GATE batch mode: dispatches files to process on severalGATE applications; do not stop on error.
GATE benchmarking: generate time stamps for eachresource and display charts from them.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 22 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
SetupOptimisationPerformance
Setup
Large Data Collider (LDC)
Our experiments were carried out on the IRF’s LDC with Java(jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes.
SGI Altix 4700 system comprising 20 nodes each with four1.4GHz Itanium cores and 18GB RAM.
In comparison, we found it 4x faster on Intel Core 2 2.4GHz.
Specific applications
GATE batch mode: dispatches files to process on severalGATE applications; do not stop on error.
GATE benchmarking: generate time stamps for eachresource and display charts from them.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 22 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
SetupOptimisationPerformance
Optimisation
Benchmarking and refactoring
Benchmarking of each processing resources.
Removing of unnecessary resources like ANNIEMorphological analyser and Named Entities Recognition tokeep only the Tokenizer.
Optimisation of the JAPE rules where the benchmarkingdetect abnormal execution time.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 23 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
SetupOptimisationPerformance
Performance
Baseline vs. optimized
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 24 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent Gold StandardEvaluation on the Patent Gold Standard
Contents
1 Task: patent annotation2 Tools: GATE gazetteers and rules3 Experiments: large scale and parallel4 Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 25 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent Gold StandardEvaluation on the Patent Gold Standard
Patent Gold Standard
Creation of the Gold Standard
Selection of patents from two very different fields:mechanical engineering and biomedical technology.
Manual annotation of USPTO and EPO patents by more than10 person with several annotators for each patent.
In total: 51 patents, 2,5 million characters.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 26 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent Gold StandardEvaluation on the Patent Gold Standard
Statistics on Gold Standard
Annotation type USPTO EPO
Section.Abstract 23 28S.BackgroundArt 19 22S.BestMode 2 5S.BibliographicData 23 28S.Bibliography 0 8S.Claims 23 0S.CrossReferenceToR.A. 6 1S.DetailedDescription 11 18S.DisclosureOfInvention 3 6S.DrawingDescription 16 20S.Effects 1 2S.Examples 17 25S.PreferredEmbodiment 10 7S.PriorArt 4 6S.Sponsorship 2 0S.SummaryOfTheInvent. 20 18S.TechnicalField 14 17S.UsageOfInvention 1 6Annotations/Doc 8.5 8
Annotation type USPTO EPO
Reference.Claim 352 2R.Example 99 264R.Figure 375 570R.Formula 79 66R.Literature 114 488R.Patent 92 182R.Table 59 105Annotations/Doc 51 60
M.scalarValue 1998 3409Measurement.unit 1613 2994M.interval 432 375Annotations/Doc 176 242
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 27 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent Gold StandardEvaluation on the Patent Gold Standard
Statistics on Gold Standard
Annotation type USPTO EPO
Section.Abstract 23 28S.BackgroundArt 19 22S.BestMode 2 5S.BibliographicData 23 28S.Bibliography 0 8S.Claims 23 0S.CrossReferenceToR.A. 6 1S.DetailedDescription 11 18S.DisclosureOfInvention 3 6S.DrawingDescription 16 20S.Effects 1 2S.Examples 17 25S.PreferredEmbodiment 10 7S.PriorArt 4 6S.Sponsorship 2 0S.SummaryOfTheInvent. 20 18S.TechnicalField 14 17S.UsageOfInvention 1 6Annotations/Doc 8.5 8
Annotation type USPTO EPO
Reference.Claim 352 2R.Example 99 264R.Figure 375 570R.Formula 79 66R.Literature 114 488R.Patent 92 182R.Table 59 105Annotations/Doc 51 60
M.scalarValue 1998 3409Measurement.unit 1613 2994M.interval 432 375Annotations/Doc 176 242
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 27 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent Gold StandardEvaluation on the Patent Gold Standard
Results on Gold Standard, Micro-averaged precision, recall
Annotation type USPTO EPOP. R. F1 P. R. F1
S.BackgroundArt 74 74 74 56 68 61S.DrawingDescr. 75 75 75 84 80 82Section.Examples 65 65 65 61 56 58S.SummaryOf. 89 80 84 83 83 83S.TechnicalField 80 57 67 94 94 94
Reference.Claim 100 100 100 100 100 100R.Example 97 100 99 100 99 99R.Figure 99 99 99 99 98 98R.Formula 99 99 99 100 100 100R.Literature 69 75 72 70 74 72R.Patent 76 77 77 72 84 78R.Table 100 98 99 100 100 100
M.scalarValue 96 93 94 94 92 93Measurement.unit 95 92 93 94 93 93M.interval 93 92 93 82 81 82
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 28 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent Gold StandardEvaluation on the Patent Gold Standard
Section annotation: Examples (EPO)
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 29 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent Gold StandardEvaluation on the Patent Gold Standard
Reference annotation: Literature (USPTO)
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 30 / 33
Task: patent annotationTools: GATE gazetteers and rules
Experiments: large scale and parallelEvaluation: gold standard
Patent Gold StandardEvaluation on the Patent Gold Standard
Measurement annotation: interval (EPO)
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 31 / 33
Conclusion Conclusion
Contents
In conclusion...
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 32 / 33
Conclusion Conclusion
Conclusion
Fully automatic, scaling up method (million patents, 100GB).
Quality close to human annotators.
Perspective
Machine learning from annotated patents.
Semantic queries with Patent Ontology.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 33 / 33
Conclusion Conclusion
Conclusion
Fully automatic, scaling up method (million patents, 100GB).
Quality close to human annotators.
Perspective
Machine learning from annotated patents.
Semantic queries with Patent Ontology.
T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 33 / 33