Query Enhancement for Patent Prior-Art-Search Based on Keyterm Dependency Relations and Semantic Tags

석사학위논문

Master’s Thesis

Query Enhancement for Patent Prior Art Search

with Keyterm Dependency Relations

and Semantic Tags

Khanh Ly Nguyen

Department of Computer Science

KAIST

2011

Query Enhancement for Patent Prior Art Search

with Keyterm Dependency Relations

and Semantic Tags

Query Enhancement for Patent Prior Art Search with

Keyterm Dependency Relations and Semantic Tags

Advisor: Professor Sung-Hyon Myaeng

By

Khanh Ly Nguyen

Department of Computer Science

KAIST

A thesis submitted to the faculty of KAIST in partial fulfillment of the

requirements for the degree of MMaasstteerr ooff SScciieennccee iinn EEnnggiinneeeerriinngg in the De-

partment of Computer Science. The study was conducted in accordance with

Code of Research Ethics1

23rd November , 2011

Approved by

Professor Sung-Hyon Myaeng

1Declaration of Ethical Conduct in Research: I, as a graduate student of KAIST, hereby declare that I have notcommitted any acts that may damage the credibility of my research. These include, but are not limited to: fals i-fication, thesis written by someone else, distortion of research findings or plagiarism. I affirm that my thesiscontains honest conclusions based on my own careful research under the guidance of my thesis advisor.

Query Enhancement for Patent Prior Art Search with

Keyterm Dependency Relations and Semantic Tags

Khanh Ly Nguyen

The present dissertation has been approved by the dissertation committee

as a master’s thesis at KAIST

November 23rd 2011

Committee head

Committee member

Committee member

Professor Sung-Hyon Myaeng

Professor Alice Oh

Professor Ho-jin Choi

i

ICE

20074298

Khanh Ly Nguyen. Query Enhancement for Patent Prior Art Search with Keyterm Depen-dency Relations and Semantic Tags. Department of Information and Communication Engi-neering. Advisor Prof. Sung-Hyon Myaeng.

ABSTRACT

The increasing number of applications and granted patents constantly leads to critical demands for patent

search. Prior art search is one of the most common patent searches and its goal is to find patent documents that

constitute prior art to a given patent. Current patent searches are mostly keyword-based systems and due to com-

plex structures and length of patent documents, they do not perform very well. In this research, we propose a

new query formulation method for patent prior art search by identifying the most discriminate terms using key-

word dependency relations. Instead of using only a separate field, our intention is to select the most significant

field or combination of fields to find the best one for query formulation. Furthermore, we concentrated on appro-

priating number of key terms that should be included in the query by performing experiments with different

query size. Specifically, our work is different from all previously reported ones in a way that instead of using

only keyterm extraction based on dependency relations, our idea is to combine the keyterm extraction with se-

mantic tags; which are identified from patent documents to find prior art patents with similar IPC codes. And for

prior art search evaluation, we applied the re-ranking method based on the IPC classification codes which were

assigned to the patent document since this method can aid in the identification of prior art patents without extra

cost of expert judgments and incompleteness of citations.

In this work, 36 experiments were conducted, and the results show that the proposed method achieves

significant improvement over the baseline. The results indicate that: 1) For query formulation from a separate

field, e.g. query formulated by top 10 terms from Abstract, 18% improvement of Sub-class, 17% improvement of

Main-group, and 13% improvement of Sub-group compared to those of the baseline method can be obtained; 2)

For query formulation from combined fields, e.g. query formulated by top 10 terms from Abstract and top 10

terms from Claims, we can achieve 16% improvement of Sub-class, 16% improvement of Main-group, and 13%

improvement of Sub-group compared to those of the baseline method; 3) For query formulation combined with

semantic tags, e.g. for Abstract, 46% improvement of Sub-class, 42% improvement of Main-group, and 45%

improvement of Sub-group compared to those of the baseline method can be achieved. Experiment results also

show that extracting terms from Description gave the best performance over all other fields (e.g. Abstract,

Claims field). The reason for this is the Description field contains specification about what a process or method

of the invention is and how it differs from previous patents and technology. By identifying IFPS terms from De-

scription, we can achieve better performance if IFPS is used as a query itself or and the best is to use in combina-

ii

tion with query selection by KDR since IFPS includes information related to the areas a patent belong to which

can be very helpful to identify the IPC sub-classes of a patent document (IF) and it includes Problems/Solutions

(PS) which related to limitations of previous patents and effects of present invention that may help to identify

IPC main-groups or sub-groups of the query patent. We also show the effectiveness of IFPS terms when IFPS is

combined with KDR terms or tf*idf. When IFPS is added we gain much more improvement that shows a good

strategy for query expansion.

Our experiments show that terms about details of method or process of the invention are more significant

for query formulation from Abstract or Claims; while terms about limitations or effects are more significant for

query formulation from Description.

Keywords: patent retrieval, prior art retrieval, keyterm dependency relations, semantic tags, term cooccurences

iii

Contents

List of Tables .................................................................................... 1

List of Figures................................................................................... 3

List of Abbreviations ........................................................................ 4

Chapter 1. Introduction ..................................................................... 51.1 Motivation .............................................................................................5

1.2 Contribution...........................................................................................6

1.3 Thesis Organization ...............................................................................7

Chapter 2. Background and Related Works ....................................... 82.1 IPC Taxonomy.......................................................................................8

2.2 Patent document.....................................................................................9

2.3 Patent Analysis and Processing ............................................................10

2.4 Prior art Search ....................................................................................12

Chapter 3. Methodology ................................................................. 143.1 System Description ..............................................................................14

3.2 Pre-processing & Stop-word Removal .................................................15

3.3 IFPS Extraction....................................................................................15

3.3.1 Extraction of Invention Fields: ................................................................ 17

3.3.2 Extraction of Problems and Solutions .....................................................18

3.4 Term Extraction based on Keyterm Dependency Relations...................20

3.5 Query Formulation...............................................................................21

3.6 Patent Indexing & Retrieval .................................................................23

3.7 Re-ranking based on IPC .....................................................................23

Chapter 4. Experiments & Results .................................................. 254.1. Data Collection and Preparation..........................................................25

4.2 Evaluation Metrics ...............................................................................26

4.3 Experimental Results ...........................................................................26

4.3.1 Data Statistics ..........................................................................................26

4.3.2 Baseline ...................................................................................................27

4.3.3 Experimental Results...............................................................................28

iv

4.5 Discussion ...........................................................................................40

4.6 Conclusions & Future works ................................................................43

References ...................................................................................... 45

Acknowledgement .......................................................................... 52

Curriculum Vitae ............................................................................ 53

Publication...................................................................................... 54

- 1 -

List of Tables

Table 1. IPC classifications ..................................................................................................8

Table 2. IPC sections............................................................................................................8

Table 3. Details of Experimental Query Sets ......................................................................22

Table 4. Statistics of the relevant IPC codes .......................................................................25

Table 5. Statistics of the data extracted by KDR method.....................................................27

Table 6. Statistics of Semantic tags: Invention Fields (IF), Problems/Solutions (PS) ...........27

Table 7. Results of queries extracted from Abstract field ....................................................29

Table 8. Results of queries extracted from Claims field ......................................................29

Table 9. Results of queries extracted from Description field ...............................................30

Table 10. MAP values of queries from different fields ........................................................31

Table 11. Results of queries formulated from field combinations of Abstract and Claims ...32

Table 12. Results of queries formulated from field combinations of Abstract and Description

.......................................................................................................................... 33

Table 13. Results of queries formulated from field combinations of Claims and Description33

Table 14. Results of queries formulated from field combinations of Abstract, Claims and

Description ........................................................................................................33

Table 15. Comparision of KDR queries when Titles are added ...........................................34

Table 16. Results of IFPS queries compared with tf-idf queries. .........................................35

Table 17. Results of KDR queries when adding IFPS compared with tf-idf queries for

Abstract .............................................................................................................35

Table 18. Results of KDR queries when adding IFPS compared with tf-idf queries for Claims

.......................................................................................................................... 36

Table 19. Results of KDR queries when adding IFPS compared with tf-idf queries for field

combination of Abstract and Claims...................................................................36

Table 20. Results of KDR queries formulated by top 10 terms from Abstract expanded with

IFPS compared with tf*idf queries formulated by top 10 terms from Abstract and

top 58 terms from Description............................................................................37

Table 21. Results of KDR queries formulated by top 20 terms from Claims expanded with

IFPS compared with tf*idf queries formulated by top 10 terms from Claims and

top 58 terms from Description............................................................................38

- 2 -

Table 22. Results of KDR queries formulated by combination of top 10 terms from Abstract

plus top 20 terms from Claims expanded with IFPS compared with that of tf*idf

queries expanded with top 58 terms from Description. .......................................38

Table 23. Results of tf*idf queries formulated by the top 10 terms from Abstract when IFPS

is added..............................................................................................................39

Table 24. Experiments results of tf*idf queries formulated by top 10 terms from Abstract plus

IFPS compared with top 10 terms from Abstract plus top 58 terms from

Description ........................................................................................................39

Table 25. Results of tf*idf queries formulated by combination of top 10 terms from Abstract,

top 20 terms from Claims when IFPS is added. ..................................................40

Table 26. Example of top 10 terms extracted by KDR and tf*idf for Abstract field. ............41

Table 27. Example of top 10 terms extracted by KDR and tf*idf for Claims field. ..............41

Table 28. Example of top 30 terms extracted by KDR and tf*idf for Descripion field. ........41

Table 29. Example of top 40 ~ 60 terms extracted by KDR and tf*idf for Description field.42

- 3 -

List of Figures

Figure 1. Example of a section hierarchy in IPC...................................................................9

Figure 2. Example of a patent document............................................................................. 11

Figure 3. System Architecture ............................................................................................ 15

Figure 4. Example of relations between semantic tags and the IPC of H01M......................17

Figure 5. Example of Invention Field under applicant defined tag. .....................................18

Figure 6. Example of Invention Field with no applicant defined tag. ..................................18

Figure 7. Sample extracted IFPS ........................................................................................19

Figure 8. Problem Sample Patterns.....................................................................................19

Figure 9. Solution Sample Patterns.....................................................................................19

Figure 10. Example of a KDR graph ..................................................................................21

Figure 11. Steps of re-ranking based on IPC codes and Example ........................................24

- 4 -

List of Abbreviations

IPC

USPTO

International Patent Classification

United States Patent and Trademark Office

NLP

KDR

IFPS

IF

Natural Language Processing

Keyterm Dependency Relation

Invention Fields, Problems and Solutions

Invention-Field

PS Problem/Solution

NP Noun Phrase

VP Verb Phrase

SC Sub-class

MG Main-group

SG Sub-group

- 5 -

Chapter 1. Introduction

1.1 Motivation

Patents are legal documents granted by patenting authorities to protect the inventor’s rights. Patents can

show technological details and relations, reveal business trends, inspire novel industrial solutions, or help make

investment policy that are valuable to the industry, business, law, etc. Companies and inventors who wish to file

a new patent are interested in verifying that the invention is actually new, with reference to the current state-of-

the-art. At the same time, they are interested in discovering infringements for their granted patents. Researchers

are interested in finding patent information to avoid duplicating solutions already covered by patents and/or to

freely reuse expired patents. Managers can exploit patent information to assess competitors, partners and sup-

pliers, and to identify technology trends and new business opportunities. Finally, venture capitalists and inves-

tors can leverage patent-related information to select the targets of their financial operations while third party

resellers can benefit from patent information when selecting their suppliers.

Nowadays, the number of applications and granted patents has been increasing constantly worldwide,

creating a greater demand for patent analysis and search. Patent analysis aims at obtaining relevant patents and

to analyze them as an aggregate to produce patent maps [22] or discover trends [3] [23] [24]. Patent search is

often conducted by inventors, patent attorneys, technical and business experts to find the prior art and mitigate

risks. There are many ways of patent search such as: prior art or novelty search, validity search, infringement

search, clearance search, etc. Prior art search is one of the most common search and its goal is to find patent

documents that constitute prior art to a given patent [17]. Prior art search is performed before filing an applica-

tion to ascertain patentability of an invention to determine novelty of the invention, to invalidate a patent’s

claim of originality. During the application process, patent experts will examine a patent with all of the patents

which have an earlier priority date, called prior art patents, for ensuring that the claims on a target and prior art

patent are not overlapped.

Previously, patent examination process is usually performed manually, which requires considerable ef-

fort and expertise in information retrieval, domain-specific technologies, and business intelligence. In addition,

the increasing amount of patent information and the growing need to access patent information require the de-

velopment of automatic search tools and new methodologies to shorten search times for patent awarding and

can also increase the quality of the patents granted. Current patent search systems are mostly keyword-based

and due to the complex structures of patent documents, they do not perform very well. The success of automatic

prior art search relies upon the selection of relevant search queries, however queries are built by extracting

- 6 -

terms from some textual patent documents fields using TF/IDF [20][21] and give preference to terms in Title

[19] or by taking all words in mostly Claims without filtering [18]. Queries may contain many ambiguous and

vague terms which can affect the retrieval results. Also, it is difficult to know which terms are good for formu-

lating a query. The retrieval of patent documents may be related and relevant to the query but do not contain the

exact key word or phrase. Similarly, many patents returned as a result of query do contain keywords but have

no relevance to the intent of the searcher. Also, the query size is difficult to set. Few query terms make query

processing is fast but information might be misrepresented. Otherwise, many query terms make processing time

prohibitive and the query can contain many noisy terms. Therefore, a good formulation of query is a key factor

to achieve good effectiveness, and in this work, a query enhancement for patent prior art search with keyterm

dependency relations and semantic tags is proposed.

1.2 Contribution

Previous works on prior-art search focused on methods of formulating queries by identifying keywords

from patent documents based on some weighting schemes and using the citation patents to add additional key-

words for a high probability of retrieving relevant results. Most of the works stress on the complexity of patent

structure and used Claims field extracted from a topic patent, which is considered as the most informative part

of a patent, as a search query. Therefore, we propose a method for better query formulation to improve prior art

search in the patent domain based on keyword dependency relations in combination with semantic tags (IFPS).

Instead of using only the Claims field as reported in [4] [5] [18], our idea is to use key words from different

fields of a patent document and combinations of those fields to explore which one is the best for query formula-

tion. Furthermore, we concentrated on deciding the number of key words that should be included in the query

by doing experiments with different query size. To improve the query formulation, we suggest the algorithm to

select the most representative terms based on by dependency relation of terms in the same sentence. More spe-

cifically, instead of using only the term ranking algorithm based on dependency relations, we use this method in

combination with semantic tags extracted from patent documents, which have not been done before to get better

results. And for prior art search evaluation, we applied the re-ranking method based on the IPC classification

codes assigned to the patent document, since this method can aid in the identification of prior art patents with-

out extra cost of expert judgments and incompleteness of citations.

- 7 -

1.3 Thesis Organization

The organization of this thesis is as follows: Chapter 2 describes the background and related works, in

which we will introduce about the IPC taxonomy, and characteristic of a patent document. We also discuss

about related works to the patent analysis and patent search. Chapter 3 gives details of our methodology for

query term extraction from patent documents using keyterm dependency relations and semantic tags. Chapter 4

reports evaluation results for our method using the corpus provided by the NTCIR-6 and the test set we crawled

from the USPTO database. In Chapter 4, we will discuss about the results and compares our approach with the

baseline tf*idf. Lastly, we will conclude with a short summary and mentions about the future work.

- 8 -

Chapter 2. Background and Related Works

2.1 IPC Taxonomy

The International Patent Classification (IPC) is a standard taxonomy developed by the World Intellec-

tual Property Organization (WIPO) for classifying patents and patent applications. The IPC covers all areas of

technology, chemistry, mechanics, and electronics which are classified into sections, classes, subclasses and

groups, therefore a specific topic can be identified easily and accurately. The IPC contains eight sections, about

120 classes, about 630 subclasses, 6,923 groups and approximately 60,700 subgroups as shown in Table 1. Each

section is designated by capital letter from A to H as shown in Table 2.

Table 1. IPC classifications

Table 2. IPC sections

Each section is subdivided into 11 classes, whose symbols consist of the section symbol followed by a

two-digit number. The classification symbol is made up of a letter denoting the IPC section, followed by a

number (two digits) denoting the IPC class (e.g., H01). Optionally, the classification can be followed by a se-

A Human Necessities;

B Performing Operations, Transporting

C Chemistry, Metallurgy

D Textiles, Paper

E Fixed Constructions

F Mechanical Engineering, Lighting, Heating, Weapons

G Physics

H Electricity

- 9 -

quence of a letter (e.g., H01M) denoting the IPC subclass, a number (variable, 1 to 3 digits, e.g., H01M 11) de-

noting the IPC main group, a forward slash (“/”) and a number (variable, 1 to 3 digits, e.g., H01M 11/00) denot-

ing the IPC subgroup. Table 2 shows a section and its classes/subclasses in IPC. And, an example of a section

hierarchy in IPC is shown in Figure. 1.

Figure 1. Example of a section hierarchy in IPC

2.2 Patent document

A patent document contains many items for analysis including structured items which are uniform in

semantics and format (e.g. patent number, application number, patent class, filed date, issue date, etc.) and un-

H SECTION H_ ELECTRICITY

H01 BASIC ELECTRIC ELEMENTS

H01B CABLES; CONDUCTORS; INSULATORS; SELECTION OF

MATERIALS FOR THEIR CONDUCTIVE, INSULATING, OR

DIELECTRIC PROPERTIES (selection for magnetic properties H01F 1/00;

waveguides H01P; installation of cables or lines, or of combined optical and

electric, cables or lines H02G)

H01C RESISTORS

…

H01M PROCESSES OR MEANS, e.g. BATTERIES, FOR THE DIRECT

CONVERSION OF CHEMICAL ENERGY INTO ELECTRICAL ENERGY

(electrochemical processes or apparatus in general C25; semiconductor or

other solid state devices for converting light or heat into electrical energy

H01L, e.g. H01L 31/00, H01L 35/00, H01L 37/00)

H01M 2/00 Constructional details, or processes of manufacture, of the non-active parts

H01M 2/02 . Cases, jackets, or wrappings (working of plastics or substances in a plastic

state

H01M 2/04 .. Lids or covers

…

- 10 -

structured items which are free text of different length (e.g. Title, Abstract, Claims, and Description, etc.). For

patent search, unstructured items are important text fields which dominate the influence on query formulation

but they are known to be difficult to process with traditional text processing and text retrieval techniques be-

cause of technical terminology, vague terms and complex structure. This complicates the examination of a pa-

tent document and particularly influences the patent retrieval process, because it is necessary for a precise query

to narrow the search and find relevant documents. Titles provide the least reliable clues for determining the re-

levancy of a patent because they contain relatively short key words and phrases. Abstracts are more informative

and provide summaries of claimed inventions. Claims include the most central content of a patent and disclose

the novelty of an invention. By reading the claims, we can determine the scope of the patent; however, the

claims may be directed to only one embodiment, method, etc. Typically, claims are written in patent’s specific

styles consisting of one long sentence, starting with “We claim:” or “What is claimed is:” followed by item lists

initialized by numbers. Claims consist of multiple components (e.g. part of a machine or substances of a chemi-

cal compound) and terminologies used in patent claims are highly dependent on the specific topic domain of the

patent (e.g. secondary battery). There are two types of claims which are independent claims and dependent

claims. Independent claims broadly describe the invention and do not have association with any other claims;

while dependent claims depend on a single claim or several claims to give some further limitation of a specific

compound or condition. Descriptions may be the longest text in a patent which elaborate the same content with

Claims in details and further segmented into Field of the invention, Background/Prior Art describing problems

that the invention solves and information related to the technical background, Summary often a restatement of

the Claims showing how the problem is solved; and Detailed description is a full description of the invention

with definitions, specific examples and drawings. Some patents may not have all these segments. Figure 2

shows an example of a patent document.

2.3 Patent Analysis and Processing

In recent years, patent analysis and processing have long been considered useful in product innovation,

patent maps [22] or trend discovery [3] [23] [24]. Patent documents contain important technical knowledge and

research results; however they are lengthy and contain such a lot of terminology that requires much of human

efforts for analysis. To obtain useful information, experts have to scan or read indexed patent documents from

long lists of noisy results, which is a rather time-consuming task and requires a careful manual selection. With

the rapid increase of the number of patent documents, there is a need to find a way to obtain useful and precise

patent information quickly. Thus, automatic tools in patent analysis and processing for assisting innovators or

patent applicants are in great demand.

- 11 -

Figure 2. Example of a patent document

A patent document contains structured and unstructured text. There have been approaches for patent

analyses based on structured text for years [28] [29]. For unstructured text, text mining techniques have been

applied to derive information to assist patent analysis and processing tasks. In [30], a number of text mining

techniques; including text segmentation, summary extraction, feature selection, term association, cluster genera-

tion, topic identification, and topic mapping, have been developed. Sentences were extracted by simply splitting

a period and question mark. Each sentence was then weighted by the number of keywords, title words, and clue

words it contains and position of the paragraph containing the sentence and the position of the sentence in the

containing paragraph. Natural language processing techniques have been also applied for analysis of patent

claims [33], for similarity analysis [34] [35] and for improving readability of patent [36] [37]. In [32], a NLP

methodology was proposed for analyzing patent claims that combines symbolic grammar formalisms with data

Patent No. 7,897,284

Publication Date March 1, 2011

Title Lithium secondary battery

Abstract A lithium secondary battery is provided with a positive

electrode, a negative electrode (1), a separator interposed between the

positive and negative electrodes…

Claims What is claimed is:

1. A lithium secondary battery comprising: a negative electrode

comprising a negative electrode current collector and…

2…

Description

- Field of the invention The present invention relates to lithium secondary batteries, and more

particularly…

- Description of Related

Art

Various mobile communication devices and mobile electronic devices

such as laptop computers have emerged in recent years, and this has

lead to…

- Summary of the

invention

Accordingly, it is an object of the present invention to provide a

lithium secondary battery that is capable of minimizing…

- Description of the

drawings

FIG. 1 is a cross-sectional view illustrating a portion of the negative

electrode of one example of the lithium secondary battery…

- Detail Description of

the invention

The lithium secondary battery according to the present invention is

provided with…

- 12 -

intensive methods while enhancing analysis robustness. [31] focused on discovering significant-rare words from

Claims in a patent database. [33] presented a system called COA (Claim Originality Analysis) to assess a patent

by evaluating the originality of the invention described in it. [32] proposed an approach to find problem solved

concepts from Detailed Description of a patent document by assigning more weight to the sentences appearing

at the beginning and ending of the text.

2.4 Prior art Search

Since patents play an important role in Intellectual Property protection, recently there has been a grow-

ing interest in research into patent retrieval. Patent retrieval started from the NTCIR-3 [1] with released patent

test collections to enable researchers to systematically evaluate their methodologies. In the NTCIR-4 [2], a

search task related to the prior-art search, also called invalidity search run, was presented. The goal of the prior

art search was to identify previously published patents in the collection which have the closest prior art to a giv-

en patent. Also, it is relevant for the purpose of a technical survey to evaluate the novelty or to invalidate pa-

tent’s claim of originality. Prior-art search is an essential step in the examination process of patent applications;

however, it is time-consuming and laborious. Therefore, it is important to identify discriminative terms from

patent documents to formulate queries that enhance the success of automatic prior art search.

Previously, most of the researches focus on Claims field by applying different term weighting methods

for query generation because Claims are thought to be the most informative part of a patent. To enhance the

initial query, query expansion techniques were performed by extracting effective and concrete terms from De-

scription field. In [4], Claims are first broken into components and then each component is separately used to

extract query terms. Query expansion is performed by using these terms to extract related query terms from De-

tailed Description field of the patent document. A similar work was introduced in [14], where query terms were

components extracted from the topic claim and expanded by extracting query terms from explanation sentences

related to the components in Detailed Description. [26] studies the rhetorical structure of a claim. They applied

an associative document retrieval method, in which a document is used as a query to search for other similar

documents. To produce an initial query, each Claims is segmented into multiple components and then used to

search for candidate documents on a component by component basis. [27] uses two retrieval stages which con-

sist of query term extraction from Claims. In the first stage, the query from Claims was used to retrieve the top

1,000 patents and then several techniques were used to re-rank the top 1,000 patents in the second stage. Evalu-

ation results show that the effectiveness of the method varies depending on the test sets used. However, [18]

does not distil any terms from the Claims but took all the words as one long query and no query expansion was

- 13 -

done. In [18] [20] [21], queries are built by extracting terms from one of the text fields such as Title, Abstract,

Claims, Description. [46] shows that words from the title field are the least useful for prior-art search, and

TF/IDF and terms in Title are given preference [19].

In NTCIR-4, expert judgments were used as the relevance data for patent evaluation, however only 34

query topics were developed because of the cost. Also, in NTCIR-4 the IPC codes were integrated with a prob-

abilistic retrieval model for estimating the document prior. In NTCIR-5 and NTCIR-6, citations were used and

thousands of query topics were developed automatically. However, evaluation based on citations has some limi-

tations such as citations have different degrees of relevancy; citation language may differ from the patent appli-

cation’s own publication language; and the citation lists are incomplete [47]. Therefore, the IPC codes have

been used as a feature for document filtering and patent retrieval. In [26] the authors use IPC codes for docu-

ment filtering and show how this feature can help in patent retrieval.

- 14 -

Chapter 3. Methodology

This chapter describes our methodology for query formulation for patent prior art search.

3.1 System Description

Figure 3 shows the overall system architecture of our patent retrieval system. The system is composed of

query formulation based on semantic tags (IFPS), query formulation based on keyterm dependency relations,

patent indexing, patent retrieval, re-ranking and evaluation the results.

In query formulation based on semantic tags (left part, as shown in Figure 3), only Description fields

from patent document are extracted as input text. There are two steps in the IFPS extraction: extracting Inven-

tion Field (IF) and extracting Problems-Solutions (PS). Details of IF extraction will be discussed in Section 3.3.

For Problem-Solution extraction, Description fields are parsed with the Open NLP POS tagger [49]. Then we

apply pattern matching method to extract Problems and Solutions. After that, we combine IF and PS and re-

move all redundant and stop-words to formulate queries.

In query formulation based on keyterm dependency relations, terms from each patent field will be ex-

tracted. Each patent field will be used as input text and will be pre-processed. All redundant and stop-word will

be removed. The text will be segmented into sentences by stop punctuation. Each sentence will be represented

as a graph, in which in each term is a node in the graph. Weight of node will be calculated and ranked in des-

cending order. Then, top N terms from each separate field will be selected to formulate queries. Queries are also

formulated by merging queries from different fields.

Then, queries will be sent to the patent indexing to retrieve similar documents with relevance scores.

The retrieved documents will be re-ranked based on the IPCs. Finally, we will evaluate the results.

- 15 -

Figure 3. System Architecture

3.2 Pre-processing & Stop-word Removal

Given the input text, we segment the text into sentences by stop punctuation. Unimportant terms are de-

leted from the input text field. We used the Rijsbergen’s stopword list which consists of 570 words. We also

used the stopword list that we manually collected from patent documents consisting of 150 words that occur

frequently in patents but are meaningless to the content of a patent (e.g. figure, relates, said, apparatus, method,

device, etc.) The total number of stopwords we used in this research is 720 words.

3.3 IFPS Extraction

A patent can be captured by a few elements such as “What problem does the invention solve?”, “What is

the invention?”, and “What does the invention do?” [48]. The problem that an invention is going to solve is

- 15 -








3.3 IFPS Extraction



- 15 -








3.3 IFPS Extraction



- 16 -

called Problem (P) and what the invention is and what it does to solve the problems is called Solution (S). For

example, “long-cycle-life lithium secondary cells” is the problem and “utilizing a lithium ionic reaction” is the

solution.

Problems and Solutions can be shared within a number of patents in the same domain. Intuitively, Prob-

lems and Solutions are important for describing the gist of a patent without processing lengthy queries. In addi-

tion, the Invention-Field (IF) of a patent can help for describing the area of technology (domain) which a patent

belong to (e.g. secondary battery). As in [45], patents would belong to the same domain if they are in the same

semantic tags which are defined by patent applicants (e.g. Means of solving the problems, Effects of the inven-

tion, Application field, etc.). In US patents, we examine whether semantic tags such as Invention-Field, Prob-

lems or Solutions have relations with the IPC codes which can aid in identification of related prior art patents.

Therefore by extracting Invention-Fields, Problems and Solutions, we can reduce the size of an input patent

query that can help in searching for the prior art efficiently. In the domain of “secondary battery”, for example,

we can retrieve about 1000 patents from the USPTO database indexing for each patent query. However, it would

be very difficult and time-consuming to process words one by one to identify which patents is most related to

the topic patent.

Figure 4 shows an example of IFPS phrases extracted from patents in Batteries domain and how IFPS

phrases assist in identifying the IPCs that a patent belong to. As shown in the figure, IF phrases such as “rechar-

geable batteries”, “alkaline storage batteries”, or “high power nickel metal hydride batteries” contain the word

“batteries” which is the same as the name of IPC Sub-class (Batteries). Also for problem phrases such as “posi-

tive electrode”, “positive electrode material” or “composite positive electrode material” all contain the word

“electrode” which is the same as the name of IPC Main-group (Electrodes). Similarly, solution phrases such as

“nickel based multi metals oxide”, “nickel hydroxide material”, “composite nickel electrode hydroxide particu-

late” all contain “nickel” which is the same as the name of IPC Sub-group.

The task of IFPS phrase extraction is to extract Invention Fields and Problem/Solution phrases from patent

document, which consists of the following three steps:

Step 1: Invention Fields from each patent document are extracted

Step 2: By parsing patent documents using Open NLP POS tagger, we can apply pattern matching and ex-

tracted a key terms as Noun Phrases or Verb Phrases.

Step 3: After generating two candidate lists, we merged all the key phrases in the lists and remove all stop

words and redundant words. As a result, we have a set of IFPS phrases.

The details for Step 1 & 2 are described as follows:

- 17 -

Figure 4. Example of relations between semantic tags and the IPC of H01M

3.3.1 Extraction of Invention Fields:

Invention Fields are extracted from Description fields, which is generally the first sub-field of De-

scription of a patent document. Though all patent documents have a similar kind of structure as described in 2.2,

titles of fields are fixed but the names of detailed elements is normally labeled by applicants with no standard

format. Therefore, automatically identifying Invention Fields part of patent documents is also a challenge since

a number of patents have separate Invention Fields but they use inconsistent phrases such as “Field of the Inven-

tion” or “Technical Field”. Other patents instead of separation, they include Invention Fields in variations of

“Background of the Invention”, “Prior art”, “Description of the Related Art”, etc. Meanwhile, a few patents do

not have Invention Fields (about 10%). Therefore, to extract Invention Fields, we extract the subfields that con-

tain the variations of “Field of the Invention”. As shown in Figure 5, Description field contains a separate Inven-

tion Field under applicant-defined-tag “Field of the Invention” and Invention Field we extract is in italic font.

For the case that do not have separate “Field of the Invention”, we extracted sentences that contain “relates to”

which is mostly used for describing Invention Fields, in the variations of “Background of the Invention”. As

shown in Figure 6, Description field contains a non-separated Invention Field that is the first two sentences con-

tain “relates to” under the tag “Background of the Invention”. The reason we do not use the clue “relates to” for

- 18 -

all extractions because it will bring too many sentences from the other fields (e.g. Embodiments, Detailed De-

scription, etc.) that may not be relevant to Invention Fields.

Figure 5. Example of Invention Field under applicant defined tag.

Figure 6. Example of Invention Field with no applicant defined tag.

3.3.2 Extraction of Problems and Solutions

Problems and Solutions are also extracted from the Description fields since Problems are often found in

“Background of the Invention” while Solutions are mostly found in the followed Summary parts. We use Open

NLP POStagger to tag the input descriptions. We manually analyzed patent documents for generating a list of

clues which are generally used by a large number of patents. We utilize these linguistic clues for creating 24

patterns so that Problems and Solutions can easily be extracted through a pattern matching process.

After extracting IFPS, we remove all redundant and stop-words using the stop word list (Section 3.2.) to

formulate queries.

Figure 7 shows examples of sentences that contain Invention Fields, Problems and Solutions in italic.

- 18 -












formulate queries.


- 18 -












formulate queries.


- 19 -

Figure 7. Sample extracted IFPS

Figure 8 and Figure 9 show examples of Problem and Solution patterns, respectively. The rationale be-

hind developing patterns based on the clues is as follows. Since Problems or Solutions can be Noun Phrases

(NP) or Verb Phrases (VP), we observe some patterns to indicate the PS phrases. For example, the pattern “me-

thod/NN for/IN” usually followed by a noun phrase or a noun phrase can precede the patter “can/MD be/VB

provided/VBN”. The patterns were extracted by analyzing the data, and generalized by unifying with common

syntactic labels. For example, “can/MD be/VB provided/VBN” and “can/MD be/VB obtained/VBN” will be

unified as “can/MD be/VB provided/VBN|obtain/VBN”.

Figure 8. Problem Sample Patterns

Figure 9. Solution Sample Patterns

Problem Sample Pattern Input text{NP} + can/MD be/VBprovided/VBN|improved/VBN|obtained/VBN

Thus, a nickel-metal hydride storage battery of high capacitycan be provided.

{NP} + is/VBZ improved/VBN in/IN + {NP} …alkaline storage battery is improved in charging efficiencyapparatus/NN|methods/NNS for/IN + {NP|V-ing + NP}

Apparatus for integrated-circuit battery devices

provided/VBN + {NP} / There is provided an alkaline storage battery …

Solution Sample Pattern Input textutilizing/VBG|employing/VBG|using|VBG +{NP}

lithium secondary battery employing the nonaqueouselectrolyte.

to/TO +{VBG+NP} fuel cell within an external/JJ circuit/NN

- 20 -

3.4 Term Extraction based on Keyterm Dependency Relations

In the traditional process of term extraction, researchers represent a story into a bag of words (BOW) and

use some criterions to score and sort these words. In that way, words are assumed to be independent; however,

these words; which can have strong dependency relations for describing the event, are ignored. Hence these me-

thods often bring noise, which leads to reduced precision and recall. Recent studies have demonstrated the im-

portance of dependency relations between words for topic tracking [40], text classification [38] [39], query ex-

pansion [41] or passage retrieval [42].

Our approach is based on the method for building Keyword Dependency Profile [40] which utilizes

keyword dependency relations (KDR) for topic tracking. The intuition is that a word may have strong depen-

dency relations with other words, which is important for describing information. Keyword Dependency Rela-

tions is evaluated by their co-occurrences in the same sentences. The weight of a keyword is high if it strongly

depends on the importance of the other keywords, in which a word initial weight is calculated by the tf-idf value.

For example, there are two sentences:

Sentence 1: “Thus, a nickel-metal hydride storage battery of high capacity can be provided.”

Sentence 2: “Nickel based alloy layer for perpendicular recording media.”

In the first sentence, “nickel” and “battery” co-occur in the same sentence, so it may probably related to

the Battery domain. In the second sentence, “nickel” co-occurs with other words but not with “battery”, so it is

not related to the Battery domain.

Figure 10 is an example of words come from the sentence “Thus, a hydride storage battery of high ca-

pacity can be provided”. After removing all stop-words and punctuations, we have a list of keywords K = “hy-

dride, storage, battery, capacity, provided”. The graph of words will be created as shown in Figure 10. The

number on a word is the initial importance weight which is calculated by tf*idf, and the numbers (e.g. 1, 2, 3)

besides edges are the frequency of two words that co-occur in the same sentences. After weighting by KDR,

weights of terms change as in Figure 10. Words that have more edges and more important node connected will

have higher weight, for example “hydride” has higher weight since it connects to important nodes such as “ca-

pacity”, “storage”, “battery”.

- 21 -

Figure 10. Example of a KDR graph

An input text will be segmented into sentences. After removing all redundant and stop-words, each sen-

tence will be represented as a graph, in which each word is a node n in the graph, and each edge e is the connec-

tion between two nodes. Weight of each node is calculated by the following formula:

( ) = ( ) × ( ) × , + 1,, in which w(nk) is the weight of node k,

m is the number of nodes that co-occur with node nk in the same sentence,

l is the neighbor node that co-occur with node nk in the same sentence,

tf(nl) is the term frequency of node nl,

idf(nl) is the inverse document frequency of node nl,

ek,l is the edge that connect node nk and node nl, and tf(ek,l) is the frequency of edge ek,l in the input text.

3.5 Query Formulation

Query formulation for prior-art search is to select the most informative terms from a query patent docu-

ment to form an effective query which can distinguish relevant patents from non-relevant patents in the patent

collections. Our experiments focused on selecting the most significant field or combinations of those fields to

explore which one is the best for query formulation. Instead of selecting terms only from a separate field, we

choose a particular number of terms from each field to have a better formulation of query. We concentrated on

deciding appropriate number of terms that should be included in the query by doing experiments with different

query size. We do not use Titles as a separate field because they contain relatively short key words and phrases,

but we want to see values of Titles when combined with other fields. To get better results, instead of using terms

extracted based on keyterm dependency relations, we want to combine with IFPS extracted from patent docu-

ments, which have not been done before.

- 22 -

Table 3. Details of Experimental Query Sets

After extracting terms from a field by applying a weighting algorithm (e.g. KDR or tf*idf), query formu-

lation is performed by taking the top N number of terms with highest weight and formulated the N terms as one

query. There are four types of queries that we used in the retrieval process including queries from a separate

field, queries from merged fields, queries merged with Titles, and queries merged with IFPS. Table 3 shows the

details of the query sets.

For separate fields, we choose query size as N = 10 for Abstract since the minimum number of terms in

Abstract is 11; N = 10, 20 for Claims since minimum number of terms is 23; and N = 10, 20, 30, 40, 60 for De-

scription since the minimum number of terms in Description is 61.

No. Query Set Query Description

Separate Field 1 Abs Top 10 words from Abstract2 Cla Top N words from Claims (N = 10 ~ 20)3 Des Top N words from Description (N = 10~ 60)

Merged Field

4-5 10Abs + 10/20Cla Top 10 words from Abstract + Top 10/20 words from Claims6 10Abs +60Des Top 10 words from Abstract + Top 60 words from Description7-8 10/20Cla + 60Des Top 10 words from Abstract + Top 10/20 words from Claims + Top 60 wordsfrom Description9 10Abs + 10/20Cla + 60Des Top 10/20 words from Claims + Top 60 words from Description

Merged with T

itles

10 Tit +10Abs Tit + Top 10 words from Abstract11 Tit + 20Cla Tit + Top 20 words from Claims12 Tit + 60 Des Tit + Top 60 words from Description13 Tit + 10 Abs + 20Cla Tit + Top 10 words from Abstract + Top 20 words from Claims14 Tit + 10 Abs + 20Cla + 60 Des Tit + Top 10 words from Abstract + Top 20 words from Claims + Top 60 wordsfrom Description

Merged with IF

PS 15 IFPS IFPS phrases16 IFPS + 10Abs IFPS + Top 10 words from Abstract (by KDR)17 IFPS + 20 Cla IFPS + Top 10 words from Claims (by KDR)18 IFPS + 10Abs + 20Cla IFPS + Top 10 words from Abstract (by KDR) + Top 10 words from Claims (byKDR)

- 23 -

For combination with fields, we have 7 set of queries formulated by: top 10 terms from Abstract merged

top 10/20 terms from Claims; top 10 terms from Abstract merged with top 60 terms from Description; top 10/20

terms from Claims merged with top 60 terms from Description; top 10 terms from Abstract merged with Top

1020 terms from Claims merged with top 60 terms from Description.

For combinations with Titles, we only choose the most appropriate number of terms which has higher re-

sults from each field such as 20 terms from Claims, 60 terms from Description, etc. For combinations with

IFPS, we have 4 different sets of queries as shown in Table 12. We do not combine IFPS with terms from

Description since IFPS phrases are identified from this field.

3.6 Patent Indexing & Retrieval

The Lemur Indri search engine, which is based on a combination of language model and inference

framework, was utilized to index patent documents in order to retrieve similar documents for a given query. No

stemming or stop-word removal was done. For each query, we retrieved top 1000 patents from the corpus that

contain query terms. Each retrieved patent was assigned with a relevant score. We used the Okapi BM25 formu-

la for the ranking model in this retrieval which has been used in many retrieval systems.

3.7 Re-ranking based on IPC

Patent retrieval results can be evaluated by comparisons with expert judgments, citations or IPC codes.

Among the three methods, IPC codes are employed to improve the ranked list of relevant retrieved patent doc-

uments. Since the IPC codes were assigned to each patent by patent experts, it can eliminate the limitations of

the two other evaluation methods such as cost and incomplete citations.

Figure 11 shows steps of re-ranking based on IPC codes and example. As shown in Figure 11, after re-

trieval process, the top N retrieved patents with relevant scores are re-ranked using their IPC codes. The re-

trieved patent ids are then mapped with the IPC codes contained in the IPC list which is provided by NTCIR-6.

We also separated the IPC codes into Subclass, Main Group and Subgroup, which will be applied weights sepa-

rately. Then, we calculate the average relevance scores of Sub-class, Main Group and Sub-group as the follow-

ing formula.

Score (IPCi) = ∑ scores of the distinct IPCs / #of all distinct IPCs

- 24 -

Here, X indicates an IPC code and n is the number of patents that X is assigned to within the top N re-

trieved patents, respectively.

Figure 11. Steps of re-ranking based on IPC codes and Example

- 25 -

Chapter 4. Experiments & Results

In this chapter, we present the experimental results. Query formulation was carried out based on two me-

thods: keyword dependency relations and semantic tags. We compared the results with tf-idf and evaluated the

results for 3 IPC codes including Sub-class, Main-Group and Sub-group. We choose tf-idf as the baseline since

it has been done by previous numerous researches. The following sections will describe our data collection,

evaluation metrics, experimental results, and future works.

4.1. Data Collection and Preparation

For experiments, three data sets are collected as follows: (1) A corpus of patent documents to search; (2) A set

of patent queries, and (3) Relevance judgments for patent documents in the corpus. For (1), we use the NTCIR-6

corpus which consists of 1,315,470 patent documents published from 1993 until 2002. All fields of a patent (e.g.

Title, Abstract, Claims, Description) have been indexed using Lemur toolkit [25]. For (2), we choose patent

documents which belong to the domain of Batteries (H01M) published from 2003 up to now. Although the data

is related to Batteries, our methodology can be also applied to other domains since a patent is assigned more

than one IPC code. To collect the patent documents, we issued several queries containing the International Clas-

sification of Batteries (e.g. ICL/H01M004/52 to search for H01M 4/52) on the USPTO patent search website

[44], and then crawled only patents published after 2003. For (3), to evaluate the experiments the IPC codes of

query patents are compared with the IPC codes of retrieved patents. For retrieved patents, the list of IPC codes

provided by the NTCIR-6 for the corpus is used. For the query patents, we extract the IPC codes from IPC tags

contained in the crawled data since the NTCIR-6 provided list does not cover patents’ IPCs after 2002. IPC

codes are separated into Subclass, Main-Group and Sub-Group. This separation allows us to apply weights to

Sub-class, Main-group and Sub-group separately and determine their relative influence on the retrieved ranked

document list. Table 4 shows the statistics of relevant IPC codes of the patent query set. In Table 4, the total

number of relevant IPC codes is 415, 1001 and 1829 for Subclass, Main-Group and Sub-Group, respectively.

Table 4. Statistics of the relevant IPC codes

Relevant IPCs Total Min Max

Sub-class 415 1 7

Main-group 1001 2 8

Sub-group 1829 6 23

- 26 -

4.2 Evaluation Metrics

To evaluate the experimental results, we choose the most commonly used metrics in IR. For the prior art

search tasks these included Mean Average of Precisions (MAP), Recall (R), and Precision@5. The measures

were computed with trec_eval program [50] which was written by Chris Buckley and commonly used in the

TREC evaluation campaigns.

Recall is a measure of the ability that a system to retrieve a portion of relevant items from all relevant

items in the collection as a result of a query. Recall is calculated as follows:

=Precision is a measure of the ability of a system to retrieve relevant items. Precision is calculated as fol-

lows:

=P@N is the precision over different rank cutoffs. Rather than considering the entire retrieved set which

can be quite large or possibly the entire collection, we pick a rank cutoff and calculate the precision among only

top N ranked documents; in this work, the rank cutoff is chosen to be 5. P@5 means high precision in the top 5

which indicates that a user can expect to see a lot of relevant documents near the top, even if the precision of the

entire retrieved set is low.

4.3 Experimental Results

In this section, we present and discuss about the experimental results for query formulation based on two

methods: keyword dependency relations and semantic tags. The results demonstrate the effectiveness of these

methods in patent prior art search.

4.3.1 Data Statistics

Table 5 and Table 6 show the statistics of keywords dependency relations (KDR) and semantic tags

(IFPS), respectively.

As shown by Table 5, the average number of graphs per document is 40, while the minimum and maxi-

mum numbers of graphs are 6 and 256, respectively. Each graph contains an average of 8 nodes with 7 edges.

- 27 -

The graph has a minimum number of nodes of 3 and maximum of 58. Each node has a minimum number of 2

edges and maximum of 57 edges.

Statistics of KDR Average Min Max

#of graphs per document 40 6 256

#of nodes per graph 8 3 58

#of edges per node 7 2 57

Table 5. Statistics of the data extracted by KDR method

Statistics of IFPS Average Min Max

#of IF per document 16 0 113

#of PS per document 53 7 263

#of IFPS per document 58 11 279

Table 6. Statistics of Semantic tags: Invention Fields (IF), Problems/Solutions (PS)

As shown by Table 6, the average number of Invention-Field phrases per document is 16, while the min-

imum and maximum numbers of Invention-Field phrases are 0 and 113, respectively. There are about 10% of

patents do not have Invention-Field as mentioned in section 3.3.1 that explains why the minimum can be 0. The

average number of Problem/Solution phrases per document is 53, while the minimum and maximum numbers

of graphs are 7 and 263, respectively. There are only about 1% of patents that have as many Problem/Solution

phrases as over 260 terms. Since the patents talk about problems or solutions of previous inventions, we gener-

ate a list of candidate phrases more than other patents. After merging Problem/Solution phrases and Invention-

Field phrases, and removing all redundant and stop-words, we got the average number of total IFPS phrases per

document is 58, while the minimum and maximum numbers of IFPS phrases are 58 and 279, respectively.

4.3.2 Baseline

We chose tf*idf as our baseline for comparison since this is the most commonly used method in pre-

vious patent prior art searches. Tf*idf is a statistical measure used to evaluate how important a word is to a doc-

ument in a collection. The importance increases proportionally to the number of times a word appears in the

document but is offset by the frequency of the word in the corpus. Tf*idf assigns weight to a term t in document

d given by:

Tf*idft,d = tft,d * idft

- 28 -

, where tft,d is the frequency of a term t that appears in the document d and idft is the inverse document frequen-

cy which calculated as follows:

( ) = | ||{ : ∈ }|, where D is the total number of documents in the corpus and |{ : ∈ }| is the number of documents where

term t appear.

4.3.3 Experimental Results

We ran experiments on query formulation by each method to test the effectiveness of each method indi-

vidually. We also ran experiments on query formulation by combining two methods to increase the effectiveness

of retrieval results. In section 4.3.4.1, we present the experimental results of keyword dependency relation

(KDR). In section 4.3.4.2, we present the experimental results of semantic tags (IFPS). In section 4.3.4.3, we

present the experimental results of combining the two methods.

4.3.3.1 Results of Query Formulation by Keyword Dependency Relation (KDR)

Keyword Dependency Relation for Query Formulation from Separate Field

In general, KDR outperformed tf-idf in selection of terms from a patent field. KDR calculate the impor-

tance of terms using dependency relations between key terms while tf-idf does not reflect this information. As

mentioned in Section 2.2, Titles contain very short terms such that using Titles as a separate field is not advan-

tageous. Consequently, we only use Title terms in combination with other queries to see their values. For query

formulation for separate fields, we run experiments for three fields including Abstract, Claims, and Description.

For Abstract, we ran experiments with only 10 query terms since the minimum query size in our data is 11

terms. For Claims, we ran experiments with query length as 10 and 20 terms and for Description, the query

length is from 10 to 60 terms. The results show that increasing the query length improves the scores, however

when the query length exceeds a limit, adding more terms does not further improve the performance.

Table 7 ~ 9 reports the performance of query formulation by keyword dependency relation method for

three fields including Abstract, Claims and Description. Furthermore, table 10 shows the brief summary of the

most important results for query formulation from each field as shown in Table 7 ~ 9.

Table 7 shows experimental results for queries extracted from Abstract field. As shown in Table 7, for

Abstract, KDR has significant improvement in term of MAP with 18.2% for Sub-class; 17.1% for Main-group;

and 13.4% for Subgroup. Although Recall is a very slight decrease for Sub-class (-0.2%) and Main-Group (-

1.3%), Recall is increased for Sub-group (+7.3%).

- 29 -

Query

LengthMethod

Sub-Class Main-Group Sub-Group

Recall MAP P@5 Recall MAP P@5 Recall MAP P@5

10 Tf-idf 0.985 0.5663 0.1951 0.928 0.4422 0.3221 0.768 0.2062 0.2254

10 KDR 0.983

(-0.2%)

0.6691

(+18,2%)

0.210

(+7.9%)

0.916

(-1.3%)

0.5178

(+17.1%)

0.3803

(+18.1%)

0.824

(+7.3%)

0.2338

(+13.4%)

0.2434

(+8.0%)

Table 7. Results of queries extracted from Abstract field

Table 8 shows experimental results for queries extracted from Claims field. For top 10 term queries,

KDR has MAP improvement over tf-idf with 7% improvement for Sub-class and 5.5% for Main-group and

slightly worse than tf-idf for Sub-group (-3.7%). Since characteristics of Claims fields are that contain a lot of

components, the frequency of terms in the field is higher than other fields. That explains why for top 20 term

queries KDR does not work as well as tf-idf.

Length

of

Query

Method



10

Tf-idf 0.988 0.6401 0.2172 0.945 0.4992 0.3566 0.840 0.2370 0.2680

KDR 0.986

(-0.2%)

0.6852

(+7.0%)

0.2189

(+0.8%)

0.942

(-0.3%)

0.5270

(+5.5%)

0.3730

(+4.6%)

0.830

(-1.1%)

0.2282

(-3.7%)

0.2393

(-10.7)

20

Tf-idf 0.995 0.7299 0.2352 0.960 0.5670 0.4066 0.885 0.2747 0.3008

KDR 0.995

(0%)

0.7205

(-1.3%)

0.2352

(0%)

0.965

(+5.2)

0.5639

(-0.5%)

0.4041

(-0.6%)

0.875

(-1.1%)

0.2554

(-7.0%)

0.2721

(-9.5%)

Table 8. Results of queries extracted from Claims field

Table 9 shows experimental results for queries extracted from Description field. As shown in Table 9, the

more term queries gave the better results. For queries with length of 40 to 60 terms, KDR gave better results

than tf-idf. For queries with length from 10 to 30 terms, KDR does not work as well as tf-idf. This is because tf-

idf short queries contain terms about problems or solutions which have very high frequency in Description while

KDR queries include abbreviations. In the example below, top 10 term query by KDR contains 3 abbreviations

(e.g. ag, ca, cr) while tf*idf does not contain any abbreviations and it contains more terms about problems (e.g.

charging, overvoltage, storage). Example:

Top 10 terms by KDR : “electrode positive nickel oxide temperature ag ca material effect cr”.

- 30 -

Top 10 terms by tf*idf: “charging nickel overvoltage storage alkaline batteries absorbing positive effect

oxygen”.

Query

Length

Method



10 Tf-idf 0.993 0.7243 0.2402 0.961 0.5582 0.4008 0.878 0.2662 0.2852

KDR 0.990

(-0.3%)

0.7024

(-3.0%)

0.2279

(-5.1%)

0.944

(-1.7%)

0.5167

(-7.4%)

0.3811

(-4.9%)

0.850

(-3.2%)

0.2399

(-9.8%)

0.2623

(-8.0%)

20 Tf-idf 0.990 0.7598 0.2475 0.964 0.5913 0.4328 0.899 0.2824 0.3000

KDR 0.988

(-0.2%)

0.7468

(-1.7%)

0.2443

(-1.3%)

0.948

(-1,6%)

0.5633

(-4.7%)

0.4090

(-5.5%)

0.884

(-1.7%)

0.2619

(-7.3%)

0.2934

(-2.2%)

30 Tf-idf 0.995 0.7745 0.2533 0.964 0.5958 0.4361 0.913 0.2921 0.3066

KDR 0.990

(-0.5%)

0.7653

(-1.2%)

0.2557

(+1.0%)

0.959

(-0.5%)

0.5760

(-3.3%)

0.4131

(-5.3%)

0.907

(-0.6%)

0.2745

(-6.0%)

0.2984

(-2.6%)

40 Tf-idf 0.998 0.7658 0.2533 0.965 0.5920 0.4344 0.912 0.2883 0.3033

KDR 0.993

(-0.5%)

0.7773

(1.5%)

0.2525

(-0.3%)

0.945

(-2.0%)

0.5875

(0.7%)

0.4205

(-3.2%)

0.910

(-0.2%)

0.2796

(-3%)

0.3033

(0%)

50 Tf-idf 0.995 0.7725 0.2566 0.962 0.5933 0.4344 0.917 0.2894 0.3139

KDR 0.995

(0%)

0.7894

(+2.2%)

0.2582

(+0.6%)

0.965

(+0.3%)

0.6024

(+1.5%)

0.4385

(+0.9%)

0.908

(-0.9%)

0.2863

(-1.1%)

0.3164

(+0.8%)

60 Tf-idf 0.995 0.7751 0.2541 0.962 0.5921 0.4336 0.909 0.2890 0.3123

KDR 1.000

(+0.5%)

0.8083

(+4.3%)

0.2557

(+0.6%)

0.966

(+0.4%)

0.6086

(+2.8%)

0.4320

(-0.3%)

0.910

(+0.1%)

0.2885

(-0.1%)

0.3230

(+3.4%)

Table 9. Results of queries extracted from Description field

- 31 -

Table 10 is a brief summary of the most important results for query formulation from each field as pre-

viously shown in Table 7 ~ 9. As shown by Table 10, KDR gave better results over tf-idf. Query length in Ab-

stract is 10 since the minimum number of Abstract terms in our data is 11; the most appropriate query length in

Claims is 20, and in Description is 60. Also shown by Table 10, extracting terms from Description gave the best

performance over all other fields (e.g. Abstract, Claims field). The reason for this is the Description field contain

specification about what a process or method of the invention is and how it differs from previous patents and

technology. Also, Description starts with the general background information of the area where the inventions

belongs to and increasing levels of details of the invention. Therefore, terms from Description mostly related to

the area that a patent belong that helps to identify the IPC Sub-class; terms about limitations of previous patents

and effects of present invention that may help to identify the IPC Main-group; and terms about details of method

or process can help to identify the IPC Sub-group;

FieldQuery

LengthMethod

Sub-

Class

Main-

group

Sub-

group

Abstract 10 Tf-idf 0.5663 0.4422 0.2062

KDP 0.6691 0.5178 0.2338

Claims 20 Tf-idf 0.7299 0.5670 0.2747

KDP 0.7205 0.5639 0.2890

Description 60 Tf-idf 0.7751 0.5921 0.2885

KDP 0.8083 0.6086 0.2885

Table 10. MAP values of queries from different fields

Keyword Dependency Relation for Query Formulation from Combined Fields

In this section, we present experimental results for query formulation by combining fields to see the ef-

fectiveness of field combinations. Queries were created by selecting top N number of terms from field A com-

bined with top N number of from field B. All redundant terms are removed. There are 4 types of combined que-

ries in which 3 are combinations of two fields (e.g. Abstract and Claims; Abstract and Description; Claims and

Description), and the other is a combinations of three fields (e.g. Abstract and Claims and Description). For Ab-

stract, query length is 10 since the minimum number of terms is 11. For Claims and Description, we choose the

- 32 -

query size as 20 and 60, respectively since those are the most appropriate number of terms for those fields,

which were shown by our previous experiments (Section 4.3.3.1).

Tables 10~13 show the results of combined queries by KDR compared with tf*idf. As shown by Table

10~13, KDR gave better performance for all combined queries over the baseline. We obtained the best results

for three field combination queries which were formulated by the top 10 terms from Abstract, top 20 terms from

Claims, and top 60 terms from Description. Additionally, among the queries formulated by combinations of two

fields, queries which were formulated by top 20 terms from Claims combined with top 60 terms from Descrip-

tion achieves better results than other queries.

Table 11 shows the results for queries formulated by the top 10 terms from Abstract combined with top

10 terms from Claims. As shown by Table 10, KDR achieve MAP improvement of 16% for Sub-class, 16.2%

for Main-group and 13.3% for Sub-group over the baseline.

Query Method



Top

10Abs

+ top

10Cla

Tf-idf 1.000 0.6676 0.2336 0.953 0.5200 0.3787 0.856 0.2475 0.2770

KDR 0.995

(-0.5%)

0.7742

(+16%)

0.2418

(+3.5%)

0.955

(+0.2%)

0.6041

(+16.2%)

0.4369

(+15.4%)

0.885

(+3.4%)

0.2803

(+13.3%)

0.2918

(+5.3%)

Table 11. Results of queries formulated from field combinations of Abstract and Claims

Table 12 shows the results for queries formulated by the top 10 terms from Abstract combined with the

top 60 terms from Descriptions. As shown by Table 12, KDR achieve MAP improvement of 5.2% for Sub-class,

4.4% for Main-group and 1.5% for Sub-group over the baseline.

Table 13 shows the results for queries formulated by the top 20 terms from Claims combined with top 60

terms from Descriptions. As shown by Table 13, KDR achieve MAP improvement of 4.9% for Sub-class, 3.5%

for Main-group and 1.8% for Sub-group over the baseline.

Table 14 shows the results for queries formulated by three field combinations which include the top 10

terms from Abstract, top 20 terms from Claims plus top 60 terms from Descriptions. As shown by Table 14,

KDR achieve MAP improvement of 5.4% for Sub-class, 4.8% for Main-group and 2.4% for Sub-group over the

baseline.

- 33 -

QueryMe-

thod



Top10Abs +

top60Des

Tf-idf 0.995 0.7719 0.2574 0.965 0.5902 0.4295 0.909 0.2900 0.3131

KDR 0.998

(+0.3%)

0.8124

(+5.2%)

0.2615

(+1.6%)

0.963

(-0.2%)

0.6159

(+4.4%)

0.4459

(+3.8%)

0.907

(-0.2%)

0.2944

(+1.5%)

0.3270

(+4.4%)

Table 12. Results of queries formulated from field combinations of Abstract and Description

QueryMe-

thod



Top 20 Cla +

top 60Des

Tf-idf 0.995 0.7779 0.2590 0.964 0.5978 0.4377 0.909 0.2919 0.3205

KDR 0.998

(+0.3%)

0.8157

(+4.9%)

0.2648

(+2.2%)

0.967

(+0.3%)

0.6184

(+3.5%)

0.4434

(+1.3%)

0.992

(+9.1%)

0.2970

(+1.8%)

0.3221

(+0.5%)

Table 13. Results of queries formulated from field combinations of Claims and Description

QueryMe-

thod



Top 10Abs +

top 20Cla +

top60Des

Tf-idf 0.995 0.7770 0.2557 0.966 0.5957 0.4328 0.910 0.2931 0.3189

KDR 0.998

(+0.3%)

0.8189

(+5.4%)

0.2656

(+3.9%)

0.970

(+0.4%)

0.6245

(+4.8%)

0.4533

(+4.7%)

0.923

(+1.4%)

0.3002

(+2.4%)

0.3328

(+4.4%)

Table 14. Results of queries formulated from field combinations of Abstract, Claims and Description

Keyword Dependency Relation for Query Formulation combined with Titles

To see the values of Titles as combined with other fields, 5 experiments for queries extracted by KDR

and combined with Titles were performed. Then we compare the results of each field with the field combined

with Title (e.g. top 10 words from Abstract compared with those of Abstract when Title were added). As can be

- 34 -

seen from Table 15, queries extracted by KDR when Titles are added improve performance compared to that of

queries without Titles, especially when Titles are added to queries extracted from Abstract (MAP improvement

of 16.2% for Sub-class; 14.3% for Main-group and 19.8% for Sub-group). Also shown in Table 15, some of the

results are very slightly worse for queries that come from combinations of 10 terms from Abstract and 20 terms

from Claims and 60 terms from Description, Sub-class (-0.2%) and Main-group (-0.1%), and for Sub-group it is

slightly improved (+0.1%). The experiment show the importance of Titles when they are added to other fields.

QuerySub-Class Main-Group Sub-Group


KDR 10 Abs 0.983 0.6691 0.2100 0.916 0.5178 0.3803 0.824 0.2338 0.2434

Tit + 10 Abs 0.993

(+1.02%)

0.7777

(+16.2%)

0.2410

(+14.8%)

0.951

(+3.8%)

0.5919

(+14.3%)

0.4336

(+14.0%)

0.872

(+5.83%)

0.2801

(+19.8%)

0.3033

(+24.6%)

20 Cla 0.995 0.7205 0.2352 0.965 0.5639 0.4041 0.875 0.2554 0.2721

Tit + 20 Cla 0.995

(0%)

0.7837

(+8.8%)

0.2525

(+7.4%)

0.971

(+0.6%)

0.6078

(+7.8%)

0.4525

(+12%)

0.905

(+3.4%)

0.2864

(+12.1%)

0.3090

(+13.6%)

60 Des 1.000 0.8083 0.2557 0.966 0.6086 0.4320 0.910 0.2885 0.3230

Tit + 60 Des 0.995

(-0.5%)

0.8233

(+1.9%)

0.2664

(+4.2%)

0.962

(-0.4%)

0.6267

(+3%)

0.4467

(+3.4%)

0.915

(+0.5%)

0.2967

(+2.8%)

0.3221

(-0.3%)

10Abs + 20Cla 0.998 0.7738 0.2533 0.974 0.6050 0.4426 0.914 0.2881 0.3000

Tit + 10 Abs +

20 Cla

1.000

(0.2%)

0.7948

(+2.7%)

0.2549

(+0.6%)

0.975

(+0.1%)

0.6208

(+2.6%)

0.4574

(+3.3%)

0.917

(+0.3%)

0.2973

(+3.2%)

0.3164

(+5.5%)

10Abs + 20Cla

+ 60Des

0.998 0.8189 0.2656 0.970 0.6245 0.4533 0.923 0.3002 0.3328

Tit + 10Abs +

20Cla + 60Des

0.998

(0%)

0.8176

(-0.2%)

0.2664

(+0.3%)

0.969

(-0.1%)

0.6227

(-0.3%)

0.4508

(-0.6%)

0.925

(+0.2%)

0.3006

(+0.1%)

0.3328

(0%)

Table 15. Comparision of KDR queries when Titles are added

4.3.3.2 Results of Query Formulation by Semantic tags (IFPS)

We ran experiments on semantic tags to identify Invention Fields (IF) and Problems/Solutions (PS) from

Description. We compared the results of IFPS with those of tf*idf queries formulated by 58 terms from Descrip-

tion since IFPS is extracted from Description and 58 is the average number of terms of IFPS queries.

- 35 -

Table 16 shows the experiments results of IFPS compared with the baseline. As shown in Table 16, IFPS

achieves MAP improvement of 7.6% for Sub-class, 4.3% for Main-group and 1.7% for Sub-group over the

baseline.

Query

Length

Me-

thod



58 Tf*idf 0.998 0.7720 0.2533 0.962 0.5930 0.4328 0.914 0.2890 0.3131

IFPS 0.998

(+0%)

0.8305

(+7.6%)

0.2689

(+6.2%)

0.972

(+1%)

0.6185

(+4.3%)

0.4574

(+5.7%)

0.918

(+0.4%)

0.2940

(+1.7%)

0.3213

(+2.6%)

Table 16. Results of IFPS queries compared with tf-idf queries.

4.3.3.3 Results of Query Formulation by combining Keyword Dependency Relation (KDR) and

Semantic tags (IFPS)

In order to validate the usefulness of IFPS in patent prior art search, we conducted experiments by com-

bining Keyword Dependency Relation and IFPS.

Table 17 shows the experiments results of KDR queries when adding IFPS compared with tf*idf queries

for Abstract field. As shown in Table 17, KDR queries when IFPS is added achieves significant MAP improve-

ment of 46.8% for Sub-class, 42.6% for Main-group and 45.3% for Sub-group over the baseline. Our experi-

ment shows that KDR gave better results over the baseline, especially when IFPS is the results are significantly

improved.

Table 17. Results of KDR queries when adding IFPS compared with tf-idf queries for Abstract

Method



10 Abs

(Tf*idf)

0.985 0.5663 0.1951 0.928 0.4422 0.3221 0.768 0.2062 0.2254

10 Abs

(KDR) +

IFPS

0.998

(+1.3%)

0.8312

(+46.8%)

0.2664

(+36.5%)

0.970

(+4.5%)

0.6306

(+42.6%)

0.4672

(+45%)

0.921

(+19.9%)

0.2997

(+45.3%)

0.3328

(+47.6%)

- 36 -

Table 18 shows the experiments results of KDR queries when adding IFPS compared with tf*idf queries

for Claims field. As shown in Table 18, KDR queries when IFPS is added achieves significant MAP improve-

ment of 12.5% for Sub-class, 11.5% for Main-group and 10.7% for Sub-group over the baseline.

Method



20 Cla

(Tf*idf)

0.995 0.7299 0.2352 0.960 0.5670 0.4066 0.885 0.2747 0.3008

20 Cla

(KDR)+

IFPS

0.998

(+0.3%)

0.8209

(+12.5%)

0.2705

(+15%)

0.973

(+1.4%)

0.6321

(+11.5%)

0.4631

(+13.9%)

0.923

(+4.3%)

0.3041

(+10.7%)

0.3344

(+%11.2)

Table 18. Results of KDR queries when adding IFPS compared with tf-idf queries for Claims

Table 19 shows the experiments results of queries formulated by top 10 terms from Abstract combined

with top 20 terms from Claims by KDR when adding IFPS compared with those by Tf*idf. As shown in Table

18, KDR queries when IFPS is added achieves significant MAP improvement of 11.3% for Sub-class, 10.3% for

Main-group and 9.4% for Sub-group over the baseline.

Method



10 Abs + 20

Cla (Tf*idf)

0.998 0.7379 0.2418 0.964 0.5737 0.4066 0.878 0.2831 0.3016

10 Abs + 20

Cla (KDR) +

IFPS

0.995

(-0.3%)

0.8216

(+11.3%)

0.2721

(+12.5%)

0.973

(+0.9%)

0.6330

(10.3%)

0.4730

(+16.3%)

0.923

(+5.1%)

0.3097

(+9.4%)

0.3410

(+13.1)

Table 19. Results of KDR queries when adding IFPS compared with tf-idf queries for field combination of

Abstract and Claims

- 37 -

The experimental results show that queries extracted by KDR and added more IFPS terms have signifi-

cant improvement over the baseline (e.g. Especially for queries from Abstract, we achieve the highest MAP im-

provement of 46.8% for Sub-class, 42.6% for Main-group, 45.3% for Sub-group). This shows that using KDR

can change weight of terms that results in improvement of retrieval performance. And adding more IFPS terms

gave much more improvement, that show a good strategy for query expansion.

In order to validate the effectiveness of KDR in combination with IFPS, we conducted experiments by

comparing our approach with tf*idf and adding the same number of terms from Description to tf*idf queries. As

explained before, IFPS is extracted from Description and 58 is the average number of terms therefore we choose

to adding more terms to tf*idf queries by taking the top 58 terms from Description. That makes sure that both of

expanded KDR and tf*idf queries have the same number of adding terms.

Table 20 shows the experiments results of KDR queries formulated by top 10 terms from Abstract ex-

panded with IFPS compared with tf*idf queries formulated by top 10 terms from Abstract and top 58 terms from

Description. As shown in Table 20, KDR queries when IFPS is added achieves MAP improvement of 8.6% for

Sub-class, 7.1% for Main-group and 3.7% for Sub-group over the baseline.

Method



10 Abs + Des(Tf*idf) 0.995 0.7657 0.2566 0.962 0.5890 0.4279 0.910 0.2889 0.3139

10 Abs (KDR) + IFPS 0.998

(+0.3%)

0.8312

(+8.6%)

0.2664

(+3.8%)

0.970

(+0.8%)

0.6306

(+7.1%)

0.4672

(+9.2%)

0.921

(+1.2%)

0.2997

(+3.7%)

0.3328

(+6.0%)

Table 20. Results of KDR queries formulated by top 10 terms from Abstract expanded with IFPS compared

with tf*idf queries formulated by top 10 terms from Abstract and top 58 terms from Description

Table 21 shows the experiments results of KDR queries formulated by top 20 terms from Claims ex-

panded with IFPS compared with tf*idf queries formulated by top 20 terms from Claims and top 58 terms from

Description. As shown in Table 21, KDR queries when IFPS is added achieves MAP improvement of 5.9% for

Sub-class, 5.6% for Main-group and 3.4% for Sub-group over the baseline.

Table 22 shows the experiments results of KDR queries formulated by combination of top 10 terms from

Abstract plus top 20 terms from Claims expanded with IFPS compared with that of tf*idf queries expanded with

- 38 -

top 58 terms from Description. As shown in Table 22, KDR queries expanded with IFPS is added achieves MAP

improvement of 6.7% for Sub-class, 6.2% for Main-group and 5.6% for Sub-group over the baseline.

Method



20 Cla + Des (Tf*idf) 0.995 0.7750 0.2557 0.966 0.5987 0.4369 0.910 0.2939 0.3213

20 Cla (KDR) + IFPS 0.998

(+0.3%)

0.8209

(+5.9%)

0.2705

(+5.6%)

0.973

(+0.7%)

0.6321

(+5.6%)

0.4631

(+6.0%)

0.923

(+1.4%)

0.3041

(+3.4%)

0.3344

(+4.1%)

Table 21. Results of KDR queries formulated by top 20 terms from Claims expanded with IFPS compared with

tf*idf queries formulated by top 10 terms from Claims and top 58 terms from Description

Method



10 Abs+ 20 Cla+58Des

(Tf*idf)

0.995 0.7700 0.2541 0.966 0.5960 0.4344 0.908 0.2933 0.3180

10 Abs + 20 Cla

(KDR) + IFPS

0.995

(0%)

0.8216

(+6.7%)

0.2721

(+7.1%)

0.973

(+0.7%)

0.6330

(6.2%)

0.4730

(+8.9%)

0.923

(+1.7%)

0.3097

(+5.6%)

0.3410

(+7.2%)

Table 22. Results of KDR queries formulated by combination of top 10 terms from Abstract plus top 20 terms

from Claims expanded with IFPS compared with that of tf*idf queries expanded with top 58 terms from

Description.

4.3.3.4 Results of Query Formulation by combining Tf*idf and Semantic tags (IFPS)

In order to validate the effectiveness of IFPS in query formulation for patent prior art search, we con-

ducted experiments by comparing terms extracted from Description by tf*idf with terms extracted by IFPS. We

added to the top N terms from Abstract (e.g. 10 terms) or Claims (e.g. 20 terms) the same number of tf*idf

terms from Description (58 terms) and compared with the top N terms from Abstract (e.g. 10 terms) or Claims

(e.g. 20 terms) in combination with IFPS.

Table 23 shows the experiments results of tf*idf queries formulated by top 10 terms from Abstract plus

the top 58 terms from Description compared with that plus IFPS. As shown in Table 23, when IFPS is added to

- 39 -

tf*idf queries we achieves MAP improvement of 8.5% for Sub-class, 5.7% for Main-group and 1.7% for Sub-

group compared with terms extracted by tf*idf.

Table 24 shows the experiments results of tf*idf queries formulated by top 20 terms from Claims plus

IFPS compared with that plus top 58 terms from Description. As shown in Table 24, when IFPS is added to

tf*idf queries we achieves MAP improvement of 6% for Sub-class, 5.3% for Main-group and 4.4% for Sub-


Table 25 shows the experiments results of tf*idf queries formulated by combination of top 10 terms from

Abstract plus top 20 terms from Claims expanded with IFPS compared with top 10 terms from Abstract plus top

20 terms from Claims expanded with 58 terms from Description. As shown in Table 25, when IFPS is added to

tf*idf queries we achieves MAP improvement of 6.6% for Sub-class, 5.7% for Main-group and 4.8% for Sub-


Method



10 Abs + 58

Des(Tf*idf)

0.995 0.7657 0.2566 0.962 0.5890 0.4279 0.910 0.2889 0.3139

10 Abs (tf*idf) + IFPS 0.998

(+0.3%)

0.8305

(+8.5%)

0.2689

(+4.8%)

0.974

(+1.3%)

0.6228

(+5.7%)

0.4623

(+8.0%)

0.919

(0.99%)

0.2939

(+1.7%)

0.3262

(+3.9%)

Table 23. Results of tf*idf queries formulated by the top 10 terms from Abstract when IFPS is added.

Method



20 Cla + 58 Des

(Tf*idf)

0.995 0.7750 0.2557 0.966 0.5987 0.4369 0.910 0.2939 0.3213

20 Cla (KDR) + IFPS 0.998

(+0.3%)

0.8216

(+6.0%)

0.2705

(+5.79%)

0.978

(+1.2%)

0.6304

(+5.29%)

0.4656

(+6.57%)

0.925

(+1.65%)

0.3067

(+4.4%)

0.3410

(+6.13%)

Table 24. Experiments results of tf*idf queries formulated by top 10 terms from Abstract plus IFPS compared

with top 10 terms from Abstract plus top 58 terms from Description

- 40 -

Method



10 Abs+ 20 Cla+

58Des (Tf*idf)

0.995 0.7700 0.2541 0.966 0.5960 0.4344 0.908 0.2933 0.3180

10 Abs + 20 Cla (tfidf)

+ IFPS

0.998

(+0.3%)

0.8212

(+6.6%)

0.2689

(+5.8%)

0.978

(+1.2%)

0.6298

(+5.7%)

0.4648

(+7%)

0.921

(+1.4%)

0.3073

(+4.8%)

0.3434

(+8%)

Table 25. Results of tf*idf queries formulated by combination of top 10 terms from Abstract, top 20 terms from

Claims when IFPS is added.

4.5 Discussion

In this chapter, the experiments and implications from the approach and evaluation results are discussed.

We carried out experiments for query formulation by two methods: keyword dependency relations

(KDR) and semantic tags (IFPS). Queries were extracted by taking the top N number of terms from each field or

combination of two or three fields. Then, results were evaluated for three IPC codes including Sub-class, Main-

group and Subgroup by comparing with those of tf*idf. The experimental results show that: 1) Description is the

best field for query formulation compared with Abstract or Claims; 2) Query formulation by combining the top

N terms from Abstract, Claims and Description gives better performance than query formulation by a separate

field (e.g. top 10 terms from Abstract plus top 20 terms from Claims plus top 60 terms from Description); 3)

KDR gave better performance than tf*idf since KDR can identify important terms by changing weight of a term

based on the importance of its neighbor terms; 4) IFPS gave better performance than tf*idf; and 5) the best per-

formance was achieved when KDR is combined with IFPS; Moreover, we found that 6) for Sub-class, the high-

est results were achieved by using IFPS queries alone or by KDR queries extracted from Abstract combining

with IFPS; for Main-group, the highest results achieved by KDR queries extracted from Abstract or Claims or

both combining with IFPS terms; and for Sub-class, the highest results were achieved by KDR queries extracted

from Claims combining with IFPS terms.

Our approach points out distinct features to improve the effectiveness of prior art patent search, which

have not been discovered before. Most of previous researches used words from a separate field as a query (e.g.

Claims field). However instead of doing the same way as previous research, we show that formulating queries

based on keyterm dependency relations by selecting top N terms from each field and combining those terms as

- 41 -

the search queries is more significant to improve the effectiveness of the prior art search. We also show that by

combining with IFPS terms; the results were much more significantly improved and each particular field plays a

different role in identifying the IPC codes of a query patent. Abstract field in combination with IFPS is more

significant to identify IPC sub-class; while terms from Claims field in combination with IFPS are more signifi-

cant to identify IPC sub-group. And Abstract or Claims in combination with IFPS have almost same importance

to identify IPC main-group.

Field Terms extracted by KDR Terms extracted by tf*idf

Abstract cr ti ca material electrode active positive metal

oxide conductive

charging decrease efficiency oxide nickel oxygen

battery temperature yb supplement

Table 26. Example of top 10 terms extracted by KDR and tf*idf for Abstract field.

Table 27. Example of top 10 terms extracted by KDR and tf*idf for Claims field.

Table 28. Example of top 30 terms extracted by KDR and tf*idf for Descripion field.

Keyword dependency relation works better than tf*idf since it is based on the relation between words in

which the importance of a word depends on the importance of its neighbor words. If a term has more relations to

important neighbor, it will be assigned more weight. Tables 26 ~ 28 are examples of terms extracted by KDR

and tf*idf. As we can see from those table, KDR selects terms about details of method or process (e.g. cr, ti, ca,

active, material) while tf*idf selects terms about limitations or effects (e.g. charging, decrease, efficiency). In


Claims ti cr ca material ni battery metal oxide active

nickel

overvoltage increases electrically conductive ma-

terial oxygen alkaline coating oxide nickel


Description electrode positive nickel oxide temperature

ag ca material effect cr metal negative ba

hydride alloy increasing improve absorbing

hydrogen problems surface composed elec-

trolyte element capable electrochemically

battery releasing time object

charging nickel overvoltage storage alkaline batte-

ries absorbing positive effect oxygen hydride effi-

ciency battery electrode increasing capacity oxide

powders active add hydrogen decreases tempera-

tures hydroxide proposals material increased time

releasing negative

- 42 -

the example below, top 10 term query by KDR contains 3 abbreviations (e.g. ag, ca, cr) while tf*idf does not

contain any abbreviations and it contains more terms about problems (e.g. charging, overvoltage, storage). For

Abstract and Claims, KDR queries have better performance than tf*idf queries, which show that terms about

details of method or process of a patent is more importance for prior art search than terms about limitations or

effects. However, for Description field terms about limitations or effects is more effective. For queries with

length from 10 ~ 30 terms (Table 28), KDR identified terms mostly about method or process and it has worse

performance than tf*idf. Meanwhile, for queries with length from 40 ~ 60 terms (Table 29) KDR identified

more terms about limitations or effects that results in higher performance than tf*idf. Terms from Description

field are more important than other fields since Description contains terms that related to the area a patent be-

long to, terms about limitations of previous patents and effects of present invention, and terms about details of

method or process.

Table 29. Example of top 40 ~ 60 terms extracted by KDR and tf*idf for Description field.

Based on the experiment analysis, we found that the KDR includes many abbreviations, which can re-

duce the performance of KDR. This problem can be solved by constructing a dictionary such as using Wikipedia

or WordNet, however words from patents are mostly very technical which may not exist in those dictionaries.

Therefore, constructing a patent dictionary is one way to resolve the problems for better improvement.

Extracting terms from Description gave the best performance over all other fields (e.g. Abstract, Claims

field). The reason for this is the Description field contain specification about what a process or method of the

invention is and how it differs from previous patents and technology. Also, Description starts with the general

background information of the area where the inventions belongs to and increasing levels of details of the inven-

tion. However, it will be a difficult task to identify those terms based on frequency. By identifying IFPS terms


Description active add storage high oxidation generated

cadmium efficiency hydroxide demand

reaction alkaline batteries rising reduction

temperatures capacity caused solution

charging solid elements reducing contained

place great cost sealing energy decomposi-

tion

similarly additives sized elements average cad-

mium competitively merits radiating proportioned

conspicuous dispersibility increasing efficiently

particle metal composed agglomeration agglome-

rate rising explanation apt industrialized caused

electrochemically sharp ca alloy cr beryllium

- 43 -

from Description, we can achieve better performance if IFPS is used as a query itself or and the best is to use in

combination with query selection by KDR. Our analysis shows that Invention Field (IF) terms include informa-

tion related to the areas a patent belong to which can be very helpful to identify the IPC sub-classes of a patent

document. Since the frequency of terms that describe an invention domain is relatively lower, IF phrases cannot

be extracted by frequency based method. Also, Problems/Solutions (PS) includes information related to limita-

tions of previous patents and effects of present invention that may help to identify the IPC main-groups.

Our experiments also show that, when combining with IFPS terms from Abstract is more significant for

identifying IPC Sub-classes of a query patent; terms from Abstract or Claims both significant for identifying

IPC Main-groups of a query patent; while terms from Claims is more significant for identifying IPC Sub-groups

of a query patent.

Through a number of experiments performed in this work, we show that extracting terms based on de-

pendency relations is a good way in changing weights of terms by assigning higher weights to more important

terms. We also show that how IFPS terms can contribute to the effectiveness of query formulation for prior art

search, especially when terms extracted by keyword dependency relations and IFPS can be combined as a query.

4.6 Conclusions & Future works

A new method for query enhancement in patent prior art search that outperforms the baseline (tf-idf)

based on keyterm dependency relations and semantic tags was proposed in this thesis. The experiments demon-

strated significant improvements for query formulation by extracting the top N terms from each field and com-

bining those terms as a query rather than using terms from a separate field as a query. We show that query for-

mulated by combinations of three fields which is the top ten terms from Abstract and the top 20 terms from

Claims and the top 60 terms from Description give the best result. And, our works show the improvement of

query formulation by IFPS terms compared with the same number of terms extracted by tf*idf from Description

field. The reason IFPS terms outperform tf*idf terms since IFPS includes information related to the areas a pa-

tent belong to which can be very helpful to identify the IPC sub-classes of a patent document (IF) and it includes

Problems/Solutions (PS) which related to limitations of previous patents and effects of present invention that

may help to identify IPC main-groups or sub-groups of the query patent. We also show the effectiveness of IFPS

terms when IFPS is combined with KDR terms or tf*idf. When IFPS is added we gain much more improvement

that shows a good strategy for query expansion.

Our experiments show that terms about details of method or process of the invention (e.g. ag, ca, cr) are

more significant for query formulation from Abstract or Claims; while terms about limitations or effects (e.g.

- 44 -

charging, decrease, efficiency) are more significant for query formulation from Description. In the example be-

low, top 10 term query by KDR contains 3 abbreviations (e.g. ag, ca, cr) while tf*idf does not contain any ab-

breviations and it contains more terms about problems (e.g. charging, overvoltage, storage).

Our experiments suggest a way to improve the identification of IPC codes by identifying terms from par-

ticular field instead of using various field or whole document. For example, one wants to know only sub-classes

of a patent he can focus on query terms from Abstract and IFPS; or terms from Claims and IFPS for main-

groups or sub-groups.

The proposed methods in this work are applied to patent documents which are related to the Batteries’

domain; however, they can also be applied to other domains as well. As a future work, we intend to apply our

approach to a larger corpus with various domains. We also further consider how to use dependency relations of

terms for identifying phrases in patent documents instead of using unique words. In particular, dependency rela-

tions between IFPS terms are expected to achieve better improvement; therefore, it should be considered. A pa-

tent term dictionary and a synonym dictionary should also be developed for better term-matching accuracy. Fur-

thermore, the way to improve the original keyword dependency relation method should be analyzed.

- 45 -

References[1] M. Iwayama, A. Fujii, N. Kando, and A. Takano (2009). “Overview of patent re-

trieval task at NTCIR-3”. In Proceedings of NTCIR Workshop, 2002.

[2] A. Fujii,M. Iwayama, and N. Kando (2004). “Overview of Patent Retrieval Task at

NTCIR- 4”. In Proceedings of NTCIR-4 Workshop, 2004.

[3] Youngho Kim, et al (2009). “Automatic Discovery of Technology Trends from Pa-

tent”. Proceedings of the 2009 ACM symposium on Applied Computing, pp. 1480-

1487, 2009.

[4] Atsushi Fujii ,Tetsuya Ishikawa (2004). “Document Structure Analysis in Associa-

tive Patent Retrieval”. NTCIR-4 Workshop, 2004.

[5] Hisao Mase, et al. (2004). “Two-Stage Patent Retrieval Method Considering Claim

Structure”. NTCIR-4 Workshop, 2004.

[6] Sumio Fujita (2004). “Revisiting Document Length Hypotheses: NTCIR-4 CLIR

and Patent Experiments at Patolis”. NTCIR-4 Workshop, 2004.

[7] Hironori Takeuchi, et al. (2004). “Experiments on Patent Retrieval at NTCIR-4

Workshop”. NTCIR-4 Workshop, 2004.

[8] Atsushi Fujii (2007). “Integrating Content and Citation Information for the NTCIR-

6 Patent Retrieval Task”. NTCIR-6 Workshop, 2007.

[9] Jungi Kim, et al. (2007). “POSTECH at NTCIR-6 English Patent Retrieval Sub-

task”. NTCIR-6 Workshop, 2007.

[10] Kazuya Konishi, Akira Kitauchi and Toru Takaki, (2004). “Invalidity Patent

Search System of NTT DATA”. NTCIR-4 Workshop, 2004.

[11] Hisao Mase,Makoto Iwayama (2007). “NTCIR-6 Patent Retrieval Experiments at

Hitachi”, NTCIR-6 Workshop, 2007.

[12] Hidetsugu Nanba (2007). “Query Expansion using an Automatically Constructed

Thesaurus”. NTCIR-6 Workshop, 2007.

[13] Hiroki Tanioka, Kenichi Yamamoto (2007). “A Passage Retrieval System using

Query Expansion and Emphasis”, NTCIR-6 Workshop, 2007.

[14] Kazuya K. (2005). “Query Term Extraction from patent documents for invalidity

search”. Proceedings of NTCIR-5 Workshop Meeting, December 6-9, 2005, Tokyo, Ja-

pan.

[15] Jarvelin, A. and Preben, H. (2009). “UTA and SICS at CLEF-IP”. 1st CLEF-IP,

Corfu, Greece, 2009.

[16] Lopez, P. and Romary, L. (2009). “Multiple Retrieval Models and Regression

Models for Prior Art Search”. In: 1st CLEF-IP, Corfu, Greece, 2009.

[17] G. Roda, J. Tait, F. Piroi, and V. Zenz (2009). “CLEF-IP 2009: Retrieval experi-

ments in the Intellectual Property domain”, CLEF-IP 2009.

- 46 -

[18] Susan V. and Eva D. (2010). “Prior Art retrieval using the claims section as a bag

of words”. CLEF-IP 2010.

[19] Toucedo, J.C. and Losada, D.E. (2009). “University of Santiago de Compostela at

CLEF-IP09”. 1st CLEF-IP, Corfu, Greece, 2009.

[20] Xiaobing X. and W. Bruce C. (2010). “Transforming Patents into Prior Art Que-

ries”. SIGIR’09.

[21] Metti Z. et al. (2010). “Prior art retrieval using various patent document fields

contents”. CLEF-IP 2010.

[22] Mai, F.-D., Hwang, F., Chien, K.-m., Wang, Y.-M., & Chen, C.-y. (2002). “Patent

map and analysis of carbon nanotube”. Science and Technology Information Center,

National Science Council, ROC.

[23] Young Gil K., et al (2008). “Visualization of patent analysis for emerging tech-

nology”. Expert Systems with Applications: An International Journal archive Volume

34 Issue 3, April, 2008.

[24] Brian Lent, et al. (1997). “Discovering trends in text databases”. In Proc. 3rd Int.

Conf. Knowledge Discovery and Data Mining, KDD, pp. 227-230.

[25] The Lemur Toolkit. http://www.lemurproject.org.

[26] Takaki, et al. (2004). “Associative Document Retrieval by Query Subtopic Analy-

sis and its Application to Invalidity Patent Search”. In: Proceedings of CIKM 2004.

[27] Mase, H., et al. (2005). “Proposal of Two Stage Patent Retrieval Method Consi-

dering the Claim Structure”. ACM Transactions on Asian Language Information

Processing 4, 2005.

[28] Archibugi, D., & Pianta, M. (1996). “Measuring technological change through pa-

tents and innovation survey”. Technovation, 16(9), 451–468.

[29] Be’de’carrax, C., & Huot, C. (1994). A new methodology for systematic exploita-

tion of technology databases. Information Processing & Management, 30(3), 407–418.

[30] Tseng, Y., Lin, C., & Lin, Y. (2007). “Text mining techniques for patent

analysis”. Information Processing and Management, 43(5), 1216–1247.

[31] Y.R. Li, L.H. Wang and C.F. Hong. (2009). “Extracting the significant-rare key-

words for patent analysis”. Expert Systems with Applications 36 (2009), pp. 5200–

5204.[32] Tiwana, S., & Horowitz, E. (2009). Extracting Problem Solved Concepts from Pa-

tent Documents. Proceedings of the 2nd ACM workshop on Patent Information Re-

trieval, PaIR 2009, November 6, 2009, Hong Kong, China, 43-48.

[33] O. Babina. “Nlp-based patent information retrieval”.

http://fccl.ksu.ru/issue8/babinaNLPpatentIR.pdf.

- 47 -

[34] K. V. Indukuri, A. A. Ambekar, and A. Sureka. (2007). “Similarity analysis of pa-

tent claims using natural language processing techniques”. In ICCIMA ’07: Proc of the

Int’l Conf on Computational Intelligence and Multimedia Applications (ICCIMA

2007), Washington, DC, USA, 2007. IEEE CS.

[35] S. Sheremetyeva. (2003). “Natural language analysis of patent claims”. In Proc of

the ACL-2003 Workshop on Patent Corpus Processing, Morristown, NJ, USA, 2003.

ACL.

[36] A. Shinmori, M. Okumura, Y. Marukawa, and M. Iwayama. (2003). “Patent claim

processing for readability: structure analysis and term explanation”. In Proc of the

ACL-2003 Workshop on Patent Corpus Processing, pages 56–65, Morristown, NJ,

USA, 2003. ACL.

[37] S.-Y. Yang and V.-W. Soo. (2008). “Comparing the conceptual graphs extracted

from patent claims”. In SUTC ’08: Proc of the 2008 IEEE Int’l Conf on Sensor Net-

works, Ubiquitous, and Trustworthy Computing (SUTC 2008), Washington, DC, USA,

2008. IEEE CS.

[38] C. Yang, Hong Peng, J. Wang (2008). “A new Feature Extraction Approach

Based on Sentence Element Analysis”. In Computational Intelligence and Security,

CIS’08.

[39] V.Nastase, J.S. Shirabad, M. F. Craropreso (2007). “Using Dependency Relations

for Text Classication”. University of Ottawa SITE Technical Report TR-2007-12.

[40] ] W. Zheng, et al., “Topic Tracking Based on Keywords Dependency Profile”.

AIRS 2008.

[41] ] Renxu Sun , Chai-huat Ong , Tat-seng Chua (2006). “Mining Dependency Rela-

tions for Query Expansion in Passage Retrieval”. In SIGIR ’06: Proceedings of the

29th annual international ACM SIGIR conference on Research and development in in-

formation retrieval.

[42] H. Cui, R. Sun, K. Li, M.-Y. Kan and T.-S. Chua. (). “Question Answering Pas-

sage Retrieval Using Dependency Relations”. Proceedings of the 28th annual interna-

tional ACM SIGIR conference on Research and development in information retrieval,

Salvador, Brazil , Aug 15-19, pp. 400 - 407.

[43] Schonhofen, P. and Benczur, A.A. (). “Feature selection based on word-sentence

relation”. In

[44] The USPTO databased. http://www.uspto.gov/

[45] Jae-Ho Kim, et al. “Patent document categorization based on semantic structural

information”, Information Processing and Management (2007).

- 48 -

[46]Xiaobing Xue and W. Bruce Croft (2009). “Automatic Query Generation for Pa-

tent Search”. In Proceeding of the 18th ACM conference on Information and Know-

ledge Management, CIKM’ 09.

[47] Lupu, M.; Mayer, K.; Tait, J.; Trippe, A.J. (2011). “Current Challenges in Patent

Information Retrieval”. The Information Retrieval Series, Vol. 29, 2011.

[48] David Hunt, Long Nguyen, Matthew Rodgers (2007). “Patent searching: tools &

techniques”

[49] Open NLP POStagger: http://opennlp.sourceforge.net/

[50] trect_eval program at TRECT website: trec.nist.gov/trec_eval

- 49 -

- 50 -

Summary

Query Enhancement for Patent Prior Art Search with Keyterm Dependency Relations

and Semantic Tags

A new method for query enhancement in patent prior art search that outperforms the baseline (tf-idf)

using keyterm dependency relations and semantic tags was proposed in this thesis. The experiments demon-

strated in this work show that for query formulation from a separate field, Description is the most significant

field to improve the ranking of retrieved prior art patents with 60 terms as the appropriate query size. It is also

shown that for query formulation from combined fields, query formulated by combinations of three fields

which is the top ten terms from Abstract and the top 20 terms from Claims and the top 60 terms from Descrip-

tion give the best result. And, our works show that the best query is achieved by the combination of the top

ten terms from Abstract, top 20 terms from Claims and IFPS. Our proposed method also shows that the query

formulation using IFPS itself can still significantly improve results over the baseline.

The proposed methods in this work are applied to patent documents which are related to the Batteries’

domain; however, they can also be applied to other domains as well.

Keywords: patent retrieval, prior art retrieval, keyterm dependency relations, semantic tags, term cooccu-

rences

Data & Analytics

Query Enhancement for Patent Prior-Art-Search Based on Keyterm Dependency Relations and Semantic Tags