24
Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search John Tait Chief Scientific Officer IRF 1

Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

  • Upload
    falala

  • View
    42

  • Download
    1

Embed Size (px)

DESCRIPTION

Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search. John Tait Chief Scientific Officer IRF. Acknowledgements. Mihai Lupu of the IRF Jian -Han Zhu of University College London Jimmy Huang of York University of Canada - PowerPoint PPT Presentation

Citation preview

Page 1: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Text mining and Indexing: Assessing the Results of

Deeper Indexing for Patent Search

John TaitChief Scientific OfficerIRF

1

Page 2: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Acknowledgements

• Mihai Lupu of the IRF• Jian-Han Zhu of University College

London• Jimmy Huang of York University of

Canada• Giovanna Roda of colleagues in the CLEF

IP team for Matrixware and the IRF• Royal Society for Chemistry for

generously making their Scientific Journal Colelctions available to us

IRF Member Services 2

Page 3: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

An apology

The content of this talk was planned on the basis we could discuss the detail of TREC CHEM: over the weekend it became clear NIST policy is that the results should not be made public until the TREC conference in the US in November

• Therefore much detail had to be removed

IRF Member Services 3

Page 4: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Outline

• Introduction to the IRF• TREC CHEM

– 2009– Plans for the future

• Summary and Conclusions

IRF Member Services 4

Page 5: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

The IRF

5

Page 6: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

The Information Retrieval Facility

A international not-for-profit institution, founded in 2006, based in Vienna, to promote and facilitate research in large scale information retrieval

Page 7: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

The IRF Mission

To bridge the gap between the needs of the industry and the academic know-how.To maintain a facility that enables largescale information retrieval and in depthprocessing of data for researchTo bring the latest information retrieval technology to the community of patentprofessionals and other professional searchers.

Page 8: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

IRF – Founding Members

Page 9: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

The Information Retrieval Facility

A platform initiated by Matrixware which: improves the transfer of knowledge between professionals in Intellectual Property and Information Retrieval andpromotes collaboration between experts on the development of new research methodologies for international patent data .

Page 10: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Distinctive Patent Search Characteristics

High Recall: a single missed document can invalidate a patentSession based: single searchers may involve days of cycles of results review and query reformulationDefendable: Process and results may need to be defended in court

Page 11: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

CLEF-IPThe goal of the CLEF-IP track is to investigate

multilingual IR techniques in the Intellectual Property domain.

• Target data >1Mio EPO granted patents documents in three languages: English, German, French

• Tasks prior art search, invalidity search

• Test collection constructed using the available EPO prior art reports

CLEF-IP Track

Page 12: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

IRF Member Services

• Scientific Members– Access to data to resources– Project links to industry

• Industrial Members– Consultancy and research in IR and IP

search– Training and support: systems evaluation

semantic computing– Links to academia

IRF Member Services 12

Page 13: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

TREC Chemistry Information Retrieval

Track 2009

John TaitChief Scientific OfficerIRF

13

Page 14: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

TREC

• Organised by the US Federal Institute of Standards and Technology

• Has run annually since1991– Originally focused on ad hoc text

retrieval with long queries– Regularly extended

• Video• Web• Genomics• Legal

IRF Member Services 14

Page 15: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Origins of TREC CHEM

• IRF approached NIST about using our patent data and computing facilities as a means to promote scientific co-operation

• Jian-Han Zhu then of UK Open University about Chemistry approached NIST about Chemistry

• NIST were interested in domain specific retrieval to follow up Genomic track etc. and helped us get going

IRF Member Services 15

Page 16: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Data

• 1.2 mil. patent files (IRF)

• 59k scientific articles (RSC)

IRF Member Services 16

Page 17: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Tasks

• Technical Survey– Search for all potentially relevant

documents, in both collections.– 18 manually defined and evaluated

topics• Prior Art

– Search for patents that may invalidate a given patent

– 1000 automatically created and evaluated topics (1000 patent files)

17

Page 18: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Participants

• 15 institutions registered to get the data– 6 submitted 31 runs for the TS task:

• University of Applied Science Geneva, Information Retrieval Laboratory of Dalian University of Technology, Fraunhofer SCAI, Milwaukee School of Engineering, Purdue University, York University

– 8 submitted 59 runs for the PA topics:• University of Applied Science Geneva,

Carnegie Mellon University, Information Retrieval Laboratory of Dalian University of Technology,University of Iowa,Fraunhofer SCA, Milwaukee School of Engineering, Purdue University, York University

18

Page 19: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Methods

• Basic vector space model– Different sections, weights on each

section– bm25

• Additional filtering/weighting based on IPC codes

• Linguistic processing– Emphasis on Noun Phrases

• Concept based search• Query expansion

– Using Oscar3, MeSH19

Page 20: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Evaluations

• Technology Survey tasks– 8 chemistry grad students– 5 experts– Each topic evaluated by 2 students and 1

expert• Prior Art tasks

– Automatically evaluated based on citations within patents and family members

20

Page 21: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Initial Results

• Manual evaluations have some conflicting results– Not more than other manually evaluated

topics• Using entity recognition and synonyms

proves successful– Some groups manually extended the

queries• “simple methods” seem to also perform

well (e.g. Lucene-based, bm25)– E.g. for Inferred Average Precision they

reach 97% of highest score• Disclaimer: results analysis is still ongoing

21

Page 22: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

TREC CHEM 2010 onwards

• Subject to discussion at TREC in November– Increased numbers of patents– Include images– Task extensions/refinements

• Searching for numerical ranges (independent of unit)

• Searching for specific roles of specific chemical components

• The use of Markush structures

IRF Member Services 22

Page 23: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Summary and Conclusions

• The IRF is promoting collaboration between information retrieval and intelelctual property professionals through promoting evaluations and joint technology development projects

• TREC CHEM has provided an objective and independent means as assessing the effectiveness of technologies on two sorts of retrieval tasks

IRF Member Services 23

Page 24: Text mining and Indexing: Assessing the Results of Deeper Indexing for Patent Search

Thank you for your attentionAny questions ?

www.ir-facility.orgwww.matrixware.comwww.matrixware.net

IRF Newsletter Chemistry Issue:http://www.ir-facility.org/the_irf/newsletter