8
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko James Allan, Bruce Croft University of Massachusetts The Lemur Toolkit

1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

Embed Size (px)

Citation preview

Page 1: 1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

1

Thi Nhu Truong, ChengXiang Zhai

Paul Ogilvie, Bill Jerome

John Lafferty, Jamie Callan

Carnegie Mellon University

David Fisher, Fangfang Feng

Victor Lavrenko

James Allan, Bruce Croft

University of Massachusetts

The Lemur Toolkit

Page 2: 1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

2

Outline

• What is the Lemur toolkit

• Release 1.9

• Release 2.0

• Plans for the future

• Audience comments and suggestions

Page 3: 1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

3

What is Lemur?

• Objective: A flexible toolkit to support research on language modeling applied to text retrieval and other language technologies

• Written in C++

• Three releases, available at http://www.cs.cmu.edu/~lemur

• Developed as part of the Lemur Project sponsored by ARDA

• Open source (BSD-style license)

– Use, change, distribute as you see fit

Page 4: 1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

4

Current Components

• Three Indexers

• General retrieval architecture

• Specific retrieval models (TFIDF, Okapi, Unigram LM)

• General support for language models

• Smoothing algorithms

• Feedback algorithms

• Distributed information retrieval

• Text summarization

• Various utilities

• Runs on Unix (Solaris, Linux) and Windows (NT, 2000, XP)

Page 5: 1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

6

Lemur Usage

• The Lemur Toolkit has played an important role in recent research at CMU and UMass

• Also being used at other locations

– 95 people on Lemur email list

– Downloaded to over 600 locations worldwide

» http://www.cs.cmu.edu/~lemur/

– Example uses: Question answering, filtering, educational uses, noun phrases, cross-lingual IR

Page 6: 1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

7

Lemur Version 1.9

• Released July 2002

• New features:

– Two-stage smoothing

– Text summarization: whole document, query-based

– Distributed IR: query-based sampling, DB selection, result merging

– Simple document manager

– Index upgrades

– Bug fixes

Page 7: 1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

8

Lemur Version 2.0

• Planned for late September 2002• New features:

– Upgraded distributed IR result merging– Integrate UMass additions

» Chinese and Arabic retrieval, preliminary support for multilingual operations

» Inquery-style query operators » Simple incremental indexing» Passage indexer» New KL relevance models» Cosine similarity retrieval method» Query clarity

– Bug fixes (and probably new bugs)

Page 8: 1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko

9

Plans for the Future

• Efficiency (speed)• Incremental indexing• Structured query operators for language modeling retrieval• Better support for interactive applications (and a GUI)• Quality assurance and regression test sets• Better modularity, “Lemur lite”• Locality-based retrieval• Clustering• Aspect-oriented retrieval• Filtering• Multilingual capability• New feedback models• Fields and Metadata• Your suggestions and comments….?