39
www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

Embed Size (px)

Citation preview

Page 1: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Classification at Northern Light

Presentation to Access 98

October 4, 1998

Page 2: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

“This year, the World Wide Web has arrived as a serious supplier

of ‘serious’ online information.”

Sue Feldman, “Web Search Services in 1998: Trends and Challenges,” Searcher

Magazine, June 1998

Page 3: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998
Page 4: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998
Page 5: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998
Page 6: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Search engines are being held to higher standards

All users want freshness and manageable results sets

Professional information seekers want

– high relevance and high quality content first

– good descriptive information for all results

– precision searching

– text and tables

Page 7: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Web search environment

constant growth in all dimensions (pages, countries, languages, file formats)

constantly increasing traffic

continuous onslaught of spam

Page 8: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Practical considerations for search engines

significant engineering time spent counteracting spam

constantly adding disk space: 3 terabytes at Northern Light

crawler efficiency: must balance new page discovery with known-page re-crawl

Page 9: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

You step in the stream, but the water has moved on.

This page is not here.

Page 10: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Search engines: limitations

lack the higher quality sources not found on the Web

no concept of classification as found in library systems

like an index of every word on every page in every book in your library

– with no subject catalog

Page 11: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com #

Northern Light’s fundamental goals

Combine Web data with quality information not on the Web in a single integrated search

Make results set manageable for user (already a problem; worse after non-Web data is added)

Page 12: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Research Engine : Content as of Oct 98

Web

– 96,000,000 pages

Special Collection

– 3,600,000+ full-text documents

– 4600 journals, magazines, books, trusted reference works, etc.

Mixes free (Web) and Fee (Special Collection)

Page 13: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Relevancy ranking still critical

Engines continue to improve their ranking algorithms

All seem to agree that relevancy ranking is not enough to manage results lists of size commonly seen now

Page 14: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Techniques for taming results sets

abridge the database (Excite, Lycos, Infoseek)

re-sort by popularity (HotBot/Direct Hit)

suggest further refinement steps to user (Alta Visa Refine)

sort based on number of inbound links (Infoseek…?)

sort by classification metadata (Northern Light)

Page 15: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Research Engine: Classification

classify the Web according to the same standards found in journal literature

sort results for user, based on this classification

work with the user to refine the question (reference interview approach)

Page 16: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Relevancy ranking has its limits

Library patron: “I need some baseball information.”

Librarian: “OK. Here are 41,536 books and sources about baseball, relevancy ranked.”

Good general sources may be ranked on top, but the user probably had something more specific in mind...

Page 17: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Reference librarian approach: work with the user to refine the question

“I need some baseball information.”

“OK. Tell me more. Do you want general info, teams and players, recent news...?”

“Um... team info”

“OK. Red Sox, Yankees, ...?”

“Red Sox.”

Page 18: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Page 19: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Classification helps organize results

shows aspects of a topic (‘baseball’, ‘diagnostic tests’)

disambiguates queries (‘what is balance’)

sometimes answers questions directly (‘12th President’)

Page 20: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Page 21: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Page 22: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Search Current News

Computer networksLocal area networksModemsCable modems

all others...

Special Collection

Personal computersComputer cachesBuses (computer)

Health care softwareSoftware industryCircuit design

Page 23: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Page 24: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Special Collection documentsCommercial sites

Sociology of the familyEmployee assistance programs

Neurology

Online bankingHelicoptersMartial artsChinese philosophy

all others...

1. WHAT IS BALANCE?84% - Articles & General info: WHAT IS BALANCE? Back to New Evangelicanism Reports. Back to the Way of Life Home Page Way of Life Literature Online Catalog You Can Own…11/09/97Personal Page: http://www.dsinclair.com /~dcloud/fbns /whatisbalance.htm

2. Emotional Stability is Balance77% - Articles & General info: Emotional Stability is Balance Emotional Stability is Balance - 1 He is unbalanced - 2 She’s not on an even keel - 3 They’re upset…03/24/95Educational site:http://cogsci.berkeley.edu/metaphors/ EmotionalStabilityIsBalance.html

3. What is balance?73% - Biographical sources: “What is balance?” This is an ongoing, soul-searching, head-scratching question that my husband, Don, and I ponder on a regular bases….07/01/96Exceptional parent (magazine): Available at Northern Light

Page 25: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Page 26: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Subject classification of Web documents

exists for sites in Web directories (Yahoo, Looksmart, The Mining Co)

exists behind CGI interfaces

doesn’t exist at the document level

except where supplied by the page creator

Page 27: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Cost of document classification

Original cataloging of book: $37

Creating a journal article abstract: $1.50

Deriving subject headings from journal abstract: $.20

for 95,000,000 Web documents = $161.5 million

Page 28: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Metadata manufacturing

Automatically determine document’s subject, type, source and language metadata

Controlled vocabularies interoperate with classifier system

System classifies pages

Fraction of cent per document

Page 29: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

NL’s controlled vocabularies

Editorially developed

Hierarchical in form (graph)

Exist for subjects, types, and sources

Page 30: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

NL’s subject vocabulary

Subject scope is unlimited (as in LC, Dewey, Yahoo)

Major points of reference were DDC, LC Subject headings, UMI subject headings, and subject-specialized classification schemes

Unique, selective conflation of these

Mapping NL with content partners’ vocabularies gives freshness, completion

20,000 concepts; 200-300,000 concept equivalents

Page 31: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Subject classification process

Three main techniques:

– mapping

– automatic classification

– editorial classification of whole web sites

Page 32: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Mapping

Indexing vocabularies of content partners are normalized

with NL vocabularies

Excellent source of new terms; helps maintain freshness

and ensure complete coverage of a topic

All terms become synonyms, equivalents of NL terms and

are used in automatic classification... creating a ‘network

effect’ of subject knowledge

Page 33: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Partner vocabularies mapped to date

journal aggregators: UMI, IAC, Ethnic News

Watch, Responsive Database Services

news databases: AP News, Comtex Newswires,

Newsbytes

others: U.S. Pharmacopeia, American Banker,

Engineering News Record

Page 34: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Automatic classification

based on words contained in document

uses Term Frequency/Inverse Document Frequency methods

document must have a strong degree of

‘aboutness’ to class

Page 35: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

NL’s type classification

This scheme too is hierarchical, e.g.• Reviews

– Book reviews– Movie reviews– Product reviews

classification process based on words and structure of document

Page 36: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Librarians at Northern Light

Build and maintain controlled vocabulary

Map vocabularies of new partners

Continually tune classification performance

Help design and test user interface

Mine and classify whole web sites

Edit databases

Page 37: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Database editing

Classification used to slice NL database into “vertical search engines”

Since Feb 98, we’ve released

– 17 subject search engines on NL Power Search

– 26 industry databases (for NL; also on Netscape Netcenter)

– 5 personal finance databases (for Doubleclick)

– music industry database (with Billboard magazine)

– construction industry database (with Engineering News Record)

Page 38: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Automatic classification is still a fledgling technology, however...

it has proved practical for classifying close to 100 million web pages

it is remarkably accurate, given the breadth of concept space it covers

it is responsive to tuning

it is effective in managing results sets for users

Page 39: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998

www.nlsearch.com

Joyce WardDirector, Content ClassificationNorthern Light Technology LLC222 Third St.Cambridge, MA [email protected]