Upload
neil-oyler
View
222
Download
2
Tags:
Embed Size (px)
Citation preview
Search Engine Industry Trends – Impact for Digital Libraries
Dr. John M. Lervik, CEO FAST
7th International Bielefeld Conference 2004
Oslo
BostonTokyo
Munich
San Francisco
Chicago
Rome
London
Washington DC
Rio de Janeiro
Fast Search & Transfer (FAST)
Since 1997, FAST has grown globally
– Public company (OSE: ’FAST’)
– 200+ employees, 80 in R&D
– Profitable and well capitalized
– Fast growing
• > 900 customers & partners (Univ. Lib Bielefeld, HBZ, ZIB, Norwegian Nat’l Lib, Elsevier, LexisNexis, etc)
• #2 growing company in Europe 1998-2002
– Internet business sold to Overture/Yahoo!– Acquired AltaVista software w/200 customers
“Industrial Strength”
“Magic Quadrant: Most Visionary”
“Excellent Choice”
New York
Tromsø
Mission-Critical Business Search
• Search has become mission-critical & strategic:– Internet portals: Google, MSN, Yahoo!, …– E-commerce: Amazon, eBay, …– Corporate web sites: Dell.com, IBM.com, ...– Yellow Pages: SEAT PG, TPI PA, Findexa, …– Directory services: Thomas Publishing, Bonnier…– Mobile: Vodafone live!, …
• Common purpose: Connect buyer with seller
Search Trends
• ”The Google effect”– Users demand simple one-field search– Users demand relevant results– Paid search (advertisement) is the main business driver
• Challenge: Search is much more difficult in academic and corporate world– Need to provide the relevant (correct) answer– Web search: Provide a relevant answer
• Solution: 3rd generation search technology– Improved relevance through content and query analysis– Tools for navigation, discovery, and visualization
Digital Library Challenges
• Digital libraries face an information management challenge– Huge and increasing amount of digital data
– Data/content aggregation, data store (repository), information retrieval & discovery, etc
• Increasing volumes and types of digital data– Media types: Books, magazines, CDs, ...
– Media formats: Text/numbers (incl metadata), audio files, images, video
– Must support various access patterns, copyright, etc
• Need flexible and efficient interfaces between information and users– Search engine as unified information access layer
Current Role of Search- Point Solutions
SITE SEARCH
Intranet Documents
SITE SEARCH
Intranet Documents
MailSystem
MAIL SEARCH
Documents
DMS,CMS
DMS SEARCH
RDBMS
ERP, CRM
Legacy Data
Datawarehouse
Datamarts
BI SEARCHCORPORATE
SEARCHECOMMERCE
SEARCH
The CorporationThe Corporation
IsolatedIsolatedSolutionsSolutions
… to a Horizontal Search Platform…
RDBMS(JDBC, ODBC,
SQLNet, DW, DM)
Applications(e.g. ERM, CRM,
Help Desk)
Legacy Data(e.g. ISAM, VSAM, IMS)
Message Queues(e.g. TIBCO, MQ-Series)
DMS(e.g. M’Soft CMS,
Documentum)
eMail Systems(e.g. Notes,Exchange)
Files(e.g. Word, Excel,pdf, images, mp3)
Portals(e.g. WebSphere,
WebLogic)
WWW(HTML, XML, WML,
JavaScript)
Private Webs(e.g. news feeds,
Intranets)
Direct Push
UNSTRUCTUREDSTRUCTURED REAL--TIME
Enterprise Search Platform
SITE
SEA
RC
H
MA
IL S
EAR
CH
BI S
EAR
CH
DM
S SE
AR
CH
CO
RPO
RA
TESE
AR
CH
ECO
MM
ERC
ESE
AR
CH…
A common, unified service for intelligent, dynamic information retrieval
• Web services• GRID computing
Search EngineHow It Works
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY &
RESU
LTPR
OC
ESSING
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
WebContent
Files,Documents
Databases
CustomApplications
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Pipeline
Multimedia
Open, modular, scalable architecture
Search EngineHow It Works
• Connect to content sources and get data– Web pages (e.g. XML, HTML, WML): Crawler– Files, documents (e.g. Word, Excel, pdf): File traverser– Database content (e.g. Oracle, DB2): Database connectors– Applications (e.g. Notes, Exchange, CMS/DMS): Application connectors
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ER
Y &
RE
SU
LT
PR
OC
ES
SIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
WebContent
Files,Documents
Databases
CustomApplications
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Multimedia
Search EngineHow It Works
• Analyze and index content to make it searchable– Convert and process content through pre-processing pipeline:
• Lemmatization, entity extraction, taxonomy classification, ontology• Custom logic (e.g. adding special tags)
– Write content to index files
WebContent
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY /R
ESULT
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
ContentPush
Pipeline
Multimedia
Search EngineHow It Works
• Analyze query– Use query language or query API– Convert and process query through query pipeline:
• Linguistic processing• Custom logic (e.g. query term modification/addition)
WebContent
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
Multimedia
Search EngineHow It Works
• Match query to content index– Query- and content adaptive matching– Exploit all information and structure in the data
CO
NN
ECTO
RS
Pipeline
SEARCH
QU
ERY /R
ESULT
PRO
CESSIN
G
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
WebContent
ContentPush
Files,Documents
Databases
CustomApplications
Pipeline
Multimedia
CO
NN
ECTO
RS
Search EngineHow It Works
• Return results to user– Convert and process results through result pipeline:
• Resort, filter for security, analyze for navigation and discovery (dynamic drilldown)– Pass results on to application (generated or through API) – Push results to alert engine and then external environment (e.g. mail, queue)
WebContent
Pipeline
SEARCH
RESU
LTPR
OC
ESSING
FILTER
Query
Results
Alert
VerticalApplications
Portals
CustomFront-Ends
MobileDevices
DATABASECONNECTOR
FILETRAVERSER
WEBCRAWLER
ContentPush
DO
CU
MEN
TPR
OC
ESSING
Pipeline
CO
NN
ECTO
RS
TUNING, ADMINISTRATION
Index Files
Files,Documents
Databases
CustomApplications
Multimedia
Search Engine FeaturesRelevant, Organized Information
• Linguistic Analysis– Auto-language detection– Natural language processing– Approximate matching (spelling)– Lemmatization (grammar)– Entity extraction, anti-phrasing– Multiple dictionaries, thesauri
• Taxonomy and Classification– Structured, unstructured data– Supervised, unsupervised categorization– Dynamic classification– Auto-taxonomy generation (terms, Web)– Taxonomy toolkit– Ontologies
• Open, Flexible Relevancy Model– Absolute and relative query boosting– Relative document boosting– Custom processing logic (pre-index, query)– Rule-based matching
• Powerful Query Language– Exact matches, wildcards, multiple terms– “more like this” (query by example), “near”– Text, integer, Boolean expressions (infinite level of
parentheses– Integer comparisons (>, , =, <, , )– Fuzzy queries, concept,
• Flexible Search and Sort– Range searching– Default sort, sort by field– Static & dynamic teasers, any field– Full inclusion, exclusion URI control– Robot aware
• Navigation, Discovery & Visualization– Structure, unstructured data– Dynamic drill-down (faceted browsing)– Results-based binning– Statistical analysis
Relevance & Information Discovery
• Traditional: Results sets are typically lists of document identifiers
• 3rd generation: Result set depending on the query intentions– Traditional result set lists
– Dynamic clustering: Supervised and unsupervised
– Live analytics (dynamic drill-down) for navigation and discovery
– Visualization...
2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”
2 ways to search: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”
Intelligent OrganizationIntelligent Organization
The search barLive analytics
Traditional Result Set
• Languages– 77 languages auto-detectable, searchable,
sortable– 20 languages include advanced linguistics– Multiple code sets for each language
• Multiple field sorting
There are 2 ways to search for anything: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”
There are 2 ways to search for anything: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”The search bar
• Linguistics– Auto-language detection– Approximate matching (spelling)– Lemmatization (grammar)– Phrase detection– Anti-phrasing, stop words– Proximity search– Multiple dictionaries, thesauri– Full search language (incl. text, integer,
boolean)
Relevance: Ranking – The FCASQ Framework
• Completeness– How well does the query match superior contexts like the title or the url?– Example: query=”Mexico”, Is ”Mexico” or ”University of New Mexico” best?
• Authority– Is the document considered an authority for this query?– Examples: Web link cardinality, article references (citations), product revenue,
page impressions, ...
• Statistics– How well does the contents of this document on overall match the query?– Examples: Proximity, context weights, tfidf, degree of linguistic normaliz., etc
• Quality– What is the quality of the document? – Examples: Homepage?, Entry point to product group?, Press release?, ...
• Freshness– How fresh is the document compared to the time of the query?
Navigation & Discovery
There are 2 ways to search for anything: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there”
There are 2 ways to search for anything: - “I know what I want, but I don’t know where it is” - “I’m not sure what I’m looking for but I know how to get there” Live Analytics
• Multi-Dimensional Navigation– Taxonomic, ontological– Clustering of extracted entities– Field-based categories
• Dynamic, Automatic Generation– Auto-generated from configuration
definitions– Re-generated on each query– Internal scoring for further
refinement
Automatically Extracted Entities
Information DiscoveryExample: Scirus Metadata
Information DiscoveryExample: Medical Information (Medline) – 12M Documents
Discovery
• MESH keywords• Publication year• Journal Title• Author(s)• Chemical substances• Etc
Information DiscoveryExample: Medical Information – 12M Documents
Analytical SearchExample: Author Analysis
Data source:12M Medline Publications
Example: Echocardiography- Author drill-down
Jim Seward, Mayo
Jim Seward: Publishing pattern
A Tajik 56
J Oh 25
P Pellikka 16
B Khanderia 16
D Hagler 13
V Roger 13
K Bailey 13
F Miller 11
stress echocardiographyImage orientationregurgitant orificeabnormal relaxationtwo-dimensional echocardiographyventricular response in patientsinitial repairmitral lesionsechocardiographic contrastmyocardial infarction
Co-Authors Research Topics
Example 1: Scirus (www.scirus.com)
Scirus is the leading online search enginefor scientific content
ProprietaryDatabases
ValueAdded
Functionalities
ScientificWeb Pages
Twice winner of SEW Best Specialty
Search Engine award
140 million Web pages(.edu, .gov, .org, .com, …)
30M articlerecords (Medline, SciencDirect, …)
• Large-scale content aggregation• Automatic content & page classificat.• Query refinements (1-D drill-down)
One integrated search engine across many diverse projects
– One search interface for all catalogs – instead of search in 100+ databases
– Information from objects of all types of media (multimedia, textual content, metadata)
– In-house library production systems, end-user services and in ongoing innovation projects
• Projects
– The Digital Radio Archive (DRA): NRK Radio historical radio archive (300,000 programs)
– Culture Net Norway: The official gateway to Norwegian culture on the web
– The Digital Newspaper Library: 300,00 pages from year 1763 and onwards
– Cultural Heritage Ekofisk: Content related Ekofisk oil field (incl. OAI metadata harvester)
– The National Library’s public web site
– Paradigma (Preservation, Arrangement & Retrieval of Assorted DIGital MAterials)
– The Nordic Web Archive (NWA): Harvesting and archiving of web documents
Example 2: Norwegian National Library (www.nb.no)
Summary
• Search engines can do more than just search…– Unified information access solution for digital libraries– Open, scalable and modular architecture: Allows for customization– Adapts to content and queries– Powerful data discovery, navigation, and visualization
• Many exciting technology developments to come– More advanced content and query analysis– Adaptive, personalized query- & content-sensitive matching– Dynamic result set presentation, navigation, discovery, visualization– Federation across external content applications
Thank you!