36
ENTERPRISE SEARCH an introduction

Introduction to Search Engines

Embed Size (px)

DESCRIPTION

This presentation gives an introduction to the Search Engines. What are they? How do they work? It also has a brief introduction to Solr and Lucene

Citation preview

Page 1: Introduction to Search Engines

ENTERPRISE SEARCH

an introduction

Page 2: Introduction to Search Engines

Web Search

Desktop Search

Enterprise Search

Page 3: Introduction to Search Engines

so what is a

Search Engine?

Page 4: Introduction to Search Engines

a SOFTWARE

• that builds index on Text

• answers queries using that index

Page 5: Introduction to Search Engines

Any search application has two major

components

SEARCH component

INDEXING component - of importance to us developers

(read headache)

- of importance to the users

Page 6: Introduction to Search Engines

data

INDEX FILES

is indexed

user

sends search query

receives search results

INDEXING component

SEARCH component

Page 7: Introduction to Search Engines

Let’s start with

INDEXING

Page 8: Introduction to Search Engines

is it easy to search here . . .

Page 9: Introduction to Search Engines

or here . . .

Page 10: Introduction to Search Engines

• that’s information like garbage

• no structure

• comes in all kinds of shapes, sizes, formats

Page 11: Introduction to Search Engines

• And this is what indexing does

• Makes data accessible in a structured format, easily accessible through search.

Page 12: Introduction to Search Engines

so what all needs to be

Indexed and Searched ?

Page 13: Introduction to Search Engines

various FILE FORMATS

Text Files

HTMLPDF

MS Word

PPT

Page 14: Introduction to Search Engines

coming from various DATA SOURCES

EmailsCMS

File System

Database

Web Pages

Page 15: Introduction to Search Engines

data ( documents )

INDEX FILES

user

sends search query

receives search results

Analyzer

fed to

text that should be indexed

removing stop words such as "a" or "the"

converting all text to lowercase letters for case-insensitive searching

Stemming(A stemming algorithm reduces the words "fishing", "fished",

"fish", and "fisher" to the root word, "fish". )-

Index Writer

tokenized text

Page 16: Introduction to Search Engines

Document 1:Coffee isn't my cup of tea.

Document 2: Chocolate, men, coffee - some things are better rich.

INDEXcoffee - 1,2cup - 1 tea - 1chocolate - 1men - 1things - 1better - 1rich - 1

Page 17: Introduction to Search Engines

And now the

SEARCH Component

Page 18: Introduction to Search Engines

data

INDEX FILES

is indexed

user

receives search results

sends search query

search terms

Page 19: Introduction to Search Engines

Search Request Terms

Taxonomy

Spelling IndexCorrect Search Terms + Incorrect Search Terms

Search Terms +Related Terms from Taxonomy + Concept IDs

Search engine(INDEX)

Search results with

1) Actual Location of the result2) Rank3) Details4) Facet Categorization

Results’ Page

Page 20: Introduction to Search Engines

introducing

LUCENE

Page 21: Introduction to Search Engines

Full-text search library

Open Source

Documents in xml format

Can operate on its own or via Solr

Page 22: Introduction to Search Engines
Page 23: Introduction to Search Engines
Page 24: Introduction to Search Engines

Ways of storing fields of any document:

Indexed means it is searchable

Stored you may chose not to make a field searchable, means the content can be displayed in the search results. Example : “summary associated with a page”

Tokenized means it is run through an Analyzer, that converts the

content into a sequence of tokens

Page 25: Introduction to Search Engines

introducing

SOLRSolr

Solr

Lucene

Index

Page 26: Introduction to Search Engines

• open source

• handles index/Query to Lucene via HTTP and XML ( also JSON )

• manages document update, add and delete requests to Lucene

• straightforward schema and config files

• comprehensive HTML Admin Interfaces

• highly configurable

Page 27: Introduction to Search Engines

Adding Documentsto SOLR

Page 28: Introduction to Search Engines

HTTP POST to /update

<add><doc boost=“2”>

<field name=“type”>05991</field>

<field name=“from”>Apache Solr</field>

<field name=“subject”>An intro...</field>

<field name=“category”>search</field>

<field name=“category”>lucene</field>

<field name=“body”>Solr is a full...</field>

</doc></add>

Page 29: Introduction to Search Engines

Schema.xml field indexing and display definition

Page 30: Introduction to Search Engines

Solrconfig.xml file

defines cache size, faceted field type, request handler customization

Page 31: Introduction to Search Engines

Deleting Documents• Delete by Id

<delete><id>05591</id></delete>

• Delete by Query (multiple documents)

<delete>

<query>manufacturer:microsoft</query>

</delete>

Page 32: Introduction to Search Engines

Search Results

http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price

Page 33: Introduction to Search Engines

Default Parameters

param

default description

q The query

start 0 Offset into the list of matches

rows 10 Number of documents to return

fl * Stored fields to return

qt standard Query type; maps to query handler

df (schema) Default field to search

http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price

Page 34: Introduction to Search Engines

<response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound="16173" start="0"> <doc> <str name="name">Apple 60 GB iPod with Video</str> <float name="price">399.0</float> </doc> <doc> <str name="name">ASUS Extreme N7800GTX/2DHTV</str> <float name="price">479.95</float> </doc> </result></response>

Page 35: Introduction to Search Engines

Solr Core

Lucene

AdminInterface

StandardRequestHandler

DisjunctionMax

RequestHandler

CustomRequestHandler

Update Handler

Caching

XMLUpdate Interface

Config

Analysis

HTTP Request Servlet

Concurrency

Update Servlet

XMLResponse

Writer

Replication

Schema

Search Requests hit here New document to be added here

Page 36: Introduction to Search Engines