Upload
doug-turnbull
View
570
Download
1
Embed Size (px)
DESCRIPTION
Why do we need a dedicated search engine to search our unstructured text data? Why can't we just rely on the features built in most databases?
Citation preview
Why Search?(starring
Elasticsearch)Doug Turnbull
OpenSource Connections
OpenSource Connections
Hello
• Me@[email protected]
• Ushttp://o19s.comWorld class search consultants Right here in C’ville!
Hiring passionate interns!
OpenSource Connections
Why Search?
• What does a dedicated search engine do?o that a database doesn’t?
• Why not [MySQL|mongoDB|Cassandra | etc]?
• Why a dedicated search engine?
OpenSource Connections
Why not MySQL?
• We’ve got rows of stuff in tables. IE for SciFi StackExchange, we’ve stored ~20K posts:
OpenSource Connections
PostID
UserId CreationDate
ViewCount
Body
0 1 2011-01-11T20:52:46.753
124 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p>
1 2 2013-02-01T12:44:46.525
525 <p>Been meaning to read the Foundation Series, what should I read first?</p>
Why not MySQL?
• Our mission: Find all the “Darth Vader” in SciFi StackExchange Posts!
OpenSource Connections
P U C V Body
0 1 2 1 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p>
1 2 2 5 <p>Been meaning to read the Foundation Series, what should I read first?</p>
Found!
Missing!
Why not MySQL – SQL Like?
• SQL “LIKE” operator – scan all rows for a specific wildcard match
SELECT * FROM posts WHERE body LIKE "%darth vader%"
OpenSource Connections
Match?
Match?
Match?
Match?
Performs Table Scan
Approx 300ms to search a measly 20K docs!(what if we had 20 Million?)
SQL Like – other problems
• Can’t search for words out –of-order:
SELECT * FROM posts WHERE body LIKE "%vader, darth%"0 results
• Can’t search for alternate forms of a word:
SELECT * FROM posts WHERE body LIKE "%kittie pictures%“
SELECT * FROM posts WHERE body LIKE "%kitteh pictures%"
OpenSource Connections
SQL Like – other problems
• No Ranking of Results – given these two docs:
OpenSource Connections
One might ask how none of the Jedi at Qui-Gon's funeral
noticed that there was a Dark Lord of the Sith standing right
behind them. Darth Vader and Obi-Wan only noticed each other when on the same station … It's apparently hard to pick up another force-user without knowing he or she is there…
I seem to remember a novel, I think it was Dark Lord: The Rise of Darth Vader, that addressed this. It made the assertion that while Darth Vader had lost both hands, he was still as formidable, in the force sense,
- Directly about Darth Vader
- Darth Vader is a side topic here
Which should come first?
SQL Like| CTRL+F |grep is
1. Extremely Slow
2. Not fuzzy -- Needs exact literal matches, no fuzziness!
3. Unranked -- Simply says y/n whether there is a match
OpenSource Connections
Search needs to be
1. FAST! A data structure that can efficiently take search terms and return a set of documents
2. FUZZY! A way to record positional and fuzzy modifications to text to assist matching
3. FRUITFUL! Relevant documents bubble to the top.
OpenSource Connections
Lets play with an
implementation
• Lucene -> Elasticsearch
OpenSource Connections
Lucene
Solr
Elasticsearch
• Lucene, 1999 by Doug Cutting• Java library for search
• Solr, 2006, Yonik Seely• First to put Lucene behind
an http interface• Still going strong
• Elasticsearch, 2010, Shay Banon• Alternative implementation• Extremely REST-Y
• Your database’s full text search featureso MySQL, for example has a FULLTEXT indexo Works for trivial cases, not the path of wisdom
Elasticsearch
• Create an index
curl –XPUT http://localhost:9200/stackexchange
• Index some docs!
curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{
“Body”: “<p>Darth Vader dined with Luke</p>”,“Title”: “...”}’
OpenSource Connections
What is being built?
The answer can be found in your textbook…
OpenSource Connections
Book Index:• Topics -> page no• Very efficient tool –
compare to scanning the whole book!
Lucene uses an index:• Tokens => document ids:
laser => [2, 4]light => [2, 5]lightsaber => [0, 1, 5,
7]
Computers == Dumb
• Humans are smart o I see “cat” or “cats” in the back of a book, no duh – jump
to page 9
• Computers are dumb, o “CAT” != “cat” – no match returnedo “cat” != “cats” – no match returned
• Hence, when indexing, normalize text to more searchable form:
cats -> catfitted -> fitalumnus -> alumnu
OpenSource Connections
Normalization aka Text Analysis
• Raw input Filtered (char filter)• <p>Darth Vader dined with Luke</p>• Darth Vader dined with Luke
• Tokenized, o Darth Vader dined with Lukeo [Darth] [Vader] [dined] [with] [Luke]
• Token filters (Lowercased, synonyms applied, remove pointless words)o [darth] [vader] [dine] [luke]
• Most importantly: this is highly configurable
OpenSource Connections
Normalization aka Text Analysis
OpenSource Connections
curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined with Luke‘
{ "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ]}
What is being built?
OpenSource Connections
field Body term darth doc 1
<metadata> doc 2
<metadata> term vader doc 1 <metadata> term dine doc 1
<metadata>
curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
“Body”: “<p>Darth Vader dined with Luke</p>”,
“Title”: “...”}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
“Body”: “<p>We love Darth</p>”,“Title”: “...”}’
Ranking
OpenSource Connections
field Body term darth doc 1
<metadata> doc 2
<metadata> term vader doc 1 <metadata> term dine doc 1
<metadata>
curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
“Body”: “<p>Darth Vader dined with Luke</p>”,
“Title”: “...”}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
“Body”: “<p>We love Darth</p>”,“Title”: “...”}’Can we store anything here
to help decide how relevant this term is for this doc?
Yes!- Term Frequency
- How much “darth” is in this doc?
- Position within document- Helps when we search
for the phrase “darth vader”
Query Documents
• When did Darth Vader and Luke have dinner?
OpenSource Connections
curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true" -d '{ "query": { "match": { "Body": "luke darth dinner" } }}
User Query
What happens when we query?
OpenSource Connections
luke darth dinner
field Body term darth doc 1
<metadata> doc 2
<metadata> term vader doc 1 <metadata> term dine doc 1
<metadata>
How to consult index for matches?
Analysis
[luke][darth][dine]
[darth]
[dine]
...
Score for [darth] docs (1 and 2)
Score for [dine] docs (1)
Return sorted docs client
So Elasticsearch!
OpenSource Connections
• FAST!o Inverted index data structure is blazing fasto Lucene is probably the most tuned implementation
• FUZZY!o We use analysis to normalize text to canonical formso We can use positional information when querying (not
shown here)
• FRUITFUL!o Relevant documents are scored based on relative term
frequency
BUT WAIT THERE’S MORE
• Many non-traditional applications of “search”o Rank file directory by proximity to current directoryo Geographic-aided search, rank based on distance and
search relevancyo Q & A systems – Watson has a ton of Luceneo Log aggregation, ie Kibana -- because in Lucene
everything is indexed!
• And many features!o Spellcheckingo Facetso More-like-this document
OpenSource Connections
QUESTIONS?
OpenSource Connections