Why Search? (starring Elasticsearch)

Why Search?(starring

Elasticsearch)Doug Turnbull

OpenSource Connections


Hello

• Me@[email protected]

• Ushttp://o19s.comWorld class search consultants Right here in C’ville!

Hiring passionate interns!


mailto:[email protected]

http://o19s.com/

Why Search?

• What does a dedicated search engine do?o that a database doesn’t?

• Why not [MySQL|mongoDB|Cassandra | etc]?

• Why a dedicated search engine?


Why not MySQL?

• We’ve got rows of stuff in tables. IE for SciFi StackExchange, we’ve stored ~20K posts:


PostID

UserId CreationDate

ViewCount

Body

0 1 2011-01-11T20:52:46.753

124 What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?

1 2 2013-02-01T12:44:46.525

525 Been meaning to read the Foundation Series, what should I read first?

Why not MySQL?

• Our mission: Find all the “Darth Vader” in SciFi StackExchange Posts!


P U C V Body

0 1 2 1 What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?

1 2 2 5 Been meaning to read the Foundation Series, what should I read first?

Found!

Missing!

Why not MySQL – SQL Like?

• SQL “LIKE” operator – scan all rows for a specific wildcard match

SELECT * FROM posts WHERE body LIKE "%darth vader%"


Match?

Match?

Match?

Match?

Performs Table Scan

Approx 300ms to search a measly 20K docs!(what if we had 20 Million?)

SQL Like – other problems

• Can’t search for words out –of-order:

SELECT * FROM posts WHERE body LIKE "%vader, darth%"0 results

• Can’t search for alternate forms of a word:

SELECT * FROM posts WHERE body LIKE "%kittie pictures%“

SELECT * FROM posts WHERE body LIKE "%kitteh pictures%"


SQL Like – other problems

• No Ranking of Results – given these two docs:


One might ask how none of the Jedi at Qui-Gon's funeral

noticed that there was a Dark Lord of the Sith standing right

behind them. Darth Vader and Obi-Wan only noticed each other when on the same station … It's apparently hard to pick up another force-user without knowing he or she is there…

I seem to remember a novel, I think it was Dark Lord: The Rise of Darth Vader, that addressed this. It made the assertion that while Darth Vader had lost both hands, he was still as formidable, in the force sense,

- Directly about Darth Vader

- Darth Vader is a side topic here

Which should come first?

SQL Like| CTRL+F |grep is

1. Extremely Slow

2. Not fuzzy -- Needs exact literal matches, no fuzziness!

3. Unranked -- Simply says y/n whether there is a match


Search needs to be

1. FAST! A data structure that can efficiently take search terms and return a set of documents

2. FUZZY! A way to record positional and fuzzy modifications to text to assist matching

3. FRUITFUL! Relevant documents bubble to the top.


Lets play with an

implementation

• Lucene -> Elasticsearch


Lucene

Solr

Elasticsearch

• Lucene, 1999 by Doug Cutting• Java library for search

• Solr, 2006, Yonik Seely• First to put Lucene behind

an http interface• Still going strong

• Elasticsearch, 2010, Shay Banon• Alternative implementation• Extremely REST-Y

• Your database’s full text search featureso MySQL, for example has a FULLTEXT indexo Works for trivial cases, not the path of wisdom

Elasticsearch

• Create an index

curl –XPUT http://localhost:9200/stackexchange

• Index some docs!

curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{

“Body”: “Darth Vader dined with Luke”,“Title”: “...”}’


http://localhost:9200/stackexchange/post/1


What is being built?

The answer can be found in your textbook…


Book Index:• Topics -> page no• Very efficient tool –

compare to scanning the whole book!

Lucene uses an index:• Tokens => document ids:

laser => [2, 4]light => [2, 5]lightsaber => [0, 1, 5,

7]

Computers == Dumb

• Humans are smart o I see “cat” or “cats” in the back of a book, no duh – jump

to page 9

• Computers are dumb, o “CAT” != “cat” – no match returnedo “cat” != “cats” – no match returned

• Hence, when indexing, normalize text to more searchable form:

cats -> catfitted -> fitalumnus -> alumnu


Normalization aka Text Analysis

• Raw input Filtered (char filter)• Darth Vader dined with Luke• Darth Vader dined with Luke

• Tokenized, o Darth Vader dined with Lukeo [Darth] [Vader] [dined] [with] [Luke]

• Token filters (Lowercased, synonyms applied, remove pointless words)o [darth] [vader] [dine] [luke]

• Most importantly: this is highly configurable


Normalization aka Text Analysis


curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined with Luke‘

{ "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ]}

What is being built?


field Body term darth doc 1

<metadata> doc 2

<metadata> term vader doc 1 <metadata> term dine doc 1

<metadata>

curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{

“Body”: “Darth Vader dined with Luke”,

“Title”: “...”}’


“Body”: “We love Darth”,“Title”: “...”}’




Ranking



<metadata> doc 2


<metadata>


“Body”: “Darth Vader dined with Luke”,

“Title”: “...”}’


“Body”: “We love Darth”,“Title”: “...”}’Can we store anything here

to help decide how relevant this term is for this doc?

Yes!- Term Frequency

- How much “darth” is in this doc?

- Position within document- Helps when we search

for the phrase “darth vader”



Query Documents

• When did Darth Vader and Luke have dinner?


curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true" -d '{ "query": { "match": { "Body": "luke darth dinner" } }}

User Query

What happens when we query?


luke darth dinner


<metadata> doc 2


<metadata>

How to consult index for matches?

Analysis

[luke][darth][dine]

[darth]

[dine]

...

Score for [darth] docs (1 and 2)

Score for [dine] docs (1)

Return sorted docs client

So Elasticsearch!


• FAST!o Inverted index data structure is blazing fasto Lucene is probably the most tuned implementation

• FUZZY!o We use analysis to normalize text to canonical formso We can use positional information when querying (not

shown here)

• FRUITFUL!o Relevant documents are scored based on relative term

frequency

BUT WAIT THERE’S MORE

• Many non-traditional applications of “search”o Rank file directory by proximity to current directoryo Geographic-aided search, rank based on distance and

search relevancyo Q & A systems – Watson has a ton of Luceneo Log aggregation, ie Kibana -- because in Lucene

everything is indexed!

• And many features!o Spellcheckingo Facetso More-like-this document


QUESTIONS?

Technology

Why Search? (starring Elasticsearch)