59
Elasticsearch - Speed is key Alexander Reelsen @spinscale

Speed is Key: Elasticsearch under the Hood

Embed Size (px)

Citation preview

Page 1: Speed is Key: Elasticsearch under the Hood

Elasticsearch - Speed is key

Alexander Reelsen @spinscale

Page 2: Speed is Key: Elasticsearch under the Hood

www.elastic.co2

• Introduction • Elasticsearch

• Speed by Example • Search • Aggregations • Operating System • Distributed aspects

Agenda

Page 3: Speed is Key: Elasticsearch under the Hood

www.elastic.co3

• Me • joined in march 2013 • working on Elasticsearch

& Shield • Interested in all things

scale & search

About

• Elastic • Founded in 2012 • Behind: Elasticsearch, Logstash,

Kibana, Marvel, Shield, ES for Hadoop, Elasticsearch clients

• Support subscriptions • Public & private trainings

Page 4: Speed is Key: Elasticsearch under the Hood

www.elastic.co4

• Me • joined in march 2013 • working on Elasticsearch

& Shield • Interested in all things

scale & search

About

• Elastic • Founded in 2012 • Behind: Elasticsearch, Logstash,

Kibana, Marvel, Shield, ES for Hadoop, Elasticsearch clients

• Support subscriptions • Public & private trainings

Page 5: Speed is Key: Elasticsearch under the Hood

www.elastic.co5

Elasticsearch - High level overview

Introduction

Page 6: Speed is Key: Elasticsearch under the Hood

www.elastic.co6

an open source, distributed, scalable,

highly available, document-oriented, RESTful

full text search engine

with real-time search and analytics capabilities

Elasticsearch is...

Page 7: Speed is Key: Elasticsearch under the Hood

www.elastic.co7

an open source, distributed, scalable, highly available, document-oriented,

RESTful, full text search engine with real-time search and analytics capabilities

Elasticsearch is...

Apache 2.0 License https://www.apache.org/licenses/LICENSE-2.0

Page 8: Speed is Key: Elasticsearch under the Hood

www.elastic.co8

an open source, distributed, scalable, highly available, document-oriented,

RESTful, full text search engine with real-time search and analytics capabilities

Elasticsearch is...

Page 9: Speed is Key: Elasticsearch under the Hood

www.elastic.co9

an open source, distributed, scalable, highly available, document-oriented,

RESTful, full text search engine with real-time search and analytics capabilities

Elasticsearch is...

Page 10: Speed is Key: Elasticsearch under the Hood

www.elastic.co10

an open source, distributed, scalable, highly available, document-oriented,

RESTful, full text search engine with real-time search and analytics capabilities

Elasticsearch is...

Page 11: Speed is Key: Elasticsearch under the Hood

www.elastic.co11

an open source, distributed, scalable, highly available, document-oriented,

RESTful, full text search engine with real-time search and analytics capabilities

Elasticsearch is...

{ "name" : "Craft" "geo" : { "city" : "Budapest", "lat" : 47.49, "lon" : 19.04 } }

Source:  http://json.org/

Page 12: Speed is Key: Elasticsearch under the Hood

www.elastic.co12

an open source, distributed, scalable, highly available, document-oriented,

RESTful, full text search engine with real-time search and analytics capabilities

Elasticsearch is...

Source:  https://httpwg.github.io/asset/http.svg

Page 13: Speed is Key: Elasticsearch under the Hood

www.elastic.co13

an open source, distributed, scalable, highly available, document-oriented,

RESTful, full text search engine with real-time search and analytics capabilities

Elasticsearch is...

Page 14: Speed is Key: Elasticsearch under the Hood

www.elastic.co14

an open source, distributed, scalable, highly available, document-oriented,

RESTful, full text search engine with real-time search and analytics capabilities

Elasticsearch is...

Page 15: Speed is Key: Elasticsearch under the Hood

www.elastic.co15

Getting up and running... is easy

# wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.5.1.zip

# unzip elasticsearch-1.5.1.zip # cd elasticsearch-1.5.1

# ./bin/elasticsearch

# curl http://localhost:9200

Page 16: Speed is Key: Elasticsearch under the Hood

www.elastic.co16

Scaling

Page 17: Speed is Key: Elasticsearch under the Hood

www.elastic.co17

Cluster: A collection of nodes

node

cluster

Page 18: Speed is Key: Elasticsearch under the Hood

www.elastic.co18

Cluster: A collection of nodes

node

cluster

node

Page 19: Speed is Key: Elasticsearch under the Hood

www.elastic.co

cluster

19

Cluster: A collection of nodes

node node

Page 20: Speed is Key: Elasticsearch under the Hood

www.elastic.co20

Cluster: A collection of nodes

node node node node

cluster

Page 21: Speed is Key: Elasticsearch under the Hood

www.elastic.co21

Shards: Unit of scale

a0 a1 a2a3

# curl -X PUT http://localhost:9200/a -d ' { "index.number_of_shards" : 4 }'

Page 22: Speed is Key: Elasticsearch under the Hood

www.elastic.co22

Shards: Unit of scale

a0 a1 a2 a3

# curl -X PUT http://localhost:9200/a -d ' { "index.number_of_shards" : 4 }'

Page 23: Speed is Key: Elasticsearch under the Hood

www.elastic.co23

Shards: Unit of scale

a0 a1 a2 a3

# curl -X PUT http://localhost:9200/a/_settings -d ' { "index.number_of_replicas" : 1 }'

Page 24: Speed is Key: Elasticsearch under the Hood

www.elastic.co24

Replication

a0 a1 a2 a3

# curl -X PUT http://localhost:9200/a/_settings -d ' { "index.number_of_replicas" : 1 }'

a0a3 a2a1

Page 25: Speed is Key: Elasticsearch under the Hood

www.elastic.co25

Search

Page 26: Speed is Key: Elasticsearch under the Hood

www.elastic.co26

CRUD

PUT books/book/1 { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }

Page 27: Speed is Key: Elasticsearch under the Hood

www.elastic.co27

CRUD

PUT books/book/1 { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }

GET books/book/1

Page 28: Speed is Key: Elasticsearch under the Hood

www.elastic.co28

CRUD

PUT books/book/1 { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }

GET books/book/1

DELETE books/book/1

Page 29: Speed is Key: Elasticsearch under the Hood

www.elastic.co29

CRUD

PUT books/book/1 { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }

GET books/book/1

DELETE books/book/1

GET books/book/_search?q=elasticsearch

Page 30: Speed is Key: Elasticsearch under the Hood

www.elastic.co30

Searching

GET books/book/_search { "query" : { "filtered" : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }} }

Page 31: Speed is Key: Elasticsearch under the Hood

www.elastic.co31

Searching

GET books/book/_search { "query" : { "filtered" : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }} }

{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.15342641, "hits": [ { "_index": "books", "_type": "book", "_id": "1", "_score": 0.15342641, "_source": { "name": "Elasticsearch - The definitive guide", "authors": [ "Clinton Gormley", "Zachary Tong" ], "pages": 722, "category": "search" "published_at": "2015/01/31", } } ] } }

Page 32: Speed is Key: Elasticsearch under the Hood

www.elastic.co32

Searching

GET books/book/_search { "query" : { "filtered" : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }}, "aggs" : { "category" : { "terms" : { "field" : "category" } } } }

Page 33: Speed is Key: Elasticsearch under the Hood

www.elastic.co

GET books/book/_search { "query" : { "filtered" : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }}, "aggs" : { "category" : { "terms" : { "field" : "category" } } } }

33

Searching

{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.15342641, "hits": [ ... ] }, "aggregations": { "category": { "buckets": [ { "key": "search", "doc_count": 1 }, { ... } ] } } }

Page 34: Speed is Key: Elasticsearch under the Hood

www.elastic.co34

• Hits all relevant shards • Searches for top-N results per shard • Reduces to top-N total • Gets top-N documents/data from relevant shards • Returns data to requesting client

Search

Page 35: Speed is Key: Elasticsearch under the Hood

www.elastic.co35

• Lucene is doing the heavy lifting • A single shard is a Lucene index • Each field is its own inverted index and can be searched in

Search on a single shard

term docid

clinton 1

gormley 1

tong 1

zachary 1

Page 36: Speed is Key: Elasticsearch under the Hood

www.elastic.co36

• It is very hard to reconstruct the original data from the inverted index

• Solution: Just store the whole document in its own field and retrieve it, when returning data to the client

Example: Return original JSON in search response

{ "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }

_source

Page 37: Speed is Key: Elasticsearch under the Hood

www.elastic.co

Elasticsearch - The definitive guide Clinton Gormley Zachary Tong 722 2015/01/31

37

Example: _all field

{ "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }

_all

Page 38: Speed is Key: Elasticsearch under the Hood

www.elastic.co

Elasticsearch - The definitive guide Clinton Gormley Zachary Tong

38

Example: copy_to field (name & authors)

{ "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }

copy_to

Page 39: Speed is Key: Elasticsearch under the Hood

www.elastic.co39

• Filters do not contribute to score & can be cached using a BitSet • range filter for a date/price • term filter for a category • geo filter for a bounding box

Search: Using filters

0 1 1 0 0 1 { "term" : { "category" : "search" } }

0 1 0 0 1 0 { "term" : { "category" : "reduced" } }

Page 40: Speed is Key: Elasticsearch under the Hood

www.elastic.co40

• Problem: How to search in an inverted index for non-existing fields (exists & missing filter)?

• Costly: Need to merge postings lists of all existing terms (expensive for high-cardinality fields!)

• Solution: Index document field names under _field_names

Filters: Missing fields

Page 41: Speed is Key: Elasticsearch under the Hood

www.elastic.co41

Aggregations

Page 42: Speed is Key: Elasticsearch under the Hood

www.elastic.co42

• Aggregations: Buckets & metrics

• Aggregations cannot make use of the inverted index

• Meet Fielddata: Uninverting the index

• Inverted index: Maps term to document id • Fielddata: Maps document id to terms

Aggregations

Page 43: Speed is Key: Elasticsearch under the Hood

www.elastic.co43

Aggregations

docid term

1 Clinton Gormley, Zachary Tong

Inverted Index

Fielddata

term docid

clinton 1

gormley 1

tong 1

zachary 1

Page 44: Speed is Key: Elasticsearch under the Hood

www.elastic.co44

Aggregations

docid term

1 Clinton Gormley, Zachary Tong

Inverted Index

Fielddata

term docid

Clinton Gormley 1

Zachary Tong 1

Page 45: Speed is Key: Elasticsearch under the Hood

www.elastic.co45

• Fielddata is an in-memory data structure, lazily constructed

• Easy to go OOM (wrong field or too many documents)

• Solution: • circuit breaker • doc_values: index-time data structure, no heap, uses the file

system cache, better compression

Aggregations: Fielddata

Page 46: Speed is Key: Elasticsearch under the Hood

www.elastic.co46

• Problem: Count distinct elements

• Naive: Load all data into a set, then check the size (distributed?)

• Solution: cardinality Aggregation, that uses HyperLogLog++

• configurable precision, allows to trade memory for accuracy • excellent accuracy on low-cardinality sets • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only

depends on configured precision

Aggregations: Probabilistic data structures

Page 47: Speed is Key: Elasticsearch under the Hood

www.elastic.co47

• Problem: Calculate percentiles

• Naive: Maintain a sorted list of all values

• Solution: percentiles Aggregation, that uses T-Digest

• extreme percentiles are more accurate • small sets can be up to 100% accurate • while values are added to a bucket, the algorithm trades accuracy for memory savings

Aggregations: Probabilistic data structures

Page 48: Speed is Key: Elasticsearch under the Hood

www.elastic.co48

Operating system & Hardware

Page 49: Speed is Key: Elasticsearch under the Hood

www.elastic.co49

• CPU Indexing, searching, highlighting

• I/O Indexing, searching, merging

• Memory Aggregation, indices

• Network Relocation, Snapshot & Restore

Elasticsearch can easily max out...

Page 50: Speed is Key: Elasticsearch under the Hood

www.elastic.co50

• CPU: Threadpools are sized on number of cores

• Disk: SSD

• Memory: ∞

• Network: GbE or better

Hardware

Page 51: Speed is Key: Elasticsearch under the Hood

www.elastic.co51

• file system cache

• file handles

• memory locking: bootstrap.mlockall

• dont swap, no OOM killer

Operating system

Page 52: Speed is Key: Elasticsearch under the Hood

www.elastic.co52

Distributed aspects

Page 53: Speed is Key: Elasticsearch under the Hood

www.elastic.co53

• The network is reliable • Latency is zero • Bandwidth is infinite • The network is secure

Fallacies of distributed computing

• Topology doesn't change • There is one administrator • Transport cost is zero • The network is homogeneous

by  Peter  Deutsch  https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

Page 54: Speed is Key: Elasticsearch under the Hood

www.elastic.co54

Wrapup

Page 55: Speed is Key: Elasticsearch under the Hood

www.elastic.co55

• Speed is key!

• Search is a tradeoff: Query time vs. index time

• Benchmark your use-case • http://benchmarks.elasticsearch.org/

Summary

Page 56: Speed is Key: Elasticsearch under the Hood

www.elastic.co56

• Automatic I/O throttling • Clusterstate incremental updates • Faster recovery • Aggregations 2.0 • Merge queries and filters • Reindex API • Changes API • Expression scripting engine

Elasticsearch 2.x

Page 57: Speed is Key: Elasticsearch under the Hood

www.elastic.co57

• Speed improvements in queries (must_not, sloppy phrase) • Automated caching • BitSet compression vastly improved (roaring bitsets) • Index compression (on disk + memory) • Indexing performance (adaptive merge throttling, SSD detection) • Index safety: atomic commits, segment commit identifiers, verify

integrity at merge • ...

Lucene 5.x

Page 58: Speed is Key: Elasticsearch under the Hood

www.elastic.co58

Alexander Reelsen [email protected]

@spinscale

Thanks for listening! Questions?

We’re  hiring  https://www.elastic.co/about/careers  

We’re  helping  https://www.elastic.co/subscriptions

Page 59: Speed is Key: Elasticsearch under the Hood

www.elastic.co59

http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-field-names-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html http://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html http://www.elastic.co/guide/en/elasticsearch/guide/current/percentiles.html http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html http://www.elastic.co/elasticon/2015/sf/elasticsearch-architecture-amusing-algorithms-and-details-on-data-structures/ http://speakerdeck.com/elastic/all-about-aggregations http://www.elastic.co/elasticon/2015/sf/updates-from-lucene-land http://speakerdeck.com/elastic/resiliency-in-elasticsearch-and-lucene http://www.elastic.co/elasticon/2015/sf/level-up-your-clusters-upgrading-elasticsearch http://speakerdeck.com/elasticsearch/maintaining-performance-in-distributed-systems

References