Upload
beiske
View
72
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Elasticsearch easily lets you develop amazing things, and it has gone to great lengths to make Lucene's features readily available in a distributed setting. However, when it comes to running Elasticsearch in production, you still have a fairly complicated system on your hands: a system with high demands on network stability, a huge appetite for memory, and a system that assumes all users are trustworthy. This talk will cover some of the lessons we've learned from securing and herding hundreds of Elasticsearch clusters.
Citation preview
Who?
Senior software engineer of Found AS Working with Elasticsearch for 2 years
Herding hundreds of Elasticsearch clusters
Agenda
Agenda• Anti-patterns
• Memory / Resource Usage
• Distributed problems
• Security
• Client concerns
• Changing a cluster
found.no/foundation
Snapshot / Restore
Circuit breakersDocument values
Aggregations
Distributed percolation
Suggesters
…
Snapshot / Restore
Circuit breakersDocument values
Aggregations
Distributed percolation
Suggesters
…
Anti-Patterns
Arbitrary Keys
• “Schema Free”
• One field per value
• Ever-growing cluster state
acls: 1234: READ 42: WRITE
Heavy Updating
• Update = Delete + Reindex
• Be careful with counters
Slow queries
• WHERE foo ILIKE ‘%bar%’
• {“query_string”: {“query”: “foo:*bar*”}}
Arbitrary searches
query: filtered: filter: term: user_id: 42 query: [user’s query here]
Time Bomb
Memory
Memory• Field caches
• Filter caches
• Page caches
• Aggregations
• Index building
Page Cache
• Keeping index pages in memory
• Can’t have too much
• Outgrow: Gradual slowdown
Heap Space
• Memory used by Elasticsearch process
• Field / Filter caches
• Aggregations
Time Bomb
Time Bomb
OutOfMemoryError
Woah there
I ate all the memories
Your cluster may or may not work any more
OutOfMemory
• Growing too big
• Selecting too big timespan in Kibana
• Document ingestion peak
Preventing OOMs• Have enough memory :-)
• Understand your search’s memory profile
• Bulk / Circuit breaker settings
• Monitoring
• Document values
Marvel( /_stats )
Document Values
"my_field": { "type": "string", "fielddata": { "format": "doc_values" } }
Sizing
Sizing
• Test, don’t guess
• Start big, scale down
• Index, search, monitor
Glitch Meltdown
Glitch Meltdown
Glitch Meltdown
Glitch Meltdown
Glitch Meltdown
• Tie-breaker can be a cheap master-node
• Applies to data centers / availability zones too
Data-only nodes
Master-only nodes
Jepsen
Jepsen
• Kyle Kingsbury’s series on distributed systems
• Distributed systems are hard
• aphyr.com
Security
Security
• “Not my job!” – Elasticsearch
• That’s fine!
Dynamic Scripts
!
• Scoring
• Aggregations
• Updating
Dynamic Scripts
Runtime.getRuntime().exec(…)
Security
!
• Disable dynamic scripts
• Mind index patterns
• Even then, don’t accept arbitrary requests
Client Concerns
Client Concerns
• Connection pools
• Idempotent requests
• Have sane syncing/indexing strategies
# BOOM !
Cluster changes
Cluster changes
• Make new nodes join existing cluster
• No rolling restarts
• Easy rollback if things go bad
v1.0.0 v1.0.1
v1.0.0 v1.0.1
v1.0.0 v1.0.1
v1.0.0 v1.0.1
v1.0.0 v1.0.1
Cluster changes
• Test first
• Mind recover_*-settings
Multi-Cluster Workflows
• Snapshot/Restore
• Operations across clusters
• Swap clusters!
• Works well with good syncing strategy
Misc
• Same JVM
• ulimits
• Unicast and cluster name
• SSD? noop-scheduler
@foundsays
Learn More! !
found.no/foundation
@beiskeFollow