Upload
anshum-gupta
View
548
Download
5
Tags:
Embed Size (px)
Citation preview
Deploying and managing Solr at Scale
Who am I?
• Anshum Gupta, Apache Lucene/Solr committer, Lucidworks Employee.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
• Organizations I am or have been a part of:
Apache Solr has a huge install base and tremendous momentum
most widely used search solution on the planet. 8M+
total downloads
Solr is both established & growing
250,000+monthly downloads
Solr has tens of thousands of applications in production.
You use Solr everyday.
2500+open Solr jobs.
Activity Summary30 Day summary
Dec 06, 2014 - Jan 05, 2015
• 135 Commits • 17 Contributors
via https://www.openhub.net/p/solr
12 Month Summary Jan 5, 2014 — Jan 5, 2015
• 1363 Commits • 30 Contributors
Getting started with Solr
• Download
• Untar/Unzip
• bin/solr start -e cloud -noprompt
• open http://localhost:8983/solr
Recent usability improvements
• Start scripts
• Schema APIs
• Config API - Register custom handlers using API
• Status APIs and more….
SolrCloud Architecture
Shard 1 (leader)
Followers
Shard 2 (leader)
Followers
ZooKeeperEnsemble
Multiple Nodes = Need for Coordination
Production scale?
• Zk ensemble. NOT embedded
• Multiple nodes
• Manually (or script) the 4 steps for each node?
Solr Scale Toolkit• Open Source!
• Fabric (Python) toolset for deploying and managing SolrCloud clusters in the cloud
• Code to support benchmark tests (Pig script for data generation / indexing, JMeter samplers)
• EC2 for now, more cloud providers coming soon via Apache libcloud
• No *need* to know Python!
The building blocks: A lot of python!• boto – Python API for AWS (EC2, S3, etc)
• Fabric – Python-based tool for automating system admin tasks over SSH
• pysolr – Python library for Solr (sending commits, queries, ...)
• kazoo – Python client tools for ZooKeeper
• Supporting Cast:
• JMeter – run tests, generate reports
• collectd – system monitoring
• Logstash4Solr – log aggregation
• JConsole/VisualVM – monitor JVM during indexing / queries
Overview of features:• Provisioning N machine instances in EC2
• Configuring / starting ZooKeeper (1 to n servers)
• Configuring / starting N Solr instances in cloud mode (M x N nodes)
• Integrating with Logstash4Solr and other supporting services, e.g. collectd
• Day-to-day operations on an existing cluster
N X M SolrCloud Nodes
ZK Host N
Node 1: Custom AMI
Architecture
Solr-Scale-Toolkit
SiLK
ZK Host 1
ZooKeeper 1
ZK Ensemble
Meta Node
Solr Node 1: 8983
core
core
core
Solr Node N: 89xx
core
core
coreZooKeeper N
X M such machines
system monitoringof M machines w/collectd and JMX
Provisioning cluster nodes
• Custom built AMI (one for PV instances and one for HVM instances) – Amazon Linux
• Dedicated disk per Solr node
• Launch and then poll status until they are live
• Verify SSH connectivity
• Tag each instance with a cluster ID and username
fab new_ec2_instances:test1,n=3,instance_type=m3.xlarge
Deploy ZooKeeper ensemble
• Two options to use the ensemble:
• Provision 1 to N nodes when you launch Solr cluster
• use existing named ensemble
• Fabric command simply creates the myid files and zoo.cfg file for the ensemble
• and some cron scripts for managing snapshots
• Basic health checking of ZooKeeper status:
• echo srvr | nc localhost 2181
fab new_zk_ensemble:zk1,n=3
Deploy SolrCloud cluster
• Uses bin/solr in Solr 4.10 to control Solr nodes
• Set system props: jetty.port, host, zkHost, JVM opts
• One or more Solr nodes per machine
• JVM mem opts dependent on instance type and # of Solr nodes per instance
• Optionally configure log4j.properties to append messages to Rabbitmq for SiLK integration
fab new_solrcloud:test1,zk=zk1,nodesPerHost=2
Demo
• Launch ZooKeeper Ensemble
• 3 nodes to establish quorum
• Launch SolrCloud cluster
• Create new collection and index some docs
• Run a healthcheck on the collection
Dashboards
Other useful stuff• patch from a local build.
• fab mine: See clusters I’m running (or for other users too)
• fab kill_mine: Terminate all instances I’m running
• fab ssh_to: Quick way to SSH to one of the nodes in a cluster
• fab stop/recover/kill: Basic commands for controlling specific Solr nodes in the cluster
• fab jmeter: Execute a JMeter test plan against your cluster
• Example test plan and Java sampler is included with the source
Testing Methodology• Transparent repeatable results
• Ideally hoping for something owned by the community
• Synthetic docs ~ 1K each on disk, mix of field types
• Data set created using code borrowed from PigMix
• English text fields generated using a Zipfian distribution
• Java 1.7u67, Amazon Linux, r3.2xlarge nodes
• enhanced networking enabled, placement group, same AZ
• Stock Solr (cloud) 4.10
• Using custom GC tuning parameters and auto-commit settings
• Use Elastic MapReduce to generate indexing load
• As many nodes as I need to drive Solr!
Indexing performanceCluster Size # of Shards # of Replicas Reducers Time (secs) Docs / sec
10 10 1 48 1762 73,780
10 10 2 34 3727 34,881
10 20 1 48 1282 101,404
10 20 2 34 3207 40,536
10 30 1 72 1070 121,495
10 30 2 60 3159 41,152
15 15 1 60 1106 117,541
15 15 2 42 2465 52,738
15 30 1 60 827 157,195
15 30 2 42 2129 61,062
Indexing performance lessons• Solr has no built-in throttling support – will accept work until it
falls over; need to build this into your indexing application logic
• Oversharding helps parallelize indexing work and gives you an easy way to add more hardware to your cluster
• GC tuning is critical
• Auto-hard commit to keep transaction logs manageable
• Auto soft-commit to see docs as they are indexed
• Replication is expensive! (Work in progress, SOLR-6816)
Query Performance• Still a work in progress!
• Sustained QPS & Execution time of 99th Percentile
• Stable: ~5,000 QPS / 99th at 300ms while indexing ~10,000 docs / sec
• Using the TermsComponent to build queries based on the terms in each field.
• Harder to accurately simulate user queries over synthetic data
• Need mix of faceting, paging, sorting, grouping, boolean clauses, range queries, boosting, filters (some cached, some not), etc ...
• Start with one server (1 shard) to determine baseline query performance.
• Look for inefficiencies in your schema and other config settings
More on query performance…• Higher risk of full GC pauses (facets, filters, sorting)
• Use optimized data structures (DocValues) for facet / sort fields, Trie-based numeric fields for range queries, facet.method=enum for low cardinality fields
• Add more replicas; load-balance
• -Dhttp.maxConnections=## (default = 5, increase to accommodate more threads sending queries)
• Avoid increasing ZooKeeper client timeout ~ 15000 (15 seconds) is about right
• Don’t just keep throwing more memory at Java! –Xmx128G
Roadmap
• Not just AWS
• No need for custom AMI, configurable download paths and versions.
Questions?
References
• Solr scale toolkit
• Blog: http://lucidworks.com/blog/introducing-the-solr-scale-toolkit/
• Podcast: http://solrcluster.podbean.com/e/tim-potter-on-the-solr-scale-toolkit/
• github: https://github.com/LucidWorks/solr-scale-tk
Connect @
http://www.twitter.com/anshumgupta
http://www.linkedin.com/in/anshumgupta/