Deploying and managing Solr at scale

Deploying and managing Solr at Scale

Who am I?

• Anshum Gupta, Apache Lucene/Solr committer, Lucidworks Employee.

• Interested in search and related stuff.

• Apache Lucene since 2006 and Solr since 2010.

• Organizations I am or have been a part of:

Apache Solr has a huge install base and tremendous momentum

most widely used search solution on the planet. 8M+

total downloads

Solr is both established & growing

250,000+monthly downloads

Solr has tens of thousands of applications in production.

You use Solr everyday.

2500+open Solr jobs.

Activity Summary30 Day summary

Dec 06, 2014 - Jan 05, 2015

• 135 Commits • 17 Contributors

via https://www.openhub.net/p/solr

12 Month Summary Jan 5, 2014 — Jan 5, 2015

• 1363 Commits • 30 Contributors

https://www.openhub.net/p/solr

Getting started with Solr

• Download

• Untar/Unzip

• bin/solr start -e cloud -noprompt

• open http://localhost:8983/solr

http://localhost:8983/solr

Recent usability improvements

• Start scripts

• Schema APIs

• Config API - Register custom handlers using API

• Status APIs and more….

SolrCloud Architecture

Shard 1 (leader)

Followers

Shard 2 (leader)

Followers

ZooKeeperEnsemble

Multiple Nodes = Need for Coordination

Production scale?

• Zk ensemble. NOT embedded

• Multiple nodes

• Manually (or script) the 4 steps for each node?

Solr Scale Toolkit• Open Source!

• Fabric (Python) toolset for deploying and managing SolrCloud clusters in the cloud

• Code to support benchmark tests (Pig script for data generation / indexing, JMeter samplers)

• EC2 for now, more cloud providers coming soon via Apache libcloud

• No *need* to know Python!

The building blocks: A lot of python!• boto – Python API for AWS (EC2, S3, etc)

• Fabric – Python-based tool for automating system admin tasks over SSH

• pysolr – Python library for Solr (sending commits, queries, ...)

• kazoo – Python client tools for ZooKeeper

• Supporting Cast:

• JMeter – run tests, generate reports

• collectd – system monitoring

• Logstash4Solr – log aggregation

• JConsole/VisualVM – monitor JVM during indexing / queries

Overview of features:• Provisioning N machine instances in EC2

• Configuring / starting ZooKeeper (1 to n servers)

• Configuring / starting N Solr instances in cloud mode (M x N nodes)

• Integrating with Logstash4Solr and other supporting services, e.g. collectd

• Day-to-day operations on an existing cluster

N X M SolrCloud Nodes

ZK Host N

Node 1: Custom AMI

Architecture

Solr-Scale-Toolkit

SiLK

ZK Host 1

ZooKeeper 1

ZK Ensemble

Meta Node

Solr Node 1: 8983

core

core

core

Solr Node N: 89xx

core

core

coreZooKeeper N

X M such machines

system monitoringof M machines w/collectd and JMX

Provisioning cluster nodes

• Custom built AMI (one for PV instances and one for HVM instances) – Amazon Linux

• Dedicated disk per Solr node

• Launch and then poll status until they are live

• Verify SSH connectivity

• Tag each instance with a cluster ID and username

fab new_ec2_instances:test1,n=3,instance_type=m3.xlarge

Deploy ZooKeeper ensemble

• Two options to use the ensemble:

• Provision 1 to N nodes when you launch Solr cluster

• use existing named ensemble

• Fabric command simply creates the myid files and zoo.cfg file for the ensemble

• and some cron scripts for managing snapshots

• Basic health checking of ZooKeeper status:

• echo srvr | nc localhost 2181

fab new_zk_ensemble:zk1,n=3

Deploy SolrCloud cluster

• Uses bin/solr in Solr 4.10 to control Solr nodes

• Set system props: jetty.port, host, zkHost, JVM opts

• One or more Solr nodes per machine

• JVM mem opts dependent on instance type and # of Solr nodes per instance

• Optionally configure log4j.properties to append messages to Rabbitmq for SiLK integration

fab new_solrcloud:test1,zk=zk1,nodesPerHost=2

Demo

• Launch ZooKeeper Ensemble

• 3 nodes to establish quorum

• Launch SolrCloud cluster

• Create new collection and index some docs

• Run a healthcheck on the collection

Dashboards

Other useful stuff• patch from a local build.

• fab mine: See clusters I’m running (or for other users too)

• fab kill_mine: Terminate all instances I’m running

• fab ssh_to: Quick way to SSH to one of the nodes in a cluster

• fab stop/recover/kill: Basic commands for controlling specific Solr nodes in the cluster

• fab jmeter: Execute a JMeter test plan against your cluster

• Example test plan and Java sampler is included with the source

Testing Methodology• Transparent repeatable results

• Ideally hoping for something owned by the community

• Synthetic docs ~ 1K each on disk, mix of field types

• Data set created using code borrowed from PigMix

• English text fields generated using a Zipfian distribution

• Java 1.7u67, Amazon Linux, r3.2xlarge nodes

• enhanced networking enabled, placement group, same AZ

• Stock Solr (cloud) 4.10

• Using custom GC tuning parameters and auto-commit settings

• Use Elastic MapReduce to generate indexing load

• As many nodes as I need to drive Solr!

Indexing performanceCluster Size # of Shards # of Replicas Reducers Time (secs) Docs / sec

10 10 1 48 1762 73,780

10 10 2 34 3727 34,881

10 20 1 48 1282 101,404

10 20 2 34 3207 40,536

10 30 1 72 1070 121,495

10 30 2 60 3159 41,152

15 15 1 60 1106 117,541

15 15 2 42 2465 52,738

15 30 1 60 827 157,195

15 30 2 42 2129 61,062

Indexing performance lessons• Solr has no built-in throttling support – will accept work until it

falls over; need to build this into your indexing application logic

• Oversharding helps parallelize indexing work and gives you an easy way to add more hardware to your cluster

• GC tuning is critical

• Auto-hard commit to keep transaction logs manageable

• Auto soft-commit to see docs as they are indexed

• Replication is expensive! (Work in progress, SOLR-6816)

https://issues.apache.org/jira/browse/SOLR-6816

Query Performance• Still a work in progress!

• Sustained QPS & Execution time of 99th Percentile

• Stable: ~5,000 QPS / 99th at 300ms while indexing ~10,000 docs / sec

• Using the TermsComponent to build queries based on the terms in each field.

• Harder to accurately simulate user queries over synthetic data

• Need mix of faceting, paging, sorting, grouping, boolean clauses, range queries, boosting, filters (some cached, some not), etc ...

• Start with one server (1 shard) to determine baseline query performance.

• Look for inefficiencies in your schema and other config settings

More on query performance…• Higher risk of full GC pauses (facets, filters, sorting)

• Use optimized data structures (DocValues) for facet / sort fields, Trie-based numeric fields for range queries, facet.method=enum for low cardinality fields

• Add more replicas; load-balance

• -Dhttp.maxConnections=## (default = 5, increase to accommodate more threads sending queries)

• Avoid increasing ZooKeeper client timeout ~ 15000 (15 seconds) is about right

• Don’t just keep throwing more memory at Java! –Xmx128G

Roadmap

• Not just AWS

• No need for custom AMI, configurable download paths and versions.

Questions?

References

• Solr scale toolkit

• Blog: http://lucidworks.com/blog/introducing-the-solr-scale-toolkit/

• Podcast: http://solrcluster.podbean.com/e/tim-potter-on-the-solr-scale-toolkit/

• github: https://github.com/LucidWorks/solr-scale-tk

https://github.com/LucidWorks/solr-scale-tk

Connect @

http://www.twitter.com/anshumgupta

http://www.linkedin.com/in/anshumgupta/

[email protected]

mailto:[email protected]

Software

Deploying and managing Solr at scale