Understanding DSE Search by Matt Stump

DSE 4.7 SearchMatt Stump, Chief Architect/Manager for SWAT, DataStax

Thank you for joining. We will begin shortly.

All attendees

placed on muteInput questions at any time

using the online interface

Webinar Housekeeping

1 Data Locality

2 Bitmap Indexing

3 IO Path

4 Demo

5 Performance

6 Why DSE?

Agenda

Hash(“some bytes”) => A Number

??

V1 OR V2

Quick to ReadExpensive to Update

Near Real Time is Expensive

Use 32 vnodes in DSE 4.7.1

{ 'asin': '0007148089', 'title': "Blood and Roses: The Tumultuous Wars of the Roses", 'price': 5.98, 'imUrl': 'http://ecx.images-amazon.com/images/I/518p8d64F8L.jpg', 'related': { 'also_bought': ['0061430765', '0061430773’,'B00A4E8E78'], 'buy_after_viewing': ['0061430773', '0345404335', 'B00A4E8E78', '0975126407'] }, 'salesRank': {'Books': 326205}, 'categories': [['Books']]}

CREATE TABLE IF NOT EXISTS amazon.metadata ( asin text, title text, imurl text, price double, categories set<text>, also_bought set<text>, buy_after_viewing set<text>, PRIMARY KEY(asin));

CREATE TABLE IF NOT EXISTS amazon.rank ( asin text, category text, rank int, PRIMARY KEY(asin, category));

dsetool create_core amazon.metadata generateResources=true

dsetool create_core amazon.rank generateResources=true

http://localhost:8983/solr/#/amazon.metadata

http://localhost:8983/solr/#/amazon.rank

Index Size

Index Size

• Core index size• Fields, term frequency, count, and settings• Number of dynamic fields and frequency using Luke• termVectors="false" • termPositions="false" • termOffsets="false"• omitNorms="true"• Only index fields you intend to search

http://localhost:8983/solr/%23/amazon.metadata/plugins/core?entry=core

http://localhost:8983/solr/%23/amazon.metadata/schema-browser

http://localhost:8983/solr/amazon.metadata/admin/luke

Dynamic Fields

http://localhost:8983/solr/demo.dynamic/admin/luke

Indexing throughput

• Set autoSoftCommit as high as possible• Disable all caches except filterCache• Increase RAM buffer to 512-1024MB• Enable realtime indexing• Large heap (20GB) with G1 or 8150 tuning• Increase back_pressure_threshold_per_core to 2000-5000• Set max_solr_concurrency_per_core to number of cores• Recommend more cores (32)

Live Indexing Throughput



Query Latency and Throughput

• Set autoSoftCommit as high as possible• Disable all caches except filterCache• Use docValues for faceted or sorted fields• Large heap (20GB) with G1 or 8150 tuning• Move query parameters to filters• Use single pass queries where possible• Recommend more cores (32)

Query Latency and Throughput

• DSETool Performance objects• Solr slow query log• Tracing• Use Jbean com.datastax.bdp.search DSP-2792

– EXECUTE– RETREIVE– COORDINATE

CASSANDRA-8150 Tuning

MAX_HEAP_SIZE="20G"HEAP_NEWSIZE="6G"JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2"JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=10"JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=10000"JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"

CASSANDRA-7486 (G1) Tuning

MAX_HEAP_SIZE="20G"JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"

# set these to the number of coresJVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=8"JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=8"JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:G1ReservePercent=15"JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"JVM_OPTS="$JVM_OPTS -XX:G1HeapRegionSize=32"

DSE 4.7 Improvements

DSP-4477 - Pivot facetingDSP-4476 - PaginationDSP-3740 - Live indexingDSP-4091 - Remove support for stored copy fieldsDSP-4703 - Query Solr from SparkDSP-4518 - Improved memory usage for facetingDSP-3931 - Filter cache sizing is now global across all segmentsDSP-4475 - Verify/Integrate single pass distributed queries (SOLR-5768)DSP-4091 - Remove support for stored copy fieldsDSP-4072 - Fault-tolerant distributed queriesDSP-3958 - Improve shard routing by taking into account node health factorsDSP-3935 - Implement faceting inside CQL Solr queries

DSE vs ElasticSearch

Feature DSE ElasticSearch

Replication and multiple datacentersBased on Cassandra, multi-DC support for free,

real-time replication, high availabilityMaster slave, long replication delay, doesn't do

multi-DC well

Scalability Hundreds of nodes, hundreds of terabytes 10s of nodes a couple terabytes

Data loss possible No Yes

Primary Data Store Yes No

Operational Complexity Single system Multiple systems

Analytics Yes No

Dynamic Schema Sorta Sorta, slightly easier

Increased performance by 700% while growing data by 500%

Reduced operational costs by 40%

Deleted 15,000 lines of code

Technology

Understanding DSE Search by Matt Stump