51
DSE 4.7 Search Matt Stump, Chief Architect/Manager for SWAT, DataStax Thank you for joining. We will begin shortly.

Understanding DSE Search by Matt Stump

Embed Size (px)

Citation preview

Page 1: Understanding DSE Search by Matt Stump

DSE 4.7 SearchMatt Stump, Chief Architect/Manager for SWAT, DataStax

Thank you for joining. We will begin shortly.

Page 2: Understanding DSE Search by Matt Stump
Page 3: Understanding DSE Search by Matt Stump

All attendees

placed on muteInput questions at any time

using the online interface

Webinar Housekeeping

Page 4: Understanding DSE Search by Matt Stump

1 Data Locality

2 Bitmap Indexing

3 IO Path

4 Demo

5 Performance

6 Why DSE?

Agenda

Page 5: Understanding DSE Search by Matt Stump

Hash(“some bytes”) => A Number

Page 6: Understanding DSE Search by Matt Stump
Page 7: Understanding DSE Search by Matt Stump
Page 8: Understanding DSE Search by Matt Stump
Page 9: Understanding DSE Search by Matt Stump
Page 10: Understanding DSE Search by Matt Stump
Page 11: Understanding DSE Search by Matt Stump
Page 12: Understanding DSE Search by Matt Stump
Page 13: Understanding DSE Search by Matt Stump

??

Page 14: Understanding DSE Search by Matt Stump
Page 15: Understanding DSE Search by Matt Stump
Page 16: Understanding DSE Search by Matt Stump
Page 17: Understanding DSE Search by Matt Stump

V1 OR V2

Page 18: Understanding DSE Search by Matt Stump
Page 19: Understanding DSE Search by Matt Stump
Page 20: Understanding DSE Search by Matt Stump
Page 21: Understanding DSE Search by Matt Stump

Quick to ReadExpensive to Update

Page 22: Understanding DSE Search by Matt Stump
Page 23: Understanding DSE Search by Matt Stump
Page 24: Understanding DSE Search by Matt Stump
Page 25: Understanding DSE Search by Matt Stump
Page 26: Understanding DSE Search by Matt Stump
Page 27: Understanding DSE Search by Matt Stump

Near Real Time is Expensive

Page 28: Understanding DSE Search by Matt Stump
Page 29: Understanding DSE Search by Matt Stump
Page 30: Understanding DSE Search by Matt Stump
Page 31: Understanding DSE Search by Matt Stump
Page 32: Understanding DSE Search by Matt Stump

Use 32 vnodes in DSE 4.7.1

Page 33: Understanding DSE Search by Matt Stump

{ 'asin': '0007148089', 'title': "Blood and Roses: The Tumultuous Wars of the Roses", 'price': 5.98, 'imUrl': 'http://ecx.images-amazon.com/images/I/518p8d64F8L.jpg', 'related': { 'also_bought': ['0061430765', '0061430773’,'B00A4E8E78'], 'buy_after_viewing': ['0061430773', '0345404335', 'B00A4E8E78', '0975126407'] }, 'salesRank': {'Books': 326205}, 'categories': [['Books']]}

Page 34: Understanding DSE Search by Matt Stump

CREATE TABLE IF NOT EXISTS amazon.metadata ( asin text, title text, imurl text, price double, categories set<text>, also_bought set<text>, buy_after_viewing set<text>, PRIMARY KEY(asin));

CREATE TABLE IF NOT EXISTS amazon.rank ( asin text, category text, rank int, PRIMARY KEY(asin, category));

Page 35: Understanding DSE Search by Matt Stump

dsetool create_core amazon.metadata generateResources=true

dsetool create_core amazon.rank generateResources=true

http://localhost:8983/solr/#/amazon.metadata

http://localhost:8983/solr/#/amazon.rank

Page 36: Understanding DSE Search by Matt Stump

Index Size

Page 37: Understanding DSE Search by Matt Stump

Index Size

• Core index size• Fields, term frequency, count, and settings• Number of dynamic fields and frequency using Luke• termVectors="false" • termPositions="false" • termOffsets="false"• omitNorms="true"• Only index fields you intend to search

Page 39: Understanding DSE Search by Matt Stump

Indexing throughput

• Set autoSoftCommit as high as possible• Disable all caches except filterCache• Increase RAM buffer to 512-1024MB• Enable realtime indexing• Large heap (20GB) with G1 or 8150 tuning• Increase back_pressure_threshold_per_core to 2000-5000• Set max_solr_concurrency_per_core to number of cores• Recommend more cores (32)

Page 40: Understanding DSE Search by Matt Stump

Live Indexing Throughput

Page 41: Understanding DSE Search by Matt Stump

Live Indexing Throughput

Page 42: Understanding DSE Search by Matt Stump

Live Indexing Throughput

Page 43: Understanding DSE Search by Matt Stump

Query Latency and Throughput

• Set autoSoftCommit as high as possible• Disable all caches except filterCache• Use docValues for faceted or sorted fields• Large heap (20GB) with G1 or 8150 tuning• Move query parameters to filters• Use single pass queries where possible• Recommend more cores (32)

Page 44: Understanding DSE Search by Matt Stump

Query Latency and Throughput

• DSETool Performance objects• Solr slow query log• Tracing• Use Jbean com.datastax.bdp.search DSP-2792

– EXECUTE– RETREIVE– COORDINATE

Page 45: Understanding DSE Search by Matt Stump

CASSANDRA-8150 Tuning

MAX_HEAP_SIZE="20G"HEAP_NEWSIZE="6G"JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2"JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=10"JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=10000"JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"

Page 46: Understanding DSE Search by Matt Stump

CASSANDRA-7486 (G1) Tuning

MAX_HEAP_SIZE="20G"JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"

# set these to the number of coresJVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=8"JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=8"JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:G1ReservePercent=15"JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"JVM_OPTS="$JVM_OPTS -XX:G1HeapRegionSize=32"

Page 47: Understanding DSE Search by Matt Stump

DSE 4.7 Improvements

DSP-4477 - Pivot facetingDSP-4476 - PaginationDSP-3740 - Live indexingDSP-4091 - Remove support for stored copy fieldsDSP-4703 - Query Solr from SparkDSP-4518 - Improved memory usage for facetingDSP-3931 - Filter cache sizing is now global across all segmentsDSP-4475 - Verify/Integrate single pass distributed queries (SOLR-5768)DSP-4091 - Remove support for stored copy fieldsDSP-4072 - Fault-tolerant distributed queriesDSP-3958 - Improve shard routing by taking into account node health factorsDSP-3935 - Implement faceting inside CQL Solr queries

Page 48: Understanding DSE Search by Matt Stump

DSE vs ElasticSearch

Feature DSE ElasticSearch

Replication and multiple datacentersBased on Cassandra, multi-DC support for free,

real-time replication, high availabilityMaster slave, long replication delay, doesn't do

multi-DC well

Scalability Hundreds of nodes, hundreds of terabytes 10s of nodes a couple terabytes

Data loss possible No Yes

Primary Data Store Yes No

Operational Complexity Single system Multiple systems

Analytics Yes No

Dynamic Schema Sorta Sorta, slightly easier

Page 49: Understanding DSE Search by Matt Stump

Increased performance by 700% while growing data by 500%

Page 50: Understanding DSE Search by Matt Stump

Reduced operational costs by 40%

Page 51: Understanding DSE Search by Matt Stump

Deleted 15,000 lines of code