Upload
datastax
View
694
Download
4
Embed Size (px)
Citation preview
DSE 4.7 SearchMatt Stump, Chief Architect/Manager for SWAT, DataStax
Thank you for joining. We will begin shortly.
All attendees
placed on muteInput questions at any time
using the online interface
Webinar Housekeeping
1 Data Locality
2 Bitmap Indexing
3 IO Path
4 Demo
5 Performance
6 Why DSE?
Agenda
Hash(“some bytes”) => A Number
??
V1 OR V2
Quick to ReadExpensive to Update
Near Real Time is Expensive
Use 32 vnodes in DSE 4.7.1
{ 'asin': '0007148089', 'title': "Blood and Roses: The Tumultuous Wars of the Roses", 'price': 5.98, 'imUrl': 'http://ecx.images-amazon.com/images/I/518p8d64F8L.jpg', 'related': { 'also_bought': ['0061430765', '0061430773’,'B00A4E8E78'], 'buy_after_viewing': ['0061430773', '0345404335', 'B00A4E8E78', '0975126407'] }, 'salesRank': {'Books': 326205}, 'categories': [['Books']]}
CREATE TABLE IF NOT EXISTS amazon.metadata ( asin text, title text, imurl text, price double, categories set<text>, also_bought set<text>, buy_after_viewing set<text>, PRIMARY KEY(asin));
CREATE TABLE IF NOT EXISTS amazon.rank ( asin text, category text, rank int, PRIMARY KEY(asin, category));
dsetool create_core amazon.metadata generateResources=true
dsetool create_core amazon.rank generateResources=true
http://localhost:8983/solr/#/amazon.metadata
http://localhost:8983/solr/#/amazon.rank
Index Size
Index Size
• Core index size• Fields, term frequency, count, and settings• Number of dynamic fields and frequency using Luke• termVectors="false" • termPositions="false" • termOffsets="false"• omitNorms="true"• Only index fields you intend to search
Dynamic Fields
Indexing throughput
• Set autoSoftCommit as high as possible• Disable all caches except filterCache• Increase RAM buffer to 512-1024MB• Enable realtime indexing• Large heap (20GB) with G1 or 8150 tuning• Increase back_pressure_threshold_per_core to 2000-5000• Set max_solr_concurrency_per_core to number of cores• Recommend more cores (32)
Live Indexing Throughput
Live Indexing Throughput
Live Indexing Throughput
Query Latency and Throughput
• Set autoSoftCommit as high as possible• Disable all caches except filterCache• Use docValues for faceted or sorted fields• Large heap (20GB) with G1 or 8150 tuning• Move query parameters to filters• Use single pass queries where possible• Recommend more cores (32)
Query Latency and Throughput
• DSETool Performance objects• Solr slow query log• Tracing• Use Jbean com.datastax.bdp.search DSP-2792
– EXECUTE– RETREIVE– COORDINATE
CASSANDRA-8150 Tuning
MAX_HEAP_SIZE="20G"HEAP_NEWSIZE="6G"JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2"JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=10"JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=10000"JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
CASSANDRA-7486 (G1) Tuning
MAX_HEAP_SIZE="20G"JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"
# set these to the number of coresJVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=8"JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=8"JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:G1ReservePercent=15"JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"JVM_OPTS="$JVM_OPTS -XX:G1HeapRegionSize=32"
DSE 4.7 Improvements
DSP-4477 - Pivot facetingDSP-4476 - PaginationDSP-3740 - Live indexingDSP-4091 - Remove support for stored copy fieldsDSP-4703 - Query Solr from SparkDSP-4518 - Improved memory usage for facetingDSP-3931 - Filter cache sizing is now global across all segmentsDSP-4475 - Verify/Integrate single pass distributed queries (SOLR-5768)DSP-4091 - Remove support for stored copy fieldsDSP-4072 - Fault-tolerant distributed queriesDSP-3958 - Improve shard routing by taking into account node health factorsDSP-3935 - Implement faceting inside CQL Solr queries
DSE vs ElasticSearch
Feature DSE ElasticSearch
Replication and multiple datacentersBased on Cassandra, multi-DC support for free,
real-time replication, high availabilityMaster slave, long replication delay, doesn't do
multi-DC well
Scalability Hundreds of nodes, hundreds of terabytes 10s of nodes a couple terabytes
Data loss possible No Yes
Primary Data Store Yes No
Operational Complexity Single system Multiple systems
Analytics Yes No
Dynamic Schema Sorta Sorta, slightly easier
Increased performance by 700% while growing data by 500%
Reduced operational costs by 40%
Deleted 15,000 lines of code