26
Interac(ve Queries on Compressed RDD Succinct Spark Rachit Agarwal AMPLab [email protected] TwiEer: @_ragarwal_

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Embed Size (px)

Citation preview

Page 1: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Interac(veQueriesonCompressedRDD

SuccinctSpark

RachitAgarwalAMPLab

[email protected]

TwiEer:@_ragarwal_

Page 2: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Nosecondaryindexes,nodatascans,nodatadecompression

AdistributedcompresseddatastoreSuccinct

Pointqueries

• search• randomaccess• rangequeries• regularexpressions

UnifiedInterface

• Unstructureddata• Key-valuestore• Documentstore• Tables

Page 3: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Interactivepointqueries

Randomaccess

Search

RangeQueries

RegularExpressions

Aggregatequeries

Updates

Graphqueries

Page 4: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

0, 10, 14, 16, 19, 26, 29

1, 4, 5, 8, 20, 22, 24

2, 15, 17, 27

3, 6, 7, 9, 12, 13, 18, 23 ..

11, 21

DataScans Indexes

LowstorageHighLatency

HighstorageLowLatency

Existingsystems,e.g.,search()

Search( )

Page 5: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

IndexesinslowerstorageScansin

fasterstorageexecu(ngqueriesoffslowerstorage

Inputsize

QueryLatency

Datascans

Indexes

Scansinslowerstorage

Indexesinfasterstorage

Existingsystems“atscale”(qualitatively)

Page 6: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Succinct

LowstorageLowLatency

Queriesexecuteddirectlyonthe

compressedrepresenta(on

WhatmakesSuccinctunique

Noaddi(onalindexes

Queryresponsesembeddedwithin

thecompressedrepresenta(on

Nodatascans Func(onalityofindexes

Nodecompression

Queriesdirectlyonthecompressedrepresenta(on(exceptfordataaccessqueries)

Succinct

Page 7: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Inputsize

QueryLatency

Indexes

Succinct

Avoidingdatascans

Avoidingqueriesoffslowerstorage

Datascans

Succincttradeoffs

Page 8: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

OriginalInput

Extract:returnsdataatarbitraryoffsetsinuncompressedfileCount:returnscountofarbitrarystringsinuncompressedfile

Succinct

Search()={0,10,14,16,19,26,29}Extract(0,5)={,,,,}

Count()=7

Search:returnsoffsetsofarbitrarystringsinuncompressedfile

Input:flat(unstructured)files

Append(,,,,)Rangequeries

SuccinctDatamodelandFunctionality

Page 9: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Supported,buttraded-offinfavorofpointqueriesoncompresseddata

• Preprocessingtime

• CPU(dataaccess)

• Sequentialscanthroughput

• “In-place”updates

Whatdowelose?

Succincttradeoffs

Page 10: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Nosecondaryindexes,nodatascans,nodatadecompression

AdistributedcompresseddatastoreSuccinct

Pointqueries

• search• randomaccess• rangequeries• regularexpressions

UnifiedInterface

• Unstructureddata• Key-valuestore• Documentstore• Tables

Page 11: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Withallthepowerfulqueriesonvalues,documents,columns

• Unstructureddata

• Key-valuestores(Voldemort,Dynamo)

• Documentstore(Elasticsearch,MongoDB)

• Tables(Cassandra,BigTable)

• Andmanymore….

UnifiedInterface

SuccinctDataModel:FlatFileInterface

Page 12: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Search(Column1,)Search()

SuccinctFlatFileInterface:Unification

Page 13: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Wherearewe?

• Succinct• SuccinctSpark

Wherearewegoing?

• Industrycollabora(on• Succinct++

AdistributedcompresseddatastoreSuccinct

Page 14: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

• System(prototyped&tested)

• Asalibrary

• C++,Java,Scala

• foreaseofintegration

• Allfunctionalitiessupported

Succinct

Succinct:Wherearewe?

Page 15: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

• ASparkpackage

• Enablesnewfunctionalities

• Documentstores

• Pointqueries

• Fasterfilters

• CompressedRDDs:Morein-memory

• DataframesAPInotsomature

QueriesoncompressedRDDs

SuccinctSpark

Succinct:Wherearewe?

Page 16: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

IfyouarealreadyusingSpark

Newfunc(onali(es

Documentstore,Key-Valuestore

searchondocuments,values

Fasteropera(onsintoRDDs

randomaccess,filters

avoidscans

Morein-memory CompressedRDDs nodecompressionoverheads

SuccinctSpark

Page 17: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

importedu.berkeley.cs.succinct._valrdd=ctx.textFile(...).map(_.getBytes)

valbytes=succinctRDD.extract(50,100)

valcount=succinctRDD.count("Berkeley")

valoffsets=succinctRDD.search("Berkeley")

Importclasses

CreateanRDD

Extract100bytesfromoffset50

Count#occurrencesof“Berkeley”

Findalloccurrencesof“Berkeley”

valsuccinctRDD=rdd.succinct CompressusingSuccinct

SuccinctSpark:SuccinctRDD(unstructureddata)

Page 18: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

importedu.berkeley.cs.succinct.kv._

valkvRDD=rdd.zipWithIndex.map(t=>(t._2,t._1.getBytes))

valvalue=succinctKVRDD.get(0)

valvalueData=succinctKVRDD.extract(0,50,100)

valkeys=succinctKVRDD.search("Berkeley")

Importclasses

Loaddata

Getvalueforkey0

Extract100bytesatoffset50inthevalueforkey0

Findallkeysforvaluesthatcontain

“Berkeley”

valsuccinctKVRDD=kvRDD.succinctKV CompressusingSuccinct

SuccinctSpark:SuccinctKVRDD(documentstore)

Page 19: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

• 5xAmazonEC2servers,30GBRAMeach

• Wikipediadataset,40GB

• Spark,Elasticsearch

• searchqueries

• #occurrences1-10k

SuccinctEvaluation

Page 20: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Take-away:SuccinctSpark2.75xfasterthanElas(cSearchwhilebeing2.5xmorespaceefficient(datafitsinmemoryforallsystems)

SuccinctSparkEvaluation(searchlatency)

Page 21: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

SuccinctSparknowsupportsRegularExpressions!

valmatches=succinctRDD.regexSearch("William.*Clinton")

FindallmatchesfortheRegEx

“William.*Clinton”

valmatchKeys=succinctKVRDD.regexSearch("William.*Clinton")

FindallkeysforvaluesthatcontainmatchesfortheRegEx“William.*Clinton”

SuccinctRDD

SuccinctKVRDD

Page 22: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Take-away:SuccinctsignificantlyspeedsupRegExqueriesevenwhenallthedatafitsinmemoryforallsystems

SuccinctSparkEvaluation(RegExlatency)

Page 23: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

valjsonDoc=succinctJsonRDD.get(0)

valids1=succinctJsonRDD.filter("city","Berkeley")

valids2=succinctJsonRDD.search("AMPLab")

GetJSONdocumentwithid0

FilterJSONdocumentswhere“city=Berkeley”

SearchforJSONdocumentscontaining

“AMPLab”

SuccinctSparknowsupportsJSONdocuments!

Page 24: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

• Moretesting,benchmarking

• SuccinctSparkDataframes

• Newfunctionalities

Where are we going?

Page 25: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Queriesoncompressedandencrypteddata

• BlowFish

• SuccinctEncryption

• SuccinctGraphs

Newfunctionalities

Succinct

BlowFish

Indexes

Queriesoncompressedgraphs

Storage

QueryLatency

Page 26: Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

ANDMANYMORE!

succinct.cs.berkeley.edu