Upload
xml
View
227
Download
0
Embed Size (px)
Citation preview
7/29/2019 Shark - Hive on Spark
1/48
Shark
CliffEngle,AntonioLupher,ReynoldXin,MateiZaharia,MichaelFranklin,IonStoica,
ScottShenker
HiveonSpark
7/29/2019 Shark - Hive on Spark
2/48
Agenda
IntrotoSpark ApacheHive Shark SharksImprovementsoverHive Demo Alphastatus Futuredirections
7/29/2019 Shark - Hive on Spark
3/48
WhatSparkIs
NotawrapperaroundHadoop Separate,fast,MapReduce-likeengine
In-memorydatastorageforveryfastiterativequeries
PowerfuloptimizationandschedulingInteractiveusefromScalainterpreter
CompatiblewithHadoopsstorageAPIsCanread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc
7/29/2019 Shark - Hive on Spark
4/48
ProjectHistory
Startedinsummerof2009 Opensourcedinearly2010 InuseatUCBerkeley,UCSF,Princeton,Klout,Conviva,Quantifind,Yahoo!Research
7/29/2019 Shark - Hive on Spark
5/48
SparkProgrammingModel
Resilientdistributeddatasets(RDDs)
DistributedcollectionsofScalaobjects
Canbecachedinmemoryacrossclusternodes
ManipulatedlikelocalScalacollectionsAutomaticallyrebuiltonfailure
7/29/2019 Shark - Hive on Spark
6/48
Example:LogMining
Loaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(hdfs://...)
errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))
cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(foo)).count
cachedMsgs.filter(_.contains(bar)).count. . .
tasks
results
Cache1
Cache2
Cache3
BaseRDD
TransformedRDD
Action
Result:full-textsearchofWikipediain
7/29/2019 Shark - Hive on Spark
7/48
ApacheHive
DatawarehousesolutiondevelopedatFacebook
SQL-likelanguagecalledHiveQLtoquerystructureddatastoredinHDFS
QueriescompiletoHadoopMapReducejobs
7/29/2019 Shark - Hive on Spark
8/48
7/29/2019 Shark - Hive on Spark
9/48
HivePrinciples
SQLprovidesafamiliarinterfaceforusers
Extensibletypes,functions,andstorageformats
Horizontallyscalablewithhighperformanceonlargedatasets
7/29/2019 Shark - Hive on Spark
10/48
HiveApplications
Reporting Adhocanalysis ETLformachinelearning
7/29/2019 Shark - Hive on Spark
11/48
HiveDownsides Notinteractive
Hadoopstartuplatencyis~20seconds,evenforsmalljobs
NoquerylocalityIfqueriesoperateonthesamesubsetofdata,theystillrunfromscratch
Readingdatafromdiskisoftenbottleneck Requiresseparatemachinelearningdataflow
7/29/2019 Shark - Hive on Spark
12/48
SharkMotivation
Exploittemporallocality:workingsetofdatacanoftenfitinmemorytobe
reusedbetweenqueries Providelowlatencyonsmallqueries IntegratedistributedUDFsintoSQL
7/29/2019 Shark - Hive on Spark
13/48
IntroducingShark
Shark=Spark+ Hive
RunHiveQLqueriesthroughSparkwithHiveUDF,UDAF,SerDeUtilizeSparksin-memoryRDDcachingandflexiblelanguagecapabilities
7/29/2019 Shark - Hive on Spark
14/48
SharkintheAMPStack
Mesos
Spark
PrivateCluster AmazonEC2
Hadoop
MPI
Bagel(Pregelon
Spark)Shark
DebugTools
Streaming
Spark
7/29/2019 Shark - Hive on Spark
15/48
Shark
PhysicaloperatorsusingSpark RelyonSparksfastexecution,faulttolerance,andin-memoryRDDs
ReuseasmuchHivecodeaspossibleConvertlogicalqueryplangeneratedfromHiveintoSparkexecutiongraph
CompatiblewithHiveRunHiveQLqueriesonexistingHDFSdatausingHivemetadata,withoutmodifications
7/29/2019 Shark - Hive on Spark
16/48
7/29/2019 Shark - Hive on Spark
17/48
0.1alpha:84%Hivetestspassing
(575outof683)
http://github.com/amplab/shark
7/29/2019 Shark - Hive on Spark
18/48
0.1alpha ExperimentalSQL/RDDintegration UserselectedcachingwithCTASColumnarRDDcache
Someperformanceimprovements
7/29/2019 Shark - Hive on Spark
19/48
Caching
UserselectedcachingwithCTASCREATETABLEmytable_cachedASSELECT*FROMmytableWHEREcount>10;
mytable_cachedbecomesanin-memorymaterializedviewthatcanbeusedtoanswerqueries.
7/29/2019 Shark - Hive on Spark
20/48
SQL/SparkIntegration
AllowuserstoimplementsophisticatedalgorithmsasUDFsinSpark
QueryprocessingUDFsarestreamlinedval rdd = sc.sql2rdd("select foo, count(*) c from pokes group by foo")
println(rdd.count)println(rdd.mapRows(_.getInt(c)).reduce(_+_))
7/29/2019 Shark - Hive on Spark
21/48
Performanceoptimizations
Hash-basedshuffle(speedsupgroup-bys)Hadoopdoessort-basedshufflewhichcanbeexpensive.
Limitpush-downinorderbyselect*frompokesorderbyfoolimit10
Columnarcache
7/29/2019 Shark - Hive on Spark
22/48
LargeCachesinJVM
Javaobjectshavelargeoverhead(2-4xtheactualdatasize)
Boxing/UnboxingisslowGCisabottleneckifwehavemanylong-livedobjects
7/29/2019 Shark - Hive on Spark
23/48
Example:CachingRows
Int,Double,Boolean
25 10.7 True
24+24+16=
64bytes*
Shouldbe~13bytes
*Estimatedon64bitVM.Implementationsmayvary.
7/29/2019 Shark - Hive on Spark
24/48
PossibleSolutions
1. StoredeserializedJavaobjects2. Serializeobjectstoin-memory
bytearray
3. UsememoryslabexternaltoJVM
4. Storedatainprimitivearrays
7/29/2019 Shark - Hive on Spark
25/48
DeserializedJavaObjects
+Fast(requireslittleCPU)
(Upto5Xspeedup)
-Poormemoryusage(3Xworse)
-LotsofGarbageCollectionoverhead
7/29/2019 Shark - Hive on Spark
26/48
SerializetoBytes
+Bestmemoryefficiency
+EfficientGC-CPUusagetoserialize/
deserializeeachobj
7/29/2019 Shark - Hive on Spark
27/48
StoreinExternalSlab
+Efficientmemory
+NoGC-CPUusagetoserialize/
deserializeeachobj-Moredifficulttoimplement
7/29/2019 Shark - Hive on Spark
28/48
PrimitiveColumnArrays
+Efficientmemory
(similartobytearrays)+EfficientGC
+NoextraCPUoverhead
7/29/2019 Shark - Hive on Spark
29/48
ColumnvsRowStorage
1
Column Row2 3 1
john mike sally
4.1 3.5 6.4
john 4.1
2 mike 3.5
3 sally 6.4
7/29/2019 Shark - Hive on Spark
30/48
ColumnarBenefits
BettercompressionFewerobjectstotrackOnlymaterializenecessarycolumns=>LessCPU
7/29/2019 Shark - Hive on Spark
31/48
ColumnarinShark
Storeallprimitivecolumnsinprimitivearraysoneachnode
LessdeserializationoverheadSpaceefficientandeasiertogarbagecollect
7/29/2019 Shark - Hive on Spark
32/48
Result
Achievedefficientstoragespacewithoutsacrificingperformance
7/29/2019 Shark - Hive on Spark
33/48
LimitPushdowninSort
SELECT*FROMtableORDERBYkeyLIMIT10;
7/29/2019 Shark - Hive on Spark
34/48
MainIdea
Thelimitcanbeappliedbeforesortingalldatatoincrease
performance.EachmappercomputesitsTopKandthenasinglereducermergesthese
7/29/2019 Shark - Hive on Spark
35/48
PossibleImplementations
1.PriorityQueue(Heap)tokeeprunningtopkinonetraversalon
eachmapper.
O(nlogk)
7/29/2019 Shark - Hive on Spark
36/48
PossibleImplementations
2.QuickSelectalgorithm.Essentiallyquicksort,butprunes
elementsononesideofpivotwhenpossible.
O(n)
7/29/2019 Shark - Hive on Spark
37/48
Result
WechosetouseGoogleGuavasimplementationofQuickSelect.
WeobservedmuchbetterperformancethanHivewhich
appliesthelimitaftersortingoneachreducer.
7/29/2019 Shark - Hive on Spark
38/48
Demo
3-nodeEC2largecluster(2virtualcores,7GBofRAM)
SyntheticTPC-Hdata(2xscalefactor)
7/29/2019 Shark - Hive on Spark
39/48
SomenumberswhilewewaitforHivetofinishrunning
Brown/Stonebrakerbenchmark70GB1AlsousedonHivemailinglist2
10AmazonEC2HighMemoryNodes(30GBofRAM/node)
NaivelycacheinputtablesCompareSharktoHive0.7
1http://database.cs.brown.edu/projects/mapreduce-vs-dbms/2https://issues.apache.org/jira/browse/HIVE-3961
7/29/2019 Shark - Hive on Spark
40/48
Benchmarks:Query1
SELECT * FROM grep WHERE field LIKE %XYZ%;
30GBinputtable
7/29/2019 Shark - Hive on Spark
41/48
Benchmark:Query2
5GBinputtableSELECT pagerank, pageURL FROM rankings WHEREpagerank > 10;
7/29/2019 Shark - Hive on Spark
42/48
Status
Passing575Hivetestsoutof683.Youcanhelpusdecidewhattoimplementnext.
7/29/2019 Shark - Hive on Spark
43/48
UnsupportedHiveFeatures
ADDFILE(useHadoopdistributedcachetodistributeuser-definedscripts)
STREAMTABLEhintnotsupportedOuterjoinwithnon-equijoinfilter
Tablestatistics(piggybackfilesink) Tablewithbuckets ORDERBYwithmultiplereducershasabug(willbefixedthisweek!)
MapJoinhasaserializationbug(willbefixedthisweek!)
7/29/2019 Shark - Hive on Spark
44/48
UnsupportedHiveFeatures
Automaticallyconvertjoinstomapjoins Sort-mergemapjoins Partitionswithdifferentinputformats Uniquejoins WritingtotwoormoretablesinasingleSELECTINSERT
Archivingdata virtualcolumns(INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE) Mergesmallfilesfromoutput
7/29/2019 Shark - Hive on Spark
45/48
Whatscomingup(inamonth)?
Performanceimprovements(groupbys,joins)DemoofSharkona100-nodeclusterinSIGMOD2012(May22,Scottsdale,AZ)miningWikipediavisitlogdata
7/29/2019 Shark - Hive on Spark
46/48
Whatscomingup(ayear)?
Shark-specificqueryoptimizer Automatictuningofparameters(e.g.numberofreducers)
Off-heapcaching(e.g.EhCache) AutomaticcachingbasedonqueryanalysisSharedcachinginfrastructure
7/29/2019 Shark - Hive on Spark
47/48
Conclusion
SharkisawarehousesolutioncompatiblewithApacheHive
Canbeordersofmagnitudefaster Wearelookingforpeopletotryandwearehappytoprovidesupport
7/29/2019 Shark - Hive on Spark
48/48
Thankyou!
Questions?
http://shark.cs.berkeley.edu(willbeupdatedtonight)