Kudu Cloudera Meetup Paris

1©Cloudera,Inc.Allrightsreserved.

IntrotoApacheKuduHadoopStorageforFastAnalyticsonFastData@[email protected]


Agenda

•WhyKudu?Motivationsanddesigngoals•Usecases•Designandsomeinternals• Performance•Howtogetstarted


WhyKudu?• Gapsincapabilitiesofcurrentstoragesystems• Changinghardwarelandscape


PreviousstoragelandscapeoftheHadoopecosystemHDFS (GFS)excelsat:

• Batchingestonly(eg hourly)• Efficientlyscanninglargeamounts

ofdata(analytics)HBase (BigTable)excelsat:

• Efficientlyfindingandwritingindividualrows

• Makingdatamutable

Gapsexistwhenthesepropertiesareneededsimultaneously


• HighthroughputforbigscansGoal:Within2xofParquet

• Low-latencyforshortaccessesGoal: 1msread/writeonSSD

• Database-like semantics(initiallysingle-rowACID)

• Relationaldatamodel• SQLqueriesareeasy• “NoSQL”stylescan/insert/update(Java/C++client)

Kududesigngoals


Changinghardwarelandscape

• Spinningdisk->solidstatestorage• NANDflash:Upto450kread250kwriteiops,about2GB/secreadand1.5GB/secwritethroughput, atapriceoflessthan$3/GBanddropping

• 3DXPoint memory (1000xfasterthanNAND,cheaperthanRAM)

• RAM ischeaperandmoreabundant:• 64->128->256GBoverlastfewyears

• Takeaway:The nextbottleneckisCPU,andcurrentstoragesystemsweren’tdesignedwithCPUefficiencyinmind.


WhereKudufitsintheHadoopbigdatastackStorageforfast(lowlatency)analyticsonfast(highthroughput)data

• Simplifiesthearchitectureforbuildinganalyticapplicationsonchangingdata

• Optimizedforfastanalyticperformance

• NativelyintegratedwiththeHadoopecosystemofcomponents

FILESYSTEMHDFS

NoSQLHBASE

INGEST– SQOOP,FLUME,KAFKA

DATAINTEGRATION&STORAGE

SECURITY– SENTRY

RESOURCEMANAGEMENT– YARN

UNIFIEDDATASERVICES

BATCH STREAM SQL SEARCH MODEL ONLINE

DATAENGINEERING DATADISCOVERY&ANALYTICS DATAAPPS

SPARK,HIVE,PIG

SPARK IMPALA SOLR SPARK HBASE

RELATIONALKUDU


Kudu:Scalableandfasttabularstorage

• Scalable• Testedupto275nodes(~3PBcluster)• Designedtoscaleto1000sofnodes,tensofPBs

• Fast• Millions ofread/writeoperationspersecondacrosscluster• MultipleGB/secondreadthroughputpernode

• Tabular• Storetableslikeanormaldatabase(cansupportSQL,Spark,etc)• Individualrecord-levelaccessto100+billionrowtables(Java/C++/PythonAPIs)


Kuduusage

• TablehasaSQL-likeschema• Finitenumberofcolumns(unlikeHBase/Cassandra)• Types:BOOL,INT8,INT16,INT32,INT64,FLOAT,DOUBLE,STRING,BINARY,TIMESTAMP

• Somesubsetofcolumnsmakesupapossibly-compositeprimarykey• FastALTERTABLE

• JavaandC++“NoSQL”styleAPIs• Insert(),Update(),Delete(),Scan()

• IntegrationswithMapReduce,Spark,andImpala• ApacheDrillwork-in-progress


Usecasesandarchitectures


Kuduusecases

Kuduisbestforusecasesrequiringasimultaneouscombinationofsequentialandrandomreadsandwrites

• Timeseries• Examples:Streamingmarketdata,frauddetection/prevention,riskmonitoring• Workload:Insert,updates,scans,lookups

• Machinedataanalytics• Example:Networkthreatdetection• Workload:Inserts,scans,lookups

• Onlinereporting• Example:Operationaldatastore(ODS)• Workload:Inserts,updates,scans,lookups


“Traditional”real-timeanalyticsinHadoopFrauddetectionintherealworldmeansstoragecomplexity

Considerations:● HowdoIhandlefailure

duringthisprocess?

● HowoftendoIreorganizedatastreaminginintoaformatappropriateforreporting?

● Whenreporting,howdoIseedatathathasnotyetbeenreorganized?

● HowdoIensurethatimportantjobsaren’tinterruptedbymaintenance?

NewPartition

MostRecentPartition

HistoricalData

HBase

ParquetFile

Haveweaccumulatedenoughdata?

ReorganizeHBasefileintoParquet

• Waitforrunningoperationstocomplete• DefinenewImpalapartitionreferencingthenewlywrittenParquetfile

IncomingData(MessagingSystem)

ReportingRequest

StorageinHDFS


Real-timeanalyticsinHadoopwithKudu

Improvements:● Onesystemtooperate

● Nocronjobsorbackgroundprocesses

● Handlelatearrivalsordatacorrectionswithease

● Newdataavailableimmediatelyforanalyticsoroperations

HistoricalandReal-timeData

IncomingData(MessagingSystem)

ReportingRequest

StorageinKudu


Xiaomiusecase

• World’s4th largestsmart-phonemaker(mostpopularinChina)

• GatherimportantRPCtracingeventsfrommobileappandbackendservice.

• Servicemonitoring&troubleshootingtool.

Highwritethroughput

• >5Billionrecords/dayandgrowing

Querylatestdataandquickresponse

• Identifyandresolveissuesquickly

Cansearchforindividualrecords

• Easyfortroubleshooting


XiaomibigdataanalyticspipelineBeforeKudu

LargeETLpipelinedelays● Highdatavisibilitylatency

(from1hourupto1day)

● Dataformatconversionwoes

Orderingissues● Logarrival(storage)not

exactlyincorrectorder

● Mustread2– 3daysofdatatogetallofthedatapointsforasingleday


XiaomibigdataanalyticspipelineSimplifiedwithKudu

LowlatencyETLpipeline● ~10sdatalatency

● ForappsthatneedtoavoiddirectbackpressureorneedETLforrecordenrichment

Directzero-latencypath● Forappsthatcantolerate

backpressureandcanusetheNoSQLAPIs

● Appsthatdon’tneedETLenrichmentforstorage/retrieval

OLAPscanSidetablelookupResultstore


HowitworksReplicationandfaulttolerance


Tablesandtablets

• Eachtableishorizontallypartitionedinto tablets• Range orhash partitioning

• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS

• Translation: bucketNumber = hashCode(row[‘timestamp’]) % 100• EachtablethasNreplicas(3or5),withRaftconsensus• Tabletservershosttabletsonlocaldiskdrives


Metadata

• Replicatedmaster• Actsasatabletdirectory• Actsasacatalog(whichtablesexist,etc)• Actsasaloadbalancer(tracksTSliveness,re-replicatesunder-replicatedtablets)

• Cachesallmetadata inRAMforhighperformance• Clientconfiguredwithmasteraddresses

• Asksmasterfortabletlocationsasneededandcachesthem


Client

HeyMaster!Whereistherowfor‘mpercy’intable“T”?

It’spartoftablet2,whichisonservers{Z,Y,X}.BTW,here’sinfoonothertabletsyoumightcareabout:T1,T2,T3,…

UPDATEmpercySETcol=foo

MetaCacheT1:…T2:…T3:…


Faulttolerance

• Operationsreplicated usingRaft consensus• Strictquorumalgorithm.SeeRaftpaperfordetails

• Transientfailures:• Followerfailure:Leadercanstillachievemajority• Leaderfailure:automaticleaderelection(~5seconds)• RestartdeadTSwithin5minanditwillrejointransparently

• Permanentfailures• After5minutes,automaticallycreatesanewfollowerreplicaandcopiesdata

• Nreplicascantoleratemaximumof(N-1)/2failures


HowitworksColumnarstorage


Columnarstorage

{25059873,22309487,23059861,23010982}

Tweet_id

{newsycbot,RideImpala,fastly,llvmorg}

User_name

{1442865158,1442828307,1442865156,1442865155}

Created_at

{Visualexp…,Introducing..,MissingJuly…,LLVM3.7….}

text


Columnarstorage

{25059873,22309487,23059861,23010982}

Tweet_id

{newsycbot,RideImpala,fastly,llvmorg}

User_name

{1442865158,1442828307,1442865156,1442865155}

Created_at

{Visualexp…,Introducing..,MissingJuly…,LLVM3.7….}

text

SELECTCOUNT(*)FROMtweetsWHEREuser_name =‘newsycbot’;

Onlyread1column

1GB 2GB 1GB 200GB


Columnarcompression

{1442865158,1442828307,1442865156,1442865155}

Created_at

Created_at Diff(created_at)

1442865158 n/a

1442828307 -36851

1442865156 36849

1442865155 -1

64bitseach 17bitseach

• Manycolumnscancompresstoafewbitsperrow!

• Especially:• Timestamps• Timeseriesvalues• Low-cardinalitystrings

• Massivespacesavingsandthroughputincrease!


Integrations


SparkDataSource integration(WIP)

sqlContext.load("org.kududb.spark",Map("kudu.table" -> “foo”,

"kudu.master" -> “master.example.com”)).registerTempTable(“mytable”)

df = sqlContext.sql(“select col_a, col_b from mytable “ +“where col_c = 123”)

https://github.com/tmalaska/SparkOnKudu


Impalaintegration

• CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16 BUCKETS AS SELECT … FROM …

• INSERT/UPDATE/DELETE

• Optimizations like predicate pushdown, scan parallelism, plans for more on the way

• Not an Impala user? Apache Drill integration in the works by Dremio


MapReduceintegration

•MostKuduintegration/correctnesstestingviaMapReduce•Multi-frameworkcluster(MR+HDFS+Kuduonthesamedisks)•KuduTableInputFormat /KuduTableOutputFormat

•Supportforpushingdownpredicates,columnprojections,etc.


Performance


TPC-H(analyticsbenchmark)

• 75servercluster• 12(spinning)diskseach,enoughRAMtofitdataset• TPC-HScaleFactor100(100GB)

• Examplequery:• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;


• KuduoutperformsParquetby31%(geometricmean)forRAM-residentdata


VersusotherNoSQLstorage• ApachePhoenix:OLTPSQLenginebuiltonHBase• 10nodecluster(9worker,1master)• TPC-HLINEITEMtableonly(6Brows)

2152

21976

131

0,04

1918

13,2

1,7

0,7

0,15

155

9,3

1,4 1,5 1,37

0,01

0,1

1

10

100

1000

10000Load TPCHQ1 COUNT(*)

COUNT(*)WHERE…

single-rowlookup

Time(sec)

Phoenix

Kudu

Parquet


WhataboutNoSQL-stylerandomaccess?(YCSB)

• YCSB 0.5.0-snapshot• 10nodecluster(9worker,1master)

• 100Mrowdataset• 10Moperationseachworkload


Demo


Kuducommunity

Yourcompanyhere!


Gettingstartedasauser

• https://kudu.apache.org/

• Quickstart VM• Easiestwaytogetstarted• ImpalaandKuduinaneasy-to-installVM

• CSDandParcels• ForinstallationonaClouderaManager-managedcluster


Gettingstartedasadeveloper

• https://github.com/apache/kudu

• Apache2.0licensedopensourceandanASFincubatorproject.• Contributionsarewelcomeandencouraged!Becomeacommitter!


Questions?http://getkudu.io@ApacheKudu

Data & Analytics

Kudu Cloudera Meetup Paris