Building Multi-Petabyte Data Warehouses with ClickHouse · PDF file– Paraccel (now...

Preview:

Citation preview

BuildingMulti-PetabyteDataWarehouseswithClickHouse

AlexanderZaitsev

LifeSteet,Altinity

PerconaLiveDublin,2017

Altinity

WhoamI

•  GraduatedMoscowStateUniversityin1999

•  Softwareengineersince1997

•  Developeddistributedsystemssince2002

•  Focusedonhighperformanceanalyticssince2007

•  DirectorofEngineeringinLifeStreet

•  Co-founderofAltinity

•  AdTechcompany(adexchange,adserver,RTB,DMPetc.)

since2006

•  10,000,000,000+events/day

•  2K/event

•  3monthsretention(90-120days)

10B*2K*[90-120]=[1.8-2.4]PB

•  Tried/used/evaluated:

–  MySQL(TokuDB,ShardQuery)

–  InfiniDB

–  MonetDB

–  InfoBrightEE–  Paraccel(nowRedShift)

–  Oracle

–  Greenplum

–  SnowflakeDB

–  Vertica

ClickHouse

Flashback:ClickHouseat08/2016

•  1-2monthsinOpenSource

•  InternalYandexproduct–nootherinstallations

•  Nosupport,roadmap,communicatedplans

•  3officialdevs

•  Anumberofvisiblelimitations(andmanyinvisible)

•  Storiesofotherdoomedopen-sourcedDBs

Developproductionsystemwith“that”?

ClickHouseis/wasmissing:

•  Transactions

•  Constraints

•  Consistency

•  UPDATE/DELETE

•  NULLs(addedfewmonthsago)

•  Milliseconds

•  Implicittypeconversions

•  StandardSQLsupport

•  Partitioningbyanycolumn(dateonly)

•  Enterpriseoperationtools

SQLdevelopersreaction:

Butwetriedandsucceeded

Beforeyougo:

ü Confirmyourusecase

ü Checkbenchmarks

ü Runyourown

ü Considerlimitations,notfeatures

ü MakeaPOC

Migrationproblem:basicthingsdonotfit

MainChallenges

•  Efficientschema

– UseClickHousebests

– Workaroundlimitations

•  Reliabledataingestion

•  Shardingandreplication

•  Clientinterfaces

LifeStreetUseCase

•  Publisher/Advertiserperformance

•  Campaign/Creativeperformanceprediction

•  Realtimealgorithmicbidding

•  DMP

LifeStreetRequirements

•  Load10Brows/day,500dimensions/row

•  Ad-hocreportson3monthsofdata

•  Lowdataandquerylatency

•  HighAvailability

Multi-DimensionalAnalysis

N-dimensionalcube

M-dimensionalprojection

slice

OLAPquery:aggregation+filter+groupby

Rangefilter

Queryresult

Disclaimer:averageslie

Typicalschema:“star”

•  Facts•  Dimensions•  Metrics•  Projections

StarSchemaApproach

De-normalized:dimensionsinafacttable

Normalized:dimensionkeysinafacttableseparatedimensiontables

Singletable,simple Multipletables

Simplequeries,nojoins Morecomplexquerieswithjoins

Datacannotbechanged Dataindimensiontablescanbechanged

Sub-efficientstorage Efficientstorage

Sub-efficientqueries Moreefficientqueries

Normalizedschema:traditionalapproach-joins

•  LimitedsupportinClickHouse(1level,

cascadesub-selectsformultiple)

•  Dimensiontablesarenotupdatable

Dictionaries-ClickHousedimensionsapproach

•  Lookupservice:key->value

•  Supportsdifferentexternalsources(files,

databasesetc.)

•  Refreshable

Dictionaries.ExampleSELECT country_name, sum(imps) FROM T ANY INNER JOIN dim_geo USING (geo_key) GROUP BY country_name; vs SELECT dictGetString(‘dim_geo’, ‘country_name’, geo_key) country_name, sum(imps) FROM T GROUP BY country_name;

Dictionaries.Configuration<dictionary>

<name></name>

<source>…</source>

<lifetime>...</lifetime>

<layout>…</layout>

<structure>

<id>...</id>

<attribute>...</attribute>

<attribute>...</attribute>

...

</structure>

</dictionary>

Dictionaries.Sources•  file

•  mysqltable

•  clickhousetable

•  odbcdatasource

•  executablescript

•  httpservice

Dictionaries.Layouts

•  flat

•  hashed

•  cache

•  complex_key_hashed

•  range_hashed

Dictionaries.range_hashed

•  ‘EffectiveDated’queries

<layout>

<range_hashed/>

</layout>

<structure>

<id>

<name>id</name>

</id>

<range_min>

<name>start_date</name>

</range_min>

<range_max>

<name>end_date</name>

</range_max>

dictGetFloat32('srv_ad_serving_costs','ad_imps_cpm',toUInt64(0),event_day)

Dictionaries.Updatevalues•  Bytimer(default)

•  AutomaticforMySQLMyISAM

•  Using‘invalidate_query’

•  Manuallytouchingconfigfile

•  Warning:Ndict*Mnodes=N*MDB

connections

Dictionaries.Restrictions

•  ‘Normal’keysareonlyUInt64

•  Noondemandupdate(addedinSep2017

1.1.54289)

•  Everyclusternodehasitsowncopy

•  XMLconfig(DDLwouldbebetter)

Dictionariesvs.Tables

+NoJOINs

+Updatable

+Alwaysinmemoryforflat/hash(faster)

-  Notapartoftheschema

-  Somewhatinconvenientsyntax

Tables

•  Engines

•  Sharding/Distribution

•  Replication

Engine=?

•  Inmemory:

–  Memory

–  Buffer

–  Join

–  Set

•  Ondisk:

–  Log,TinyLog

–  MergeTreefamily

•  Interface:

–  Distributed–  Merge

–  Dictionary•  Specialpurpose:

–  View

–  MaterializedView

–  Null

Mergetree•  Whatis‘merge’

•  PKsortingandindex

•  Datepartitioning

•  Queryperformance

Block1 Block2

Mergedblock

PKindex

Seedetailsat:https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324

MergeTreefamily

ReplicatedReplacingCollapsingSummingAggergatingGraphite

MergeTree+ +

DataLoad

•  Multipleformatsaresupported,includingCSV,TSV,

JSONs,nativebinary

•  Errorhandling•  SimpleTransformations

•  Loadlocally(better)ordistributed(possible)

•  Temptableshelp

•  Replicatedtableshelpwithde-dup

ThepowerofMaterializedViews

•  MVisatable,i.e.engine,replicationetc.

•  Updatedsynchronously

•  Summing/AggregatingMergeTree–consistentaggregation

•  Altersareproblematic

DataLoadDiagram

Temptables(local)

Facttables(shard)

SummingMergeTree(shard)

SummingMergeTree(shard)

LogFiles

INSERT

MV MV

INSERT Buffertables(local)

Realtimeproducers

INSERT

Bufferflush

MySQL

Dictionaries

CLICKHOUSENODE

Updatesanddeletes

•  Dictionariesarerefreshable

•  ReplacingandCollapsingmergetrees

– eventuallyupdates

– SELECT…FINAL

•  Partitions

ShardingandReplication•  ShardingandDistribution=>Performance

–  FacttablesandMVs–distributedovermultipleshards

–  Dimensiontablesanddicts–replicatedateverynode

(localjoinsandfilters)

•  Replication=>Reliability–  2-3replicaspershard

–  CrossDC

DistributedQuerySELECTfooFROMdistributed_tableGROUPbycol1

Server1,2or3

SELECTfooFROMlocal_tableGROUPBYcol1

•  Server1

SELECTfooFROMlocal_tableGROUPBYcol1

•  Server2

SELECTfooFROMlocal_tableGROUPBYcol1

•  Server3

Replication•  Pertabletopologyconfiguration:

–  Dimensiontables–replicatetoanynode–  Facttables–replicatetomirrorreplica

•  Zookepertocommunicatethestate

–  State:whatblocks/partstoreplicate

•  Asynchronous=>fasterandreliableenough

•  Synchronous=>slower

•  Isolatequerytoreplica

•  Replicationqueues

SQL•  SupportsbasicSQLsyntax

•  Non-standardJOINsimplementation:

–  1levelonly

–  ANYvsALL

–  onlyUSING

•  Aliasingeverywhere

•  Arrayandnesteddatatypes,lambda-expressions,ARRAYJOIN

•  GLOBALIN,GLOBALJOIN

•  Approximatequeries

•  Someanalyticalfunctions

HardwareandDeployment

•  LoadisCPUintensive=>morecores

•  Queryisdiskintensive=>fasterdisks

•  10-12SATARAID10–  SAS/SSD=>x2performanceforx2priceforx0.5capacity

•  10TB/serverseemsoptimal

•  Zookeper–keepinonDCforfastquorum

•  RemoteDCworkbad(e.g.EastanWestcoastinUS)

MainChallengesRevisited

•  Designefficientschema

– UseClickHousebests

– Workaroundlimitations

•  Designshardingandreplication

•  Reliabledataingestion

•  Clientinterfaces

Migrationprojecttimelines•  August2016:POC

•  October2016:firsttestruns

•  December2016:productionscaledataload:

–  10-50Bevents/day,20TBdata/day–  12x2serverswith12x4TBRAID10

•  March2017:ClientAPIready,startingmigration

–  30+clienttypes,20req/squeryload

•  May2017:extensionto20x3servers

•  June2017:migrationcompleted!

–  2-2.5PBuncompresseddata

Fewexamples

:)selectcount(*)fromdw.ad8_fact_eventwhereaccess_day=today()-1;SELECTcount(*)FROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)┌────count()─┐│7585106796│└────────────┘1rowsinset.Elapsed:0.503sec.Processed12.78billionrows,25.57GB(25.41billionrows/s.,50.82GB/s.)

:)selectdictGetString('dim_country','country_code',toUInt64(country_key))country_code,count(*)cntfromdw.ad8_fact_eventwhereaccess_day=today()-1groupbycountry_codeorderbycntdesclimit5;SELECTdictGetString('dim_country','country_code',toUInt64(country_key))AScountry_code,count(*)AScntFROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)GROUPBYcountry_codeORDERBYcntDESCLIMIT5┌─country_code─┬────────cnt─┐│US│2159011287││MX│448561730││FR│433144172││GB│352344184││DE│336479374│└──────────────┴────────────┘5rowsinset.Elapsed:2.478sec.Processed12.78billionrows,55.91GB(5.16billionrows/s.,22.57GB/s.)

:)SELECTdictGetString('dim_country','country_code',toUInt64(country_key))AScountry_code,sum(cnt)AScntFROM(SELECTcountry_key,count(*)AScntFROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)GROUPBYcountry_keyORDERBYcntDESCLIMIT5)GROUPBYcountry_codeORDERBYcntDESC┌─country_code─┬────────cnt─┐│US│2159011287││MX│448561730││FR│433144172││GB│352344184││DE│336479374│└──────────────┴────────────┘5rowsinset.Elapsed:1.471sec.Processed12.80billionrows,55.94GB(8.70billionrows/s.,38.02GB/s.)

:)SELECTcountDistinct(name)ASnum_cols,formatReadableSize(sum(data_compressed_bytes)ASc)AScomp,formatReadableSize(sum(data_uncompressed_bytes)ASr)ASraw,c/rAScomp_ratioFROMlf.columnsWHEREtable='ad8_fact_event_shard'┌─num_cols─┬─comp───────┬─raw──────┬──────────comp_ratio─┐│308│325.98TiB│4.71PiB│0.06757640834769944│└──────────┴────────────┴──────────┴─────────────────────┘1rowsinset.Elapsed:0.289sec.Processed281.46thousandrows,33.92MB(973.22thousandrows/s.,117.28MB/s.)

ClickHouseatfall2017

•  1+yearOpenSource

•  100+prodinstallsworldwide

•  Publicchangelogs,roadmap,andplans

•  5+2Yandexdevs,fewcommunitycontributors

•  Activecommunity,blogs,casestudies

•  Alotoffeaturesaddedbycommunityrequests

•  SupportbyAltinity

Sonowitismucheasier

ClickHouseandMySQL

•  MySQLiswidespreadbutweakforanalytics

– TokuDB,InfiniDBsomewhathelp

•  ClickHouseisbestinanalytics

Howtocombine?

Imagine

MySQLflexibilityatClickHousespeed?

Dreams….

ClickHousewithMySQL•  ProxySQLtoaccessClickHouse

dataviaMySQLprotocol(moreat

thenextsession)

•  BinlogsintegrationtoloadMySQL

datainClickHouseinrealtime(in

progress)

MySQL CH

ProxySQL

binlogconsumer

ClickHouseinsteadofMySQL

•  Weblogsanalytics

•  Monitoringdatacollectionandanalysis

– Percona’sPMM

–  InfinidatInfiniMetrics

•  Othertimeseriesapps

•  ..andmore!

Questions?

Contactme:

alexander.zaitsev@lifestreet.comalz@altinity.comskype:alex.zaitsev

Altinity

Recommended