APACHE KUDU: ANALYTICS & RANDOM ACCESS IN A SINGLE ...biconsulting.hu/letoltes/2018budapestdata/...kudu.pdf · Fast Scans, Analytics and Processing of Static Data Fast On-Line Updates

© Cloudera, Inc. All rights reserved. 1

APACHE KUDU: ANALYTICS & RANDOM ACCESS IN A SINGLE DISTRIBUTED DATABASEAttila Bukor


HDFS

Fast Scans, Analytics

and Processing of Static Data

Fast On-Line Updates &

Data Serving

Arbitrary Storage

(Active Archive)

Fast Analytics(on fast-changing or

frequently-updated data)

Unchanging

Fast ChangingFrequent Updates

HBase

Append-Only

Real-Time

Complex Hybrid Architectures

Analytic Gap

Pace of Analysis

Pace

of D

ata


HDFS

Fast Scans, Analytics

and Processing of Static Data

Fast On-Line Updates &

Data Serving

Arbitrary Storage

(Active Archive)

Fast Analytics(on fast-changing or

frequently-updated data)

Unchanging

Fast ChangingFrequent Updates

HBase

Append-Only

Real-Time

KuduKudu fills the Gap

Modern analytic applications often

require complex data flow and difficult

integration work to move data between

HBase and HDFS

Analytic Gap

Pace of Analysis

Pace

of D

ata


NEXT GENERATION HARDWARE

• Cheaper and faster every year.

• CPUs adding cores and SIMD width instead of clock speed

• Cheaper and bigger every year


SCALABLE AND FAST TABULAR STORAGE

● ScalableTested up to 275 nodes (~3PB cluster)

Designed to scale to 1000s of nodes and tens of PBs

● FastMillions of read/write operations per second across cluster

Multiple GB/second read throughput per node

● TabularRepresents data in structured tables like a relational database

Individual record-level access to 100+ billion row tables


STORING RECORDS IN KUDU TABLES

● SQL-like schema○ Finite number of columns○ Strongly typed○ Primary keys - ordered subset of columns○ Fast ALTER TABLE (async)

● Java, Python, and C++ NoSQL-style APIs● SQL via integrations with Impala and Spark● CRUD with low-milliseconds latencies


TABLES, TABLETS AND TABLET SERVERS

● Horizontal partitions - tablets hosted on Tablet servers

○ Range and/or hash partitioning on primary key

● 2n+1 replicas with Raft consensus (http://raft.github.io)

○ MTTR ~5 seconds

● Hosted on local disks, decoupled from Hadoop

● HA Master servers

○ metadata operations (DDL)

○ tablet directory


XIAOMI’S USE-CASE

• 4th largest smartphone maker

• RPC tracing

• Service monitoring & troubleshooting tool

• High write throughput

• Real-time querying

• Search for individual records


PIPELINE BEFORE KUDU

● Long pipeline

● No ordering


● ETL Pipeline (0~10s latency)

● Direct Pipeline (no latency)

NEW PIPELINE WITH KUDU


● ATLAS EventIndex - catalogue of all events

● Tens of billions of ~1kB records

● Needs a fast and efficient selection of events of interest

○ various criteria

○ referencing events in millions of files scattered around a world-wide

distributed computing system

CERN’S USE-CASE


● Benchmark

○ same dataset

○ same Hadoop cluster

○ data access with Apache Impala

○ 14 machines with 2 x 8 cores @2.60GHz, 64GB RAM, 24x2TB SAS drives

CERN’S USE-CASE


● Storage efficiency○ Total volume reduced by a factor of 10 in Kudu and Parquet with Snappy

● Ingestion speed○ Ingestion rate 2 to 50 times faster in all tested solutions

● Random access○ Random lookup typically below 200ms in HBase or Kudu

● Analytics○ 300k records/s per CPU for aggregation, filtering and reporting with

Parquet and Kudu● Data mutability

○ Kudu and HBase can modify records in place.

BENCHMARK RESULTS


http://tiny.cloudera.com/kudu-cern (https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)

BENCHMARK RESULTS

http://tiny.cloudera.com/kudu-cern


From Zbigniew Baranowski’s Strata New York 2017 presentation "Scaling database and analytic workloads with Apache Kudu”

NEW BENCHMARK RESULTS


THANK YOUAttila [email protected]@r1pp3rj4ck

http://kudu.apache.orghttp://github.com/apache/kudu

mailto:[email protected]

http://kudu.apache.org

http://github.com/apache/kudu

Documents

APACHE KUDU: ANALYTICS & RANDOM ACCESS IN A SINGLE ...biconsulting.hu/letoltes/2018budapestdata/...kudu.pdf · Fast Scans, Analytics and Processing of Static Data Fast On-Line Updates