Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Apache Kudu

Zbigniew Baranowski

Intro

What is KUDU?

• New storage engine for structured data (tables) – does not use HDFS!

• Columnar store

• Mutable (insert, update, delete)

• Written in C++

• Apache-licensed – open source– Quite new ->1.0 version recently released

• First commit on October 11th, 2012

– …and immature?

KUDU tries to fill the gap

• HBASE (on HDFS) excels at– Fast random lookups by key– Making data mutable

HDFS excels at• Scanning of large amount of data at speed• Accumulating data with high throughput

Table oriented storage

• A Kudu table has RDBMS-like schema– Primary key (one or many columns),

• No secondary indexes

– Finite and constant number of columns (unlike HBase)

– Each column has a name and type • boolean, int(8,16,32,64), float, double, timestamp, string,

binary

• Horizontally partitioned (range, hash) – partitions are called tablets

– tablets can have 3 or 5 replicas

Data Consistency

• Writing– Single row mutations done atomically across all columns

– No multi-row ACID transactions

• Reading– Tuneable freshness of the data

• read whatever is available

• or wait until all changes committed in WAL are available

– Snapshot consistency• changes made during scanning are not reflected in the results

• point-in-time queries are possible– (based on provided timestamp)

Kudu simplifies BigData deployment model for online analytics (low latency ingestion and access)

• Classical low latency design

Stream Source

Staging area

Flush periodically

HDFS

Big Files

Events

Indexed data

Flush immediately

Batch processing

Fast data access

Stream SourceStream Source

Stream Source

Implementing low latency with Kudu

Stream Source

EventsStream SourceStream Source

Stream Source

Batch processing

Fast data access

Kudu Architecture

Architecture overview

• Master server (can be multiple masters for HA)

– Stores metadata - tables definitions

– Tablets directory (tablets locations)

– Coordinates the cluster reconfigurations

• Tablet servers (worker nodes)

– Writes and reads tablets

• Stored on local disks (no HDFS)

– Tracks status of tablets replicas (followers)

• Replicates the data to followers

TabletID Leader Follower1 Follower2

TEST1 TS1 TS2 TS3

TEST2 TS4 TS1 TS2

TEST3 TS3 TS4 TS1

TabletID Leader Follower1 Follower2

TEST1 TS1 TS2 TS3

TEST2 TS4 TS1 TS2

TEST3 TS3 TS4 TS1

Tables and tabletsTabletID Leader Follower1 Follower2

TEST1 TS1 TS2 TS3

TEST2 TS4 TS1 TS2

TEST3 TS3 TS4 TS1

TabletServer1 TabletServer2 TabletServer3 TabletServer4

Master

Map of table TEST:

Leader TEST1 TEST1 TEST1 Leader TEST2

TEST2 TEST2 Leader TEST3 TEST3

TEST3

Data changes propagation in Kudu (Raft Consensus - https://raft.github.io)

Client

Master

Tablet server X

WALTablet 1 (leader)

Tablet server Y

WALTablet 1 (follower)

Tablet server Z

WALTablet 1 (follower)

Commit

Commit Commit

Tablets Server

Insert into tablet (without uniqueness check)

INSERT

MemRowSet

DiskRowSet1 (32MB) PK {min, max}

B+treeRow: Col1,Col2, Col3

Row1,Row2,Row3Leafs sorted by

Primary Key

Flush

Col1 Col2 Col3

Columnar store encoded similarly to ParquetRows sorted by PK.

Bloom filters

Bloom filters for PK ranges. Stored in cached btree

There might be Ks of sets per tablet

Interval tree Interval tree keeps

track of PK ranges within DiskRowSets

PK

DiskRowSet2 (32MB) PK {min, max}

Col1 Col2 Col3 Bloom filters

PK

DiskRowSet compaction

• Periodical task

• Removes deleted rows

• Reduces the number of sets with overlapping PK ranges

• Does not create bigger DiskRowSets– 32MB size for each DRS is preserved

DiskRowSet1 (32MB) PK {A, G}

DiskRowSet2 (32MB) PK {B, E}

DiskRowSet1 (32MB) PK {A, D}

DiskRowSet2 (32MB) PK {E, G}

Compact

32MB

Column1

How columns are stored on disk (DiskRowSet)

Values

Values

Values

Page metadata

Page metadata

Page metadata

ValuesPage metadata

Btr

eein

dex

Values

Values

Values

Page metadata

Page metadata

Page metadata

ValuesPage metadata

Btr

eein

dex

PK

B

tree

ind

ex

Values

Values

Values

Page metadata

Page metadata

Page metadata

Size 256KB

ValuesPage metadata

Btr

eein

dex

maps row offsets to pages

Column2

Column3

maps PK to row offset

Pages are encoded with a

variety of encodings, such

as dictionary encoding,

bitshuffle, or RLE

Pages can be compressed:

Snappy, LZ4 or ZLib

Kudu deployment

3 options for deployments

• Build from source

• Using RPMs– 1 core rpms

– 2 service rpms (master and servers)

– One shared config file

• Using Cloudera manager– Click, click, click, done

Interfacing with Kudu

Table access and manipulations

• Operations on tables (NoSQL)

– insert, update, delete, scan

– Python, C++, Java API

• Integrated with

– Impala & Hive(SQL), MapReduce, Spark

– Flume sink (ingestion)

Manipulating Kudu tables with SQL(Impala/Hive)

CREATE TABLE `kudu_example` (`runnumber` BIGINT,`eventnumber` BIGINT,`project` STRING,`streamname` STRING,`prodstep` STRING,`datatype` STRING,`amitag` STRING,`lumiblockn` BIGINT,`bunchid` BIGINT,)DISTRIBUTE BY HASH (runnumber) INTO 64 BUCKETSTBLPROPERTIES(

'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler','kudu.table_name' = ‘example_table','kudu.master_addresses' = ‘kudu-master.cern.ch:7051','kudu.key_columns' = 'runnumber, eventnumber'

);

• Table creation

• DMLsinsert into kudu_example values (1,30,'test',….);insert into kudu_example select * from data_parquet;update kudu_example set datatype='test' where runnumber=1;delete from kudu_example where project='test';

• Queries

select count(*),max(eventnumber) from kudu_example where datatype like '%AOD%‘ group by runnumber;

select * from kudu_example k, parquet_table p where k.runnumber=p.runnumber ;

Creating table with Java

import org.kududb.*

//CREATING TABLE

String tableName = "my_table";

String KUDU_MASTER_NAME = "master.cern.ch"

KuduClient client = new KuduClient.KuduClientBuilder(KUDU_MASTER_NAME).build();

List<ColumnSchema> columns = new ArrayList();

columns.add(new ColumnSchema.ColumnSchemaBuilder("runnumber",Type.INT64). key(true).encoding(ColumnSchema.Encoding.BIT_SHUFFLE).nullable(false).compressionAlgorithm(ColumnSchema.CompressionAlgorithm.SNAPPY).build());

columns.add(new ColumnSchema.ColumnSchemaBuilder("eventnumber",Type.INT64). key(true).encoding(ColumnSchema.Encoding.BIT_SHUFFLE).nullable(false).compressionAlgorithm(ColumnSchema.CompressionAlgorithm.SNAPPY).build());

……..

Schema schema = new Schema(columns);

List<String> partColumns = new ArrayList<>();

partColumns.add("runnumber");

partColumns.add("eventnumber");

CreateTableOptions options = new CreateTableOptions().addHashPartitions(partColumns, 64).setNumReplicas(3);

client.createTable(tableName, schema,options);

……..

Inserting rows with Java

//INSERTING

KuduTable table = client.openTable(tableName);

KuduSession session = client.newSession();

Insert insert = table.newInsert();

PartialRow row = insert.getRow();

row.addLong(0, 1);

row.addString(2,"test")

….

session.apply(insert); //stores them in memory on client side (for batch upload)

session.flush(); //sends data to Kudu

……..

Scanner in Java

//configuring column projection

List<String> projectColumns = new ArrayList<>();

projectColumns.add("runnumber");

projectColumns.add("dataType");

//setting a scan range

PartialRow start = s.newPartialRow();

start.addLong("runnumber", 8);

PartialRow end = s.newPartialRow();

end.addLong("runnumber",10);

KuduScanner scanner = client.newScannerBuilder(table)

.lowerBound(start)

.exclusiveUpperBound(end)

.setProjectedColumnNames(projectColumns)

.build();

while (scanner.hasMoreRows()) {

RowResultIterator results = scanner.nextRows();

while (results.hasNext()) {

RowResult result = results.next();

System.out.println(result.getString(1)); //getting 2nd column

}

}

Spark with Kudu

wget http://central.maven.org/maven2/org/apache/kudu/kudu-spark_2.10/1.0.0/kudu-spark_2.10-1.0.0.jar

spark-shell --jars kudu-spark_2.10-1.0.0.jar

import org.apache.kudu.spark.kudu._

// Read a table from Kudu

val df = sqlContext.read.options(

Map("kudu.master"->“kudu_master.cern.ch:7051“,

"kudu.table" ->“kudu_table“)s).kudu

// Query using the DF API...

df.select(df("runnumber"),df("eventnumber"),df("db0")).filter($"runnumber"===169864).filter($"eventnumber"===1).show();

// ...or register a temporary table and use SQL

df.registerTempTable("kudu_table")

sqlContext.sql("select id from kudu_table where id >= 5").show()

// Create a new Kudu table from a dataframe schema

// NB: No rows from the dataframe are inserted into the table

kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1))

// Insert data

kuduContext.insertRows(df, "test_table")

Kudu Security

To be done!

Performance(based on ATLAS EventIndex case)

Average row length• Very good compaction ratio

– The same like parquetEach row consists of 56 attributes• Most of them are strings• Few integers and floats

777890

2819

1559

171 189

538

326

87 90

314217

0

500

1000

1500

2000

2500

3000

kudu parquet hbase avro

Byt

es

No compression Snappy GZip-like

length in CSV

Insertion rates (per machine, per partition) with Impala

• Average ingestion speed

– worse than parquet

– better than HBase

7.21

64

5.3

115

11.34

85

4.4

70

10.9

38

4.9

49

0

20

40

60

80

100

120

140


Inse

rtio

n s

pp

ed (

kHz)


Random lookup with Impala

• Good random data lookup speed

– Similar to Hbase

0.27 0.62 0.56

16

0.45 0.86 0.4

19

0.32 0.89 0.5

27

0

5

10

15

20

25

30


Ave

rage

ran

do

m lo

oku

p s

pp

ed [

s]


Data scan rate per core with a predicate on non PK column (using Impala)

• Quite good data scanning speed

– Much better than HBase

– If natively supported predicates operations are used it is even faster than parquet

136

237

129

232

345

488

131

215

260

435

120

62

0

100

200

300

400

500

600

kudu parquet hbase

Scan

sp

eed

(kH

z)


Kudu monitoring

Cloudera Manager

• A lot of metrics are published though servers http

• All collected by CM agents and can be plotted

• Predefined CM dashboards

– Monitoring of Kudu processes

– Workload plots

• CM can be also used for Kudu configuration

CM – Kudu host status

CM - Workload plots

CM - Resource utilisation

Observations & Conclusions

What is nice about Kudu

• The first one in Big Data open source world trying to combine columnar store + indexing

• Simple to deploy• It works (almost) without problems• It scales (this depends how the schema is designed)

– Writing, Accessing, Scanning

• Integrated with Big Data mainstream processing frameworks– Spark, Impala, Hive, MapReduce– SQL and NoSQL on the same data

• Gives more flexibility in optimizing schema design comparing to HBase (to levels of partitioning)

• Cloudera is pushing to deliver production-like quality of the software ASAP

What is bad about Kudu?

• No security (it should be added in next releases)– authentication (who connected)– authorization (ACLs)

• Raft consensus not always works as it should– Too frequent tablet leader changes (sometime leader cannot be

elected at all)– Period without leader is quite long (sometimes never ends)– This freezes updates on tables

• Handling disk failures– you have to erase/reinitialize entire server

• Only one index per table• No nested types (but there is a binary type)• Cannot control tablet placement on servers

When to Kudu can be useful?

• When you have structured ‘big data’– Like in a RDBMS– Without complex types

• When sequential and random data access is required simultaneously and have to scale– Data extraction and analytics at the same time– Time series

• When low ingestion latency is needed– and lambda architecture is too expensive

Learn more

• Main page: https://kudu.apache.org/• Video: https://www.oreilly.com/ideas/kudu-resolving-

transactional-and-analytic-trade-offs-in-hadoop• Whitepaper: http://kudu.apache.org/kudu.pdf• KUDU project: https://github.com/cloudera/kudu• Some Java code examples:

https://gitlab.cern.ch:8443/zbaranow/kudu-atlas-eventindex

• Get Cloudera Quickstart VM and test it

https://www.oreilly.com/ideas/kudu-resolving-transactional-and-analytic-trade-offs-in-hadoop

http://getkudu.io/kudu.pdf

https://github.com/cloudera/kudu

https://gitlab.cern.ch:8443/zbaranow/kudu-atlas-eventindex

Documents

Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount