40
Apache Kudu Zbigniew Baranowski

Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Apache Kudu

Zbigniew Baranowski

Page 2: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Intro

Page 3: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

What is KUDU?

• New storage engine for structured data (tables) – does not use HDFS!

• Columnar store

• Mutable (insert, update, delete)

• Written in C++

• Apache-licensed – open source– Quite new ->1.0 version recently released

• First commit on October 11th, 2012

– …and immature?

Page 4: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

KUDU tries to fill the gap

• HBASE (on HDFS) excels at– Fast random lookups by key– Making data mutable

HDFS excels at• Scanning of large amount of data at speed• Accumulating data with high throughput

Page 5: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Table oriented storage

• A Kudu table has RDBMS-like schema– Primary key (one or many columns),

• No secondary indexes

– Finite and constant number of columns (unlike HBase)

– Each column has a name and type • boolean, int(8,16,32,64), float, double, timestamp, string,

binary

• Horizontally partitioned (range, hash) – partitions are called tablets

– tablets can have 3 or 5 replicas

Page 6: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Data Consistency

• Writing– Single row mutations done atomically across all columns

– No multi-row ACID transactions

• Reading– Tuneable freshness of the data

• read whatever is available

• or wait until all changes committed in WAL are available

– Snapshot consistency• changes made during scanning are not reflected in the results

• point-in-time queries are possible– (based on provided timestamp)

Page 7: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Kudu simplifies BigData deployment model for online analytics (low latency ingestion and access)

• Classical low latency design

Stream Source

Staging area

Flush periodically

HDFS

Big Files

Events

Indexed data

Flush immediately

Batch processing

Fast data access

Stream SourceStream Source

Stream Source

Page 8: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Implementing low latency with Kudu

Stream Source

EventsStream SourceStream Source

Stream Source

Batch processing

Fast data access

Page 9: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Kudu Architecture

Page 10: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Architecture overview

• Master server (can be multiple masters for HA)

– Stores metadata - tables definitions

– Tablets directory (tablets locations)

– Coordinates the cluster reconfigurations

• Tablet servers (worker nodes)

– Writes and reads tablets

• Stored on local disks (no HDFS)

– Tracks status of tablets replicas (followers)

• Replicates the data to followers

Page 11: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

TabletID Leader Follower1 Follower2

TEST1 TS1 TS2 TS3

TEST2 TS4 TS1 TS2

TEST3 TS3 TS4 TS1

TabletID Leader Follower1 Follower2

TEST1 TS1 TS2 TS3

TEST2 TS4 TS1 TS2

TEST3 TS3 TS4 TS1

Tables and tabletsTabletID Leader Follower1 Follower2

TEST1 TS1 TS2 TS3

TEST2 TS4 TS1 TS2

TEST3 TS3 TS4 TS1

TabletServer1 TabletServer2 TabletServer3 TabletServer4

Master

Map of table TEST:

Leader TEST1 TEST1 TEST1 Leader TEST2

TEST2 TEST2 Leader TEST3 TEST3

TEST3

Page 12: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Data changes propagation in Kudu (Raft Consensus - https://raft.github.io)

Client

Master

Tablet server X

WALTablet 1 (leader)

Tablet server Y

WALTablet 1 (follower)

Tablet server Z

WALTablet 1 (follower)

Commit

Commit Commit

Page 13: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Tablets Server

Insert into tablet (without uniqueness check)

INSERT

MemRowSet

DiskRowSet1 (32MB) PK {min, max}

B+treeRow: Col1,Col2, Col3

Row1,Row2,Row3Leafs sorted by

Primary Key

Flush

Col1 Col2 Col3

Columnar store encoded similarly to ParquetRows sorted by PK.

Bloom filters

Bloom filters for PK ranges. Stored in cached btree

There might be Ks of sets per tablet

Interval tree Interval tree keeps

track of PK ranges within DiskRowSets

PK

DiskRowSet2 (32MB) PK {min, max}

Col1 Col2 Col3 Bloom filters

PK

Page 14: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

DiskRowSet compaction

• Periodical task

• Removes deleted rows

• Reduces the number of sets with overlapping PK ranges

• Does not create bigger DiskRowSets– 32MB size for each DRS is preserved

DiskRowSet1 (32MB) PK {A, G}

DiskRowSet2 (32MB) PK {B, E}

DiskRowSet1 (32MB) PK {A, D}

DiskRowSet2 (32MB) PK {E, G}

Compact

Page 15: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

32MB

Column1

How columns are stored on disk (DiskRowSet)

Values

Values

Values

Page metadata

Page metadata

Page metadata

ValuesPage metadata

Btr

eein

dex

Values

Values

Values

Page metadata

Page metadata

Page metadata

ValuesPage metadata

Btr

eein

dex

PK

B

tree

ind

ex

Values

Values

Values

Page metadata

Page metadata

Page metadata

Size 256KB

ValuesPage metadata

Btr

eein

dex

maps row offsets to pages

Column2

Column3

maps PK to row offset

Pages are encoded with a

variety of encodings, such

as dictionary encoding,

bitshuffle, or RLE

Pages can be compressed:

Snappy, LZ4 or ZLib

Page 16: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Kudu deployment

Page 17: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

3 options for deployments

• Build from source

• Using RPMs– 1 core rpms

– 2 service rpms (master and servers)

– One shared config file

• Using Cloudera manager– Click, click, click, done

Page 18: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Interfacing with Kudu

Page 19: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Table access and manipulations

• Operations on tables (NoSQL)

– insert, update, delete, scan

– Python, C++, Java API

• Integrated with

– Impala & Hive(SQL), MapReduce, Spark

– Flume sink (ingestion)

Page 20: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Manipulating Kudu tables with SQL(Impala/Hive)

CREATE TABLE `kudu_example` (`runnumber` BIGINT,`eventnumber` BIGINT,`project` STRING,`streamname` STRING,`prodstep` STRING,`datatype` STRING,`amitag` STRING,`lumiblockn` BIGINT,`bunchid` BIGINT,)DISTRIBUTE BY HASH (runnumber) INTO 64 BUCKETSTBLPROPERTIES(

'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler','kudu.table_name' = ‘example_table','kudu.master_addresses' = ‘kudu-master.cern.ch:7051','kudu.key_columns' = 'runnumber, eventnumber'

);

• Table creation

• DMLsinsert into kudu_example values (1,30,'test',….);insert into kudu_example select * from data_parquet;update kudu_example set datatype='test' where runnumber=1;delete from kudu_example where project='test';

• Queries

select count(*),max(eventnumber) from kudu_example where datatype like '%AOD%‘ group by runnumber;

select * from kudu_example k, parquet_table p where k.runnumber=p.runnumber ;

Page 21: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Creating table with Java

import org.kududb.*

//CREATING TABLE

String tableName = "my_table";

String KUDU_MASTER_NAME = "master.cern.ch"

KuduClient client = new KuduClient.KuduClientBuilder(KUDU_MASTER_NAME).build();

List<ColumnSchema> columns = new ArrayList();

columns.add(new ColumnSchema.ColumnSchemaBuilder("runnumber",Type.INT64). key(true).encoding(ColumnSchema.Encoding.BIT_SHUFFLE).nullable(false).compressionAlgorithm(ColumnSchema.CompressionAlgorithm.SNAPPY).build());

columns.add(new ColumnSchema.ColumnSchemaBuilder("eventnumber",Type.INT64). key(true).encoding(ColumnSchema.Encoding.BIT_SHUFFLE).nullable(false).compressionAlgorithm(ColumnSchema.CompressionAlgorithm.SNAPPY).build());

……..

Schema schema = new Schema(columns);

List<String> partColumns = new ArrayList<>();

partColumns.add("runnumber");

partColumns.add("eventnumber");

CreateTableOptions options = new CreateTableOptions().addHashPartitions(partColumns, 64).setNumReplicas(3);

client.createTable(tableName, schema,options);

……..

Page 22: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Inserting rows with Java

//INSERTING

KuduTable table = client.openTable(tableName);

KuduSession session = client.newSession();

Insert insert = table.newInsert();

PartialRow row = insert.getRow();

row.addLong(0, 1);

row.addString(2,"test")

….

session.apply(insert); //stores them in memory on client side (for batch upload)

session.flush(); //sends data to Kudu

……..

Page 23: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Scanner in Java

//configuring column projection

List<String> projectColumns = new ArrayList<>();

projectColumns.add("runnumber");

projectColumns.add("dataType");

//setting a scan range

PartialRow start = s.newPartialRow();

start.addLong("runnumber", 8);

PartialRow end = s.newPartialRow();

end.addLong("runnumber",10);

KuduScanner scanner = client.newScannerBuilder(table)

.lowerBound(start)

.exclusiveUpperBound(end)

.setProjectedColumnNames(projectColumns)

.build();

while (scanner.hasMoreRows()) {

RowResultIterator results = scanner.nextRows();

while (results.hasNext()) {

RowResult result = results.next();

System.out.println(result.getString(1)); //getting 2nd column

}

}

Page 24: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Spark with Kudu

wget http://central.maven.org/maven2/org/apache/kudu/kudu-spark_2.10/1.0.0/kudu-spark_2.10-1.0.0.jar

spark-shell --jars kudu-spark_2.10-1.0.0.jar

import org.apache.kudu.spark.kudu._

// Read a table from Kudu

val df = sqlContext.read.options(

Map("kudu.master"->“kudu_master.cern.ch:7051“,

"kudu.table" ->“kudu_table“)s).kudu

// Query using the DF API...

df.select(df("runnumber"),df("eventnumber"),df("db0")).filter($"runnumber"===169864).filter($"eventnumber"===1).show();

// ...or register a temporary table and use SQL

df.registerTempTable("kudu_table")

sqlContext.sql("select id from kudu_table where id >= 5").show()

// Create a new Kudu table from a dataframe schema

// NB: No rows from the dataframe are inserted into the table

kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1))

// Insert data

kuduContext.insertRows(df, "test_table")

Page 25: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Kudu Security

To be done!

Page 26: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Performance(based on ATLAS EventIndex case)

Page 27: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Average row length• Very good compaction ratio

– The same like parquetEach row consists of 56 attributes• Most of them are strings• Few integers and floats

777890

2819

1559

171 189

538

326

87 90

314217

0

500

1000

1500

2000

2500

3000

kudu parquet hbase avro

Byt

es

No compression Snappy GZip-like

length in CSV

Page 28: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Insertion rates (per machine, per partition) with Impala

• Average ingestion speed

– worse than parquet

– better than HBase

7.21

64

5.3

115

11.34

85

4.4

70

10.9

38

4.9

49

0

20

40

60

80

100

120

140

kudu parquet hbase avro

Inse

rtio

n s

pp

ed (

kHz)

No compression Snappy GZip-like

Page 29: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Random lookup with Impala

• Good random data lookup speed

– Similar to Hbase

0.27 0.62 0.56

16

0.45 0.86 0.4

19

0.32 0.89 0.5

27

0

5

10

15

20

25

30

kudu parquet hbase avro

Ave

rage

ran

do

m lo

oku

p s

pp

ed [

s]

No compression Snappy GZip-like

Page 30: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Data scan rate per core with a predicate on non PK column (using Impala)

• Quite good data scanning speed

– Much better than HBase

– If natively supported predicates operations are used it is even faster than parquet

136

237

129

232

345

488

131

215

260

435

120

62

0

100

200

300

400

500

600

kudu parquet hbase

Scan

sp

eed

(kH

z)

No compression Snappy GZip-like

Page 31: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Kudu monitoring

Page 32: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Cloudera Manager

• A lot of metrics are published though servers http

• All collected by CM agents and can be plotted

• Predefined CM dashboards

– Monitoring of Kudu processes

– Workload plots

• CM can be also used for Kudu configuration

Page 33: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

CM – Kudu host status

Page 34: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

CM - Workload plots

Page 35: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

CM - Resource utilisation

Page 36: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Observations & Conclusions

Page 37: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

What is nice about Kudu

• The first one in Big Data open source world trying to combine columnar store + indexing

• Simple to deploy• It works (almost) without problems• It scales (this depends how the schema is designed)

– Writing, Accessing, Scanning

• Integrated with Big Data mainstream processing frameworks– Spark, Impala, Hive, MapReduce– SQL and NoSQL on the same data

• Gives more flexibility in optimizing schema design comparing to HBase (to levels of partitioning)

• Cloudera is pushing to deliver production-like quality of the software ASAP

Page 38: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

What is bad about Kudu?

• No security (it should be added in next releases)– authentication (who connected)– authorization (ACLs)

• Raft consensus not always works as it should– Too frequent tablet leader changes (sometime leader cannot be

elected at all)– Period without leader is quite long (sometimes never ends)– This freezes updates on tables

• Handling disk failures– you have to erase/reinitialize entire server

• Only one index per table• No nested types (but there is a binary type)• Cannot control tablet placement on servers

Page 39: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

When to Kudu can be useful?

• When you have structured ‘big data’– Like in a RDBMS– Without complex types

• When sequential and random data access is required simultaneously and have to scale– Data extraction and analytics at the same time– Time series

• When low ingestion latency is needed– and lambda architecture is too expensive

Page 40: Apache Kudu - Indico€¦ · KUDU tries to fill the gap • HBASE (on HDFS) excels at – Fast random lookups by key – Making data mutable HDFS excels at • Scanning of large amount

Learn more

• Main page: https://kudu.apache.org/• Video: https://www.oreilly.com/ideas/kudu-resolving-

transactional-and-analytic-trade-offs-in-hadoop• Whitepaper: http://kudu.apache.org/kudu.pdf• KUDU project: https://github.com/cloudera/kudu• Some Java code examples:

https://gitlab.cern.ch:8443/zbaranow/kudu-atlas-eventindex

• Get Cloudera Quickstart VM and test it