Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Apache Kudu
Zbigniew Baranowski
Intro
What is KUDU?
• New storage engine for structured data (tables) – does not use HDFS!
• Columnar store
• Mutable (insert, update, delete)
• Written in C++
• Apache-licensed – open source– Quite new ->1.0 version recently released
• First commit on October 11th, 2012
– …and immature?
KUDU tries to fill the gap
• HBASE (on HDFS) excels at– Fast random lookups by key– Making data mutable
HDFS excels at• Scanning of large amount of data at speed• Accumulating data with high throughput
Table oriented storage
• A Kudu table has RDBMS-like schema– Primary key (one or many columns),
• No secondary indexes
– Finite and constant number of columns (unlike HBase)
– Each column has a name and type • boolean, int(8,16,32,64), float, double, timestamp, string,
binary
• Horizontally partitioned (range, hash) – partitions are called tablets
– tablets can have 3 or 5 replicas
Data Consistency
• Writing– Single row mutations done atomically across all columns
– No multi-row ACID transactions
• Reading– Tuneable freshness of the data
• read whatever is available
• or wait until all changes committed in WAL are available
– Snapshot consistency• changes made during scanning are not reflected in the results
• point-in-time queries are possible– (based on provided timestamp)
Kudu simplifies BigData deployment model for online analytics (low latency ingestion and access)
• Classical low latency design
Stream Source
Staging area
Flush periodically
HDFS
Big Files
Events
Indexed data
Flush immediately
Batch processing
Fast data access
Stream SourceStream Source
Stream Source
Implementing low latency with Kudu
Stream Source
EventsStream SourceStream Source
Stream Source
Batch processing
Fast data access
Kudu Architecture
Architecture overview
• Master server (can be multiple masters for HA)
– Stores metadata - tables definitions
– Tablets directory (tablets locations)
– Coordinates the cluster reconfigurations
• Tablet servers (worker nodes)
– Writes and reads tablets
• Stored on local disks (no HDFS)
– Tracks status of tablets replicas (followers)
• Replicates the data to followers
TabletID Leader Follower1 Follower2
TEST1 TS1 TS2 TS3
TEST2 TS4 TS1 TS2
TEST3 TS3 TS4 TS1
TabletID Leader Follower1 Follower2
TEST1 TS1 TS2 TS3
TEST2 TS4 TS1 TS2
TEST3 TS3 TS4 TS1
Tables and tabletsTabletID Leader Follower1 Follower2
TEST1 TS1 TS2 TS3
TEST2 TS4 TS1 TS2
TEST3 TS3 TS4 TS1
TabletServer1 TabletServer2 TabletServer3 TabletServer4
Master
Map of table TEST:
Leader TEST1 TEST1 TEST1 Leader TEST2
TEST2 TEST2 Leader TEST3 TEST3
TEST3
Data changes propagation in Kudu (Raft Consensus - https://raft.github.io)
Client
Master
Tablet server X
WALTablet 1 (leader)
Tablet server Y
WALTablet 1 (follower)
Tablet server Z
WALTablet 1 (follower)
Commit
Commit Commit
Tablets Server
Insert into tablet (without uniqueness check)
INSERT
MemRowSet
DiskRowSet1 (32MB) PK {min, max}
B+treeRow: Col1,Col2, Col3
Row1,Row2,Row3Leafs sorted by
Primary Key
Flush
Col1 Col2 Col3
Columnar store encoded similarly to ParquetRows sorted by PK.
Bloom filters
Bloom filters for PK ranges. Stored in cached btree
There might be Ks of sets per tablet
Interval tree Interval tree keeps
track of PK ranges within DiskRowSets
PK
DiskRowSet2 (32MB) PK {min, max}
Col1 Col2 Col3 Bloom filters
PK
DiskRowSet compaction
• Periodical task
• Removes deleted rows
• Reduces the number of sets with overlapping PK ranges
• Does not create bigger DiskRowSets– 32MB size for each DRS is preserved
DiskRowSet1 (32MB) PK {A, G}
DiskRowSet2 (32MB) PK {B, E}
DiskRowSet1 (32MB) PK {A, D}
DiskRowSet2 (32MB) PK {E, G}
Compact
32MB
Column1
How columns are stored on disk (DiskRowSet)
Values
Values
Values
Page metadata
Page metadata
Page metadata
ValuesPage metadata
Btr
eein
dex
Values
Values
Values
Page metadata
Page metadata
Page metadata
ValuesPage metadata
Btr
eein
dex
PK
B
tree
ind
ex
Values
Values
Values
Page metadata
Page metadata
Page metadata
Size 256KB
ValuesPage metadata
Btr
eein
dex
maps row offsets to pages
Column2
Column3
maps PK to row offset
Pages are encoded with a
variety of encodings, such
as dictionary encoding,
bitshuffle, or RLE
Pages can be compressed:
Snappy, LZ4 or ZLib
Kudu deployment
3 options for deployments
• Build from source
• Using RPMs– 1 core rpms
– 2 service rpms (master and servers)
– One shared config file
• Using Cloudera manager– Click, click, click, done
Interfacing with Kudu
Table access and manipulations
• Operations on tables (NoSQL)
– insert, update, delete, scan
– Python, C++, Java API
• Integrated with
– Impala & Hive(SQL), MapReduce, Spark
– Flume sink (ingestion)
Manipulating Kudu tables with SQL(Impala/Hive)
CREATE TABLE `kudu_example` (`runnumber` BIGINT,`eventnumber` BIGINT,`project` STRING,`streamname` STRING,`prodstep` STRING,`datatype` STRING,`amitag` STRING,`lumiblockn` BIGINT,`bunchid` BIGINT,)DISTRIBUTE BY HASH (runnumber) INTO 64 BUCKETSTBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler','kudu.table_name' = ‘example_table','kudu.master_addresses' = ‘kudu-master.cern.ch:7051','kudu.key_columns' = 'runnumber, eventnumber'
);
• Table creation
• DMLsinsert into kudu_example values (1,30,'test',….);insert into kudu_example select * from data_parquet;update kudu_example set datatype='test' where runnumber=1;delete from kudu_example where project='test';
• Queries
select count(*),max(eventnumber) from kudu_example where datatype like '%AOD%‘ group by runnumber;
select * from kudu_example k, parquet_table p where k.runnumber=p.runnumber ;
Creating table with Java
import org.kududb.*
//CREATING TABLE
String tableName = "my_table";
String KUDU_MASTER_NAME = "master.cern.ch"
KuduClient client = new KuduClient.KuduClientBuilder(KUDU_MASTER_NAME).build();
List<ColumnSchema> columns = new ArrayList();
columns.add(new ColumnSchema.ColumnSchemaBuilder("runnumber",Type.INT64). key(true).encoding(ColumnSchema.Encoding.BIT_SHUFFLE).nullable(false).compressionAlgorithm(ColumnSchema.CompressionAlgorithm.SNAPPY).build());
columns.add(new ColumnSchema.ColumnSchemaBuilder("eventnumber",Type.INT64). key(true).encoding(ColumnSchema.Encoding.BIT_SHUFFLE).nullable(false).compressionAlgorithm(ColumnSchema.CompressionAlgorithm.SNAPPY).build());
……..
Schema schema = new Schema(columns);
List<String> partColumns = new ArrayList<>();
partColumns.add("runnumber");
partColumns.add("eventnumber");
CreateTableOptions options = new CreateTableOptions().addHashPartitions(partColumns, 64).setNumReplicas(3);
client.createTable(tableName, schema,options);
……..
Inserting rows with Java
//INSERTING
KuduTable table = client.openTable(tableName);
KuduSession session = client.newSession();
Insert insert = table.newInsert();
PartialRow row = insert.getRow();
row.addLong(0, 1);
row.addString(2,"test")
….
session.apply(insert); //stores them in memory on client side (for batch upload)
session.flush(); //sends data to Kudu
……..
Scanner in Java
//configuring column projection
List<String> projectColumns = new ArrayList<>();
projectColumns.add("runnumber");
projectColumns.add("dataType");
//setting a scan range
PartialRow start = s.newPartialRow();
start.addLong("runnumber", 8);
PartialRow end = s.newPartialRow();
end.addLong("runnumber",10);
KuduScanner scanner = client.newScannerBuilder(table)
.lowerBound(start)
.exclusiveUpperBound(end)
.setProjectedColumnNames(projectColumns)
.build();
while (scanner.hasMoreRows()) {
RowResultIterator results = scanner.nextRows();
while (results.hasNext()) {
RowResult result = results.next();
System.out.println(result.getString(1)); //getting 2nd column
}
}
Spark with Kudu
wget http://central.maven.org/maven2/org/apache/kudu/kudu-spark_2.10/1.0.0/kudu-spark_2.10-1.0.0.jar
spark-shell --jars kudu-spark_2.10-1.0.0.jar
import org.apache.kudu.spark.kudu._
// Read a table from Kudu
val df = sqlContext.read.options(
Map("kudu.master"->“kudu_master.cern.ch:7051“,
"kudu.table" ->“kudu_table“)s).kudu
// Query using the DF API...
df.select(df("runnumber"),df("eventnumber"),df("db0")).filter($"runnumber"===169864).filter($"eventnumber"===1).show();
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
sqlContext.sql("select id from kudu_table where id >= 5").show()
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1))
// Insert data
kuduContext.insertRows(df, "test_table")
Kudu Security
To be done!
Performance(based on ATLAS EventIndex case)
Average row length• Very good compaction ratio
– The same like parquetEach row consists of 56 attributes• Most of them are strings• Few integers and floats
777890
2819
1559
171 189
538
326
87 90
314217
0
500
1000
1500
2000
2500
3000
kudu parquet hbase avro
Byt
es
No compression Snappy GZip-like
length in CSV
Insertion rates (per machine, per partition) with Impala
• Average ingestion speed
– worse than parquet
– better than HBase
7.21
64
5.3
115
11.34
85
4.4
70
10.9
38
4.9
49
0
20
40
60
80
100
120
140
kudu parquet hbase avro
Inse
rtio
n s
pp
ed (
kHz)
No compression Snappy GZip-like
Random lookup with Impala
• Good random data lookup speed
– Similar to Hbase
0.27 0.62 0.56
16
0.45 0.86 0.4
19
0.32 0.89 0.5
27
0
5
10
15
20
25
30
kudu parquet hbase avro
Ave
rage
ran
do
m lo
oku
p s
pp
ed [
s]
No compression Snappy GZip-like
Data scan rate per core with a predicate on non PK column (using Impala)
• Quite good data scanning speed
– Much better than HBase
– If natively supported predicates operations are used it is even faster than parquet
136
237
129
232
345
488
131
215
260
435
120
62
0
100
200
300
400
500
600
kudu parquet hbase
Scan
sp
eed
(kH
z)
No compression Snappy GZip-like
Kudu monitoring
Cloudera Manager
• A lot of metrics are published though servers http
• All collected by CM agents and can be plotted
• Predefined CM dashboards
– Monitoring of Kudu processes
– Workload plots
• CM can be also used for Kudu configuration
CM – Kudu host status
CM - Workload plots
CM - Resource utilisation
Observations & Conclusions
What is nice about Kudu
• The first one in Big Data open source world trying to combine columnar store + indexing
• Simple to deploy• It works (almost) without problems• It scales (this depends how the schema is designed)
– Writing, Accessing, Scanning
• Integrated with Big Data mainstream processing frameworks– Spark, Impala, Hive, MapReduce– SQL and NoSQL on the same data
• Gives more flexibility in optimizing schema design comparing to HBase (to levels of partitioning)
• Cloudera is pushing to deliver production-like quality of the software ASAP
What is bad about Kudu?
• No security (it should be added in next releases)– authentication (who connected)– authorization (ACLs)
• Raft consensus not always works as it should– Too frequent tablet leader changes (sometime leader cannot be
elected at all)– Period without leader is quite long (sometimes never ends)– This freezes updates on tables
• Handling disk failures– you have to erase/reinitialize entire server
• Only one index per table• No nested types (but there is a binary type)• Cannot control tablet placement on servers
When to Kudu can be useful?
• When you have structured ‘big data’– Like in a RDBMS– Without complex types
• When sequential and random data access is required simultaneously and have to scale– Data extraction and analytics at the same time– Time series
• When low ingestion latency is needed– and lambda architecture is too expensive
Learn more
• Main page: https://kudu.apache.org/• Video: https://www.oreilly.com/ideas/kudu-resolving-
transactional-and-analytic-trade-offs-in-hadoop• Whitepaper: http://kudu.apache.org/kudu.pdf• KUDU project: https://github.com/cloudera/kudu• Some Java code examples:
https://gitlab.cern.ch:8443/zbaranow/kudu-atlas-eventindex
• Get Cloudera Quickstart VM and test it