Deep dive into enterprise data lake through Impala

Preview:

DESCRIPTION

A Hadoop near real-time solution Impala

Citation preview

Deep dive into enterprise data lake through Impala

Evans Ye2014.7.28

04/07/2023 Confidential | Copyright 2013 TrendMicro Inc. 1

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Agenda

• Introduction to Impala• Impala Architecture• Query Execution• Getting Started• Parquet File Format• ACL via Sentry

2

04/07/2023

3

Introduction to Impala

Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala

• General-purpose SQL query engine• Support queries takes from milliseconds to

hours (near real-time)• Support most of the Hadoop file formats

– Text, SequenseFile, Avro, RCFile, Parquet• Suitable for data scientists or business

analysts

2

04/07/2023

5

Why do we need it?

Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

SPEED

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Current adhoc-query solution - Pig

• Do hourly count on akamai log– A = load ‘/test_data/2014/07/20/00'

using PigStorage();B = foreach (group A all) COUNT_STAR(A);dump B;

– …0% complete100% complete(194202349)

2

4mins, 28sec

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Using Impala

• No memory cache– > select count(*) from test_data

where day=20140720 and hour=0– 194202349

• with OS cache

• Do a further query:– select count(*) from test_data where day=20140720

and hour=00 and c='US';– 41118019

2

96.46s

9.07s

6.57s

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Status quo

• Developed by Cloudera• Open source under Apache License• 1.0 available in 05/2013• current version is 1.4• connect via ODBC/JDBC/hue/impala-shell• authenticate via Kerberos or LDAP• fine-grained ACL via Apache Sentry

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Benefits

• High Performance– C++– direct access to data (no Mapreduce)– in-memory query execution

• Flexibility– Query across existing data(no duplication)– support multiple Hadoop file format

• Scalable– scale out by adding nodes

2

04/07/2023

12

Impala Architecture

Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala Architecture

Datanode

Tasktracker

Regionserver

impala daemon

2

NN, JT, HMActive

NN, JT, HMStandby

Datanode

Tasktracker

Regionserver

impala daemon

Datanode

Tasktracker

Regionserver

impala daemon

Datanode

Tasktracker

Regionserver

impala daemon

State store

Catalog

Hive Metastore

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Components

• Impala daemon– collocate with datanodes– handle all impala internal requests related to query

execution– User can submit request to impala daemon running on

any node and that node serve as coordinator node• State store daemon

– communicates with impala daemons to confirm which node is healthy and can accept new work

• Catalog daemon– broadcast metadata changes from impala SQL

statements to all the impala daemons

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Fault tolerance

• No fault tolerance for impala daemons– A node failed, the query failed

• state-store offline– query execution still function normally– can not update metadata(create, alter…)– if another impala daemon goes down, then

entire cluster can not execute any query• catalog offline

– can not update metadata

2

04/07/2023

16

Query Execution

Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023

21

Getting Started

Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala-shell (sercure cluster)

• $ yum install impala-shell• $ kinit –kt evans_ye.keytab evans_ye• $ impala-shell --kerberos \

--impalad IMPALA_HOST

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Create and insert

• > create table t1 (col1 string, col2 int);• > insert into t1 values (‘foo’, 10);

– only supports writing to TEXT and PARQUET tables

– every insert creates 1 tiny hdfs file– by default, the file will be stored under

/user/hive/warehouse/t1/– use it for setting up small dimension table ,

experiment purpose, or with HBase table

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Create external table to read existing files

• > create external table t2 (col1 string, col2 int)row format delimited fields terminated by ‘\t’ location ‘/user/evans_ye/test_data’;– location must be a directory

(for example, pig output directory)– files to read:

• V /user/evans_ye/test_data/part-r-00000• X /user/evans_ye/test_data/_logs/history• X /user/evans_ye/test_data/20140701/00/part-r-00000• no recursive

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

No recursive?

• Then how to add external data with folder structure like this:– /user/evans_ye/test_data/20140701/00

/user/evans_ye/test_data/20140701/01…/user/evans_ye/test_data/20140701/02…/user/evans_ye/test_data/20140702/00

2

04/07/2023

26

Partitioning

Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Create the table with partitions

• > create external table t3 (col1 string, col2 int)partitioned by (`date` int, hour tinyint) row format delimited fields terminated by ‘\t’;

• No need to specify the location on create

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Add partitions into the table

• > alter table t3 add partition (`date`=20130330, hour=0) location ‘/user/evans_ye/test_data/20130330/00‘;

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Partition number

• thousands of partitions per table– OK

• tens of thousands partitions per table– Not OK

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Compute table statistics

• > compute stats t3;• > show table stats t3;

• Help impala to optimize join query:broadcast join, partitioned join

2

04/07/2023

31

Parquet File Format

Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Parquet

• apache incubator project• column-oriented file format

– compression is better since all the value would be the same type– encoding is better since value could often be the same and

repeated– SQL projection can avoid unnecessary read and decoding on

columns• Supported by Pig, Impala, Hive, MR and Cascading• impala by default use snappy with parquet• impala + parquet = google dremel

– dremel doesn’t support join– impala doesn’t support nested data structure(yet)

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Transform text files table into parquet format

• > create table t4 like t3 stored as parquet;• > insert overwrite t4

partition (`date`, hour) select * from t3

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Using parquet in Pig

• $ yum install parquet• $ pig• > A = load ‘/user/hive/warehouse/t4’

using parquet.pig.ParquetLoader as (x: chararray, y: int);

• > store A into ‘/user/evans_ye/parquet_out’ using parquet.pig.ParquetStorer;

2

04/07/2023

35

ACL via

Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Sentry

• apache incubator project• provide fine-grained role based

authorization• currently integrates with Hive and Impala• require strong authentication such as

kerberos

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Enable Sentry for Impala

• turns on Sentry authorization for Impala– add two lines into impala daemon’s configuration

file(/etc/default/impala)

– auth-policy.ini Sentry policy file– server1 a symbolic name used in policy file

• all impala daemons must specify same server name

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Sentry policy file example

• roles:– on server1,

spn_user_role has permission to read(SELECT) all tables in spn database

• groups– evans_ye group has role spn_user_role

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Sentry policy file example

• roles:– evans_data has permission to access

/user/evans_ye• allows you to add data under /user/evans_ye as

partitions– foo_db_role can do anything in foo database

• create, alter, insert, drop

2

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala 2014 Roadmap

• 1.4 (now available)– order by without limit

• 2.0– nested data types (structs, arrays, maps)– disk-based joins and aggregations

2

41

Q&A

04/07/2023 Confidential | Copyright 2013 TrendMicro Inc.

Recommended