Impala use case @ edge

1

mailto:[email protected]

mailto:[email protected]

• We are building innovative advertising management platforms to assist our customers to get

smarter decisions, reach their business goals faster and better in real time.

• We are proud to have the most cutting edge products and lead the performance and video online

advertising market while striving to build long-term relationships with our clients and partners.

• Our intention is to simplify the complexity existing in the ad-tech industry and provide our

customers with the ability to earn more revenues while using our products and services.

• Edge is led by a team of industry veterans, with Offices in NY, Tel-Aviv and Beijing, and employs

over 100 team members.

2

• Data & BI Team Leader at Edge

• Experienced in wide range of RDBMS technologies

• Working with Hadoop since 2014

• Certified Cloudera Administrator and trainer

• Oracle Certified Professional

3

• Big Data at Edge

• Our Goals

• About Impala

• High Overview

• Why We Chose to Work with Impala

• Our Challenges

• Our Setup

• Designing Impala Tables

4

5

6

• Deliver insights on data in real time

Fraud Detection

Time-series Analysis

Predictive analytics

Interactive exploratory analytics on our data sets

• Provide a convenient way to interact with the data

• Continuously load batches of data, and make them visible with

minimal delay.

• Handle high number of concurrent users

7

• Cloudera's open source massively parallel processing (MPP)

SQL query engine

• Runs on Hadoop clusters

• 100% open source, released under the Apache Software

license

8

• Does not rely on a general purpose

data processing engine such as

MapReduce

• Executes queries directly on the

Hadoop cluster

• Well-suited for executing interactive

analytics queries on large data sets

• Tables are really directories of files

in HDFS

9

• Impala Servers run on each node of a cluster.

• The Impala State Store Server is responsible

for confirming which nodes are healthy and

can accept new work

• The Catalog Server (new in CDH5) is

responsible for sending the new

metadata to all other Impala

nodes

• You can submit a query to the Impala Server running on any node

10

• Supports A large set of SQL statements, including SELECT and INSERT, JOIN, Subqueries, and SQL Analytic Functions.

• Highly compatible with HiveQL

• Using Cloudera Manager, Impala services can deployed and managed

• Allows the usage of Hue for queries.

• Impala is certified to run against Tableau

11

• Querying data stored in HDFS (provides a distributed,

high-performance queries)

• Each Impala daemon can handle multiple concurrent client

requests

• Impala is pioneering the use of the Parquet file format, a

columnar storage layout that is optimized for large-scale queries

typical in data warehouse scenarios.

12

• Allows the usage of partitioning

• By default, all the data files for a table are located in a

single directory. Partitioning is a technique for physically

dividing the data during loading, based on values from one

or more columns.

• Impala is a widely adopted standard across the ecosystem,

with many users and extensive documentation

13

• Hadoop Clustero Cluster sizing o Workload testing (query throughput and and response time)

• Database Designo Identify access pattern based on real use caseso Make sure we’re not generating too many partitions o Make sure the data in each partition is large enough o Design our “Star Schema” data warehouse

• Data Types o Data consistency across Pig, Hive, and Impala

• File formats• Tune queries

14

15

● Year = 2017○ Month = 03

■ Day = 01■ Day = 02■ Day = 03■ …

○ Month = 04■ Day = 01■ Day = 02■ ...

16

• Although we use a “Star Schema” design in Impala. There are a

lot of architectural differences between our Impala layout and

the old RDBMS system.

• Keep that in mind and avoid using your existing RDBMS data

storage and processing strategies in Impala

17

18

19

• Instead of using MapReduce, Impala reads the HDFS data

directly

• Impala allows users to query data in HDFS using an SQL-like

language

• The administrative tasks related to Impala are greatly simplified

by Cloudera Manager

20

• Quickly get started with Cloudera using a preconfigured VM or a

Docker Image

• Impala Frequently Asked Questions

• More details on Apache Parquet

• The Impala Cookbook

https://www.cloudera.com/downloads/quickstart_vms/5-8.html

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_faq.html#faq_sql

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_faq.html#faq_sql

https://parquet.apache.org

https://www.slideshare.net/cloudera/the-impala-cookbook-42530186

21

Technology

Impala use case @ edge