21
1

Impala use case @ edge

Embed Size (px)

Citation preview

Page 2: Impala use case @ edge

• We are building innovative advertising management platforms to assist our customers to get

smarter decisions, reach their business goals faster and better in real time.

• We are proud to have the most cutting edge products and lead the performance and video online

advertising market while striving to build long-term relationships with our clients and partners.

• Our intention is to simplify the complexity existing in the ad-tech industry and provide our

customers with the ability to earn more revenues while using our products and services.

• Edge is led by a team of industry veterans, with Offices in NY, Tel-Aviv and Beijing, and employs

over 100 team members.

2

Page 3: Impala use case @ edge

• Data & BI Team Leader at Edge

• Experienced in wide range of RDBMS technologies

• Working with Hadoop since 2014

• Certified Cloudera Administrator and trainer

• Oracle Certified Professional

3

Page 4: Impala use case @ edge

• Big Data at Edge

• Our Goals

• About Impala

• High Overview

• Why We Chose to Work with Impala

• Our Challenges

• Our Setup

• Designing Impala Tables

4

Page 5: Impala use case @ edge

5

Page 6: Impala use case @ edge

6

• Deliver insights on data in real time

Fraud Detection

Time-series Analysis

Predictive analytics

Interactive exploratory analytics on our data sets

• Provide a convenient way to interact with the data

• Continuously load batches of data, and make them visible with

minimal delay.

• Handle high number of concurrent users

Page 7: Impala use case @ edge

7

• Cloudera's open source massively parallel processing (MPP)

SQL query engine

• Runs on Hadoop clusters

• 100% open source, released under the Apache Software

license

Page 8: Impala use case @ edge

8

• Does not rely on a general purpose

data processing engine such as

MapReduce

• Executes queries directly on the

Hadoop cluster

• Well-suited for executing interactive

analytics queries on large data sets

• Tables are really directories of files

in HDFS

Page 9: Impala use case @ edge

9

• Impala Servers run on each node of a cluster.

• The Impala State Store Server is responsible

for confirming which nodes are healthy and

can accept new work

• The Catalog Server (new in CDH5) is

responsible for sending the new

metadata to all other Impala

nodes

• You can submit a query to the Impala Server running on any node

Page 10: Impala use case @ edge

10

• Supports A large set of SQL statements, including SELECT and INSERT, JOIN, Subqueries, and SQL Analytic Functions.

• Highly compatible with HiveQL

• Using Cloudera Manager, Impala services can deployed and managed

• Allows the usage of Hue for queries.

• Impala is certified to run against Tableau

Page 11: Impala use case @ edge

11

• Querying data stored in HDFS (provides a distributed,

high-performance queries)

• Each Impala daemon can handle multiple concurrent client

requests

• Impala is pioneering the use of the Parquet file format, a

columnar storage layout that is optimized for large-scale queries

typical in data warehouse scenarios.

Page 12: Impala use case @ edge

12

• Allows the usage of partitioning

• By default, all the data files for a table are located in a

single directory. Partitioning is a technique for physically

dividing the data during loading, based on values from one

or more columns.

• Impala is a widely adopted standard across the ecosystem,

with many users and extensive documentation

Page 13: Impala use case @ edge

13

• Hadoop Clustero Cluster sizing o Workload testing (query throughput and and response time)

• Database Designo Identify access pattern based on real use caseso Make sure we’re not generating too many partitions o Make sure the data in each partition is large enough o Design our “Star Schema” data warehouse

• Data Types o Data consistency across Pig, Hive, and Impala

• File formats• Tune queries

Page 14: Impala use case @ edge

14

Page 15: Impala use case @ edge

15

● Year = 2017○ Month = 03

■ Day = 01■ Day = 02■ Day = 03■ …

○ Month = 04■ Day = 01■ Day = 02■ ...

Page 16: Impala use case @ edge

16

• Although we use a “Star Schema” design in Impala. There are a

lot of architectural differences between our Impala layout and

the old RDBMS system.

• Keep that in mind and avoid using your existing RDBMS data

storage and processing strategies in Impala

Page 17: Impala use case @ edge

17

Page 18: Impala use case @ edge

18

Page 19: Impala use case @ edge

19

• Instead of using MapReduce, Impala reads the HDFS data

directly

• Impala allows users to query data in HDFS using an SQL-like

language

• The administrative tasks related to Impala are greatly simplified

by Cloudera Manager

Page 21: Impala use case @ edge

21