30
Egor Pakhomov Data Architect, Anchorfree [email protected] Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.

Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Embed Size (px)

Citation preview

Page 1: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Egor PakhomovData Architect, [email protected]

Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.

Page 2: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Medium organization (<500 people)

Big organization ( >500 people)

DATA CUSTOMERS >10 >100

DATA VOLUME “Big data” “Big data”

DATA TEAM PEOPLE RESOURCES

Enough to integrate and support some open source stack

Enough to write our own data tools

FINANCIAL RESOURCES Enough to buy hardware for Hadoop cluster

Enough to buy some cloud solution (Databricks cloud, Google BigQuery...)

Examples:Examples:

Data infrastructure architecture

Page 3: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

HOW TO MANAGE BIG DATA

WHEN YOU ARE NOT THAT BIG?

Page 4: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Data architect in AnchorFree

About me

Spark contributor since 0.9Integrated spark in Yandex Islands. Worked in Yandex Data Factory

Participated in “Alpine Data” development - Spark based data platform

Page 5: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Agenda

Data aggregation

Why SQL is important and how to use it in Hadoop?

• SQL vs R/Python• Impala vs Spark• Zeppelin vs SQL

desktop client

How to store data to query it fast and change easily?

• JSON vs Parquet• Schema vs schema-

less

How to aggregate your data to work better with BI tools?

• Aggregate your data!• SQL code is code!

1Data

Querying

2Data

Storage

3Data

Aggregation

Page 6: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

1Data

Querying

Why SQL is important and how to use it in Hadoop?

1. SQL vs R/Python2. Impala vs Spark3. Zeppelin vs SQL desktop client

Page 7: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

BI

Analysts

Regular data transformations

SQL

QA

Page 8: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

What do you need from SQL engine?

Fast Reliable Able to process terabytes of data

Support Hive metastore

Support modern SQL statements

Page 9: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Hive metastore role

HDFS

Hive Metastoretable_1 -> file341, file542, file453

table_2 -> file457, file458, file459table_3 -> file37, file568, file359table_4 -> file3457, file568, file349…..

Driver of SQL engine 1

Driver of SQL engine 1

Executor Executor Executor Executor

Page 10: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Which one would you choose? Both!

SparkSQL ImpalaSUPPORT HIVE METASTORE + +FAST - +RELIABLE (WORKS NOT ONLY IN RAM) + -

JSON SUPPORT + -HIVE COMPATIBLE SYNTAX + -OUT OF THE BOX YARN SUPPORT + -MORE THAN JUST A SQL FRAMEWORK + -

Page 11: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Connect Tableau to HadoopStep 1

Hadoop

ODBC/JDBC server

Page 12: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Give SQL to users

Hadoop

ODBC/JDBC server

Step 2

Page 13: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

1. Manage desktop application on N laptops

2. One spark context per many users

3. Lack of visualizing

4. No decent resource scheduling

Would not work...

Page 14: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

No decent resource scheduling: One user blocks everyone

Вот этот кусок скрина хорошо бы увеличить, чтобы было явно видно, что внутри. При этом общий скрин хорошо бы тоже оставить, чтобы было понятно, откуда это увеличение.-Anton Noginov
Page 15: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

No decent resource scheduling: Hadoop good in resource scheduling!

Page 16: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Apache Zeppelin is our solution

Page 17: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

1. Web-based

2. Notebook-based

3. Great visualisation

4. Works with both Impala and Spark

5. Has cloud solution with support - Zeppelin Hub from NFLabs

It’s great!

Page 18: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Apache Zeppelin integration

Hadoop

Page 19: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

2Data

Storage

How to store data to query it fast and change easily?

1. JSON vs Parquet2. Schema vs schema-less

Page 20: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

What would you need from data storing?

Flexible format

Fast querying Access to “raw” data

Have schema

Page 21: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Can we choose just one data format? We need both!

Json Parquet

FLEXIBLE +

ACCESS TO “RAW” DATA +

FAST QUERYING +

HAVE SCHEMA +

IMPALA SUPPORT +

Page 22: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

FORMAT QUERY TIME

Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec

JSON SELECT Sum(Get_json_object(line, ‘$.some_field’)) FROM logs.json_datasource

764 sec

Parquet is 5 times faster!

But! when you need raw data, 5 times slower is not that bad

Let’s compare elegance and speed:

Page 23: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

{“First name”: “Mike”,“Last name”: “Smith”,

“Gender”: “Male”,“Country”: “US”

}

{“First name”: “Anna”,“Last name”: “Smith”,

“Age”: “45”,“Country”: “Canada”,

Comments: ”Some additional info”}...

FIRST NAME

LAST NAME GENDER AGE

Mike Smith Male NULL

Anna Smith NULL 45

... ... ... ...

JSON Parquet

How data in these formats compare

Page 24: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

3Data

Aggregation

How to aggregate your data to work better with BI tools?

1. Aggregate your data!2. SQL code is code!

Page 25: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

● “Big data” does not mean you need to query all data Daily

● BI tools should not do big queries

Aggregate your data!

Page 26: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

BI tool

select * from ...

How aggregation works?

Git with queries

Query executor

Aggregated table

Page 27: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Report development process

1

2

4

Creating aggregated table in Zeppelin

Creating BI report based on this table

Adding queries to git to run daily

Publishing report

3

Data for report changing process:

Change query in git1

Page 28: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

One more tip)

Page 29: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

1. Need to apply our patches to source code

2. Move to new versions before any release

3. Move to new version on part of infrastructure - rest remain on old one

We do not use Spark, which comes with Hadoop installation

Page 30: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Questions?

Contact:Egor Pakhomov

[email protected]@gmail.comhttps://www.linkedin.com/in/egor-pakhomov-35179a3a