Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis

Egor PakhomovData Architect, [email protected]

Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.

Medium organization (<500 people)

Big organization ( >500 people)

DATA CUSTOMERS >10 >100

DATA VOLUME “Big data” “Big data”

DATA TEAM PEOPLE RESOURCES

Enough to integrate and support some open source stack

Enough to write our own data tools

FINANCIAL RESOURCES Enough to buy hardware for Hadoop cluster

Enough to buy some cloud solution (Databricks cloud, Google BigQuery...)

Examples:Examples:

Data infrastructure architecture

HOW TO MANAGE BIG DATA

WHEN YOU ARE NOT THAT BIG?

Data architect in AnchorFree

About me

Spark contributor since 0.9Integrated spark in Yandex Islands. Worked in Yandex Data Factory

Participated in “Alpine Data” development - Spark based data platform

Agenda

Data aggregation

Why SQL is important and how to use it in Hadoop?

• SQL vs R/Python• Impala vs Spark• Zeppelin vs SQL

desktop client

How to store data to query it fast and change easily?

• JSON vs Parquet• Schema vs schema-

less

How to aggregate your data to work better with BI tools?

• Aggregate your data!• SQL code is code!

1Data

Querying

2Data

Storage

3Data

Aggregation

1Data

Querying

Why SQL is important and how to use it in Hadoop?

1. SQL vs R/Python2. Impala vs Spark3. Zeppelin vs SQL desktop client

BI

Analysts

Regular data transformations

SQL

QA

What do you need from SQL engine?

Fast Reliable Able to process terabytes of data

Support Hive metastore

Support modern SQL statements

Hive metastore role

HDFS

Hive Metastoretable_1 -> file341, file542, file453

table_2 -> file457, file458, file459table_3 -> file37, file568, file359table_4 -> file3457, file568, file349…..

Driver of SQL engine 1

Driver of SQL engine 1

Executor Executor Executor Executor

Which one would you choose? Both!

SparkSQL ImpalaSUPPORT HIVE METASTORE + +FAST - +RELIABLE (WORKS NOT ONLY IN RAM) + -

JSON SUPPORT + -HIVE COMPATIBLE SYNTAX + -OUT OF THE BOX YARN SUPPORT + -MORE THAN JUST A SQL FRAMEWORK + -

Connect Tableau to HadoopStep 1

Hadoop

ODBC/JDBC server

Give SQL to users

Hadoop

ODBC/JDBC server

Step 2

1. Manage desktop application on N laptops

2. One spark context per many users

3. Lack of visualizing

4. No decent resource scheduling

Would not work...

No decent resource scheduling: One user blocks everyone

Вот этот кусок скрина хорошо бы увеличить, чтобы было явно видно, что внутри. При этом общий скрин хорошо бы тоже оставить, чтобы было понятно, откуда это увеличение.-Anton Noginov

No decent resource scheduling: Hadoop good in resource scheduling!

Apache Zeppelin is our solution

1. Web-based

2. Notebook-based

3. Great visualisation

4. Works with both Impala and Spark

5. Has cloud solution with support - Zeppelin Hub from NFLabs

It’s great!

Apache Zeppelin integration

Hadoop

2Data

Storage

How to store data to query it fast and change easily?

1. JSON vs Parquet2. Schema vs schema-less

What would you need from data storing?

Flexible format

Fast querying Access to “raw” data

Have schema

Can we choose just one data format? We need both!

Json Parquet

FLEXIBLE +

ACCESS TO “RAW” DATA +

FAST QUERYING +

HAVE SCHEMA +

IMPALA SUPPORT +

FORMAT QUERY TIME

Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec

JSON SELECT Sum(Get_json_object(line, ‘$.some_field’)) FROM logs.json_datasource

764 sec

Parquet is 5 times faster!

But! when you need raw data, 5 times slower is not that bad

Let’s compare elegance and speed:

{“First name”: “Mike”,“Last name”: “Smith”,

“Gender”: “Male”,“Country”: “US”

}

{“First name”: “Anna”,“Last name”: “Smith”,

“Age”: “45”,“Country”: “Canada”,

Comments: ”Some additional info”}...

FIRST NAME

LAST NAME GENDER AGE

Mike Smith Male NULL

Anna Smith NULL 45

... ... ... ...

JSON Parquet

How data in these formats compare

3Data

Aggregation

How to aggregate your data to work better with BI tools?

1. Aggregate your data!2. SQL code is code!

● “Big data” does not mean you need to query all data Daily

● BI tools should not do big queries

Aggregate your data!

BI tool

select * from ...

How aggregation works?

Git with queries

Query executor

Aggregated table

Report development process

1

2

4

Creating aggregated table in Zeppelin

Creating BI report based on this table

Adding queries to git to run daily

Publishing report

3

Data for report changing process:

Change query in git1

One more tip)

1. Need to apply our patches to source code

2. Move to new versions before any release

3. Move to new version on part of infrastructure - rest remain on old one

We do not use Spark, which comes with Hadoop installation

Questions?

Contact:Egor Pakhomov

[email protected]@gmail.comhttps://www.linkedin.com/in/egor-pakhomov-35179a3a

mailto:[email protected]

Technology

Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis