Big Data - Big Pitfalls

Roman Nikitchenko, 04.12.2014

2www.vitech.com.ua

Any real big data is just about DIGITAL LIFE FOOTPRINT

3www.vitech.com.ua

THE SAME IS ABOUT...

NOT ALL THINGS IN OUR LIFE ARE NICE

4www.vitech.com.ua

BIG DATA is not about the

data. It is about OUR ABILITY TO HANDLE THEM.

5www.vitech.com.ua

YARN

6www.vitech.com.ua

Don't shoot your own foot with BIG GUN!

Some aspects are more special.

Most dangerous things in Big Data

Basics

Couple of specific notes

Beware!

7www.vitech.com.ua

MOST SERIOUS BIG DATA failure IS ...

NO DATA

8www.vitech.com.ua

NO DATA

NO MONEY

The biggest mistake in BIG DATA strategy is to limit amount of data you collect.

9www.vitech.com.ua

WHERE ARE

YOU?

10www.vitech.com.ua

DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.

11www.vitech.com.ua

YOU ALWAYS HAVE OPTION

● We have developed our own online storage which lowers maintenance and stores anything.

12www.vitech.com.ua

Most serious errors in Big Data are about operations and infrastructure. Not about algorithms, or code.

LIVE WITH IT

13www.vitech.com.ua


● We have special engineering roadmap for big data infrastructure development.

14www.vitech.com.ua

Why hadoop?

x MAX+

=

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

Use robust solutions

15www.vitech.com.ua

What is HADOOP?

● Hadoop is open source framework for big data. Both distributed storage and processing.

● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.

● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.

16www.vitech.com.ua

Hadoop: don't do it yourself

17www.vitech.com.ua

● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. Some people LOVE them.

Cloudera is stable enough but not stale. Hadoop 2.5 with YARN, HBase 0.98.x, Spark 1.x. Balance as for late 2014.

● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.

Option? Our experience is:

18www.vitech.com.ua

HBase motivation

● Designed for throughput, not for latency.

● HDFS blocks are expected to be large. There is issue with lot of small files.

● Write once, read many times ideology.

● MapReduce is not so flexible so any database built on top of it.

● How about realtime?

Hadoop is...

19www.vitech.com.ua

● 64G RAM is considered pretty small amount. 128G is more and more often configuration.

● 2xCPU with 6 cores each is considered commodity.

● 4xHDD is a minimum. SSD are used more and more often.

Uses commodity hardware...

'Commodity' word understanding is growing

20www.vitech.com.ua

Virtualization

NOTSO

REAL ELEPHANT

VIRTUALIZATION

21www.vitech.com.ua

CONCERNS● Is possible for key nodes. Not for

workers unless you are really big.

● Several nodes on single physical host: what happens if this host fail?

● Loaded services on VM: is it meaningful? Double duties?

22www.vitech.com.ua

Virtualization: practical case

● Apache ZooKeeper is QUORUM based service.

● If host with 2 ZK fails, Everything fail which breaks tolerancy to 1 failure.

● Can you garantee equal performance for ZK service instances?

● DON'T PUT QUORUM SERVICES IN VIRTUAL ENVIRONMENT!

HOST

HOST

REAL EXAMPLE

23www.vitech.com.ua


● Indeed there is lot of options with virtualization. The only concern is about ability to use your own brains.

24www.vitech.com.ua

HBase motivationNeed online storage for big data?

LATENCY, SPEED and all Hadoop properties.

25www.vitech.com.ua

NO ANY SECONDARY

INDEXES OUT OF THE BOX.

26www.vitech.com.ua

YOU ALWAYS HAVE LOT OF OPTIONS

● We have buit our search indexing technology.

27www.vitech.com.ua

● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX

● But it can index ANYTHING. Search result is document ID

INDEX UPDATE

Search responses

INDEX QUERY

Index update request is analyzed, tokenized,

transformed... and the same is for queries.

INDEX ALTERNATIVE: SOLR

28www.vitech.com.ua

● HBase handles user data change online requests.

● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests.

● Indexes are built on SOLR so HBase data are searchable.

29www.vitech.com.ua

HDFS

HBase: Data and search integration

HBase regions

Data update

Client

User just puts (or deletes) data.

Search responses

Lily HBase NRT indexer

Replication can be set up to column

family level.

REPLICATIONHBasecluster

Translates data changes into SOLR

index updates.

SOLR cloudSearch requests (HTTP)

Apache Zookeeper does all coordination

Finally provides search

Serves low level file system.

30www.vitech.com.ua

ETL

LOADYOURDATA

WITH CARE

ETL

31www.vitech.com.ua

ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.

32www.vitech.com.ua

ETL & BD: main stages

SQLserver

Table1

Table2

Table3

Table4 BIG DATA shard

BIG DATA shard

BIG DATA shard

Transform

● SQL solution are usually not so distributed as Big Data one. How to partition your data?

● Big data storages are mostly non-relational. You are to map table relations into objects. Where to put this complexity?

JOIN Partition

EXTRACT TRANSFORM LOAD

33www.vitech.com.ua

ETL & BD: complexity on SQL

SQLserver

JOIN

Table1

Table2

Table3


BIG DATA shard

BIG DATA shard

ETL stream

● It's hard to transform SQL relationship into NoSQL objects: complex joins.

● Simple stream on big data, lowered network traffic. HUGE load on SQL.

● What if you have several SQL servers and you need 2 times faster import?

SQL

dies

on

this

34www.vitech.com.ua

ETL & BD: complexity on BD side

SQLserver

JOIN

Table1

Table2

Table3


BIG DATA shard

BIG DATA shardETL stream

● Simple streaming from SQL. Things like joins on Big Data side.

● Even if you have 100 SQL servers, you have to scale single cluster.

● Network load is more intensive.

Muc

h m

ore

scal

able

ETL stream

ETL stream

ETL stream

35www.vitech.com.ua

● YARN forms resource management layer and completes real distributed data OS so heterogeneous clusters and multi-tenancy are real things.

● New distributed processing approaches: MapReduce is from now only one among other YARN appliactions.

YARN: future of Hadoop

36www.vitech.com.ua

First ever worldDATA OS

10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.

37www.vitech.com.ua

This is how retail agents often work.

YARN

38www.vitech.com.ua

This is how it often works.

YARNWhat can be reality

CPU

CPU CPU CPU

YARN presents

CPU CPU CPU CPU

it's about reservation. Indeed you could have no resource because of service not aware of YARN.

39www.vitech.com.ua


40www.vitech.com.ua

Apache Spark

● Better MapReduce with at least some MapReduce elements able to be reused.

● New job models. Not only Map and Reduce.

● Scala and Python API in addition to Java. Functional model support.

● Results can be passed through memory including final one.

41www.vitech.com.ua

● Works much better if knows about size of job to do. Streaming is just sequence of small jobs.

● Requires proper YARN tuning to use resources properly. No dynamic allocation of executors.

● Persistance: int limitation with 2G. HUGE amount of memory as for today.

● You cannot partition data 'on the fly'. Should guess right way.

42www.vitech.com.ua

● Dynamic, faster to startup, resources reusage.

● Unified management infrastructure such as logging.

+

Your cluster is ready for next tasksMap-reduce Spark

YARN

43www.vitech.com.ua

It is simply too good to wait...

44www.vitech.com.ua

TRUST ME ;-)

45www.vitech.com.ua

Share your knowledge!

DO NOTHIDE YOUREXPERIENCE

46www.vitech.com.ua

Questions and discussion

Technology

Big Data - Big Pitfalls