Big Data - An Overview

big data - an overviewArvind Kalyan

Engineer at LinkedIn

What makes data big?

Size?

100 GB of 1980 U.S. Census database was considered ‘big’ at the time and required sophisticated machinery

http://www.columbia.edu/cu/computinghistory/mss.html

http://www.columbia.edu/cu/computinghistory/mss.html

Size?

7 billion people on earth…

storing name (~20 chars) , age (7 bits), gender (1 bit) for everyone on earth:

7 billion * (20 * 8 + 7 +1)/8 = 147GB

In the last 25 years the number hasn’t grown too much and is still in the same order of magnitude..

Size?

i.e., cardinality of real-world data doesn't change/grow very fast…. so that really is not that much data…

Size?

Server with 128GB RAM and multiple TB of disk space is not difficult to find in 2015

Size?

but observations on those entities can be too many

Those 7 billion people interacting with: other people, web-sites, products, etc at different points in time and at different locations quickly explodes in the number of data points

Analysis

‘big-data’ is a challenge when you want to analyzeof all those observations to identify trends & patterns

RDBMS for performing analysis

RDBMS these days can hold *much* larger volumes on a single machine


RDBMS is good at storing data & fetching individual rows satisfying your query


MySQL for example, when used right can guarantee data durability and at the same time provide some of the lowest read-latencies


RDBMS can also be used for ad-hoc analysis

performance of such analysis can be improved by sacrificing on ‘relational’ principles:

for example, by denormalizing tables (cost of copy vs seek)


But this doesn’t scale when you want to run the analysis across all rows for every user

some queries can turn out to be running for seconds or even days depending on the size of the tables


.. and then consider the overhead of doing this on an on-going basis


RDBMS is a not the right system if you want to look at & process the full data-set


These days data is an asset and businesses & organizations need to extrapolate trends, patterns, etc. from that data

What’s the solution?

.. if not RDBMS

“Numbers Everyone Should Know”

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lock/unlock 100 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 10,000 ns

Send 2K bytes over 1 Gbps network 20,000 ns

Read 1 MB sequentially from memory 250,000 ns

Round trip within same datacenter 500,000 ns

Disk seek 10,000,000 ns

Read 1 MB sequentially from network 10,000,000 ns

Read 1 MB sequentially from disk 30,000,000 ns

Send packet CA->Netherlands->CA 150,000,000 ns

Let’s look closely at disk vs memory

Source: http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg

http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg


in general:

disk is slower than SSD, and

SSD is slower than memory

but more importantly…


Memory is not always faster than disk

SSD is not always faster than disk

.. and at network vs disk..

network is not always slower than disk

plus local machine and disk have limits but network enables you to grow!

Lesson learned

access pattern is important

depending on data-set (size) & use-case, the access pattern alone can make a difference between possible & not possible!

The solution

more often, these days, it is better to have a distributed system to crunch the data

one node gathers partial results from many other nodes and produces final result

individual nodes are in-turn optimized to return results quickly from their partial local data

big-data tech

‘big data’ technologies & tools help you define your task in a higher order language, abstracting out the details of distributed systems underneath

big-data tech

i.e., they encourage/force you to follow a certain access pattern

big-data tech

these are still tools

following suggested best-practices help make the best use of the tool

at the same time, following anti-patterns can make the situation appear more challenging

big-data tech: DSLs

some of the most popular frameworks happen to be DSLs that generate another set of instructions that actually execute

pig, hive, scalding, cascading, etc.

big-data tech: DSLs

biggest challenge with these DSLs is getting them to work, and long term maintenance

big-data tech: DSLs

when using DSLs, code is written in one language and executed in another

if anything fails, the error message is usually associated with the latter and you have to know enough about the abstracted layer to be able to translate it back to your code

big-data tech: DSLs

but it usually works for the most part, because..

some of these popular frameworks have active user-groups and blog posts that’ll help you get the job done

big-data tech: DSLs

so, more popular the technology, the better the documentation & support

also, bigger the community, the better the likelihood of evolution of the technology

so start with the most popular/common technology that does the job; even if it means compromising some ‘cool’ feature provided by another, currently less popular tool unless absolutely necessary

big-data tech: map/reduce

most of the current (as of Feb 2015) ‘big-data’ frameworks revolve around the map/reduce paradigm

… and use hadoop technologies underneath


hadoop technologies for big data processing can be seen as 2 major components

hdfs => distributed filesystem for storage

‘map/reduce’ => programming model

big-data tech: hdfs

hdfs stores data in immutable format

data is ‘partitioned’ & stored on different machines

there are also multiple copies of each ‘part’ i.e., replicated

big-data tech: hdfs

‘partitioning’ enables faster processing in parallel

also helps with data locality: code can run where data is and not the other way around

big-data tech: hdfs

‘replication’ increases availability of the data itself

it also helps with overall performance & availability of tasks running on it (speculating execution): run the same task on multiple replicas, and wait for one of them to finish


‘map/reduce’ is a functional programming concept

map => process & transform each set of data points in parallel and outputs (key, value) pairs

reduce => gather those partial results and come up with final result


‘map/reduce’ in hadoop also has one more important step between map and reduce

shuffle: makes sure all values for a given ‘key’ ends up on the same reducer


‘map/reduce’ in hadoop also has a slightly different ‘reduce’

reduce: values are aggregated per-key. Not across the whole dataset


in general map/reduce shines for ‘embarrassingly parallel’ problems

trying to run non-parallelizable jobs on hadoop (like requiring global ordering, or something similar) might work now, but may not scale in the long run

http://en.wikipedia.org/wiki/Embarrassingly_parallel

http://en.wikipedia.org/wiki/Embarrassingly_parallel


But surprisingly a *lot* of ‘big-data’ problems can be modeled directly on map/reduce with little to no change


map/reduce on hadoop is a multi-tenantdistributed system running on disk-local data

non-interactive analysis

big-data analysis/processing has typically been associated with non-interactive jobs. i.e., the user doesn’t expect the results to come back in a few seconds

the job usually takes a few mins to a few hours, or even days

the need for speedWhat if we made it run on in-memory data?

the need for speed

this is the current trend

spark, presto are some noteworthy examples

not based on map/reduce programming paradigm

but still take advantage of the underlying distributed filesystem (hdfs)

faster big-data: spark

Spark is a Scala DSL

Resilient Distributed Datasets : primary abstraction in Spark

RDDs: collections of data kept in-memory

Provides collections API comparable to Scala lang that transform one RDD to another


fault-tolerance: retries computation on certain failures

so it’s also good for the ‘backend’ / scheduled jobs in addition to interactive usage


where it shines: complex, iterative algorithms like in Machine learning.

Since RDDs can be cached in-memory and used across the network, the computation speeds up considerably in the absence of disk I/O

faster big-data: presto

presto is from facebook; written in java

essentially a distributed read-only SQL query engine

designed specifically for interactive ad-hoc analysis over Petabytes of data

data on disk, but processing pipeline is fully in-memory

faster big-data: presto

no fault-tolerance, but extremely fast

ideal for ad-hoc queries on extremely large data that finish fast but not for long running or scheduled jobs

as of today there is no UDF support

references

http://www.akkadia.org/drepper/cpumemory.pdf

http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf

http://queue.acm.org/detail.cfm?id=1563874

http://en.wikipedia.org/wiki/Apache_Hadoop

https://prestodb.io/

http://www.akkadia.org/drepper/cpumemory.pdf

http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf

http://queue.acm.org/detail.cfm?id=1563874

http://en.wikipedia.org/wiki/Apache_Hadoop

https://prestodb.io/

Leave questions & comments below, or reach out through LinkedIn!

Arvind Kalyanhttps://www.linkedin.com/in/base16

https://www.linkedin.com/in/base16

Data & Analytics

Big Data - An Overview