16
Components of Big Data 10/01/2018 25

Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Components

of Big Data

10/01/2018 25

Page 2: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Storage of Big Data

Data is growing faster

than Moore’s Law

Too much data to fit

on a single machine

Partitioning

Replication

Fault-tolerance

10/01/2018 26

Page 3: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Hadoop Distributed File System(HDFS)

The most widely used distributed file system

Fixed-sized partitioning

3-way replication

Write-once read-many

10/01/2018

128MB 128MB 128MB 128MB 128MB 128MB …

27

Page 4: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Indexing

Data-aware organization

Global Index partitions the records into blocks

Local Indexes organize the records in a partition

Challenges:

Big volume

HDFS limitation

New programming

paradigms

Ad-hoc indexes

10/01/2018

Global index

Local indexes

28

Page 5: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Fault Tolerance

Replication

Redundancy

Multiple masters

10/01/2018 29

Page 6: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Streaming

Sub-second latency for queries

One scan over the data

(Partial) preprocessing

Continuous queries

Eviction strategies

In-memory indexes

10/01/2018

…1000100010101011101110101010110111010111011101110100…

Processing window

30

Page 7: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Task ExecutionMapReduce

Map-Shuffle- Reduce

Resiliency through

materialization

Resilient Distributed Datasets (RDD)

Directed-Acyclic-Graph (DAG)

In-memory processing

Resiliency through lineages

Hyracks

Stragglers

Load balance10/01/2018

M1 M2 … Mm

R1 R2 Rn

31

Page 8: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Query Optimization

Finding the most efficient query plan

e.g., grouped aggregation

Cost model (CPU – Disk – Network)

10/01/2018

Agg

Agg

Agg

Merge

Merge

Partition

Partition

Partition

Agg

Agg

Vs

32

Page 9: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Provenance

Debugging in distributed systems is painful

We need to keep track of transformations on

each record

10/01/2018 33

Page 10: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Big Graphs

Motivated by social networks

Billions of nodes and trillions of edges

Tens of thousands of insertions per second

Complex queries with graph traversals

10/01/2018 34

Page 11: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Hadoop Ecosystem

10/01/2018

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

MapReduce Query Engine

Administration

Pig

35

Page 12: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Spark Ecosystem

10/01/2018

Hadoop Distributed File System (HDFS)

Yet Another

Resource Negotiator (YARN)

Resilient Distributed Dataset (RDD) a.k.a Spark Core

Data Frames MLlib GraphX SparkRSpark

Streaming

Spark SQL

36

Kubernetes

Page 13: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

10/01/2018

Hyracks Data-parallel Platform

Algebricks

Algebra Layer

Hadoop MapReduce

CompatibilityPregelix

HiveSterixAsteixDBOther

compilersHyracks

jobs

Pregel

Jobs

MapReduce

Jobs

PigLatinHiveQLAsterixQL

37

Page 14: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Impala

10/01/2018

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

Query Executor

Query Planner

Query Parser

38

Page 15: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

SpatialHadoop

10/01/2018

Hadoop Distributed File System (HDFS) + Spatial Indexing

Yet Another Resource Negotiator (YARN)

MapReduce Processing + Spatial Query Processing

Spatial Visualization

Pig Latin + Pigeon

39

Page 16: Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

Reading Material

“The Age of Analytics in a Data-driven World”

[Executive Summary]

by McKinsey & Company

10/01/2018 40