17
Big Data Sharing Session Febiyan Rachman Data Science Indonesia [email protected] @febiyanr http://id.linkedin.com/in/febiyan Logos, images, and pictures shown in this presentation belong to their respective owners. No copyright infringement intended.

Introduction to Big Data and Hadoop

Embed Size (px)

Citation preview

Big Data Sharing Session

Febiyan Rachman Data Science Indonesia

[email protected]

@febiyanr

http://id.linkedin.com/in/febiyan

Logos, images, and pictures shown in this presentation

belong to their respective owners. No copyright

infringement intended.

• Background of Big Data

• History of Hadoop

• Technologies Around Big Data

• What Big Data Should Be

Agenda

Data Explosion

Human Generated

Business Generated

Machine Generated

Interaction Generated

Artwork Courtesy of Teradata

Image Courtesy of

http://bigdatabloggin.blogspot.com/

Hadoop Was Started By These 2 Gentlemen

Image Courtesy of

gigaom.com

Mike Cafarella Doug Cutting

How Would You Index This?

Google File System Paper To The Rescue

2003

MapReduce Paper Completes The Vision

2004

YEAR WHAT

2002 Doug C. & Mike C. started working on Nutch

2003 Google’s GFS paper

2004 Nutch Distributed File System (NDFS) – Doug Cutting

2004 Google’s MapReduce paper

2004-2005 Nutch MapReduce Implementation

2006 NDFS and Nutch MapReduce became Hadoop

2008 Hadoop became top-level Apache Project

History of Hadoop

• Store and process huge amount of data – “Big Data”

• Designed for affordable commodity servers

• Scale horizontally

Why Hadoop?

• Distributed file system

• A single logical storage

• Breaks files into blocks

• 3 replications – fault-

tolerant

HDFS

• A processing framework

• Process data locally – bring

apps to data!

• Distributed process

MapReduce

Ingest

• Flume

• Kafka

• Sqoop

• …

Technologies Around Big Data

Store

• HDFS

• HBase

• Cassandra

• MapR-FS

• MapR-DB

• …

Orchestrate

• ZooKeeper

• YARN

• Oozie

• Mesos

• Hue

• …

APIs and Interfaces

• Hive

• Impala

• Pig

• Mahout

• Zeppelin

• …

Technologies Around Big Data (II)

Framework/Platform

• MapReduce

• Spark

• Storm

• Flink

• Teradata Aster

• …

It is not just about technology.

It is not just about acquiring storing data.

Big Data?

“It is more of an initiative that

demands more analytics from all

available data.”

Data-Driven Companies Outperform

Data-driven Companies

Companies with Low Reliance on Data

Data-driven companies are more likely to outperform their competitors when it comes to profitability

They are also more likely to have a culture of creativity and innovation

And are better positioned for top-down and bottom-up cultural evolution and success:

Top Leaders who Launch and Drive Data Initiatives

68% 40%

VS.

78% 37%

65% 42%

70% 41%

59% 33%

55% 24%

55% 28%

They are more likely to realize the benefits of data, including:

Better Knowledge Sharing

More Collaborative Organization

Greater Quality and Speed of Execution

Faster Decisions

VS.

VS.

VS.

VS.

VS.

VS.

VS.

Artwork Courtesy of Teradata

Start with a

vision.

Start with

valuable use

cases.

Thank You

[email protected]

@febiyanr

http://id.linkedin.com/in/febiyan

Logos, images, and pictures shown in this presentation

belong to their respective owners. No copyright

infringement intended.