39
Akselerasi Pertumbuhan Startup dengan Big Data Dwika Sudrajat IT Consultant Florida, Hong Kong & Jakarta. November 23 th , 2016 email: [email protected] Florida: +1-407-2502812 Hong Kong: +852-54152971 Jakarta: +62-8161108571 FB: dwika.sudrajat TW: @dwikasudrajat managingconsultant.blogspot.com dwikasudrajat.blogspot.com dwikasudrajat.wordpress.com

Sharing bisnis big data v3 part1

Embed Size (px)

Citation preview

Page 1: Sharing  bisnis big data v3 part1

Akselerasi Pertumbuhan Startupdengan Big Data

Dwika SudrajatIT Consultant

Florida, Hong Kong & Jakarta.November 23th, 2016

▐ email: [email protected]▐ Florida: +1-407-2502812▐ Hong Kong: +852-54152971▐ Jakarta: +62-8161108571▐ FB: dwika.sudrajat▐ TW: @dwikasudrajat▐ managingconsultant.blogspot.com▐ dwikasudrajat.blogspot.com▐ dwikasudrajat.wordpress.com

Page 2: Sharing  bisnis big data v3 part1

Peluang Pekerjaan

Page 2

Page 3: Sharing  bisnis big data v3 part1

Page 3

Startup Team at Work

Page 4: Sharing  bisnis big data v3 part1

Page 4

Startup Team Creating Mobile Apps

Page 5: Sharing  bisnis big data v3 part1

Page 5

What technologies do you think they are running on?

Page 6: Sharing  bisnis big data v3 part1

Page 6

Conventional Startup Development Team

Page 7: Sharing  bisnis big data v3 part1

Page 7

Today Startup Development Team

Page 8: Sharing  bisnis big data v3 part1

Page 8

From LAMP to MEAN

Page 9: Sharing  bisnis big data v3 part1

Page 9

Page 10: Sharing  bisnis big data v3 part1

Page 10

Modern web development stack

Page 11: Sharing  bisnis big data v3 part1

Page 11

MEAN.JS a full-stack JavaScript using MongoDB, Express, AngularJS, and NodeJS

Page 12: Sharing  bisnis big data v3 part1

What is Big Data?

Page 13: Sharing  bisnis big data v3 part1

Page 13

Data

Page 14: Sharing  bisnis big data v3 part1

Page 14

Hadoop, Why?

Page 15: Sharing  bisnis big data v3 part1

Hadoop, Volume, Velocity, Variety

Page 16: Sharing  bisnis big data v3 part1

Page 16

Data Growing

Page 17: Sharing  bisnis big data v3 part1

Real Application of Big Data Today

Page 18: Sharing  bisnis big data v3 part1

SHORT LIFESPAN OF THE DATA

FAST

MO

VIN

G D

ATA

FAST

DAT

A PR

OC

ESSI

NG

HIGH VARIETY OF DATA

Challenges

Page 19: Sharing  bisnis big data v3 part1

Page 19

Data Volume and Variety

Page 20: Sharing  bisnis big data v3 part1

Four V’s and a C

Not only volume makes big data big, it’s all about the three V’s: High Volume, Variety, Velocity High Value!

In addition the Challenge : the data is very complex in nature, often unstructured: Text documents, emails, images and videos, etc. Click stream data, social media feed data, etc.

Page 21: Sharing  bisnis big data v3 part1

Page 21

Eliminate A Single Point Of Failure  load balancer itself does not become a single point of failure. Load balancers must be implemented in high availability cluster

Page 22: Sharing  bisnis big data v3 part1

Page 22

Page 23: Sharing  bisnis big data v3 part1

Page 23

Page 24: Sharing  bisnis big data v3 part1

Page 24

Page 25: Sharing  bisnis big data v3 part1

Page 25

Page 26: Sharing  bisnis big data v3 part1

Rack 2 Rack 3Rack 1

A Typical Hadoop Cluster

ClientDATA ASSIGNMENT TO NODES

DATA READDATA WRITE

METADATA FORBLOCK INFO

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Job Tracker

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Master Node

Slave Nodes

Slave Nodes

Slave Nodes

Name Node

JOB ASSIGNMENT

TASK ASSIGNMENT

1. Client2. Master Node

Name Node Job Tracker

3. Slave Nodes Data Nodes Task

Trackers Map /

Reduce

Page 27: Sharing  bisnis big data v3 part1

1. Client consults Name Node2. Client writes block to Data

Node3. Data Node replicates block4. Cycle repeats for next blocks

Rack 2 Rack 3Rack 1

Hadoop File System (HDFS)

Data Node 1 Data Node 4 Data Node 7

Data Node 2 Data Node 5 Data Node 8

Data Node 3 Data Node 6 Data Node 9

Name Node

Client

FILE

FILE

DATA ASSIGNMENT TO NODES

DATA READDATA WRITE

METADATA FORBLOCK INFO

Rack 1: Data Node 1 Data Node 2 …Rack 2: Data Node 4 …

Page 28: Sharing  bisnis big data v3 part1

MapReduce

the, 1quick, 1brown, 1fox, 1

the, 1fox, 1ate, 1the, 1mouse, 1

how, 1now, 1brown, 1cow, 1

the, 1the, 1the, 1

fox, 1fox, 1

quick, 1

brown, 1brown, 1

ate, 1

mouse, 1

how, 1

now, 1

cow, 1

the, 3

fox, 2

quick, 1

brown, 2

ate, 1

mouse, 1

how, 1

now, 1

cow, 1

the, 3fox, 2quick, 1brown, 2ate, 1mouse, 1how, 1now, 1cow, 1

Input Splitting Map ShuffleSort

Reduce

OutputThe Map function processes one line at a time, splits it into tokens seperated by a withespace and emits a key-value pair

<word, 1>.

The Reducer function just sums up the values, which are the occurence counts for each key (i.e. words in this example).

Page 29: Sharing  bisnis big data v3 part1

MapReduce Wordcount Example in R

Map function.

Reduce function.

Reading the input from HDFS from.dfs().

Writing the results back to HDFS to.dfs().

Page 30: Sharing  bisnis big data v3 part1

What is MapReduce used for?

• At Google:– Index building for Google Search– Article clustering for Google News– Statistical machine translation

• At Yahoo!:– Index building for Yahoo! Search– Spam detection for Yahoo! Mail

• At Facebook:– Data mining– Ad optimization– Spam detection

Page 31: Sharing  bisnis big data v3 part1

Who uses Hadoop?

▐ Facebook (Hadoop, Hive, Scribe)▐ Google File System (HDFS)▐ Yahoo! (Hadoop in Yahoo Search)▐ IBM Transarc (Andrew File System)▐ Amazon/A9

Goals of HDFS - Hadoop Distributed File System ▐ Very Large Distributed File System

– 10K nodes, 100 million files, 10 PB▐ Assumes Commodity Hardware

– Files are replicated to handle hardware failure– Detect failures and recovers from them

▐ Optimized for Batch Processing– Provides very high aggregate bandwidth

Page 32: Sharing  bisnis big data v3 part1

Hadoop, Why?

▐ Need to process Multi Petabyte Datasets▐ Need common infrastructure

– Efficient, reliable, Open Source Apache License▐ The above goals are same as Condor, but

Workloads are IO bound and not CPU bound

Hive, Why?▐ Need a Multi Petabyte Warehouse▐ Hive is a Hadoop subproject!

Page 33: Sharing  bisnis big data v3 part1

What is MapReduce?▐ Data-parallel programming model for clusters of commodity

machines▐ Pioneered by Google Processes 20 PB of data per day▐ Popularized by open-source Hadoop project

Used by Yahoo!, Facebook, Amazon, …

Hadoop at Facebook▐ Production cluster

4800 cores, 600 machines, 16GB per machine – April 20098000 cores, 1000 machines, 32 GB per machine – July 20094 SATA disks of 1 TB each per machine2 level network hierarchy, 40 machines per rackTotal cluster size is 2 PB, projected to be 12 PB in Q3 2009

▐ Test cluster• 800 cores, 16GB each

Page 34: Sharing  bisnis big data v3 part1

2016 - Hadoop clusters

▐ ~20,000 machines running Hadoop▐ Largest clusters are currently 2000 nodes▐ Several Petabytes of user data (compressed, unreplicated)▐ Run hundreds of thousands of jobs every month

Page 35: Sharing  bisnis big data v3 part1

2016 - Big Data Server Farm

Page 35

Page 36: Sharing  bisnis big data v3 part1

Conclusions

The Digital Age brings many opportunities but also challenges.

Big Data and Analytics can face the challenges and realize the opportunities.

It is within anyone’s grasp, do it incremental and iterative. Hadoop cloud solutions are scalable, flexible and cost-

efficient, but sometimes limited in functionality (or not standardized).

Need for good Data Scientists in a mixed team of competences to make the right choices.

Page 37: Sharing  bisnis big data v3 part1

Conclusions

Page 38: Sharing  bisnis big data v3 part1
Page 39: Sharing  bisnis big data v3 part1

QUESTIONS?

39

Q&A

Thanks