Upload
dwika-sudrajat
View
128
Download
0
Embed Size (px)
Citation preview
Akselerasi Pertumbuhan Startupdengan Big Data
Dwika SudrajatIT Consultant
Florida, Hong Kong & Jakarta.November 23th, 2016
▐ email: [email protected]▐ Florida: +1-407-2502812▐ Hong Kong: +852-54152971▐ Jakarta: +62-8161108571▐ FB: dwika.sudrajat▐ TW: @dwikasudrajat▐ managingconsultant.blogspot.com▐ dwikasudrajat.blogspot.com▐ dwikasudrajat.wordpress.com
Peluang Pekerjaan
Page 2
Page 3
Startup Team at Work
Page 4
Startup Team Creating Mobile Apps
Page 5
What technologies do you think they are running on?
Page 6
Conventional Startup Development Team
Page 7
Today Startup Development Team
Page 8
From LAMP to MEAN
Page 9
Page 10
Modern web development stack
Page 11
MEAN.JS a full-stack JavaScript using MongoDB, Express, AngularJS, and NodeJS
What is Big Data?
Page 13
Data
Page 14
Hadoop, Why?
Hadoop, Volume, Velocity, Variety
Page 16
Data Growing
Real Application of Big Data Today
SHORT LIFESPAN OF THE DATA
FAST
MO
VIN
G D
ATA
FAST
DAT
A PR
OC
ESSI
NG
HIGH VARIETY OF DATA
Challenges
Page 19
Data Volume and Variety
Four V’s and a C
Not only volume makes big data big, it’s all about the three V’s: High Volume, Variety, Velocity High Value!
In addition the Challenge : the data is very complex in nature, often unstructured: Text documents, emails, images and videos, etc. Click stream data, social media feed data, etc.
Page 21
Eliminate A Single Point Of Failure load balancer itself does not become a single point of failure. Load balancers must be implemented in high availability cluster
Page 22
Page 23
Page 24
Page 25
Rack 2 Rack 3Rack 1
A Typical Hadoop Cluster
ClientDATA ASSIGNMENT TO NODES
DATA READDATA WRITE
METADATA FORBLOCK INFO
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Job Tracker
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Task Tracker
Task Tracker
Map Reduce
Map Reduce
Data Node
Data Node
Task Tracker
Map Reduce
Data Node
Master Node
Slave Nodes
Slave Nodes
Slave Nodes
Name Node
JOB ASSIGNMENT
TASK ASSIGNMENT
1. Client2. Master Node
Name Node Job Tracker
3. Slave Nodes Data Nodes Task
Trackers Map /
Reduce
1. Client consults Name Node2. Client writes block to Data
Node3. Data Node replicates block4. Cycle repeats for next blocks
Rack 2 Rack 3Rack 1
Hadoop File System (HDFS)
Data Node 1 Data Node 4 Data Node 7
Data Node 2 Data Node 5 Data Node 8
Data Node 3 Data Node 6 Data Node 9
Name Node
Client
FILE
FILE
DATA ASSIGNMENT TO NODES
DATA READDATA WRITE
METADATA FORBLOCK INFO
Rack 1: Data Node 1 Data Node 2 …Rack 2: Data Node 4 …
MapReduce
the, 1quick, 1brown, 1fox, 1
the, 1fox, 1ate, 1the, 1mouse, 1
how, 1now, 1brown, 1cow, 1
the, 1the, 1the, 1
fox, 1fox, 1
quick, 1
brown, 1brown, 1
ate, 1
mouse, 1
how, 1
now, 1
cow, 1
the, 3
fox, 2
quick, 1
brown, 2
ate, 1
mouse, 1
how, 1
now, 1
cow, 1
the, 3fox, 2quick, 1brown, 2ate, 1mouse, 1how, 1now, 1cow, 1
Input Splitting Map ShuffleSort
Reduce
OutputThe Map function processes one line at a time, splits it into tokens seperated by a withespace and emits a key-value pair
<word, 1>.
The Reducer function just sums up the values, which are the occurence counts for each key (i.e. words in this example).
MapReduce Wordcount Example in R
Map function.
Reduce function.
Reading the input from HDFS from.dfs().
Writing the results back to HDFS to.dfs().
What is MapReduce used for?
• At Google:– Index building for Google Search– Article clustering for Google News– Statistical machine translation
• At Yahoo!:– Index building for Yahoo! Search– Spam detection for Yahoo! Mail
• At Facebook:– Data mining– Ad optimization– Spam detection
Who uses Hadoop?
▐ Facebook (Hadoop, Hive, Scribe)▐ Google File System (HDFS)▐ Yahoo! (Hadoop in Yahoo Search)▐ IBM Transarc (Andrew File System)▐ Amazon/A9
Goals of HDFS - Hadoop Distributed File System ▐ Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB▐ Assumes Commodity Hardware
– Files are replicated to handle hardware failure– Detect failures and recovers from them
▐ Optimized for Batch Processing– Provides very high aggregate bandwidth
Hadoop, Why?
▐ Need to process Multi Petabyte Datasets▐ Need common infrastructure
– Efficient, reliable, Open Source Apache License▐ The above goals are same as Condor, but
Workloads are IO bound and not CPU bound
Hive, Why?▐ Need a Multi Petabyte Warehouse▐ Hive is a Hadoop subproject!
What is MapReduce?▐ Data-parallel programming model for clusters of commodity
machines▐ Pioneered by Google Processes 20 PB of data per day▐ Popularized by open-source Hadoop project
Used by Yahoo!, Facebook, Amazon, …
Hadoop at Facebook▐ Production cluster
4800 cores, 600 machines, 16GB per machine – April 20098000 cores, 1000 machines, 32 GB per machine – July 20094 SATA disks of 1 TB each per machine2 level network hierarchy, 40 machines per rackTotal cluster size is 2 PB, projected to be 12 PB in Q3 2009
▐ Test cluster• 800 cores, 16GB each
2016 - Hadoop clusters
▐ ~20,000 machines running Hadoop▐ Largest clusters are currently 2000 nodes▐ Several Petabytes of user data (compressed, unreplicated)▐ Run hundreds of thousands of jobs every month
2016 - Big Data Server Farm
Page 35
Conclusions
The Digital Age brings many opportunities but also challenges.
Big Data and Analytics can face the challenges and realize the opportunities.
It is within anyone’s grasp, do it incremental and iterative. Hadoop cloud solutions are scalable, flexible and cost-
efficient, but sometimes limited in functionality (or not standardized).
Need for good Data Scientists in a mixed team of competences to make the right choices.
Conclusions
QUESTIONS?
39
Q&A
Thanks