Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
© Cloudera, Inc. All rights reserved.
Big data introduction workshopUpdated October 24, 2018
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 2
AGENDA
Introduction
What is big data / distributed computing
Map-Reduce concept
Map-Reduce java workshop
HADOOP history & Additional components
Q&A
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 3
About me
Zoltan Siegl
Software [email protected]
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 4
ASSESS CUSTOMER BEHAVIOUR
• Do you think when playing candy crush you randomly get that ping lollipop?
• Well… think again...
BEHAVIOUR
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 5
DETECTING INTERACTIVE ADVERSE DRUG EFFECTS
• No clinical trials for cross effects
• 91 additional ADE-s revealed
• 200 000 man-year of clinic trials did not reveal these
HEALTHCARE
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 6
DIGITAL DEFENDERS OF CHILDREN
• Uses big data against human trafficking
• In 1 year:• Prevent over 860 cases
• Help identify over 300 victims
• Of witch 50 children
PROTECT
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 7
AGENDA
Introduction
What is big data / distributed computing
Map-Reduce concept
Map-Reduce java workshop
HADOOP history & Additional components
Q&A
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 8
WHAT DO YOU MEAN BIG DATAThe amount of DATA possessed and processed
It is 300 000 000 000 000 000 Bytes600 TB processed per day1 billion users / month2.7 billion likes / day300 million photos uploaded / day
NSAThat is pronounced Exabytes5*1018 bytes30 PB processed per day1,6 % of internet traffic touched / dayWeb searches, websites visited, phone calls, skype calls, credit card transactions, etc.* this is a rough estimate. For legal purposes I have to state that I do not have possession of the data they store. You hear that NSA, right? :)
100 PB processed per day60 trillion pages indexed>1 billion unique search users / month2,3 billion search / second
Reference: 2016 Janani Ravi - Building blocks of Hadoop course on Pluralsight
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 9
NO SINGLE COMPUTER IS BIG ENOUGH
• Not even in the 90’s
• Supercomputers are superexpensive
• Amount of data grows. Computers don’t.
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 10
REQUIREMENTS FOR A BIG DATA SYSTEMWhat expectations do we have
STORAGE SCALABILITYCOMPUTATION
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 11
AGENDA
Introduction
What is big data / distributed computing
Map-Reduce concept
Map-Reduce java workshop
HADOOP history & Additional components
Q&A
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 12
DISTRIBUTED COMPUTINGShard data, and shard computing capacity
STORAGE
+
COMPUTATION
STORAGE
+
COMPUTATION
STORAGE
+
COMPUTATION
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 13
MAPPER AND REDUCERLogical view
Mapper
Map(key1, value1) → list(key2, value2)
Reducer
Map(key2, list(value2)) → list(value3)
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 14
BIG- COMPLEXITY OF MAP-REDUCEThis is completely unreasonable
O(n log n * s * (1/p)) where:
- n is the number of items
- s is the number of nodes
- p is the ping time between nodes (assuming equal ping times between all nodes in the network)
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 15
MAPPER AND REDUCERLogical view
Mapper
Map(”shard1”, ”to be or not to be”) → <”to”, 1>, <”be”, 1>, <”or”, 1>,
<”not”, 1>, <”to”, 1>, <”be”, 1>
Reducer
Reduce(”to”, <1,1,1>) → <”to”, 3>
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 16
MAPPER AND REDUCERLogical view
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 17
AGENDA
Introduction
What is big data / distributed computing
Map-Reduce concept
Map-Reduce java workshop
HADOOP history & Additional components
Q&A
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved.
LET’S DO SOME CODING
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 19
AGENDA
Introduction
What is big data / distributed computing
Map-Reduce concept
Map-Reduce java workshop
HADOOP history & Additional components
Q&A
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 20
HISTORY OF HADOOPStarted from Google
HDFS SCALABILITYCOMPUTATION
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 21
HISTORY OF HADOOPHadoop 2
HDFS SCALABILITYCOMPUTATION
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 22
HISTORY ECOSYSTEMOther components
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 23
CLOUDERA MANAGER
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 24
CLOUDERA INTERNSHIP PROGRAM
http://bit.do/clouderaintern
You are
❏ A Bachelors, Masters or PhD full-time student at a Hungarian university where you have learnt programming
❏ Based in Hungary and currently enrolled to a Hungarian university❏ Finishing your studies soon: either in summer 2019 or in winter 2020❏ Able to work 20 hours / week in the active semester (schedule is flexible)❏ Able to start the internship early February 2019
You have
❏ Good verbal and written English skills❏ The urge to learn about system software and algorithms, software quality assurance or
tooling
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved. 25
AGENDA
Introduction
What is big data / distributed computing
Map-Reduce concept
Map-Reduce java workshop
HADOOP history & Additional components
Q&A
http://sli.do/bigdataworkshop
© Cloudera, Inc. All rights reserved.
THANK YOU