Spark introduction RDD Building and running Spark applications · 2018-04-17 · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing

Spark introduction ��RDD ��Building and running Spark applications

Lightning-fast cluster computing

2

2009 NoSQL

Along with Pig

2007 Hive 2012 RDD concept paper published

The beginning of Spark

• Originator: Matei Zaharia •  Start in 2009 as a class project in UC Berkeley’s AMPlab

•  Need to do machine learning faster on HDFS • Doctoral dissertaHon (2013) •  hMp://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-‐2014-‐12.pdf

• Hear Matei talking •  hMps://www.youtube.com/watch?v=BFtQrfQ2rn0

3

4

5

IBM

•  2015.6 •  hMps://www-‐03.ibm.com/press/us/en/pressrelease/47107.wss

6

2015.9

7

What is Spark?

• A general execuHon engine to improve/replace MapReduce

•  Spark’s operators are a strict superset of MapReduce

8

What’s wrong with the original MapReduce?

9

10

What’s wrong with the original MapReduce? •  LimitaHons of MapReduce.

•  Originated around year 2000. Old technology. •  Designed for batch-‐processing large amount of webpages in Google

•  And it does that job very well! • Not fit for

•  Complex, mulH-‐passing algorithms •  InteracHve ad-‐hoc queries •  Real-‐Hme stream processing

11

We are asking too much from MapReduce

12

13

The Spark way!

14

15

16

Easier to develop on Spark

•  Think of Assembly language

•  Python print “Hello world!”

17

Original MapReduce

Spark

Word count

• Mapreduce: •  hMps://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v2.0

•  Spark •  hMps://spark.apache.org/examples.html

18

Spark is not just in-‐memory processing -‐-‐ it is faster on disk too!

19

A unified engine

20

21

22

Core Spark data abstracIon

• Resilient Distributed Dataset (RDD)

23

RDDs

24

CreaIng RDD

25

RDD operaIons

26

RDD operaIons: AcIons

27

RDD operaIons: TransformaIon

28

Example: map and filter

29

Lazy execuIon

30

Chaining transformaIons

31

RDD lineage and toDebugString

32

FuncIonal programming in spark

33

Passing funcIons as parameters

34

Passing named funcIons

35

Anonymous funcIons

36

CreaIng RDDs from collecIons

37

CreaIng RDDs from files (1)

38

CreaIng RDDs from files (2)

39

Whole file-‐based RDDs (1)

40

Whole file-‐based RDDs (2)

41

Some other RDD operaIons

42

Example: flatMap and disInct

43

Example: mulI-‐RDD transformaIons

44

Some other RDD operaIons

45

Conclusion

46

AggregaIng data with pair RDDs

47

Pair RDDs

48

CreaIng pair RDDs

49

Example: a simple pair RDDs

50

Example: keying by user ID

51

QuesIon1: pairs with complex values

52

Answer1: pairs with complex values

53

QuesIon2: mapping single rows to mulIple pairs

54

Answer2: mapping single rows to mulIple pairs

55

Map-‐reduce

56

Map-‐reduce in spark

57

Example: word-‐count (1)

58


59


60


61

ReduceByKey (1)

62

ReduceByKey (2)

63

Other pair RDD operaIons

64

Example: pair RDD operaIons

65

Example: joining by key

66

Using join

67

Example: join web log with knowledge base arIcle

68

Example: join web log with knowledge base arIcle

69

70

71

72

73

74

Example output

75

Other pair operaIons

76

Writing and deploying spark applications

77

The SparkContext

78

Python example: word-‐count

79

Building a spark applicaIon

80

81

Running a spark applicaIon

82

Running spark applicaIons locally

83

Running spark applicaIons on cluster

84

StarIng shell locally or on cluster

85

86

Documents

Spark introduction RDD Building and running Spark applications · 2018-04-17 · Spark introduction!! RDD!! Building and running Spark applications Lightning-fast cluster computing