Upload
cloudcamp-chicago
View
36
Download
1
Embed Size (px)
Citation preview
Provides distributed processing
Main unit of abstraction is the RDD
Can be used with frameworks like Mesos or Yarn
Supports Java, Python and Scala
https://spark.apache.org/
What is Spark?
Can be created from… Files or HDFS In memory iterable Cassandra or SQL tables
Transformations Lazily create a new RDD from an existing one
Actions Usually return a value, force computation of RDD
Resilient Distributed Dataset
Sample Text
Spark Example
Spark Shell
Shell Example
Gists
#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line).strip().lower()) \ .flatMap(lambda line: [ (word, 1) for word in line.split() ]) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)
Example: Word Count
#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line)) \ .map(lambda line: line.strip()) \ .map(lambda line: line.lower()) \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)
Example: Alternate Word Count
$ pyspark...Using Python version 2.7.2 (default)SparkContext available as sc.>>> from word_count import word_count>>> word_count(sc, 'text.txt', 'text_counts')
Running the Example
a,23able,1about,6above,1accept,1accuse,1ago,2alarm,2all,7although,1always,2an,1
The Results From Sparkand,26anger,1another,1any,2anyone,1arches,1are,1arm,1armour,1as,7assistant,2...
#!/bin/bashtext=$(cat ${1} | tr "[:punct:]" " " | \ tr "[:upper:]" "[:lower:]")parsed=(${text})for w in ${parsed[@]}; do echo ${w}; done | sort | uniq -c
A (Bad) Shell Version
23 a 1 able 6 about 1 above 1 accept 1 accuse 2 ago 2 alarm 7 all 1 although 2 always 1 an
The Results From the Shell 26 and 1 anger 1 another 2 any 1 anyone 1 arches 1 are 1 arm 1 armour 7 as 2 assistant ...
Our Use Case
distinct()3rd party
3rd partydistinct()
join()
join()
union() distinct() foreach()1st party