10 Things About Spark

The 10 Apache Spark Features You (Unlikely) Didn't Hear About

Roger BrinkleyTechnical Evangelist

The 10 Apache Stack Features You (Unlikely) Didn't Hear About

• 10 minutes – 10 slides• Ignite Format

• No stopping!• No going back!• Questions? Sure, but only if and until time

remains on slide (otherwise, save for later)

• Hire me, I’ll find 45 more

It’s Fast Really Fast

• 10 - 100x faster than MapReduce• 10 – 100x faster than Hive• Historical perspective

– JRuby 2-3x Faster with InvokedDynamic JVM– Hardware rarely gets greater than 10x/year

MapReduce is Listed as the Last Most Important Software Innovation

And Spark Blew the Lid Off of MapReduce

• Commons-based Peer Production– Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing– 12 Organizations Committing– Peer Governance– Participative Decision Making

It’s Pure Open Source

The very essence of a free government consists in considering offices as public trusts,

bestowed for the good of the country, and not for the benefit of an individual or a party

John C. Calhoun 2/13/1835

The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party

Modern FOSS John C. Calhoun

Strong Enterprise Relationships

• Spark is in every major Hadoop distributor• Vertical enterprise use

– Internet companies, government, financials– Churn analysis, fraud detection, risk analytics

• Used in other data stores – Datastax (Cassandra)– MongoDB

• Databricks has a cloud based implementation

Enhances Other Big Data Implementations

• Hadoop – Replacement of Map Reduce• Cassandara – Analytics• Hive – Faster SQL processing• SAP Hana – Faster interactive analysis

API Stability

• Guaranteed stability of its core API for 1.X • Spark has always been conservative with API

changes• Clearly defined annotations for future APIs

– Experimental– Alpha– Developer

Don’t Need to Learn a New Language

• Scala• Java – 25%• Python – 30% • And soon R

Java 8 Lambda SupportJavaRDD<String> lines = sc.textFile("hdfs://log.txt");

// Map each line to multiple wordsJavaRDD<String> words = lines.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); }});

// Turn the words into (word, 1) pairsJavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { return new Tuple2<String, Integer>(w, 1); }});

// Group up and add the pairs by key to produce countsJavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; }});

counts.saveAsTextFile("hdfs://counts.txt");

JavaRDD<String> lines = sc.textFile("hdfs://log.txt");JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")));JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y);counts.saveAsTextFile("hdfs://counts.txt");

val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))

val lines = ssc.socketTextStream("localhost", 9999)val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()

val file = sc.textFile("hdfs://.../pagecounts-*.gz")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://.../word-count")

Real Time Stream Process

Caching Interactive Algorithms

val points = sc.textFile("...").map(parsePoint).cache()var w = Vector.random(D) //current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)

New Security Integration

• Complete Integration with Haddop/YARN Security Model– Authenticate Job Submissions– Securely transfer HDFS credentials– Authenticate communication between component

• Other deployments supported val conf = new SparkConfconf.set("spark.authenticate", "true")conf.set("spark.authenticate.secret", "good")

And Lots More

• Apache Spark Website• Databricks – making big data easy

– Introduction to Apache Spark• Jul 28 – Austin, TX - More Info & Registration• Aug 25 – Chicago, IL - More Info & Registration

10 Things About Spark

Software

Things to Talk About When Talking About Thingscollablab.northwestern.edu/CollabolabDistro/nucmc/...Things to Talk About When Talking About Things Steve Whittaker AT&T Labs–Research

Things about art

How Spark Enables the Internet of Things- Paula Ta-Shma

10 Things Everyone Should Know About the Universehuterer/EPO/tenthings.pdf · What about exotic things not found? You shouldn’t worry about ... 10 Things Everyone Should Know About

Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Simple things about internet of things

All about Spark Plugs - jenniskens.livedsl.nljenniskens.livedsl.nl/Technical/Tips/Files/Beru spark plugs.pdf · Compact spark plugs for the particularly confined spaces on power saws

The Blossoming Internet of Things Zach Supalla-Spark

THINGS ABOUT MOUSE

Important Facts About Spark Erosion En

Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

23 Things about Capitalism

Talend Spark Meetupfiles.meetup.com/14077672/Talend Spark Meetup updated.pdf4 Talend Real-time Big Data The first data integration platform on Spark Internet of Things Delivers an

Ncku csie talk about Spark

Running Apache Spark Applications - Cloudera · about spark-submit, see the Apache Spark document "Submitting Applications". Alternately, you can use Livy to submit and manage Spark

More about Spark

28 things about economics

BERU Technical information brochure - All about spark plugs

5 things one must know about spark!

All Things Open 2015 - Spark & Storm: When & Where?