10 Things About Spark

Preview:

DESCRIPTION

A presentation prepared for Data Stack as a part of their Interview process on July 20. This 15 presentation in ignite format features 10 items that you might not know about the V1.0 Spark release

Citation preview

The 10 Apache Spark Features You (Unlikely) Didn't Hear About

Roger BrinkleyTechnical Evangelist

The 10 Apache Stack Features You (Unlikely) Didn't Hear About

• 10 minutes – 10 slides• Ignite Format

• No stopping!• No going back!• Questions? Sure, but only if and until time

remains on slide (otherwise, save for later)

• Hire me, I’ll find 45 more

It’s Fast Really Fast

• 10 - 100x faster than MapReduce• 10 – 100x faster than Hive• Historical perspective

– JRuby 2-3x Faster with InvokedDynamic JVM– Hardware rarely gets greater than 10x/year

MapReduce is Listed as the Last Most Important Software Innovation

And Spark Blew the Lid Off of MapReduce

• Commons-based Peer Production– Apache Software Foundation Top Level Project – 200 people from 50 OrganizationsContributing– 12 Organizations Committing– Peer Governance– Participative Decision Making

It’s Pure Open Source

The very essence of a free government consists in considering offices as public trusts,

bestowed for the good of the country, and not for the benefit of an individual or a party

John C. Calhoun 2/13/1835

The very essence of a free software consists in considering contributing roles as public trusts, bestowed for the good of the community, and not for the benefit of an individual or a party

Modern FOSS John C. Calhoun

Strong Enterprise Relationships

• Spark is in every major Hadoop distributor• Vertical enterprise use

– Internet companies, government, financials– Churn analysis, fraud detection, risk analytics

• Used in other data stores – Datastax (Cassandra)– MongoDB

• Databricks has a cloud based implementation

Enhances Other Big Data Implementations

• Hadoop – Replacement of Map Reduce• Cassandara – Analytics• Hive – Faster SQL processing• SAP Hana – Faster interactive analysis

API Stability

• Guaranteed stability of its core API for 1.X • Spark has always been conservative with API

changes• Clearly defined annotations for future APIs

– Experimental– Alpha– Developer

Don’t Need to Learn a New Language

• Scala• Java – 25%• Python – 30% • And soon R

Java 8 Lambda SupportJavaRDD<String> lines = sc.textFile("hdfs://log.txt");

// Map each line to multiple wordsJavaRDD<String> words = lines.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); }});

// Turn the words into (word, 1) pairsJavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String w) { return new Tuple2<String, Integer>(w, 1); }});

// Group up and add the pairs by key to produce countsJavaPairRDD<String, Integer> counts = ones.reduceByKey( new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; }});

counts.saveAsTextFile("hdfs://counts.txt");

JavaRDD<String> lines = sc.textFile("hdfs://log.txt");JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")));JavaPairRDD<String, Integer> counts = words.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y);counts.saveAsTextFile("hdfs://counts.txt");

val ssc = new StreamingContext(args(0), "NetworkHashCount", Seconds(10), System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))

val lines = ssc.socketTextStream("localhost", 9999)val words = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()

val file = sc.textFile("hdfs://.../pagecounts-*.gz")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://.../word-count")

Real Time Stream Process

Caching Interactive Algorithms

val points = sc.textFile("...").map(parsePoint).cache()var w = Vector.random(D) //current separating planefor (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w)

New Security Integration

• Complete Integration with Haddop/YARN Security Model– Authenticate Job Submissions– Securely transfer HDFS credentials– Authenticate communication between component

• Other deployments supported val conf = new SparkConfconf.set("spark.authenticate", "true")conf.set("spark.authenticate.secret", "good")

And Lots More

• Apache Spark Website• Databricks – making big data easy

– Introduction to Apache Spark• Jul 28 – Austin, TX - More Info & Registration• Aug 25 – Chicago, IL - More Info & Registration

Recommended