Upload
sean-owen
View
444
Download
2
Embed Size (px)
DESCRIPTION
1
Apache Spark as Cross-over Hit for Data ScienceSean Owen / Director of Data Science / Cloudera
2
Investigative vs Operational Analytics
Data ScientistExploratory Analytics
Predictive Data ProductsOperational Analytics
Tools of the Trade
3
Trade-offs of the Tools
4
Production DataLarge-ScaleShared Cluster
ContinuousOperationOnline
Throughput, QPS
Few, Simple
Systems LanguagePerformance
Historical SubsetSample
Workstation
Ad HocInvestigation
Offline
Accuracy
Many, Sophisticated
Scripting, High LevelEase of Development
Data
Context
Metrics
Library
Language
Inve
stiga
tive O
perational
R
5
Production DataLarge-ScaleShared Cluster
ContinuousOperationOnline
Throughput, QPS
Few, Simple
Systems LanguagePerformance
Historical SubsetSample
Workstation
Ad HocInvestigation
Offline
Accuracy
Many, Sophisticated
Scripting, High LevelEase of Development
Data
Context
Metrics
Library
Language
Inve
stiga
tive O
perational
Python + scikit
6
Production DataLarge-ScaleShared Cluster
ContinuousOperationOnline
Throughput, QPS
Few, Simple
Systems LanguagePerformance
Historical SubsetSample
Workstation
Ad HocInvestigation
Offline
Accuracy
Many, Sophisticated
Scripting, High LevelEase of Development
Data
Context
Metrics
Library
Language
Inve
stiga
tive O
perational
MapReduce, Crunch, Mahout
7
Production DataLarge-ScaleShared Cluster
ContinuousOperationOnline
Throughput, QPS
Few, Simple
Systems LanguagePerformance
Historical SubsetSample
Workstation
Ad HocInvestigation
Offline
Accuracy
Many, Sophisticated
Scripting, High LevelEase of Development
Data
Context
Metrics
Library
Language
Inve
stiga
tive O
perational
Spark: Something For Everyone
8
• Now Apache TLP• From UC Berkeley
• Scala-based• Expressive, efficient• JVM-based
• Scala-like API• Distributed works
like local• Like Crunch is
Collection-like
• REPL• Interactive
• Distributed• Hadoop-friendly
• Integrate with where data already is
• ETL no longer separate• MLlib
Spark
9
Production DataLarge-ScaleShared Cluster
ContinuousOperationOnline
Throughput, QPS
Few, Simple
Systems LanguagePerformance
Historical SubsetSample
Workstation
Ad HocInvestigation
Offline
Accuracy
Many, Sophisticated
Scripting, High LevelEase of Development
Data
Context
Metrics
Library
Language
Inve
stiga
tive O
perational
Stack Overflow Tag Recommender Demo
10
• Questions have tags like java or mysql
• Recommend new tags to questions
• Available as data dump• Jan 20 2014 Posts.xml
• 24.4GB• 2.1M questions • 9.3M tags (34K unique)
11
<row Id="4" PostTypeId="1" AcceptedAnswerId="7” CreationDate="2008-07-31T21:42:52.667" Score="251" ViewCount="15207" Body="<p>I want to use a track-bar to change a form's opacity.</p>

<p>This is my code:< p>

<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
< code></pre>

<p>When I try to build it, I get this error:</p>

<blockquote>
 <p>Cannot implicitly convert type 'decimal' to 'double'.< p>
</blockquote>

<p>I tried making <strong>trans</strong> to <strong>double</ strong>, but then the control doesn't work. This code has worked fine for me in VB.NET in the past. </p>
" OwnerUserId="8” LastEditorUserId="2648239" LastEditorDisplayName="Rich B” LastEditDate="2014-01-03T02:42:54.963" LastActivityDate="2014-01-03T02:42:54.963" Title="When setting a form's opacity should I use a decimal or double?” Tags="<c#><winforms><forms><type- conversion><opacity>" AnswerCount="13" CommentCount="25" FavoriteCount="23" CommunityOwnedDate="2012-10-31T16:42:47.213" />
Stack Overflow Tag Recommender Demo
12
• CDH 5.0.1• Spark 0.9.0
• Standalone mode• Install libgfortran
• 1 master• 5 workers
• 24 cores• 64GB RAM
13
14
val postsXML = sc.textFile( "hdfs:///user/srowen/SparkDemo/Posts.xml")
postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:15
postsXML.count...res1: Long = 18066983
15
<row Id="4" ... Tags="...c#...winforms..."/>
(4,"c#")(4,"winforms") ...
(4,3104,1.0)(4,2148819,1.0) ...
16
val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r val tagRegex = "<([^&]+)>".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList if (tags.size >= 4) tags.map((postID,_)) else None } }}
17
def nnHash(tag: String) = tag.hashCode & 0x7FFFFFvar tagHashes = postIDTags.map(_._2).distinct.map(tag => (nnHash(tag),tag))
import org.apache.spark.mllib.recommendation._val alsInput = postIDTags.map(t => Rating(t._1, nnHash(t._2), 1.0))
val model = ALS.trainImplicit(alsInput, 40, 10)
18
19
def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = { val predictions = model.predict( tagHashes.map(t => (questionID,t._1))) val topN = predictions.top(howMany) (Ordering.by[Rating,Double](_.rating)) topN.map(r => (tagHashes.lookup(r.product)(0), r.rating))}
recommend(7122697).foreach(println)
20
(sql,0.1666023080230586)(database,0.14425980384610013)(oracle,0.09742911781766687)(ruby-on-rails,0.06623183702418671)(sqlite,0.05568507618047555)
I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time-consuming operation, and I need it work fast, because I use it in a live-search field in my website. Any ideas would be appreciated...postgresql query-optimization substring text-search
stackoverflow.com/questions/7122697/how-to-make-substring-matching-query-work-fast-on-a-large-table
21
blog.cloudera.com/blog/2014/ 03/why-apache-spark-is-a- crossover-hit-for-data- scientists/
goo.gl/[email protected]