Stream processing with Spark - Vargas-Solarvargas-solar.com › datacentric-sciences › wp-content...

Hands-on 3: Stream processing with Spark 1. ObjectiveTheobjectiveofthishandsonistoletyou“touch”thechallengesimpliedinprocessingstreams.Inclass,wewilluseSparkforimplementingastreamingversionofwordcountandanexampleusing:

• ATCPserver.• Twitterstreaming.

2. Material- Downloadhttps://github.com/javieraespinosa/dxlab-spark

3. GettingstartedwithSparkStreamingSpark streaming is an extension of the core SparkAPI that enables streamprocessing of live datastreams.DatacanbeharvestedfrommanysourceslikeKafka,Flume,Twitter,ZeroMQ,Kinesis,orTCPsockets,andcanbeprocessedusingcomplexalgorithmsexpressedwithhigh-levelfunctionslikemap,reduce,joinandwindow.Finally,processeddatacanbepushedouttofilesystems,databases,andlivedashboardsInternallySparkstreamingreceivesliveinputdatastreamsanddividesthedataintobatches,whicharethenprocessedbytheSparkenginetogeneratethefinalstreamofresultsinbatches.

Spark Streaming is based on the notion of discretized stream or DStream, which represents acontinuousstreamofdata.DStreamscanbecreatedeitherfrominputdatastreamsfromsourcessuchasKafka, Flume, andKinesis, or by applyinghigh-level operationsonotherDStreams. Internally, aDStreamisrepresentedasasequenceofRDDs.ThisguideshowsyouhowtostartwritingSparkStreamingprogramswithDStreams.YoucanwriteSpark Streaming programs in Scala, Java or Python (see the Spark full guide for detailshttps://spark.apache.org/docs/latest/streaming-programming-guide.html). You will find tabsthroughoutthisguidethatletyouchoosebetweencodesnippetsofdifferentlanguages.

4. Basicconcepts:DStreamsDiscretizedStream isthebasicabstractionprovidedbySparkStreaming.Itrepresentsacontinuousstreamof data, either the input data stream received from source, or the processed data streamgeneratedby transforming the input stream. Internally, aDStream is representedbya continuousseries of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in aDStreamcontainsdatafromacertaininterval,asshowninthefollowingfigure.

AnyoperationappliedonaDStreamtranslatestooperationsontheunderlyingRDDs.TheseunderlyingRDDtransformationsarecomputedbytheSparkengine.TheDStreamoperationshidemostofthesedetailsandprovidethedeveloperwithahigher-levelAPIforconvenience.4.1.1InputDStreamsandReceiversInputDStreamsareDStreamsrepresentingthestreamofinputdatareceivedfromstreamingsources.EveryinputDStream(exceptfilestream)isassociatedwithaReceiverobjectwhichreceivesthedatafromasourceandstoresitinSpark’smemoryforprocessing.SparkStreamingprovidestwocategoriesofbuilt-instreamingsources.

• Basicsources:directlyavailable intheStreamingContextAPI.Examples:filesystems,socketconnections,etc.

• Advancedsources:SourceslikeKafka,Flume,Kinesis,Twitter,etc.areavailablethroughextrautilityclasses.Theserequirelinkingagainstextradependencies(seetheTwitterexercisenext).

Notethat,ifyouwanttoreceivemultiplestreamsofdatainparallelinyourstreamingapplication,youcan create multiple input DStreams. This will create multiple receivers which will simultaneouslyreceivemultipledatastreams.Yet,notethataSparkworker/executorisalong-runningtask,henceitoccupiesoneofthecoresallocatedtotheSparkStreamingapplication.Therefore,itisimportanttoremember that a Spark Streaming application needs to be allocated enough cores (or threads, ifrunninglocally)toprocessthereceiveddata,aswellastorunthereceiver(s).4.2CountingwordsexampleLetustakeaquicklookatwhatasimpleSparkStreamingprogramlookslike.Letussaywewanttocountthenumberofwordsintextdatareceivedfromadataserver.Notethatforconvertingastreamoflinestowords,theflatMapoperationisappliedoneachRDDinthelinesDStreamtogeneratetheRDDsofthewordsDStreamasshowninthefollowingfigure.

Wewilldeveloptwoversions,thefirstonelisteningfromaTCPsocketandtheotheronelisteningfromTwitter.Forthisexercise,wewillworkinPython.NotethatthefullsparkguidehasexamplesusingScalaandJava.

5. CountingwordsproducedbyaTCPsocketTerminal1.NetcatserverGotoyoudxlab-spark-masterfile.$ docker-build RuntheTCPserver. $ docker-compose run netcat nc -l -p 9999 Terminal2.Sparkinterpreterforpython(pyspark)$ docker-compose run pyspark First,weimportStreamingContext,whichisthemainentrypointforallstreamingfunctionality.WecreatealocalStreamingContextwithtwoexecutionthreads,andbatchintervalof1second.from pyspark import SparkContext

from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second

ssc = StreamingContext(sc, 1)

Using this context, we can create a DStream that represents streaming data from a TCP source, specified as hostname (e.g. localhost) and port (e.g. 9999).

# Create a DStream that will connect to hostname:port, like localhost:9999

lines = ssc.socketTextStream("11.0.0.10", 9999)

This lines DStream represents the stream of data that will be received from the data server. Each record in this DStream is a line of text. Next, we want to split the lines by space into words.

# Split each line into words

words = lines.flatMap(lambda line: line.split(" "))

flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new records from each record in the source DStream. In this case, each line will be split into multiple words and the stream of words is represented as the words DStream. Next, we want to count these words.

# Count each word in each batch

pairs = words.map(lambda word: (word, 1))

wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console

wordCounts.pprint()

ssc.start() # Start the computation

ssc.awaitTermination() # Wait for the computation to terminate

The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs, which is then reduced to get the frequency of words in each batch of data. Finally, wordCounts.pprint() will print a few of the counts generated every second.

Test your exercise by providing words in the server side, what happens in the client side?

Look for a representative set of texts that can let you see the word count in action.

6. CountingwordsfromTwitterpostsTheseexercisesaredesignedasstandaloneScalaprogramswhichwillreceiveandprocessTwitter’sreal sample tweet streams. This section will first introduce you to the basic system setup of thestandalone Spark Streaming programs, and then guide you through the steps necessary to createTwitterauthenticationtokensnecessaryforprocessingTwitter’srealtimesamplestream.6.1 TwittercredentialssetupOurhand-onisbasedonTwitter’ssampletweetstream,soweneedtoconfigureauthenticationwitha Twitter account. To do this, you need to setup a consumer key+secret pair and an accesstoken+secretpairusingaTwitteraccount.

6.2 CreatingatemporaryTwitteraccesskeysFollowtheinstructionsbelowtosetupthesetemporaryaccesskeyswithyourTwitteraccount.TheseinstructionswillnotrequireyoutoprovideyourTwitterusername/password.Youwillonlyberequiredto provide the consumer key and access token pairs that youwill generate,which you can easilydestroyonceyouhavefinishedthetutorial.So,yourTwitteraccountwillnotbecompromisedinanyway.OpenTwitter’sApplicationSettingspage1.ThispageliststhesetofTwitter-basedapplicationsthatyouownandhavealreadycreatedconsumerkeysandaccesstokensfor.Thislistwillbeemptyifyouhavenevercreatedanyapplications.Forthistutorial,createanewtemporaryapplication.Todothis,clickontheblue“Createanewapplication”button.Thenewapplicationpageshouldlookthepageshownbelow.Providetherequiredfields:• TheNameoftheapplicationmustbegloballyunique,sousingyourTwitterusernameasaprefix

tothenameshouldensurethat.Forexample,setitas[your-twitter-handle]-test.• FortheDescription,anythinglongerthan10charactersisfine.• FortheWebsite,similarly,anywebsiteisfine,butensurethatitisafully-formedURLwiththe

prefixhttp://.Then,clickonthe“Yes,Iagree”checkboxbelowtheDeveloper Rules of the Road.Finally,fillintheCAPTCHAandclickontheblue“CreateyourTwitterapplication”button.

1 https://apps.twitter.com

ConfirmationOnceyouhavecreatedtheapplication,youwillbepresentedwithaconfirmationpageliketheoneshownbelow.ClickontheAPIKeytab.

ApplicationsettingsYoushouldbeabletoseetheAPIkeyandtheAPIsecretthathavebeengenerated.Togeneratetheaccesstokenandtheaccesstokensecret,clickonthe“Createmyaccesstoken”buttonatthebottomofthepage(markedinredinthefigurebelow).Notethattherewillbeasmallgreenconfirmationatthetopofthepagesayingthatthetokenhasbeengenerated.

FinalresultofthesetupprocessFinally,thepageshouldlooklikethefollowing.NoticetheAPIKey,APISecret,AccessTokenandAccessTokenSecret.Wearegoingtousethese4keysinthenextsection.

6.3 Runningtheexercise

- Openthe/tweets-tcp.pyinthe dxlab-spark-master/python folderandsubstitutethecredentialsspecifiedtherebyyourTwittercredentials.

Terminal3.Preparethetweetsproducer

- $ docker-compose run tweets

Terminal4.Startreceivingtweetsinspark

- $ docker-compose run pyspark

Onceinsidepyspark,copy/pastethecodeinpython/spark-streaming.py.

Notethatyourcurrentsolutionmonitorsthe#Barcelonahardcodedinthetweets-tcp.pyfile.Youcanmodifyitifyouarewillingtolistenanother#.

Stream processing with Spark - Vargas-Solarvargas-solar.com › datacentric-sciences › wp-content...

Documents

Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Spark Summit EU talk by Miklos Christine paddling up the stream

Big Data Stream Processing - ScaDS · Big Data Stream Processing ... • More details on Storm, Spark, Flink ... • Unified primitives for batch and stream processing

Lab 3: Introduction to Spark Streaminghy562/labs20/Lab6-Spark...Outline Big Stream Analysis Streaming & Real Time Processing Streaming Systems λ vs k What is Spark Streaming Core

Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Adaptive Block and Batch Sizing for Batched Stream Processing …weisong.eng.wayne.edu/_resources/pdfs/zhang16-DyBBS.pdf · Spark Streaming is a batched stream processing frame-work,

A Deep Dive into Stateful Stream Processing in Structured Streaming Tathagata "TD" Das @tathadas Spark + Al Summit Europe 2018 4th October, LondonStructured Streaming stream processing

Spark Streamingspark.incubator.apache.org/talks/strata_spark_streaming.pdf · Spark Streaming Large-scale near-real-time stream processing Tathagata Das (TD) UC Berkeley UC#BERKELEY#

StreamSQL on Spark: Manipulating · PDF fileKafka Messaging/ Queue Spark Streaming In-Memory Tables Spark-SQL Stream Processing Ram Store HDFS Tables Persistent Storage Online Analysis

Easy, scalable, fault tolerant stream processing with structured streaming - spark meetup at intel in santa clara

Stream Processing with Spark and Storm: Couchbase Connect 2015

Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

Simple)stream)processing - Computer Science …€¦ · Spark’“batch”’framework ... Discretized) Stream)Processing Spark Spark Streaming batches)of)X) seconds live data) stream

Stream Processing using Apache Spark and Apache Kafka

Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared

Spark Streaming Large -scale near-real-time stream processing

Scalable Stream Processing Spark Streaming and Flink Stream

Spark meetup stream processing use cases

Troubleshooting Oracle Stream Analytics · 1 Troubleshooting Oracle Stream Analytics 1.1 Pipeline Debug and Monitoring Metrics 1-1 1.1.1 Spark Standalone 1-1 1.1.2 Spark on YARN 1-2