12
EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming Tingyu Li, [email protected] 1 10/04/2019

EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

EECS E6893 Big Data AnalyticsHW3: Twitter data analysis with Spark Streaming

Tingyu Li, [email protected]

110/04/2019

Page 2: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Spark Streaming

https://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 3: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Dstream

● represents a continuous stream of data● a continuous series of RDDs

Page 4: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Spark Context

Twitter APISocketSpark

Streaming

BigQueryGoogleStorage

request

data

request

data

Put streaming data Read

data

Writedata

Architecture

Page 5: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Register on Twitter Apps

Page 6: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

SocketUse TCP, need to provide IP and Port for client to connect

Page 7: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Spark Streaming

Create a local StreamingContext with two working thread and batch interval of 5 second.

Create stream from TCP socket IP localhost and Port 9001

Page 8: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Spark Streaming

Start streaming context

Stop after 120 seconds

Save results to BigQuery

Page 9: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Task1: hashtagCount

Page 10: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Task2: wordCount

Page 11: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Task3: Save resultsCreate a dataset:

bq mk <Dataset name>

Replace with your own bucket and dataset name:

Page 12: EECS E6893 Big Data Analytics Tingyu Li, tl2861@columbia ...cylin/course/bigdata/... · 10/4/2019  · EECS E6893 Big Data Analytics HW3: Twitter data analysis with Spark Streaming

Task3: Save results