Online Tweet Sentiment Analysis with Apache Spark

Online Tweet Sentiment Analysis with Apache Spark

Davide Nardone0120/131

PARTHENOPEUNIVERSITY

1. Introduction2. Bag-of-words3. Spark Streaming4. Apache Kafka5. DataFrame and SQL operation6. Machine Learning library (MLlib)7. Apache Zeppelin8. Implementation and results

Summary

Sentiment Analysis (SA) refers to the use of Natural Language Processing (NLP) and Text Analysis to extract, identify or otherwise characterize the sentiment content of a text unit.

Introduction

The main dish was delicious It is an dish

The main dish was salty and

horrible

Positive NegativeNeutral

Existing SA approaches can be grouped into three main categories:1. Knowledge-based techniques;2. Statistical method;3. Hybrid approaches.

Statistical method take advantages on elements of Machine Learning (ML) such as Latent Semantic Analysis (LSA), Multinomial Naïve Bayes (MNB), Support Vector Machines (SVM) etc.

Introduction (cont.)

The bag-of-words model is a simplifying representation used in NLP and Information Retrieval (IR).

In this model, a text is represented as the the bag of its words, ignoring grammar and even word order but keeping multiplicity.

The bag-of-words model is commonly used in methods of document classification where the occurrence of each word (TF) is used as feature for training a classifier.

Bag-of-words

1. Tokening;2. Stopping;3. Stemming;4. Computation of tf (term frequency) idf (inverse

document frequency);5. Using a machine learning classifier for the

tweets classification (e.g., Naïve Bayes, Support Vector Machine, etc.)

Bag-of-words (cont.)

Spark Streaming in an extension of the core Spark API.

Data can be ingested from many sources like Kafka, etc.

Processed data can be pushed out to filesystems, databases, etc.

Furthermore, it’s possible to apply Spark’s machine learning algorithms on data streams.

Spark Streaming

Spark Streaming receives live input data streams and divides the data into batches.

Spark Streaming provides a high-level abstraction called Discretized Stream, DStream (continuous stream of data).

DStream can be created either from input data streams from sources as Kafka, Flume, etc.

Spark Streaming (cont.)

Kafka is a Distributed Streaming Platform and it behaves like a partitioned, replicated commit log services.

It provides the functionality of a messaging system.

Kafka is run as a cluster on one or more servers. The Kafka cluster stores streams of records in

categories called topics.

Apache Kafka

Kafka has two out of four main core APIs:1. The Producer API allows an application to publish

a stream record to one or more Kafka topics;2. The Consumer API allows an application to

subscribe to one or more topics and process the stream of records produced to them.

Apache Kafka (cont.)

So, at high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers.

Spark SQL is a component on the top of Spark Core that introduce a new data abstraction called SchemaRDD which provides support for structured and semi-structured data.

Spark SQL also provides JDBC connectivity and can access to several databases using both Hadoop connector and Spark connector.

In order to access to store or get data from it, it’s necessary: Define an SQLContext (entry point) for using all the

Spark's functionality; Create a table schema by means of a StructType on

which is applied a specific method for creating a Dataframe.

By using JDBC drivers, the previous schema is written on a database.

Output operations for DStream

MLlib is a Spark’s library of machine learning functions.

MLlib contains a variety of learning algorithms and is accessible from all Spark’s programming languages.

It consists of common learning algorithms and features, which includes classification, regression, clustering, etc.

Machine Learning with MLlib

The mllib.features package contains several classes for common features transformation. These includes algorithms to construct feature vectors from text and ways to to normalize and scale features.

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.

Feature extraction

Classification and regression are two common forms of supervised learning, where algorithms attempts to predict a variable from features of objects using labeled training data.

Both classification and regression use LabeledPoint class in MLlib.

MLlib includes a variety of methods for classification and regression, including simple linear methods and decision three and forests.

Classification

Naïve Bayes is a multiclass classification algorithm that scores how well each point belongs in each class based on linear function of the features.

It’s commonly used in text classification with TF-IDF features, among other applications such as Tweet Sentiment Analysis.

In MLlib, it’s possible to use Naïve Bayes through the mllib.classification.NaiveBayes class.

Naïve Bayes

Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity.

Unlike the supervised tasks, where the data is labeled, clustering can be used to make sense of unlabeled data.

It is commonly used in data exploration and in anomaly detection

Clustering

MLlib, in addition to including the popular K-means “offline algorithm”, it also provides an “online” version for clustering “online” data streams.

When data arrive in a stream, the algorithm dynamically:1. Estimate the membership data groups;2. Update the centroids of the clusters.

Streaming K-means

In MLlib, it’s possible to use Streaming K-means through the mllib.clustering.StreamingKMeans class.

Streaming K-means (cont.)

Given a dataset of points in high-dimension space, we are often interested in reducing the dimensionality of the points so that they can be analyzed with simpler tools.

For example, we might want to plot the points in two dimensions, or just reduce the number of features to train models more efficiently.

In MLlib, it’s possible to use Streaming K-means through the mllib.feature.PCA class.

Principal Component Analysis (PCA)

Apache Zeppelin is a web-based notebook that enables interactive data visualization.

Apache Zeppelin interpreters concept allows any language/data-processing-backend to be plugged into Zeppelin such as JDBC.

Apache Zeppelin

Because of the lack of Spark-Streaming API (Python) for accessing to a Twitter account, the tweet streams have been simulated using Apache Kafka.

In particular, the entity accounting for this task is a Producer which publishes stream of data on a specific topic.

The training and testing data stream have been retrieved from [1].

On the other side, each received DStream is processed by a Consumer, using stateless Spark functions such as map, transform, etc..

Implementation and results

Naïve Bayes classification results

Clustering results

Future work Integrate Twitter API’s method to retrieve tweet

from accounts. Use an alternative feature extraction method for the

Streaming K-means task.

[1] http://help.sentiment140.com/for-students/[2] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. “O'Reilly Media, Inc.", 2015.[3] Bogomolny, A. Benford’s Law and Zipf ’sLaw. http://www.cut-the-knot.org/doyouknow/zipfLaw.shtml.

References

http://help.sentiment140.com/for-students/

http://help.sentiment140.com/for-students/

For any questions, contact me at: [email protected]

Data & Analytics

Online Tweet Sentiment Analysis with Apache Spark