Data Mining Presentation - Twitter Classification

Categorization of users on Twitter

CS-248 Data Mining

Muhammad Usman Riaz | ID: 101620043

Daud Khan | ID: 101620016

Muhammad Ain Ul Hassan | ID: 101620005

Muzamil Asad | ID: 101620013

Abid Javed | ID: 101620025

Spring 2014

Outline

Introduction Pre-processing Classification Results

Problem statement

Twitter is an online social networking and microblogging service that enables users to send and read short 140-character text messages, called "tweets". Registered users can read and post tweets, but unregistered users can only read them. The objective of this project includes categorization of Twitter users into different classes like company or individual, professional or home user, sportsman, student, teacher etc.

Dataset Raw dataset

(a) Raw data

(b) Data organization

Attributes of ‘category’

Pre-processing

Conversion to ARFF format Removal of unnecessary attributes. Tweets (strings) converted into words

(using weka “StringtoWordVector” filter)

Removal of stop words (are, as, at etc)

Training data after pre-processing

Classification

Conversion of test data to ARFF format using batch filtering.

Batch filtering is used if a second dataset, normally the test set, needs to be processed with the same statistics as the the first dataset, normally the training set.

Classification

Classification using supplied test data-set

Results

NaiveBayes Naive Bayes classifiers are a family

of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

ResultsClassification using NaiveBayes

Classifier errors using NaiveBayes

X: Category Y: Predicted Category

ResultsClassification using SMO

Sequential Minimal Optimization (SMO) is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines.

ResultsClassification using SMO

Conclusion

SMO is a simple algorithm with high classification accuracy for our dataset.

It shows high performance with balanced distribution training data as input.

ThanksQuestion?

Documents

Data Mining Presentation - Twitter Classification