Upload
usman-riaz
View
24
Download
0
Embed Size (px)
DESCRIPTION
Data Mining Presentation - Twitter Classification
Citation preview
Categorization of users on Twitter
CS-248 Data Mining
Muhammad Usman Riaz | ID: 101620043
Daud Khan | ID: 101620016
Muhammad Ain Ul Hassan | ID: 101620005
Muzamil Asad | ID: 101620013
Abid Javed | ID: 101620025
Spring 2014
Outline
Introduction Pre-processing Classification Results
Problem statement
Twitter is an online social networking and microblogging service that enables users to send and read short 140-character text messages, called "tweets". Registered users can read and post tweets, but unregistered users can only read them. The objective of this project includes categorization of Twitter users into different classes like company or individual, professional or home user, sportsman, student, teacher etc.
Dataset Raw dataset
(a) Raw data
(b) Data organization
Attributes of ‘category’
Pre-processing
Conversion to ARFF format Removal of unnecessary attributes. Tweets (strings) converted into words
(using weka “StringtoWordVector” filter)
Removal of stop words (are, as, at etc)
Training data after pre-processing
Classification
Conversion of test data to ARFF format using batch filtering.
Batch filtering is used if a second dataset, normally the test set, needs to be processed with the same statistics as the the first dataset, normally the training set.
Classification
Classification using supplied test data-set
Results
NaiveBayes Naive Bayes classifiers are a family
of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.
ResultsClassification using NaiveBayes
Classifier errors using NaiveBayes
X: Category Y: Predicted Category
ResultsClassification using SMO
Sequential Minimal Optimization (SMO) is an algorithm for efficiently solving the optimization problem which arises during the training of support vector machines.
ResultsClassification using SMO
Conclusion
SMO is a simple algorithm with high classification accuracy for our dataset.
It shows high performance with balanced distribution training data as input.
ThanksQuestion?