Upload
f789gh
View
256
Download
2
Embed Size (px)
Citation preview
Customer Linguistic Profiling Predicting Personality Traits
on Facebook’s statusesVishweshwara Keekan
Dmitrij PetrovDustin Nguyen
Agenda1. Goals2. Stylometry and its use-cases 3. Predicting Big 5 Personality traits
1. Split the dataset 2. Train and test statistical models3. Evaluate the performance & show final results
4. Summary
Goals•Getting into the field of stylometry & natural language processing•Conducting various data experiments on FB’s dataset
Non-Goals
•Achieving better results than existing studies
Stylometry• Emerged in the second half of 19th century• Wincenty Lutosławski coined it since 1897
• Def.: “the statistical analysis of literary style” dealing with “the study of individual or group characteristics in written language” (e.g. sentence length) Holmes & Kardos (2003), Knight (1993)
• Applied for authorship attribution & profiling, plagiarism etc.
Examples• Authorship Identification in Greek Tweets
• Modern Greek Twitter corpus consisting of 12,973 tweets retrieved from 10 Greek popular users
• Character and word n-grams
• Forensic Stylometry for Anonymous Emails• Frequent pattern technique• Company email dataset containing 200,399 real-life emails from 158 employees
• Dream of the Red Chamber (1759) by Cao Xuegin• First, a circulation of hand-written 80 chapters of novel• Cheng-Gao’s first printed edition: 40 additional chapters being added• The “chrono-devide” proven lastly via SVC-RFE with 10-50 features*
* Hu et al. (2014)
Supervised Machine Learning (S-ML)• Dataset from MyPersonality.org project:• 9917 Facebook’s status updates from 250 users • Statuses – for our purposes – have not been pre-processed (e.g. “OMG” or remained)• Statuses are/will be classified to Big-Five binary personality traits (Extroversion,
Agreeableness, Neuroticism, Openness to experience, Conscientiousness)
• S-ML (vs. Unsupervised ML)• Dataset contains many input & (desired) output variables• S-ML learns by examples and after several iterations is able to classify an input
Methodology & Tools• Tools: NLTK, scikit-learn, jupyter-notebooks, Python3, (R), GitHub* etc.
• Methodology of S-ML:• Extract relevant stylometric (NLP) features• Split dataset into training & testing set• Train the model on the training set Learn by examples• Test the model on the ‘unseen’ set Classify• Validate the performance of the model Evaluate
> Prepare data
*https://github.com/dmpe/CaseSolvingSeminar/
Extracted features from statuses5 Labels
from ODSFeature from
ODSExtracted ones
Lexical (6) Character (8)
cNEU
STATUS
# functional words string length
lexical diversity [0-1]# words # dots
# commascAGR
# personal pronounssmileys
# semicolons # colons
cOPN
Parts-of-speech Tags # *PROPNAME*
cCON
Bag-of-words (ngrams) average word lengthcEXT
Splitting dataset using stratified k-fold CV• Create 5 trait datasets based on our labels
• Use stratified k-fold cross-validation to split into the training and testing set
>>> train_X, test_X, train_Y, test_Y = sk.cross_validation.train_test_split(agr[:,1:9], agr["cAGR"], train_size = 0.66, stratify = agr["cAGR"],random_state = 5152)
Classification Metrics -> Confusion Matrix (1)
“Golden Standard”(Real Truth Values)
Positive Negative
Observed
Predicted positive
True Positive
False Positive (Type 1 error)
Precision
Predicted Negative
False Negative (Type 2 error)
True Negative
Recall/Sensitivity (Specificity)
Accuracy =
Precision =
Recall =
F1-score =
Learning and predicting• Head-on approach: Classifiers only# Assumption: features are numeric values onlyclassifier = MultinomialNB()classifier.fit(train_X, train_Y).predict(test_X) # results in a prediction for test_X
• But: “Status” is a string and not numericnb_pipeline= Pipeline([ ('vectorizer_tfidf', TfidfVectorizer(ngram_range=(1,2))), ('nb', MultinomialNB())])predicted = nb_pipeline.fit(train_X, train_Y).predict(test_X)
• Validation of resultsscores = cross_validation.cross_val_score( nb_pipeline, train_X + test_X, train_Y + test_Y, cv=10, scoring=‘accuracy’)accuracy, std_deviation = scores.mean(), scores.std() * 2precision = average_precision_score(test_Y, predicted)recall = recall_score(test_Y, predicted, labels=[False, True])f1 = f1_score(test_Y, predicted, labels=[False, True])
Pipeline: Source Code Examplepipeline = sklearn.pipeline.Pipeline([ ('features', sklearn.pipeline.FeatureUnion( transformer_list=[ (‘status_string', sklearn.pipeline.Pipeline([ # tfidf on status
('tf_idf_vect', sklearn.feature_extraction.text.TfidfVectorizer()), ])), ('derived_numeric', sklearn.pipeline.Pipeline([ # aggregator creates derived values (‘derived_cols', Aggregator([LexicalDiversity(), NumberOfFunctionalWords()])), ('scaler', sklearn.preprocessing.MinMaxScaler()), ])), ], )), (‘classifier_naive_bayes', sklearn.naive_bayes.MultinomialNB())])
Parameter fine-tuning• Most transformers and classifier accept different parameters• Parameters can heavily influence the result
grid_params = { 'features__status_string__tf_idf_vect__ngram_range': ((1, 1), (1, 2), (2,3))}}
grid_search = GridSearchCV(pipeline, param_grid=grid_params, cv=2, n_jobs=-1, verbose=0)y_pred_trait = grid_search.fit(train_X, train_Y).predict(x_test)
# print best parameters best_parameters = grid_search.best_estimator_.get_params()for param_name in sorted(grid_parameter.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))
Baseline 1: STATUS (TF-IDF) column only
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy Mean
Accuracy Stand. Dev
Recall Precision F1-score
NEU 0.426 +/- 0.03 0.81 0.63 0.51 k-NN
OPN 0.747 +/- 0.03 0.998 0.87 0.854 Bernoulli-NB
AGR 0.585 +/- 0.03 0.91 0.76 0.698 Bernoulli-NB
EXT 0.600 +/- 0.03 0.45 0.60 0.48 Linear-SVC
CON 0.514 +/- 0.04 0.90 0.70 0.61 k-NN
Baseline 2: Derived columnsTrait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy Mean
Accuracy Stand. Dev
Recall Precision F1-score
NEU + 0.196 +/- 0.03 - 0.794 - 0.238 - 0.480 Bernoulli-NB
OPN - 0.004 +/- 0.03 + 0.002 + 0.001 - 0.001 Bernoulli-NB
AGR - 0.054 +/- 0.03 - 0.773 - 0.231 - 0.433 SVC
EXT - 0.015 +/- 0.03 - 0.273 - 0.071 - 0.215 Bernoulli-NB
CON + 0.025 +/- 0.04 - 0.5 - 0.114 - 0.167 Bernoulli-NB
Pipeline 3: Mix of STATUS and NON-STATUS cols.
Trait Dataset Achieved Results with 10-fold CV Best Algorithm
Accuracy Mean
Accuracy Stand. Dev
Recall Precision F1-score
NEU + 0.064 +/- 0.25 + 0.170 + 0.050 + 0.030 Linear SVCOPN - 0.017 +/- 0.03 - 0.002 +/- 0 - 0.004 Multinomial-NB
AGR - 0.082 +/- 0.07 + 0.082 - 0.020 - 0.008 Linear SVC
EXT - 0.073 +/- 0.02 - 0.070 - 0.060 - 0.070 k-NNCON + 0.001 +/- 0.08 + 0.096 + 0.030 + 0.020 Linear SVC
Results/Summary• Hardly any improvement of head-first approach• At least over the baseline
• Limited:• strongly by Hardware & CPU
• grid_search: Count(Algorithms) * Count(Parameters) * Count(Labels) rapidly growing effort
• grid_search for 1 label, 3 parameters and LinearSVC took >20 minutes• Future Research: look on GPU (NVIDIA)
• inconsistent data (multiple languages e.g. Spanish)