Customer Linguistic Profiling

Customer Linguistic Profiling Predicting Personality Traits

on Facebook’s statusesVishweshwara Keekan

Dmitrij PetrovDustin Nguyen

Agenda1. Goals2. Stylometry and its use-cases 3. Predicting Big 5 Personality traits

1. Split the dataset 2. Train and test statistical models3. Evaluate the performance & show final results

4. Summary

Goals•Getting into the field of stylometry & natural language processing•Conducting various data experiments on FB’s dataset

Non-Goals

•Achieving better results than existing studies

Stylometry• Emerged in the second half of 19th century• Wincenty Lutosławski coined it since 1897

• Def.: “the statistical analysis of literary style” dealing with “the study of individual or group characteristics in written language” (e.g. sentence length) Holmes & Kardos (2003), Knight (1993)

• Applied for authorship attribution & profiling, plagiarism etc.

Examples• Authorship Identification in Greek Tweets

• Modern Greek Twitter corpus consisting of 12,973 tweets retrieved from 10 Greek popular users

• Character and word n-grams

• Forensic Stylometry for Anonymous Emails• Frequent pattern technique• Company email dataset containing 200,399 real-life emails from 158 employees

• Dream of the Red Chamber (1759) by Cao Xuegin• First, a circulation of hand-written 80 chapters of novel• Cheng-Gao’s first printed edition: 40 additional chapters being added• The “chrono-devide” proven lastly via SVC-RFE with 10-50 features*

* Hu et al. (2014)

Supervised Machine Learning (S-ML)• Dataset from MyPersonality.org project:• 9917 Facebook’s status updates from 250 users • Statuses – for our purposes – have not been pre-processed (e.g. “OMG” or remained)• Statuses are/will be classified to Big-Five binary personality traits (Extroversion,

Agreeableness, Neuroticism, Openness to experience, Conscientiousness)

• S-ML (vs. Unsupervised ML)• Dataset contains many input & (desired) output variables• S-ML learns by examples and after several iterations is able to classify an input

Methodology & Tools• Tools: NLTK, scikit-learn, jupyter-notebooks, Python3, (R), GitHub* etc.

• Methodology of S-ML:• Extract relevant stylometric (NLP) features• Split dataset into training & testing set• Train the model on the training set Learn by examples• Test the model on the ‘unseen’ set Classify• Validate the performance of the model Evaluate

> Prepare data

*https://github.com/dmpe/CaseSolvingSeminar/

Extracted features from statuses5 Labels

from ODSFeature from

ODSExtracted ones

Lexical (6) Character (8)

cNEU

STATUS

# functional words string length

lexical diversity [0-1]# words # dots

# commascAGR

# personal pronounssmileys

# semicolons # colons

cOPN

Parts-of-speech Tags # *PROPNAME*

cCON

Bag-of-words (ngrams) average word lengthcEXT

Splitting dataset using stratified k-fold CV• Create 5 trait datasets based on our labels

• Use stratified k-fold cross-validation to split into the training and testing set

>>> train_X, test_X, train_Y, test_Y = sk.cross_validation.train_test_split(agr[:,1:9], agr["cAGR"], train_size = 0.66, stratify = agr["cAGR"],random_state = 5152)

Classification Metrics -> Confusion Matrix (1)

“Golden Standard”(Real Truth Values)

Positive Negative

Observed

Predicted positive

True Positive

False Positive (Type 1 error)

Precision

Predicted Negative

False Negative (Type 2 error)

True Negative

Recall/Sensitivity (Specificity)

Accuracy =

Precision =

Recall =

F1-score =

Learning and predicting• Head-on approach: Classifiers only# Assumption: features are numeric values onlyclassifier = MultinomialNB()classifier.fit(train_X, train_Y).predict(test_X) # results in a prediction for test_X

• But: “Status” is a string and not numericnb_pipeline= Pipeline([ ('vectorizer_tfidf', TfidfVectorizer(ngram_range=(1,2))), ('nb', MultinomialNB())])predicted = nb_pipeline.fit(train_X, train_Y).predict(test_X)

• Validation of resultsscores = cross_validation.cross_val_score( nb_pipeline, train_X + test_X, train_Y + test_Y, cv=10, scoring=‘accuracy’)accuracy, std_deviation = scores.mean(), scores.std() * 2precision = average_precision_score(test_Y, predicted)recall = recall_score(test_Y, predicted, labels=[False, True])f1 = f1_score(test_Y, predicted, labels=[False, True])

Pipeline: Source Code Examplepipeline = sklearn.pipeline.Pipeline([ ('features', sklearn.pipeline.FeatureUnion( transformer_list=[ (‘status_string', sklearn.pipeline.Pipeline([ # tfidf on status

('tf_idf_vect', sklearn.feature_extraction.text.TfidfVectorizer()), ])), ('derived_numeric', sklearn.pipeline.Pipeline([ # aggregator creates derived values (‘derived_cols', Aggregator([LexicalDiversity(), NumberOfFunctionalWords()])), ('scaler', sklearn.preprocessing.MinMaxScaler()), ])), ], )), (‘classifier_naive_bayes', sklearn.naive_bayes.MultinomialNB())])

Parameter fine-tuning• Most transformers and classifier accept different parameters• Parameters can heavily influence the result

grid_params = { 'features__status_string__tf_idf_vect__ngram_range': ((1, 1), (1, 2), (2,3))}}

grid_search = GridSearchCV(pipeline, param_grid=grid_params, cv=2, n_jobs=-1, verbose=0)y_pred_trait = grid_search.fit(train_X, train_Y).predict(x_test)

# print best parameters best_parameters = grid_search.best_estimator_.get_params()for param_name in sorted(grid_parameter.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))

Baseline 1: STATUS (TF-IDF) column only

Trait Dataset Achieved Results with 10-fold CV Best Algorithm

Accuracy Mean

Accuracy Stand. Dev

Recall Precision F1-score

NEU 0.426 +/- 0.03 0.81 0.63 0.51 k-NN

OPN 0.747 +/- 0.03 0.998 0.87 0.854 Bernoulli-NB

AGR 0.585 +/- 0.03 0.91 0.76 0.698 Bernoulli-NB

EXT 0.600 +/- 0.03 0.45 0.60 0.48 Linear-SVC

CON 0.514 +/- 0.04 0.90 0.70 0.61 k-NN

Baseline 2: Derived columnsTrait Dataset Achieved Results with 10-fold CV Best Algorithm

Accuracy Mean

Accuracy Stand. Dev


NEU + 0.196 +/- 0.03 - 0.794 - 0.238 - 0.480 Bernoulli-NB

OPN - 0.004 +/- 0.03 + 0.002 + 0.001 - 0.001 Bernoulli-NB

AGR - 0.054 +/- 0.03 - 0.773 - 0.231 - 0.433 SVC

EXT - 0.015 +/- 0.03 - 0.273 - 0.071 - 0.215 Bernoulli-NB

CON + 0.025 +/- 0.04 - 0.5 - 0.114 - 0.167 Bernoulli-NB

Pipeline 3: Mix of STATUS and NON-STATUS cols.

Trait Dataset Achieved Results with 10-fold CV Best Algorithm

Accuracy Mean

Accuracy Stand. Dev


NEU + 0.064 +/- 0.25 + 0.170 + 0.050 + 0.030 Linear SVCOPN - 0.017 +/- 0.03 - 0.002 +/- 0 - 0.004 Multinomial-NB

AGR - 0.082 +/- 0.07 + 0.082 - 0.020 - 0.008 Linear SVC

EXT - 0.073 +/- 0.02 - 0.070 - 0.060 - 0.070 k-NNCON + 0.001 +/- 0.08 + 0.096 + 0.030 + 0.020 Linear SVC

Results/Summary• Hardly any improvement of head-first approach• At least over the baseline

• Limited:• strongly by Hardware & CPU

• grid_search: Count(Algorithms) * Count(Parameters) * Count(Labels) rapidly growing effort

• grid_search for 1 label, 3 parameters and LinearSVC took >20 minutes• Future Research: look on GPU (NVIDIA)

• inconsistent data (multiple languages e.g. Spanish)

Data & Analytics

Customer Linguistic Profiling