33
The Linguistic Features of Fake News Headlines and Statements Urja Khurana 10739947 Bachelor thesis Credits: 18 EC Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor mw. dr. T. Deoskar Institute for Language and Logic Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 30th, 2017 1

The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

The Linguistic Features of FakeNews Headlines and Statements

Urja Khurana10739947

Bachelor thesisCredits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

Supervisormw. dr. T. Deoskar

Institute for Language and LogicFaculty of Science

University of AmsterdamScience Park 904

1098 XH Amsterdam

June 30th, 2017

1

Page 2: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

Abstract

The recent rise of fake news has heavily influenced people. From the 2016US Presidential Elections to Pizzagate. Thus, it is essential to address this so-cially relevant phenomenon. Until now, most of the research done was relatedto satire and clickbait. This thesis explores the linguistic features that are ableto distinguish between fake and real news and statements. By extracting dif-ferent linguistic features of statements and headlines, the predictive powerof each different feature is explored. Using classifiers, the overall approachis discussed. It appears that unigrams, POS tag sequences, punctuation andgenerality are one of the features that have the most predictive power. In theend, the performances were above baseline.

2

Page 3: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

Contents1 Introduction 5

2 Related Work 7

3 Approach 103.1 Algorithms & NLP-techniques . . . . . . . . . . . . . . . . 10

3.1.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 POS Tagging . . . . . . . . . . . . . . . . . . . . . 113.1.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . 113.1.4 Likelihood . . . . . . . . . . . . . . . . . . . . . . 113.1.5 Naive Bayes . . . . . . . . . . . . . . . . . . . . . 113.1.6 k-Nearest Neighbors . . . . . . . . . . . . . . . . . 123.1.7 Decision Trees . . . . . . . . . . . . . . . . . . . . 123.1.8 SVM . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.9 Logistic Regression . . . . . . . . . . . . . . . . . . 12

4 Experiments and Results 134.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Possible Datasets . . . . . . . . . . . . . . . . . . . 134.1.2 LIAR Dataset . . . . . . . . . . . . . . . . . . . . . 144.1.3 Headlines Dataset . . . . . . . . . . . . . . . . . . . 154.1.4 Pre-processing . . . . . . . . . . . . . . . . . . . . 16

4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . 174.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . 19

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3.1 Dropped features . . . . . . . . . . . . . . . . . . . 204.3.2 LIAR Statements - Train and Validation Set . . . . . 204.3.3 Statements on Headlines . . . . . . . . . . . . . . . 234.3.4 Headlines . . . . . . . . . . . . . . . . . . . . . . . 244.3.5 Total Results . . . . . . . . . . . . . . . . . . . . . 24

5 Conclusion 27

6 Discussion and Future Work 27

7 References 29

A Accuracies of All Features Statements 30

3

Page 4: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

AcknowledgementsI would like to thank my supervisor, Tejaswini Deoskar, for her constantsupport and suggestions during each and every step, and giving me insighton the research process. Furthermore, she encouraged me throughout thishard task to take up this challenge and inspired me to keep on working hard.Also, I would like to thank Houda Alberts for proof-reading my thesis.

4

Page 5: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

1 IntroductionNowadays, social media has become a major part of society. From the cir-culation of memes to staying in touch with contacts, it has enabled manydifferent ways of communication, which some might see as a blessing. How-ever, a disadvantage that has recently started to attract more attention is theeffortless propagation of fake news. This is a phenomenon where the contentof a news-article does not correspond to the actual truth, be it a mix of falseand true statements or purely lies. Many social media users have fallen preyto this issue, especially during the US Presidential Election in 20161. Sev-eral voters were influenced by fake news, which had a significant impact ontheir choice of voting. Since the rise of fake news during that election, it hasbecome a controversial topic and the ramifications of it are feared by many2.Thus, it is important to find a way to control the spread of such misinforma-tion, in order to prevent people from believing falsity and influencing manyimportant factors such as their demeanor toward certain concepts.

Due to the relatively recent nature of this issue, there has not been muchresearch conducted regarding fake news. On the contrary, that is not the casefor related news-articles such as clickbait or satire, or the credibility of socialmedia posts. The research done until now regarding these related topics isworth mentioning. These will function as a base to start off with. First ofall, the credibility of tweets was explored during highly-covered events. InGupta and Kumaraguru (2012) linguistic features such as swear words andpronouns were good predictors for the credibility of tweets. A different ap-proach on this problem was conducted by Tan et al. (2014) who explored thewording and propagation of a tweet. It turned out that linguistic properties ofa tweet influence the degree of propagation of a tweet. Research conductedregarding clickbait and satire yielded promising results, with linguistic fea-tures being able to make decent differentiation, such as in Rubin et al. (2016)and Potthast et al. (2016), respectively.As mentioned before, some properties of fake news have been explored aswell. In Potthast et al. (2017) the aim was to detect fake news and look intothe writing of left- and right-wing together and mainstream media. Therewere promising results regarding the distinction between the writing style ofhyperpartisan and mainstream media. However, it was hard to correctly clas-sify fake news, the accuracy being less than chance. This demonstrates thescope of improvement regarding this classifying task.

Building further on the foundings until now, this thesis will focus on the lin-guistic features of fake news headlines, and if they hold any predictive powerregarding the distinction between fake and real news. The prime reason forinvestigating news headlines is because that is the factor that prompts a con-sumer to read an article. Due to a lack of a reliable dataset with a large mag-nitude to be able to experiment properly, news statements will be exploredas well. These are statements by politicians that are used in news as well. Itwill be seen to what extent the classifier trained on the statements apply to

1http://www.politifact.com/truth-o-meter/article/2016/dec/13/2016-lie-year-fake-news/2http://www.bbc.com/news/technology-39718034

5

Page 6: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

headlines because of the absence of a clean dataset when it comes to labelednews headlines. Thus, the research question is: What are the linguistic fea-tures of fake news statements and headlines that make it distinguishable fromreal news and make it propagate? In order to give a proper answer to thequestion, it is essential to address the following subquestions that will helpshape up the outcome.

• Which NLP-techniques can be used to extract linguistic features?

• How does wording affect the distinction between fake and real news?

• Which linguistic features have the most predictive power regardingpropagation and distinction?

From the research conducted until now, the expectations are that linguisticfeatures do in fact hold predictive power and are able to decently distinguishbetween fake and real news headlines. In the above researches, POS Taggingand language models appear to be widely used NLP techniques to extractlinguistic features. Along with that, it seems that certain words help the dis-tinction, especially when it comes to propagation. Mostly, POS tags seem togive more insight regarding distinction and predictive power, as well as thesequence of POS tags, which give a feeling of the syntactic structure used.Hence, these subquestions shall aid in finding the answer to the question.

This thesis shall first discuss related work that will assist in enhancing themethodology, which will be elaborated after that. By extracting linguisticfeatures and using classifiers on them to be able to predict if a statement orheadline are fake or not. On the basis of the extensive approach, the resultswill be presented. The conclusion will follow after which the discussion andfuture work will conclude the thesis.

6

Page 7: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

2 Related WorkThus far, there have been several approaches regarding topics close to fakenews and statements, but as of yet, there has not been a proper analysis re-garding that. Exploration of the methods conducted for similar tasks willgain insight into how to deal with fake news and statements.

With the use of NLP-techniques, Tan et al. (2014) explored the effects ofchanging the wording of a tweet regarding its degree of propagation. Thedata used consisted of tweets from several sources that posted tweets con-cerning the same URL but with different wordings. These features weremostly linguistic ones, such as informativeness (length) and requests to sharea tweet (“please”, “pls share”). Logistic regression was applied to the data,which yielded an accuracy of 98.8% on the most and least retweeted un-paired tweets with the timing of the tweet and the followers count of thetweeter. These unpaired tweets are neither regarding the same URL nor fromthe same author. However, the use of 39 custom features resulted in a de-crease to 63%, though this is applied to paired tweets. It still performedabove the baseline of the paper. These custom features include pronouns,length, adjectives, sentiment, uni- and bigrams, retweet information and gen-erality. Another finding was that the performance appeared to vary with thedata size. Eventually, the relevant features were: length, verbs, proper nouns,numbers, positive words, indefinite articles, and adjectives. These discoveredfeatures that influence propagation can be used to explore if they also applyto fake news statements and headlines.

Contrarily, in Potthast et al. (2017) the goal was to classify fake news anddetermine if left- and right-wing are more alike than mainstream news. TheBuzzFeed Fake News dataset, which contains articles manually fact-checkedby journalists, was used. To inspect if left- and right-winged publishers aremore alike, Unmasking was used. Features that were used were: n-grams,stop words, POS tags and dictionary features. Ultimately, classifying fakenews was a complicated task. This conclusion arose from the fact that theaccuracy was less than chance. However, making a distinction between hy-perpartisan news and mainstream news yielded promising results. This papergives an indication that classifying fake news still has a long way to go andthere is certainly a need for better features that apply to fake news. It willdefinitely assist with constructing a base of linguistic features and attempt-ing to attain a better result than this.

Satire and fake news are closely related concepts, so similarly in Rubin et al.(2016) the goal was to distinguish satire from real news by using textualfeatures of satire. The dataset consisted of satirical news articles and match-ing legitimate news articles. Machine learning was used for the predictionand the experiments were conducted by using several combinations of fea-tures to determine the best performing combination. F-measures and tf-idf3

were used as a baseline. The highest F-measure was achieved when gram-mar, punctuation, and absurdity were used. Individual textual features of

3Term frequency–inverse document frequency: http://www.tfidf.com/

7

Page 8: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

syntax and punctuation marks are a reliable indication of the presence ofsatire. Along with that, satire news mostly consists of complex sentences.Ultimately, a precision of 90%, recall of 84% and F1-score of 87% wereachieved. The result was that headlines are very relevant to detect satire,since the first line of satirical articles repeat the content of the headline,whereas real news packs new information. Additionally, sentence length andcomplexity were pretty vital as well. The success of the above features mightalso prove to be beneficial for fake news and statements. Thus, the emphasisin this thesis will also lay on those features.

The challenge of clickbait is pretty similar to that of fake news. Biyani et al.(2016) explored the possibility of clickbait detection in news streams. A largedataset with clickbait and non-clickbait articles was used, retrieved from var-ious news sources. However, how the articles were labeled is unclear as itwas left unmentioned in the original paper. The approach was to use gradi-ent boosted decision trees and the features used were POS-tags. Informationgain was used to rank the features and get rid of the k-low scoring features.Finally, there were 7677 features and a precision of 75.5% and 76% recallwere achieved on the test set. What stood out the most were informality fea-tures, which capture the level of the readability since clickbait tends to usemore informal language than legitimate news. The features used in this pa-per might prove to be useful since clickbait is closely related to fake newsheadlines and statements.

Potthast et al. (2016) used a different approach for the same task as above,where they used an annotated dataset with tweets to develop a clickbaitmodel. The features were divided into three categories: teaser message,linked webpage and meta information. For teaser message, features that wereused were, for instance, sentiment and sentence complexity. Logistic regres-sion, Naive Bayes and Random Forest were applied to the dataset whereas,for the metrics, precision, recall, and ROC-AUC were utilized. The discoverywas that the teaser message features alone outperform the other features withall three algorithms. Especially n-grams, which capture the writing style.The results for ROC-AUC were 0.74 for Random Forest, 0.72 for LogisticRegression and 0.69 for Naive Bayes. The teaser message features can beapplied to fake news statements and headlines since they seem to be success-ful for clickbait.

Anand et al. (2016) applied another method in which bidirectional recur-rent neural networks (BiLSTM) were used to detect the same. With hiddenstates, the algorithm determines the worth of remembering an input word.As prevention of the method putting more emphasis on recent elements, eachinput is forward and backward propagated. The dataset they used consistedof news headlines, of which 50% were clickbait and the rest non-clickbaitheadlines. As for the features, they used word embeddings as well as char-acter embeddings. The first provides insight into the semantic and syntacticproperties of the words, while the latter offers insight into the orthographicand morphological features. For the training, mini-batch gradient descentand binary cross entropy loss were applied. The character embeddings and

8

Page 9: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

word embeddings were tested separately and combined. When using bothof them for a BiLSTM they achieved an accuracy, precision, recall and F1-score of 0.98 and an ROC-AUC4 of 0.99. Thus, it can be interesting to see ifother types of features yield similar high results for fake news headlines andstatements when using word and character embeddings.

A large problem in combating fake news is the lack of datasets. In orderto improve that complication, Wang (2017) created the LIAR dataset, con-sisting of 12791 political statements by politicians, labeled with how trueor false they are. The labeling process has been done by Politifact.com ed-itors, a website dedicated to debunking political rumors. These statementscan be used for fake news detection, as stated by the author. Also, becauseof the large size of the dataset, it can be used for machine learning prob-lems. By using logistic regression, SVM and bi-LSTM, the aim was to findout if surface-level linguistic features do have some kind of influence. Thesix-way classification consisted of the following classes: pants on fire, false,mostly false, half true, mostly true and true. Using word-embeddings theyachieved an accuracy of around 25% with SVM and logistic regression, whilebi-LSTM appeared to suffer from overfitting. Another task was to see if a hy-brid approach with meta-data would yield better results. With the use of aCNN, the accuracy increased to around 27%. Hence, it demonstrates thatthere is definitely some scope for improvement. This dataset will also beused in this thesis.

A source of inspiration for some more linguistic features that might givesome insight on the propagation of fake news is Danescu-Niculescu-Mizilet al. (2012). The aim was to discover linguistic features that make a quotememorable. The generated dataset consisted of memorable and non-memorablequotes from movies. To explore the distinctiveness of a quote, a languagemodel was built over common language and the likelihood of each quote wascalculated, on the words and POS-tags (up to trigrams). 60% of the pairshad a memorable quote that was more distinctive than its forgettable coun-terpart. In contrary to the latter, the first type of quotes contain a distinctiveword sequence, paired with a rather common syntax. The exploration of thegenerality of memorable quotes made use of the following features: personalpronouns, indefinite articles, past tense and present tense. While personalpronouns were the best indicator with an accuracy of around 60% for thedatasets, the other three features also demonstrated results above chance withaccuracies ranging from 55% to 60%. With the use of SVM, a comparisontask was conducted where given a pair of quotes the memorable quote hadto be predicted. Initially, they used a bag-of-words approach which yieldedan accuracy of 59.67%. Using distinctive and general features, an increase to64.46% was observed. The predictive features of this paper will be applied inthis thesis since it might be so that fake news headlines are more memorable,thus propagates faster and more.

From the above articles, several features and algorithms can be used to ana-lyze the linguistic properties of the headlines of fake news and what makes it

4The area under a ROC curve: http://gim.unmc.edu/dxtests/roc3.htm

9

Page 10: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

propagate.

3 ApproachIn order to find a proper answer to the research question, there are certainsteps that need to be taken. Using the LIAR Dataset as a train and devel-opment dataset, the linguistic features that have proved to be useful in thediscussed papers above will be extracted. By taking a look at the dataset,other features that appear to have some predictive power (e.g. amount ofcapital letters) will be extracted as well. This will be done by making use ofNLP-techniques, such as POS Tagging and n-grams, which give insight intothe semantics and syntax of a statement or headline, respectively.

After these features are extracted, different classifiers will be applied to findout which work the best for this kind of a problem. From there, the under-performing classifiers shall be left behind, because of their lack of predictivepower. Though, it is essential to keep in mind that some classifiers are proneto overfitting. To combat this problem, 3-fold cross-validation will be used,when training and testing on the headlines and when using the train and val-idation set for the statements. In order to find out which linguistic featuresseem to have more predictive power, the accuracy per individual feature willbe calculated and then different combinations of features will be used to seewhich features works the best in total.

The final results will be obtained from a new self annotated dataset whichconsists of headlines. This dataset will be used as a test set. The annotationprocess shall be explained in section 4.1.3

To assign a meaning to the outcome of a classifier, evaluation is essential.The metric that has been widely used in the above papers shall be used inthis thesis as well. Since the accuracy of a classifier gives insight into itsperformance, that specific metric will be used. It will then fit into the largerpicture and a fair comparison can be made.

3.1 Algorithms & NLP-techniques

3.1.1 N-grams

N-grams are a powerful NLP-technique in order to capture semantic and syn-tactic sequences. It divides a sentence into smaller sequences, where theparameter n indicates the amount of elements in the smaller sequence. Forinstance, when applying n-grams on sentence for words, the sentence: “Ilike bread”, will have the following n-grams where n = 2: [(<s>, “I”),(“I”,“like”), (“like”, “bread”), (“bread”, </s>)]. Here, start and stop symbols areadded to boost the context. The larger the n, the more context is captured.This technique can be applied to POS tags as well.

10

Page 11: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

3.1.2 POS Tagging

Some words have the same form, but vary in meaning. For example, the wordfear. It can be a verb, as well as a noun. To find out which part-of-speechcategory a word belongs to, POS tagging is used. Given a sentence, thisreturns the exact sequence back. Contrarily, this sequence does not containthe words, but the POS tag of that word.

3.1.3 Sentiment Analysis

In order to understand the emotion of a text, sentiment analysis is used. Thisis a widely used NLP-technique, where given some text, the sentiment of itis extracted. Most methods do this by giving scores to words that are highlyindicative (e.g. ’amazing’ or ’horrible’). With the total score of the wholesentence, the sentiment of the text is determined.

3.1.4 Likelihood

To be able to decide the probability of the occurrence of a sentence, the like-lihood is calculated. In order to do that, language models are used, wherethe frequencies of words are used to find out how probable a word is. Whenusing unigrams, this can be calculated by dividing the frequency of the wordby the amount of total words in the corpus. When n is equal or greater thantwo, the probability of an n-gram is decided by dividing the frequency of then-gram by the frequency of the (n-1)-gram. The probabilities of the indi-vidual n-grams are then multiplied by each other, which then results in thelikelihood of a sentence. The equation for this is:

P(sentence) =m

∏i=1

P(wi|wi−(n−1), ...,wi)

where m is equal to the amount of n-grams.

Suppose that the probability of one n-gram might be 0 because of unseenwords in the corpus, then this causes the whole likelihood to be equal to 0.This is undesirable and to prevent this, smoothing is used. In this thesis,Good-Turing smoothing will be used, which takes mass from more frequentwords and distributes it amongst rare words. (Jurafsky (2000))

3.1.5 Naive Bayes

Assuming that all features are independent of each other, Naive Bayes is apretty quick algorithm. Regardless of this “naive” assumption, it is a prettydecent classifier, where given a set of features, the probability of those be-longing to different classes is calculated. The class with the highest prob-ability is then chosen and the item is classified as that class. Even thoughfeatures are rarely ever independent of each other, it still works well whendependencies are properly distributed. It shall be seen if this is also the casewhen it comes to text. (Zhang (2004))

11

Page 12: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

3.1.6 k-Nearest Neighbors

This simple algorithm is a distance-based classifier, where each new data-point is compared to the existing datapoints, i.e. neighbors. The class of theexisting datapoint with the shortest distance to the new datapoint is assignedto the new one5. The k indicates the amount of neighbors that will be checkedwhen a new data point has to be assigned a label.

3.1.7 Decision Trees

Decision Trees are a set of algorithms that will be used as classifiers in thisthesis. Trees are generated from the data, where if-then rules are extracted,according to which the threes are modeled. There are many variants of thisalgorithm, and it appears to fit the data well. Thus, the variants used in thisthesis will be discussed.

• Random Forest: When it comes to Random Forest6, the splits in thedecision trees are based on random subsets of features, instead of tak-ing all of them in consideration. A disadvantage is that the bias in-creases, however, on the other side the variance decreases, which is aresult of the bias-variance trade-off.

• Extra Trees: This is another random decision tree algorithm, where ithas the same advantages and disadvantages as Random Forest. The dif-ference here is that the random splits per subset of features is decidedfrom random thresholds.

• Adaboost: Freund and Schapire (1995) explained this algorithm, whereweak classifiers are made to fit on different versions of the data, withmultiple iterations using different weights for the classifiers7. This cre-ates an advantage where this classifier can pick up on smallest distin-guishing details.

• Gradient Boosting: This tree-classifier makes use of loss-functions8,which makes it a strong classifier, since it can work well with outliers.

3.1.8 SVM

The support vector machine is an algorithm which will be used for classifica-tion in this thesis9. A shortcoming of this algorithm is that it takes some timeto run, since it is computationally expensive. Nevertheless, it is does performwell when the data is high-dimensional.

3.1.9 Logistic Regression

Logistic Regression is an algorithm which uses a logistic function to be ableto decide which class a datapoint belongs to10. A cost function gets mini-

5http://scikit-learn.org/stable/modules/neighbors.html6http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees7http://scikit-learn.org/stable/modules/ensemble.html#adaboost8http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting9http://scikit-learn.org/stable/modules/svm.html#classification

10http://scikit-learn.org/stable/modules/linear model.html#logistic-regression

12

Page 13: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

mized in this algorithm.

4 Experiments and Results

4.1 DataDue its recent manifestation, not many resources are available when it comesto fake news. Another factor that influences it, is the notion of knowingwhat is fake and what is real. The fine line between the two is a blurred oneand content at times can consist of mixed information. Not many people areexperts when it comes to knowing what the truth is, thus it is a challengeto find a reliable dataset. Significant time was dedicated to attain a properdataset, thus the possible datasets will be discussed that were considered inthis thesis, and explained why they were not suitable.

4.1.1 Possible Datasets

At first, the idea was to use the BuzzFeed Fake News Dataset11, which con-tains political Facebook posts labeled with how true or false they are. How-ever, the distribution of the labels across the dataset did not appear to consistof many fake labeled posts, in comparison to the true posts. Another down-side to this dataset was that the Facebook Graph API, to extract the headlinesfrom the given links, would not be sufficient, since it is illegal to scrape Face-book without their permission. Thus, this dataset was dropped.

Then, the dataset from the Fake News Challenge12 was considered. Thischallenge aims to use Artificial Intelligence in order to restrain the circula-tion of fake news. Nevertheless, a shortcoming of this was that the headlineswere labeled by their stance toward the content of the article, i.e. does it agreewith the article or is does the headline contain information that contradictswith the whole article? They opted for such a classification task because ofthe complexity that labeling an article as true or false brings with itself. Sincethis is a slightly different goal, this dataset was left behind as well.

The last contender was the Kaggle Fake News Dataset13, which consists ofnews articles originating from known fake news sources. Though, that doesnot guarantee that every article from that kind of source can be deemed asfake. It might be so that at some instances, the news reported might turnout to be true, or is a mix of false and true information. Furthermore, takinga look at the data made it clear that the items in this specific dataset are arather obvious and extreme version of fake news, using certain formattingthat makes it too obvious what is real and what is fake, not solving the actualproblem of when it is not immediately detectable with the human eye.

Now that the possible datasets have been addressed and their shortcomings

11https://github.com/BuzzFeedNews/2016-10-facebook-fact-check12http://www.fakenewschallenge.org/13https://www.kaggle.com/mrisdal/fake-news

13

Page 14: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

have been explained, the used datasets in this thesis will be introduced andthoroughly explained.

4.1.2 LIAR Dataset

The LIAR Dataset (Wang (2017)) is a dataset consisting of 12791 short po-litical statements yielded from Politifact.com. These statements are labeledwith how true they are, thus indicating their degree of falsity as well. Thedataset can be used for fake news detection, as suggested by the makers. Thedistribution is as follows:

pants-fire false barely-true half-true mostly-true true

1047 2507 2103 2627 2454 2053

Table 1: Original distribution of data

As can be seen in the table above, there are some overlapping labels andin contrast to the original paper, the focus of this thesis is on being able todistinguish between fake and real news, only. Thus, the decision was madeto create new labels that capture the uniformity between the six labels, yetstill do not eliminate essential information. This resulted in the followinglabels and their respective distribution, which can be seen in Table 2. Thetable below demonstrates the conversion from the old label to the new one.

old label new labelfalse falsepants-fire falsebarely-true falsehalf-true half-truemostly-true truetrue true

Table 2: Old labels to new labels

false half-true true

5657 2627 4507

Table 3: New distribution of labels

The dataset was already divided into a train, validation and test set. So whenusing the train and validation set, this was the distribution, with a total of11525 items:

14

Page 15: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

false half-true true

5104 2362 4059

Table 4: Distribution of train and validation test

From the above distribution, a prior probability for everything being classi-fied as false would be around 44.2%, for half-true it would be around 20.49%and for true around 35.21%. Thus, the baseline, for when using the train andvalidation set, taken is the highest of these, which is 44.28%.

Figure 1: Example LIAR Dataset, above: false, below: true

4.1.3 Headlines Dataset

The final set of experiments on headlines use a dataset that was self anno-tated in this thesis. The rumor citation dataset14, consists of three differentsources, each containing different claims that have circulated on the inter-net and their respective truth value. To be able to assign a truth value to theclaim, different (news) articles are cited as a support. The headlines of thesenews articles are in two of the sources in the dataset: Emergent and Politifact.Emergent15 is a website which debunks rumors, such as is Politifact16, whichfocuses more on political rumors.

The dataset retrieved from Kaggle for Politifact did not come with a label forthe pages supporting a claim, thus this was manually annotated by loopingover the pages and checking manually on the website what its stance was.Some claims contained interviews, YouTube video-titles and law enforce-ment webpages – irrelevant data for this thesis, thus these were discarded. Atotal of 500 items were manually labeled.

As for the Emergent dataset, each claim was labeled with if the truth valueis uncertain, true or false. Just like the previous source, pages were indi-cated, but the difference here was that each page was labeled with its stancetoward the claim: for, against, observing or discussing. Since observing anddiscussing do not say much about the validity of the content of a news arti-cle, those were neglected. Comparing the legitimacy of the claim with theviewpoint of the page regarding that allegation, it was possible to assign alabel regarding the truthfulness of the page. For instance, as can be seen inFigure 2, when a claim is false and the page would be against it, then thelabel of that page would be true. On the contrary, this was automated insteadof manually labeling it, due to the fact that it is easily programmable.

14https://www.kaggle.com/arminehn/rumor-citation15http://www.emergent.info/16http://www.politifact.com/

15

Page 16: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

Figure 2: Example Emergent Dataset

Combining the two datasets to create one dataset containing the newsheadlines, yields the following distribution:

false half true true

284 0 318

Table 5: Data distribution of manually annotated dataset headlines

4.1.4 Pre-processing

From both of the above mentioned datasets, the statements and the headlinesare the only necessary information. The LIAR Dataset is already a cleandataset, hence not much had to be done to retrieve the statements. To be ableto work conveniently with the dataset, it was loaded into a Pandas DataFrame17, which makes it easily accessible to fetch different features or rows.

Regarding the other dataset, the Emergent source already contained the head-line apart. Therefore, there was no prior processing needed. However, Poli-tifact contained the headline, along with the website of origin and its date.The following format was mostly used: website, “headline,” date. To onlyextract the headlines, regex was used which followed that specific pattern.

As mentioned before, the statements and headlines are the only informationneeded, from which the linguistic features will be extracted. To be able touse features that are classes of lexical values instead of binary or numericvalues, the only data pre-processing conducted was using the LabelEncoderfrom scikit-learn18, which changes lexical values into classes. In the follow-ing section, the features shall be discussed.

4.2 MethodologyTo find linguistic differences between fake and real statements and headlines,features will be extracted on semantic and syntactic levels. Below, there isa list of features, with an explanation of what they stand for. The mannerof extraction for non-trivial features is explained in section 5.2.1. Many of

17https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html18http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

16

Page 17: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

these are inspired from features that have been discussed in section 2 andhave turned out to be helpful for related areas.

• Amount Sentences: How many sentences does the item contain?• Length: How long is/are the sentence/sentences on a word-level?• Mood: Is the statement/headline indicative, imperative, conditional or

subjunctive?• Amount Capital Letters: How many capital letters does the sentence

contain?• Ratio Capital Letters: How many capital letters does the sentence

contain, in comparison to the amount of words. This, because abbrevi-ations tend to consist of capital letters only. Hence, the ratio is taken toget a feel of how much balance there is in the sentence.

• Amount Punctuation Marks: How many punctuation characters arethere?

• Punctuation Marks: Does the sentence contain any punctuation marks?(Done per punctuation symbol)

• Amount Quotes: What is the amount of quotes in the statement orheadline?

• Average Quote Length: The summed up length of the quotes, dividedby the amount of quotes.

• POS Tags n-grams: The sequence of POS tags of a statement or head-line, up to n = 3.

• Sentiment: Is it a positive or negative statement?• Subjectivity: How subjective is the statement? This ranges between

the values of 0 and 1.• Word n-grams: Sequences of words, up until n = 2.• Likelihood of words: How rare is the occurrence of the words used in

the sentence. Also till n = 2.• Likelihood of POS Tags: Same as for the words, however this time

with the POS Tags.• Commonness of words: How many common words are used in the

sentence?• Amount of definite articles: How many times does the word the occur

in the sentence? This gives an indication of the generality.• Amount of indefinite articles: How many times do the words a and/or

an occur in the sentence? This, as well, gives an indication of thegenerality.

4.2.1 Feature Extraction

Now that it is clear which features will be used, the method of extraction shallbe explained. Not all the features will be explained because of the trivialityof some, e.g. length of a headline or presence of punctuation marks.

When it comes to the mood of a sentence, the pattern.en package19 was used

19http://www.clips.ua.ac.be/pages/pattern-en

17

Page 18: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

to retrieve it, since the language of the statements and headlines is Englishand the detailed documentation made it easier to work with the specific func-tions. It made it clear what the input is and what the output would be. Whenfeeding the sentence as an input, one of the four moods were returned. Thesame package was used to extract the sentiment and subjectivity of a headlineor statement. When giving a sentence as input, the sentiment and subjectivityare returned in a Python-tuple. Sentiment ranges between -1 and 1, where -1is negative, 0 is neutral and 1 is positive, whereas subjectivity only rangesbetween 0 and 1, 0 indicating an objective sentence and 1 the opposite; sub-jective.

Retrieving the POS tags sequence from a sentence was done by using thePOS Tagger from nltk20, which given a sentence returns the POS tag se-quence.

Another feature explored in this thesis is the commonness of words and POStag sequences. To be able to find the commonness, a likelihood model is de-veloped over the Reuters Corpus retrieved from nltk. This method has beenused in Danescu-Niculescu-Mizil et al. (2012) as well, where they used theBrown Corpus instead. By using language models, insight is gained in thecomplexity of words and Part-Of-Speech tag sequences. Thus, does a sen-tence contain more prevalent words and/or syntax or unusual wordings and/orsentence structures? The language model built over the Reuters Corpus isthen used to calculate the likelihood for each sentence. A lower likelihoodindicates a lesser occurrence, while a higher likelihood makes it more likelyfor the sentence to occur.

Additionally, a different method for the same type of feature was used, wherethe frequencies of words occurring in the Reuters Corpus was investigated.A cut-off frequency value k was taken, where when looping over a sentencein the dataset, the frequency of each word in that sentence is looked up inthe Reuters Corpus. Comparing the frequency with that of k, if it is equalor larger (which means that the word is rather common), the total commonscore of the sentence is increased with 1. If it is lower than k, the total com-mon score of the sentence remains the same.For instance, the following sentence: “Man exquisitely nabs Superman-likevigilante.”. In Table 6 , the frequencies of a corpus are displayed, where k =5. The total common score of this sentence will then be = [1 + 0 + 0 + 0 + 0]= 1.

20http://www.nltk.org/book/ch05.html

18

Page 19: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

Word Frequencysuperpowers 15arrow 10man 9nabs 4green 2evil 1

Table 6: Example to calculate total common score

After taking a look at the distribution of the frequencies in the Reuters cor-pus, a sample of which can be seen in Figure 3, the decision was made toset k equal to 500. This decision was made since the words below that fre-quency started to become more topic specific, as can be seen in Figure 3. Asmentioned before, the Reuters corpus will give an indication of which termscan be deemed common or rare in news articles. Where outstanding, whichhas a frequency larger than k, is a word that is general and widely used, Gulfor Reuters do not seem to be ordinary terms that would occur generally.

Figure 3: Sample distribution frequencies Reuters corpus

4.2.2 Feature Selection

Features that lack variance are not desirable in the dataset since they do notcontribute to the distinction when it comes to prediction. If a feature has thesame value for all the entries and different labels, it means that its consistencycannot yield desirable results as it does not say anything significant. Thussuch features were discarded by looking at which features contain only onekind of value. These features were hence dropped. The result of this shall be

19

Page 20: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

discussed in section 5.3.1.

4.3 Results

4.3.1 Dropped features

As mentioned in 5.2.2, the features that did not have any kind of variance,were discarded. This resulted in the following features being dropped: [’&’,’*’, ’<’, ’>’, ’

∧’, ’{’, ’|’, ’}’, ’∼’]. These features being dropped do make

sense, since these barely occur in news articles, except for maybe the &.

4.3.2 LIAR Statements - Train and Validation Set

At first, the simple features, along with the unigrams, were used to get afeeling of what the expectations from the performance of the classifiers couldbe. The cut-off value for the frequencies of the unigrams here is k = 55.This will give an insight in which classifiers work better with the dataset andwhich appear to underperform. To refrain the classifiers from overfitting,3-fold cross-validation was used. In Table 7 , the accuracies can be found.As can be seen, Naive Bayes is an under-performer amongst the classifiers.This might be because of the independence assumption, which does not workwhen it comes to text, because of how heavily dependent text is.

Algorithm Accuracy

Naive Bayes 35.89%k-NN 42.42%ExtraTrees 46.11%RandomForest 45.70%Adaboost 48.39%GradientBoosting 49.03%SVM 48.06%Logistic 48.76%

Table 7: Accuracies Basic Features + Unigrams

However, something to be noted is that when using word and POS tags n-grams, treating all of the n-grams as features will lead to a sparse dataset.Thus, the decision was made to experiment with different cut-offs for theamount of frequencies considered as a feature. Plotting the accuracies of allthe simple features and the unigrams added against the value of k will giveinsight on what works the best. For instance, if the cut-off value k would be k= 50, then all the n-grams with a frequency below 50 would not be consideredas a feature. The result can be observed in Figure 4 for the word unigrams,where once again, Naive Bayes is underperforming. The best performingalgorithms, which are Logistic Regression and Gradient Boosting along withSVM and Adaboost, appear to have an arc, rising till k is around 100 andthen holding steady before a slight dip around k = 200. Therefore, it can be

20

Page 21: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

concluded that for word unigrams, k = 100 works the best, since the accuracyseems to be high there, but not much information is discarded.

Figure 4: Plot of different cut-off values; word unigrams

In Figure 5, the same pattern for the word unigrams can be observed forthe word bigrams. However, there is a difference in the value of k. Here,there is a rise till k = 50, until a slight dip is encountered when k is around100. Hence, a cut-off value for the frequencies of word bigrams seems to beoptimal for k = 50.

Figure 5: Plot of different cut-off values; word bigrams

The equivalent process was carried out for POS tag n-grams. Unigrams werenot plotted since there are a limited amount of POS tags, however bigrams

21

Page 22: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

and trigrams of POS tags bring along different combinations. When it comesto POS tag bigrams, it can be seen in Figure 6 that there is a different pattern,where only logistic regression seems to perform well. Since in the beginningit seems that most of the classifiers achieve the highest accuracy, the optimalk was set equal to 30.

Figure 6: Plot of different cut-off values; POS tags bigrams

Looking at Figure 7, this time k occurs to be optimal when it is equal to100, since the accuracy seems to settle there after rising, once again dippingslightly at the end.

Figure 7: Plot of different cut-off values; POS tags trigrams

In order to get a feeling of what features work for the distinction of state-ments, it was decided to get the accuracies of each feature individually pre-

22

Page 23: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

dicting, so that insight on their predictive power can be gained. The desirewas to use a χ2-test, however, that does not work with negative values, whichwere present in the dataset (e.g. sentiment of a headline). In Appendix A,the accuracies yielded for each feature and algorithm are available in a ta-ble. The most noteworthy features that performed above the baseline werethe unigrams, POS tag trigrams and the POS tags JJS and JJR. These POStags represent superlative and comparative adjectives, respectively. The otherfeatures did not appear to perform better than the baseline. In Table 8 thenoteworthy features are showcased, along with two of the features that havebeen linked with memorability. Those features, as can be seen, did not yieldpromising results.

Unigrams Tri POS Likelihood Uni Likelihood Bi JJS JJR

Extra Trees 43.63% 43.08% 44.28% 44.28% 45.17% 45.79%Random Forest 44.99% 42.98% 44.28% 44.28% 45.17% 45.79%Adaboost 47.32% 45.05% 44.28% 44.28% 45.17% 45.79%Gradient Boosting 48.60% 45.97% 44.28% 44.28% 45.17% 45.79%SVM 44.29% 44.46% 44.28% 44.28% 45.17% 45.79%Logistic 47.80% 44.99% 44.28% 44.28% 45.17% 45.79%

Table 8: Accuracy per feature, train + valid statements set

4.3.3 Statements on Headlines

To discover which features of statements work on the headlines properly, thesame strategy applied for the statements itself was applied here as well. Foreach individual feature, the accuracy was calculated to get a feeling of theirpredictive power. In Table 9 some features are showcased. Since trainingon the statements means that the most common label is false, this results ina baseline of 47.17% for applying that on the headlines. Once again, thelikelihood of a sentence when using unigrams does not seem to have anytype of predictive power. The performance of unigrams themselves seems tovary per algorithm, where Logistic Regression appears to work well with thisfeature, others even perform under the baseline. The same pattern can be seenfor subjectivity. Once again, the POS tag trigrams seem to have much morepredictive power, even achieving an accuracy of around 50% when usingAdaboost. The same can be said for punctuation, which yields accuracies ofaround 50% as well.

Punctuation POS trigrams Likelihood Uni Unigrams Subjectivity

Extra Trees 50.16% 41.86% 47.17% 42.19% 41.86%Random Forest 50.49% 47.01% 47.17% 42.69% 43.02%Adaboost 49.83% 50.16% 47.17% 46.01% 45.84%Gradient Boosting 47.50% 47.17% 47.17% 46.51% 43.85%SVM 47.17% 47.17% 47.17% 47.17% 47.17%Logistic 47.50% 49.83% 47.17% 48.33% 48.00%

Table 9: Accuracy per feature, train on statements, test on headlines

23

Page 24: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

4.3.4 Headlines

When training and testing on the headlines, the baseline changes again. Be-cause the most common label in this set is true, the baseline here is 52.82%.Once again, the likelihood of a sentence with the use of unigrams does per-form according to baseline again. However, besides the POS tags trigrams,there seem to be other features that appear to perform better than the baselineas well. The POS tags yield the highest individual accuracies, around 71%for Extra Trees Classifier. The presence of indefinite and definite articlesalso have predictive power, as well as the sentiment of a headline. This canbe seen in Table 10.

# Capital Letters POS Tags Tri POS Articles Sentiment Likelihood Uni

Extra Trees 59.13% 71.76% 69.59% 57.97% 59.62% 52.82%Random Forest 59.96% 68.93% 67.94% 57.31% 59.46% 52.82%Adaboost 60.62% 61.46% 65.61% 54.48% 59.29% 52.82%Gradient Boosting 59.96% 61.96% 66.27% 54.48% 59.13% 52.82%SVM 60.29% 60.30% 52.82% 55.65% 53.15% 52.82%Logistic 60.46% 61.79% 64.78% 55.98% 52.15% 52.82%

Table 10: Accuracy per feature, train on and test on headlines

4.3.5 Total Results

In order to find how well all the features perform together, the features willbe combined and the accuracy will be calculated for the train and validationset on statements, training on statements and testing on headlines and train-ing and testing only on the headlines.

When using the train and validation set for the statements, the following ac-curacies are achieved:

Features + Unigrams Features + Bigrams Features + POS Tag Bigrams Features + POS Tag Trigrams

Baseline 44.28% 44.28% 44.28% 44.28%ExtraTrees 45.56% 43.49% 45.35% 46.28%RandomForest 44.73% 44.31% 45.85% 45.94%Adaboost 48.52% 48.18% 48.01% 47.63%GradientBoosting 49.28% 48.35% 48.41% 48.19%SVM 48.72% 48.12% 48.51% 48.03%Logistic 49.93% 48.96% 47.89% 47.48%

Table 11: Final Best Results

Most of the performances are above baseline, and the highest achieved re-sult is for when using all of the features, along with the unigrams, using theoptimal cut-off frequency. Logistic Regression yields the best result with anaccuracy of 49.93%.

When training on the statements and testing on the headlines, in Table 12,

24

Page 25: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

the accuracies can be seen. While all of the features along with the bigramsappears to be struggling to perform better than the baseline, the POS tag bi-grams seem to be yielding better results than the bigrams. The best resultsare achieved when combining all of the features with the POS tag trigrams.Once again, Logistic Regression is yielding the highest accuracy of around50%.

Features + Unigrams Features + Bigrams Features + POS Tag Bigrams Features + POS Tag Trigrams

Baseline 47.17% 47.17% 47.17% 47.17%ExtraTrees 47.67% 43.18% 42.52% 49.00%RandomForest 48.33% 47.17% 45.18% 45.01%Adaboost 47.50% 46.67% 48.17% 48.33%GradientBoosting 45.68% 45.51% 46.01% 45.18%SVM 47.17% 47.01% 46.84% 47.01%Logistic 47.34% 47.17% 49.00% 50.16%

Table 12: Final Accuracies; training on statements, testing on headlines

In Table 13, the accuracies of when training and testing on headlines areshowcased. Here, all of the classifiers appear to perform better than baseline.However, the best accuracy is achieved when combining the features with thePOS tag bigrams, resulting in Extra Trees achieving an accuracy of around77.5%.

Features + Unigrams Features + Bigrams Features + POS Tag Bigrams Features + POS Tag Trigrams

Baseline 52.82% 52.82% 52.82% 52.82%Extra Trees 76.74% 73.58% 77.57% 74.09%Random Forest 72.42% 73.25% 76.74% 76.58%Adaboost 70.59% 67.77% 74.91% 70.76%Gradient Boosting 71.92% 67.77% 75.07% 72.92%SVM 62.28% 62.45% 62.62% 62.62%Logistic 71.42% 64.12% 76.08% 73.59%

Table 13: Final Accuracies; Training and Testing on Headlines

To give an overview of all the best results, Table 14 contains the best combi-nation of features for each type of training and testing that has been done. Tohighlight the amount of improvement, the baselines per task has been addedas well.

25

Page 26: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

Train & Valid: Statements Train: Statements & Test: Headlines Train & Test: HeadlinesFeatures + Unigrams Features + POS trigrams Features + POS bigrams

Baseline 44.28% 47.17% 52.82%ExtraTrees 46.11% 49.00% 77.57%RandomForest 45.70% 45.01% 76.74Adaboost 48.39% 48.33% 74.91GradientBoosting 49.03% 45.18% 75.07%SVM 48.06% 47.01% 62.62%Logistic 48.76% 50.16% 76.08%

Table 14: Final Best Results

26

Page 27: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

5 ConclusionThe main goal of the thesis is to find out which linguistic features are ableto distinguish between fake and real news. This is an essential problem toaddress since the ramifications have an impact on daily life. To be able toextract these linguistic features, POS Tagging, sentiment analysis and likeli-hood models are crucial NLP-techniques that need to be used. It seems thatcertain words do give a boost to be able to differ between the two types ofnews, since the overall accuracy seems to rise with the appearance of word n-grams as a feature, especially unigrams. Along with that, POS tag sequences,punctuation and generality (because of the good performance of articles)emerge as the features that have the most predictive power when it comesto discrimination of fake and real statements and news, respectively. In con-clusion, the highest achieved accuracy was 49.03% for statements, with theuse of Gradient Boosting, performing around 5% better than the baseline.When it comes to statements and headlines both, the best accuracy achievedwas 50.16% with Logistic Regression, 3% higher than the baseline. Witha performance of 25% higher than the baseline, Extra Trees yields the bestaccuracy for the headlines, achieving an accuracy of 77.57%.

6 Discussion and Future WorkThe findings of this thesis reinforce the fact that fake news detection is a dif-ficult task to tackle and needs more research. Such as Potthast et al. (2017)and Wang (2017), the results do not appear to be significant to say that theproblem can be solved thoroughly with NLP.

The biggest problem at hands is the lack of a reliable dataset that containsnews articles and their corresponding label of being real or fake news. TheLIAR Dataset is the first of its kind with such a large magnitude, howeverits main focus is politics and statements. Even though that is a part of fakenews, it does make the detection more difficult. As discussed before in sec-tion 4.1.1, the current datasets do not provide a proper and solid foundationfor the detection. Therefore, it is essential that there are reliable resources tobe able to conduct proper research regarding this topic.

Something to be noted is the difficulty of being able to tell what is fake andwhat is real. Biases might play a role in the determination, as well as a lackof knowledge of events that have played out and/or how. As can be seen inthe self-annotated headlines dataset, there is no such label as half-true. Thisis because of the limited expertise regarding the field, which made it hard toestablish if something is just half-true or not, due it not being clearly men-tioned on politifact’s pages as well.

When it comes to the extraction of one of the features, there might be a betterway to be able to calculate the amount of common words in a sentence. Theapproach to extract this feature in this thesis is a rather naive one, thus, it isessential to find a better method.

27

Page 28: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

In the future, a proper dataset of a large magnitude might enable more spacefor research with diverse data. Along with that, it might be useful to combineNLP-techniques with information retrieval and reinforcement learning to seeif those hybrid-approaches improve the distinction. Other linguistic featuressuch as the amount of spelling and grammar mistakes can be experimentedwith as well.

28

Page 29: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

7 ReferencesAnand, A., Chakraborty, T., and Park, N. (2016). We used neural networks to

detect clickbaits: You won’t believe what happened next! arXiv preprintarXiv:1612.01340.

Biyani, P., Tsioutsiouliklis, K., and Blackmer, J. (2016). 8 amazing secretsfor getting more clicks: detecting clickbaits in news streams using articleinformality. In Proceedings of the Thirtieth AAAI Conference on ArtificialIntelligence, pages 94–100. AAAI Press.

Danescu-Niculescu-Mizil, C., Cheng, J., Kleinberg, J., and Lee, L. (2012).You had me at hello: How phrasing affects memorability. In Proceedingsof the 50th Annual Meeting of the Association for Computational Linguis-tics: Long Papers-Volume 1, pages 892–901. Association for Computa-tional Linguistics.

Freund, Y. and Schapire, R. E. (1995). A desicion-theoretic generalization ofon-line learning and an application to boosting. In European conferenceon computational learning theory, pages 23–37. Springer.

Gupta, A. and Kumaraguru, P. (2012). Credibility ranking of tweets duringhigh impact events. In Proceedings of the 1st Workshop on Privacy andSecurity in Online Social Media, page 2. ACM.

Jurafsky, D. (2000). Speech and language processing: An introduction to nat-ural language processing. Computational linguistics, and speech recogni-tion.

Potthast, M., Kiesel, J., Reinartz, K., Bevendorff, J., and Stein, B. (2017).A stylometric inquiry into hyperpartisan and fake news. arXiv preprintarXiv:1702.05638.

Potthast, M., Kopsel, S., Stein, B., and Hagen, M. (2016). Clickbait detec-tion. In European Conference on Information Retrieval, pages 810–817.Springer.

Rubin, V. L., Conroy, N. J., Chen, Y., and Cornwell, S. (2016). Fake newsor truth? using satirical cues to detect potentially misleading news. InProceedings of NAACL-HLT, pages 7–17.

Tan, C., Lee, L., and Pang, B. (2014). The effect of wording on messagepropagation: Topic-and author-controlled natural experiments on twitter.arXiv preprint arXiv:1405.1438.

Wang, W. Y. (2017). ” liar, liar pants on fire”: A new benchmark dataset forfake news detection. arXiv preprint arXiv:1705.00648.

Zhang, H. (2004). The optimality of naive bayes. AA, 1(2):3.

29

Page 30: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

A Accuracies of All Features StatementsDue to the next page being in landscape, there is an extra blank page. Pleasecheck the next page.

30

Page 31: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

Am

ount

Sent

ence

sL

engt

hM

ood

Am

ount

Cap

italL

ette

rsR

atio

Cap

italL

ette

rs

Nai

veB

ayes

0.43

3196

0.43

9997

0.44

2264

0.43

9919

0.44

7582

k-N

N0.

4163

860.

3974

640.

3991

020.

4284

260.

3991

88E

xtra

Tree

s0.

4420

300.

4336

640.

4431

240.

4479

710.

4401

53R

ando

mFo

rest

0.44

2030

0.43

3977

0.44

3124

0.44

7815

0.43

5853

Ada

boos

t0.

4419

520.

4401

530.

4431

240.

4482

060.

4472

67G

radi

entB

oost

ing

0.44

2030

0.43

7730

0.44

3124

0.44

8127

0.44

9301

SVM

0.44

1873

0.43

4211

0.44

3124

0.44

8362

0.44

2186

Log

istic

0.44

2264

0.44

2030

0.44

2264

0.44

2108

0.45

0944

Am

ount

ofPu

nctu

atio

nM

arks

Am

ount

ofQ

uote

sA

vera

geQ

uote

Len

gth

PRP$

Nai

veB

ayes

0.44

1560

0.43

5776

0.43

9058

0.44

2264

k-N

N0.

4222

430.

4113

850.

2113

200.

4156

05E

xtra

Tree

s0.

4417

950.

4421

860.

4397

630.

4422

64R

ando

mFo

rest

0.44

2108

0.44

2186

0.43

8746

0.44

2264

Ada

boos

t0.

4417

950.

4421

860.

4391

370.

4422

64G

radi

entB

oost

ing

0.44

1795

0.44

2186

0.43

8825

0.44

2264

SVM

0.44

2108

0.44

2108

0.43

9998

0.44

2264

Log

istic

0.44

0935

0.44

2264

0.44

2030

0.44

2264

VB

GV

BD

“PO

S”

VB

PW

DT

JJW

PV

BZ

DT

#

Nai

veB

ayes

0.44

2264

0.44

2264

0.43

8903

0.44

1482

0.44

2264

0.43

8276

0.44

2264

0.44

2264

0.43

9057

0.44

2264

0.44

2264

0.44

2030

k-N

N0.

3898

810.

4134

100.

4143

560.

4120

180.

4422

640.

4399

190.

3381

540.

4422

640.

4034

090.

3950

460.

3896

550.

4420

30E

xtra

Tree

s0.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4420

30R

ando

mFo

rest

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

Ada

boos

t0.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4420

30G

radi

entB

oost

ing

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2030

SVM

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2030

Log

istic

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2030

RP

$N

N)

(FW

,TO

PRP

RB

:N

NS

Nai

veB

ayes

0.43

2255

0.42

8346

0.44

2264

0.43

6948

0.43

6869

0.44

2342

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.41

2544

0.44

2264

k-N

N0.

3659

330.

3881

550.

4181

920.

3790

940.

3749

510.

4423

420.

3956

830.

4111

510.

4185

770.

4179

410.

4422

640.

4292

81E

xtra

Tree

s0.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4428

110.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

64R

ando

mFo

rest

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2342

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

Ada

boos

t0.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4428

110.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4422

64G

radi

entB

oost

ing

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2811

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

SVM

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2811

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

Log

istic

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2811

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2264

31

Page 32: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

NN

PV

BW

RB

CC

PDT

RB

SR

BR

VB

NE

XIN

WP$

Nai

veB

ayes

0.44

8518

0.44

2264

0.43

1789

0.44

2264

0.44

2811

0.44

0623

0.44

0857

0.44

2264

0.43

8042

0.39

9108

0.20

5535

k-N

N0.

4422

640.

4014

630.

3106

230.

3831

630.

4423

420.

4120

960.

4424

990.

4012

250.

4418

740.

4200

680.

4421

86E

xtra

Tree

s0.

4485

180.

4422

640.

4422

640.

4422

640.

4428

110.

4417

170.

4446

100.

4422

640.

4412

480.

4422

640.

4421

86R

ando

mFo

rest

0.44

8518

0.44

2264

0.44

2264

0.44

2264

0.44

2811

0.44

1717

0.44

4610

0.44

2264

0.44

1248

0.44

2264

0.44

2186

Ada

boos

t0.

4485

180.

4422

640.

4422

640.

4422

640.

4428

110.

4417

170.

4446

100.

4422

640.

4412

480.

4422

640.

4421

86G

radi

entB

oost

ing

0.44

8518

0.44

2264

0.44

2264

0.44

2264

0.44

2811

0.44

1717

0.44

4610

0.44

2264

0.44

1248

0.44

2264

0.44

2186

SVM

0.44

8518

0.44

2264

0.44

2264

0.44

2264

0.44

2811

0.44

1717

0.44

4610

0.44

2264

0.44

1248

0.44

2264

0.44

2186

Log

istic

0.44

8518

0.44

2264

0.44

2264

0.44

2264

0.44

2811

0.44

1717

0.44

4610

0.44

2264

0.44

1248

0.44

2264

0.44

2186

CD

MD

NN

PSJJ

SJJ

RU

H!

”%

’+

-

Nai

veB

ayes

0.45

4147

0.44

2264

0.43

8823

0.45

1723

0.45

7901

0.35

2435

0.35

4468

0.43

8668

0.44

1717

0.44

1560

0.28

4450

0.43

6167

k-N

N0.

3833

040.

3772

710.

4392

940.

3334

100.

3700

920.

4419

510.

4414

820.

4078

660.

4421

080.

2902

340.

4422

640.

3817

45E

xtra

Tree

s0.

4541

470.

4422

640.

4422

640.

4517

230.

4579

010.

4422

640.

4422

640.

4422

640.

4422

640.

4410

910.

4422

640.

4422

64R

ando

mFo

rest

0.45

4147

0.44

2264

0.44

2264

0.45

1723

0.45

7901

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

1091

0.44

2342

0.44

2264

Ada

boos

t0.

4541

470.

4422

640.

4422

640.

4517

230.

4579

010.

4422

640.

4422

640.

4422

640.

4422

640.

4410

910.

4422

640.

4422

64G

radi

entB

oost

ing

0.45

4147

0.44

2264

0.44

2264

0.45

1723

0.45

7901

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

1091

0.44

2264

0.44

2264

SVM

0.45

4147

0.44

2264

0.44

2264

0.45

1723

0.45

7901

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

1091

0.44

2264

0.44

2264

Log

istic

0.45

4147

0.44

2264

0.44

2264

0.45

1723

0.45

7901

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

1091

0.44

2264

0.44

2264

/;

=?

@[

\]

‘Se

ntim

entH

eadl

ine

Nai

veB

ayes

0.44

1248

0.44

1951

0.30

3319

0.41

2402

0.33

3138

0.44

1561

0.44

2264

0.44

1561

0.44

2264

0.33

3197

0.43

9918

k-N

N0.

4412

480.

4422

640.

4422

640.

4403

870.

4422

640.

4422

640.

4422

640.

4422

640.

4422

640.

4421

860.

4222

49E

xtra

Tree

s0.

4422

640.

4419

510.

4422

640.

4422

640.

4421

080.

4422

640.

4422

640.

4422

640.

4422

640.

4421

860.

4403

11R

ando

mFo

rest

0.44

2264

0.44

1873

0.44

2264

0.44

2264

0.44

2108

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2186

0.43

9998

Ada

boos

t0.

4422

640.

4419

510.

4422

640.

4422

640.

4421

080.

4422

640.

4422

640.

4422

640.

4422

640.

4421

860.

4443

75G

radi

entB

oost

ing

0.44

2264

0.44

1951

0.44

2264

0.44

2264

0.44

2108

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2186

0.44

3437

SVM

0.44

2264

0.44

1951

0.44

2264

0.44

2264

0.44

2108

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2186

0.44

2264

Log

istic

0.44

2264

0.44

1951

0.44

2264

0.44

2264

0.44

2108

0.44

2264

0.44

2264

0.44

2264

0.44

2264

0.44

2186

0.44

2498

TriP

OS

Lik

elih

ood

Uni

Lik

elih

ood

Bi

Lik

elih

ood

Uni

Tag

Lik

elih

ood

BiT

agA

rtic

les

Com

mon

Nai

veB

ayes

0.42

5511

0.40

3384

0.44

2863

0.44

2863

0.44

2863

0.44

2863

0.44

0954

k-N

N0.

4452

080.

4186

540.

3858

570.

3921

040.

3908

020.

3882

860.

3658

97E

xtra

Tree

s0.

4363

550.

4308

020.

4428

630.

4428

630.

4426

900.

4408

670.

4420

83R

ando

mFo

rest

0.44

9978

0.42

9847

0.44

2863

0.44

2863

0.44

2603

0.44

1822

0.44

0087

Ada

boos

t0.

4732

310.

4505

850.

4428

630.

4428

630.

4426

900.

4426

030.

4419

96G

radi

entB

oost

ing

0.48

6073

0.45

9784

0.44

2863

0.44

2863

0.44

2690

0.44

1909

0.44

1649

SVM

0.44

2950

0.44

4685

0.44

2863

0.44

2863

0.44

2863

0.44

2863

0.44

1475

Log

istic

0.47

8091

0.44

9978

0.44

2863

0.44

2863

0.44

2863

0.44

2863

0.44

4946

32

Page 33: The Linguistic Features of Fake News Headlines and Statements · sify fake news, the accuracy being less than chance. This demonstrates the scope of improvement regarding this classifying

33