42
1 Syntactic, Semantic, and Topics: The Cognitive Framework of Fake News Leah C. Windsor Research Assistant Professor Institute for Intelligent Systems The University of Memphis Zhiqiang “Carl” Cai Research Assistant Professor Institute for Intelligent Systems The University of Memphis James Grayson Cupit Junior Software Developer Institute for Intelligent Systems The University of Memphis

Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

1

Syntactic, Semantic, and Topics: The Cognitive Framework of Fake News

Leah C. Windsor Research Assistant Professor

Institute for Intelligent Systems The University of Memphis

Zhiqiang “Carl” Cai

Research Assistant Professor Institute for Intelligent Systems

The University of Memphis

James Grayson Cupit Junior Software Developer

Institute for Intelligent Systems The University of Memphis

Page 2: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

2

Abstract The problem with detecting fake news from categories such as bias, consipracy, hate, fake science, satire, state-run media, and bullshit, is that these types of information may appear similar to information coming from reputable news sources. Further, current computational approaches to distinguishing them often do no better than chance or human ratings at distinguishing fake from real news, in the aggregate. Some researchers have suggested that fake news erodes the foundations of democracy by undermining the role of legitimate journalists reporting accurate information, causing citizens to form erroneous conclusions based on inaccurate information about important scientific, social, and political issues. In using a computational linguistics approach to analyzing news, we can identify language features about the syntax, sentiment, and topical variation that distinguish fake from real news. The benefit of doing this lies in the potential downstream automated applications, such as broswer extensions that alert users to potentially disreputable or questionable sources or articles, as well as track emerging trends in fake news. Our paper identifies the syntactic, sentiment, and topical differences in fake and real news using a Kaggle fake news corpus, and a proprietary verified news corpus. In this study, we analyzed a corpus of 12,999 posts downloaded from 244 websites identified as “bullshit” by BS detector and compared with 6,079 real news downloaded from six high reputation news agency websites.

Page 3: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

3

Wikileaks Exposes Clinton Satanic Ritual, FBI Calls Hillary the Antichrist (Fake News Headline)

State Finds 30 Deleted Clinton Emails On Benghazi

(Real News Headline) Introduction The term “fake news” refers to multiple phenomena, including the deliberate spread of false

information, satire, outdated/revived content, hoaxes, clickbait, propaganda, and disinformation.

The key problem with fake news is determining truth from fiction, a form of deception detection;

readers interact with both headlines and the body of text to discern the veracity of the source. At

first glance, fake news may appear similar to news from reputable outlets, both stylistically and

linguistically. In fact, many computational approaches for distinguishing fake from real news

often do no better than chance or human ratings at distinguishing fake from real news, in the

aggregate (1–3). Yet, as this paper demonstrates, syntactic, sentiment, and topical features

provide theoretically meaningful distinctions between fake and real news, namely that fake news

aligns with cognitive frameworks for shallow information processing. Extant research has

examined fake news from various angles, including social media engagement from sources like

Facebook and Twitter (4). We demonstrate that a computational linguistics approach helps

identify language features about the syntax, sentiment, and topical variation that distinguish fake

from real news using both headlines and full article text, providing useful information about how

audiences process news sources in the modern media era.

Our paper identifies the syntactic, sentiment, and topical differences in fake and real

news headlines and full articles using a Kaggle fake news corpus, and a proprietary verified news

corpus. We analyze a corpus of 12,999 posts downloaded from 244 websites by BS detector and

compared with 6,079 real news downloaded from six high reputation news agency websites,

including the Wall Street Journal, CNN, Fox News, Reuters, The Economist, MSNBC, and the

Page 4: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

4

Washington Post. We examine the linguistic features of fake and real news using syntactic,

sentiment, and topic modeling methods as well as a long short-term memory (LSTM) neural

network approach, and we in four parts: fake news headlines; fake news stories; real news

headlines; and real news stories. There are three key takeaways from our findings: first, fake and

real news use vastly different syntactic patterns that illuminate the cognitive framework,

motivating the persuasive method used by susceptible media consumers; second,

Our paper proceeds as follows: first we briefly discuss some of the relevant literature and

theoretical implications; we then describe our data generating process (DGP) and methods; next,

we present the results of our empirical models for each area (syntax, sentiment, topics); finally,

we discuss and provide context for our findings as well as future applications for this workflow.

Real Approaches to Analyzing Fake News

Fake News Typologies

Recent studies have demonstrated the social and political hazards related to the rise and spread of

fake news. The term ‘fake news’ refers to multiple phenomena, including the deliberate spread of

false information, satire, outdated/revived content, hoaxes, clickbait, propaganda, and

disinformation. Volkova et al., use several linguistics measurea to distinguish between these

types of suspicious news items (5), finding that adding syntax and grammar features does not

improve the predictive value of their models, but that linguistic and social interaction features do

improve classification results between the four types of suspicious news stories they investigate

(satire, hoaxes, clickbait, and propaganda). Fake news is more viral than real news, and that it

presents more novel information, piquing the curiosity of readers (6). Similarly, Rashkin et al.,

use LIWC (Linguistic Inquiry and Word Count) dictionaries for subjective lexicon to classify

news items as propaganda, satire, or hoax (2). Their results demonstrate that exaggerating terms

Page 5: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

5

such as superlatives, subjectives, and adjectives, all appear more frequently in fake news items.

These studies concur that linguistic features of fake and real news are different in substantive

ways; combined with the mainstream media-fueled rising political polarization and partisanship,

fake news sources reach a receptive audience that diffuses this information farther and faster than

reputable media messages spread (6).

Recently, scholars have pioneered automated fake news detection systems by mapping

the diffusion pattern of `likes' and `shares' across social media platforms, while social media

networks such as Facebook have crowd-sourced the problem of identifying fake news (7,8). The

key problem with fake news is determining truth from fiction, akin to deception detection

(Hauch, Sporer, Michael, and Meissner, 2014; Mihalcea and Strapparava, 2009; Rubin and

Conroy, 2011). Readers interact with both headlines and the body of text to discern the veracity

of the source, taking cues from both the content and stylistic elements, such as punctuation,

concrete nouns, and emotional tone. Scholars are exploring many paths to help distinguish

between fake and real news, including automated fake news detection that map the diffusion

pattern of ‘likes’ and ‘shares’ through automatic hoax detection systems (7). Some fake news

detectors rely on human raters, such as the BS Detector, Fake News Alert, and Politifact. FiB and

Stop-the-Bullshit (9), that use automated tools built for social media which are not available for

the Internet at large (7,10). Facebook has crowd-sourced solutions to the problem of identifying

fake news, including verified news sites (akin to Twitter’s user verification strategy), separating

‘shares’ from personal information, time delays on ‘reshares’, Snopes partnership, and headline

and content analysis (11,12).

Page 6: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

6

Permutations of fake news are found in media worldwide, and tap into a longstanding

political tradition of distracting domestic audiences using propaganda.1 Recent research on the

“50c army” in China shows that social media posts serve to distract and redirect public narrative

during times of crisis or negative publicity around socially significant events, which may be the

goals of fake news propagators more broadly (13). Crisis propaganda also takes the form of the

“rally ‘round the flag” effect (14–16). Crisis propaganda utilizes the media to foster support for

the leader and to facilitate a sense of unity. In China during times of crisis, the Internet with an

array of positive and distracting messages, a phenomenon that runs contrary to the standard

narrative of media censorship (13). Rather, the contrived social media posts draw attention away

from an undesirable event or even undermine the credibility of democratic processes, such as

human rights protests in China or the 2016 American presidential election (17). Similar to

propaganda, populism is resurgent across Latin America (18) and throughout Europe (19–21).

Yet broadly speaking, even in democracies citizens have winnowed their news sources, in

part because of a lack of diversity from consolidation of media markets, as well as the rise of the

Internet and increasingly individualized and personalized news consumption (22). Recent

scholarship found that cable media accounts for a substantial portion of recent partisan

polarization in the United States (23). Citizens have comparatively less exposure to a variety of

perspectives than in previous generations with common news sources read or viewed by people

across the political spectrum (24), trust the media and government less overall (25), and

increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-

selection bias and cognitive dissonance (26).

1 Table 3 in the Appendix provides the definitions used by Sieradski 2016 (9). We use these classifications in our analysis.

Page 7: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

7

Hardwired for Soft News?

Cognitive Framework

Humans have a demonstrably difficult time distinguishing between fake and real news, which

compromises their ability to make informed decisions about a wide array of issues, like

candidates and elections (8), as well as issues that straddle the public-private spheres such as

vaccinations (27). Fuzzy trace theory helps to account for humans’ unreliability in distinguishing

between real and fake news. As Broniatowski notes, when audiences retrieve rote, or verbatim,

information, their reasoning processes are inhibited as compared to gist information that

encourages reasoning. Reyna (2012) writes that, “Verbatim memory is memory for surface form,

for example, memory representations of exact words, numbers, and pictures. Verbatim memory

is a symbolic, mental representation of the stimulus, not the stimulus itself. Gist memory is

memory for essential meaning, the “substance” of information irrespective of exact words,

numbers, or pictures. Hence, gist is a symbolic, mental representation of the stimulus that

captures meaning (28).” Fake news tends to rely more on messages that evoke gist and use

emotional cues rather than facts to convey information. Additionally, gist representations often

correspond to the peripheral route to persuasion that appeals more to emotion than logic, whereas

verbatim representations tend toward the central route that relies more on facts and complex

explanations or descriptions (29,30).

Citizens use these heuristic shortcuts and emotional connections to make decisions about

leaders and issues (31,32), and the attention-grabbing headlines can vary in content from the

articles they summarize (33). Research has demonstrated that some voters disregard “hard”

media sources in favor of informal news sources, “including infotainment”; the reliance of news

from outside of mainstream media coupled with the castigation of this genre follows the logic of

Page 8: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

8

low-information rationality (24,34,35). In other words, voters are cognitive misers, they seek

information from familiar and easy sources, and they integrate information selectively into their

worldview. Worldview itself appears to be hardwired, as evidenced by findings on support for

authoritarianism in the American National Election Studies (36,37). As Lakoff notes, partisans

conceptualize the nature of problems broadly speaking, and specific political problems

themselves, in vastly different ways (38,39). The notion of thought shaping language, and

language shaping thought has implications for how citizens classify, integrate, and/or disregard

information (40–42), especially in the era of ubiquitous fake news.

Syntax and Political Language

The syntactic, sentiment, and topical characteristics of text influence an audience’s receptivity to

an idea. Some audiences receive information best when senders use straightforward,

uncomplicated syntax with abstract concepts, repetitive key words, and within a standard

narrative framework. Others receive information best when presented with complex phrasing and

concrete concepts delivered within a more expository, cohesive framework. These concepts are

operationalized along five syntactic constructs that shape how audiences process and send

information: syntactic simplicity; word concreteness; narrativity; deep cohesion; and referential

cohesion (43). Syntactic simplicity refers to the complexity or simplicity of sentence phrasing.

Left-embedded sentences, i.e., those with dependent clauses before the main subject and verb

(like this sentence) are syntactically complex, require greater focus, and demand a greater

cognitive workload for the listener. Simple syntax generally follows the SVO (subject verb

object) framework, and is easier on the receiver.

Word concreteness refers to the abstractness of the text base. Concrete words are nouns

that have real world extensions, such as car, boat, or chair. Abstract words include concepts like

Page 9: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

9

hope, fear, and security. Simple syntax and abstract concepts often trend together, especially in

persuasive populist rhetoric. Narrativity refers to the narrative arc, where the information is

presented in a storylike fashion with an introduction, rising action, and resolution. The opposite

is expository presentation whereby the sender conveys a litany of information, often organized

thematically or conceptually. A text has deep cohesion when the components of the discourse are

connected by underlying concepts. On the other hand, referential cohesion refers to more

localized co-referents, often sentence-to-sentence or through repetition of a key term. Referential

and deep cohesion are often inversely related; whereas the former is more locally cohesive, the

latter is more globally cohesive. A speaker with high referential cohesion is often more

“quotable”, producing useful soundbytes for short media highlights; discourse with greater deep

cohesion often requires summarizing to convey the main point.

Collectively, these five syntactic components provide clues to the route to persuasion

used by a source (29,30,44). The peripheral route often relies on simple syntax and abstract

concepts presented narratively with low deep cohesion and high referential cohesion. This

approach makes information easy to parse and is cognitively un-demanding. The central route to

persuasion, on the other hand, likely has more syntactic complexity and more concrete terms

presented in an expository fashion, having greater deep cohesion and less referential cohesion.

This approach is the more cognitively demanding and often appeals to a more discerning

audience. Given this discussion, we propose the following expectations about real and fake

news:

Expectation 1: Real news will have more syntactic complexity, more concrete words, less

narrativity, more deep cohesion, and less referential cohesion.

Page 10: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

10

Expectation 2: Fake news will have more syntactic simpliciyt, more abstract words, more

narrativity, less deep cohesion, and more referential cohesion.

Syntax Corpus

Fake news websites may talk about any topic real ones talk about. In this study, we analyzed a

corpus of 12,999 posts downloaded from 244 websites identified as “bullshit” by BS detector

and compared with 6,079 real news downloaded from six high reputation news agency websites.

The fake news corpus was downloaded from Kaggle (45). Each item contains the publish date,

headline, texts, the source url and other information (see Table 4 in the Appendix). Some of them

had empty texts. All fake news posts were published between Oct. 26, 2016 and Nov. 25, 2016,

the month of US president election. The real news posts were published between June 3, 2016

and June 3, 2017. The posts were collected from 244 websites identified as “bullshit” by the BS

Detector Chrome Extension created by Daniel Sieradski (9). The real news corpus was

downloaded from six websites, published between 6/3/2016 and 6/3/2017, totalling n=6079

articles. Of these, 391 were from sub-domains of the above six websites, such as

nytlive.nytimes.com or stream.aljazeera.com.

Methods

T-tests with unequal variances show that all variables save narrativity show statistically

significant differences between fake and real news, with the coefficients in the expected

directions (see Figure 1).

Page 11: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

11

Figure 1. T-test with unequal variance for syntactic principal components for fake and real news

Presented differently, Figure 7 shows box plots for the values of each variable for both

fake and real news (where fake news is 0 and real news is 1). Syntactic simplicity refers to the

grammatical complexity of an utterance, including features such as dependent clauses,

conjunctions, and left-embedded phrases that make parsing more taxing for the reader or listener.

Real news is more syntactically complex than is fake news, meaning that it is more cognitively

demanding than is fake news. Word concreteness refers to how concrete (nouns that refer to

tangible people, places, or things) or how abstract (concepts such as love, fear, loyalty, or

patriotism) the text base is. Real news uses more concrete terms, whereas fake news uses more

abstract concepts; this aligns with our theoretical expectations about routes to persuasion.

To demonstrate how language aligns conceptually with political partisanship, we

compare the same five syntactic categories using a corpus of speeches given by all major

Page 12: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

12

Republican and Democratic presidential candidates in the 2016 election (for a full list, see Table

5 in the Appendix). Figure 2 shows that the Democratic candidates had more syntactic

complexity, whereas Republican candidates used more straightforward syntactic constructions.

Democratic candidates used more concrete words than Republican candidates, and Republicans’

language was demonstrably more narrative than Democrats’. Finally, Democratic candidates

used less referential cohesion than the Republican candidates, but more deep cohesion. There are

striking parallels between the syntactic structures used in fake and real news corresponding to the

ways in which partisans use language that may have implications regarding the susceptibility of

audiences to ensnarement by fake news. We discuss this in the conclusions.

Figure 2. T-test with unequal variance for syntactic principal components for Republican and Democratic presidential candidates, 2016

Page 13: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

13

Getting Sentimental About Fake and Real News

Automated sentiment analysis also provides clues about how listeners conceive of the world

around them; this includes how discerning they are about the quality of news information they

consume. Pennebaker and others have demonstrated that sentiment analysis can be utilized to

deduce truthfulness and deception, as well as personality and well-being (46–50). Given that

fake news is, by design, untruthful in a multitude of ways, we generate the following

expectations:

Expectation 3: Fake news text will display features of deceptive language.

Expectation 4: Real news will display features of honest language.

We use the Kaggle “Getting Real about Fake News” dataset as ground truth for fake

news stories, consisting of 12,999 posts from 244 fake news websites. We remove non-English

language entries and entries that lack a headline or full article text, resulting in a usable dataset of

(n = 11,568). For each headline in the fake news corpus, we use the Buzzsumo service to

measure social media engagement of the story (sum of all shares across popular Social Network

platforms). For the sentiment analysis, we assembled news stories published by reputable news

outlets (CNN, Fox News, MSNBC, New York Times, Reuters, and Al Jazeera) to establish a

comparison corpus. We used the Buzzsumo service to search for English language articles from

any web domains associated with those sites (51). This process yields a dataset of headlines

along with social media share counts and URLs. To facilitate a balanced distribution of news

sources, we randomly select 1,833 headlines from each source (n = 10,998). We used an HTML

scraper implemented in Python to retrieve article texts from these URLs. After discarding entries

that have no usable article texts (e.g. videos), we generated a usable corpus of real news articles

(n = 6,081).

Page 14: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

14

We generate four corpora from these datasets: real news headlines, fake news headlines,

real news articles, and fake news articles (see Table 6 in the Appendix). We analyze the

headlines to provide a comparison to other research performed on short utterances, such as

Twitter (5), given that this type of discourse differs from full-text articles. We generated the real

news article corpus by implementing an HTML scraper in Python to retrieve article texts from

the URLs returned by Buzzsumo. We discarded entries that have no usable article text (e.g.

videos) resulting in a usable corpus (n = 6,081). For each document in each corpus, we analyzed

the document using Linguistic Inquiry and Word Count (LIWC) 2015, resulting in 93 measures

per document describing the cognitive, affective, and grammatical processes of the text.

Headlines

For both real and fake news headline corpora, we analyzed document headlines using

Linguistic Inquiry and Word Count (LIWC) 2015, resulting in 93 measures per document

describing the cognitive, affective, and grammatical processes of the text (46). We used a

truncated singular value decomposition (SVD) to compress each data point to the top 70 singular

values, preserving 97.7% of the variance. From this, we performed a t-Stochastic Neighbor

Embedding (t-SNE) to assess the separability of the data (52). We then performed a two-tailed,

independent samples t-test between LIWC features of fake and real headlines and find that 68 of

93 measures are significantly different. Fake news headlines use significantly more quotation

marks, function words, and conjunctions, and less male language (e.g. boy, his, dad) on average

than do real news headlines.

Full-text articles

We apply an identical methodology to the corpus of full text articles. To first assess the

separability of the data, we perform a truncated singular value decomposition (SVD) to compress

Page 15: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

15

each data point to the top 70 singular values, preserving 97.7% of the variance of all headlines

and 98.7% of the variance for all articles. As shown in Figure 3, we use a t-SNE algorithm to

embed 70 dimensions into two and we observe a greater degree of separability (52). This inspires

confidence in the ability of traditional classification algorithms to perform well on the dataset.

We again performed a two-tailed, independent samples t-test on linguistic features and 80 out of

93 LIWC measures are significantly different. We also found larger effect sizes when

considering the article bodies (see Table 9 in the Appendix). We observed that the full article

text exhibits more intrinsic separability in linguistic space than the headline text.

Figure 3. Results of t-SNE for headlines (left) and full-text articles (right)

The results of the t-SNE process suggest that there are statistically significant differences

in the linguistic properties of real and fake news. To verify this, we perform a two-tailed,

independent samples t-test between the raw LIWC features of fake and real headlines as well as

fake and real articles. For headlines, we observe 68 of the 93 measures to exhibit significance at

the p<0.05 level. In the case of articles, we observe 80 of 93 measures to be statistically

Page 16: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

16

significant. Error! Reference source not found. list samples of significant variables along with

effect sizes, calculated using Cohen’s d (53,54) show in Equation 1.

Equation 1. Effect size calculation

From Table 1 we see that fake news headlines on average have a higher word count; use

more quotations, exclamations, and swear words; and use language that is less analytic and more

certain. When we examine the full article text, we see that fake news articles are much more

focused on the present, much less focused on the past, and are more likely to use personal

pronouns than real articles. Several of these concepts are correlated with honest and deceptive

communication: increased incidence of pronoun usage is associated with truthfulness; our

findings show that fake news has more pronouns than real news. We caution that this should not

be interpreted as fake news conveying truthful intentions; rather, we speculate that real news

coming from mainstream journalists uses fewer pronouns by design to convey a higher register

and more professional, less colloquial, tone. Similar to others’ findings, we see that fake news

headlines have more exaggerated punctuation. Both fake headlines and full-text articles are less

analytic than are real ones as well.

Table 1. LIWC variables showing significant differences between fake and real news

LIWC variable p d Headlines

WC 0.00 0.52 Quote 0.00 0.32 Exclam 0.00 0.20 certain 0.00 0.18 Analytic 0.00 -0.14 swear 0.00 0.11

Page 17: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

17

Full text articles focuspast 0.00 -0.75 focuspresent 0.00 0.64 Analytic 0.00 -0.43 you 0.00 0.37 we 0.00 0.36 they 0.00 0.13 Topics in Fake and Real News

We constructed a 50-topic model with the combined fake and real stories data using the Mallet topic modeling tool, and the topics were labeled and sorted based on their average proportion scores. Using the tags from BS Detector Chrome Extension created by by Daniel Sieradski (9), we grouped the fake news in 8 categories (see Table 3). The whole corpus of fake and real news items combined equals 18,244 documents. We compared the topic proportion distributions of the real news websites and each categories of fake news websites.

Table 2 shows the top ten categories by news type. State media – news disseminated by official

ministries of information or media – tends to focus on international phenomena. Junk science

captures current controversies in health and nature, while hate media focuses both on concrete

political topics as well as more esoteric concepts. Satirical news spans the topics of social media

and public policy, as well as familiar targets of parody such as ancient aliens. Conspiratorial

news sources blend mainstream political topics with more marginal themes found in other topics

such as hate and junksci. On the surface, BS and biased news sources appear centrist, although

they, too, blend their stories with marginal themes such as Wikileaks. Finally, real news covers

both domestic and foreign policy issues, as well as blanket topics with generic nouns such as

“person, place, or thing (ppt) potpourri”.

Table 2. Top 10 topics by news category

Media Type Top 10 Topics

State syria US defense brexit russia tur-egy-venez1 tur-egy-venez2 intl trade 2016 election candidates deals with iran eurasia-asia

Page 18: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

18

Junksci medical research infowars nutrition zika public policy ppt contractions amgov time and place climate change ppt potpourri taxes

Hate public policy ppt contractions amgov hillary emails social media 2016 election candidates enlightenment nsfw family world religions wikileaks clintons

Satire ppt contractions time and place nsfw family infowars nutrition tur-egy-venez enlightenment social media public policy trump presidency ancient aliens

Fake trump presidency code 2 voting 2016 election candidates social media intl trade wikileaks clintons ppt contractions infowars nutrition public policy

Conspiracy wikileaks clintons voting 2016 election candidates infowars nutrition social media syria ppt contractions intelligence public policy hillary emails

BS ppt contractions public policy social media amgov 2016 election candidates voting hillary emails wikileaks clintons syria syr-lib-irq

Bias 2016 election candidates voting hillary emails social media ppt contractions public policy nsfw family wikileaks clintons police trump presidency

Real trump presidency 2016 election candidates tur-egy-venez1 ppt potpourri intelligence syria oil market time and place company business public policy

Figure 4 graphs these categories for all fifty topics by their average topic proportions.

Given that the Kaggle fake news corpus only captures the month before the 2016 presidential

election, we isolate a smaller subset of the real news corpus corresponding to the Kaggle corpus

dates to maintain similarity of comparison.

Page 19: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

19

Figure 4. Average topic proportions on each category Oct. 25-Nov. 26, 2016

Topic Changes

While Figure 1 collapses all observations into a topic mean, we also model topic changes over

time. To address the issue of the shifting content of news stories, and the magnitude of change,

we examined all topic patterns over time, aggregated by month. The topic proportions of each

month were plotted as stacked area in Figure 5. Unsurprisingly, the two major topic in the real

corpus were Topics 1 and 2, related to both Trump and the election more generally. Topic 14,

related to the police, drops off while Topics 16 and 12, related to health care and intelligence

respectively, increase. The big change of the two major proportions occurred in Nov. 2016, the

time Trump won the election. The topics from bottom to top were ordered based on the average

proportions over the whole dataset (18,244 documents).

Page 20: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

20

Figure 5. 50 topics over time (real and fake news combined)

Pre-election topical feeding frenzy

We also examine topic changes of each monthby computing the correlation 𝑅𝑖 between the topic

proportions of a month 𝑖 and the topic proportions of the previous month 𝑖 − 1. We use

𝐶𝑖 = 1− 𝑅 !! to measure the topic change between the two consecutive months. Since 𝑅! is

often used as “proportion of explained variance”, we use 𝐶𝑖 to measure the proportion of topic

change. Interestingly, the highest topic change (23%) occurred between October and November,

2016, exactly the election month (see Figure 6). The topic changes after the election month range

from 4% to 12%, much lower than the months before the election months 11% ~ 23%. As Figure

6 shows, the political issue space experiences increased entropy as both fake and real news

sources cover more topics. This finding is interesting in that it demonstrates how media is not

simply recycling and rehashing familiar information in the weeks and months prior to the

Page 21: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

21

election; rather, they are introducing new topics for voters to sort, integrate, and/or discard

before casting their votes. It may also represent increased linkages, or chaining, of seemingly

unrelated issues such as belief in the link between vaccinations and autism, belief in deleterious

consequences of fluoridated water, and belief that President Obama is a Muslim (55–57). This

pattern corroborates what Vosoughi et al., found in their study on social engagement and fake

news: fake news has more novelty than real news, and as such is likely to garner more likes and

shares that disperse it more widely (6).

Figure 6. Changes in topics between June 2016 and April 2017

Conclusions

In this paper we have presented three complementary approaches to analyzing fake and real news

for syntax, sentiment, and topics. We extend existing studies of fake and real news by analyzing

the full text of articles. We find that the syntactic properties of fake and real news vary

Page 22: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

22

substantially, as do the sentiment categories. We also observe curious patterns of topic changes

over time, and between types of fake news. As Pennycook and others suggest, audiences who

prefer the peripheral route to persuasion marked by more simple syntax, and abstract terms that

evoke, stoke, and validate feelings and beliefs may be more susceptible to ideas presented by

fake news (58,59). It is also possible that, given the variation in language use by partisan

candidates, individuals are inclined to receive information presented through either central or

peripheral cues. This raises a chicken-and-egg question: is fake news compelling and virulent

based on its intrinsic linguistic properties, or are some voters more susceptible to this type of

news because it conforms to a linguistic and cognitive framework with which they are already

familiar? Does fake news introduce worldviews, or does it crowdsurf through pre-existing ones?

We hope to continue to disentangle these complex relationships using computational linguisitics

methodology.

The larger implications of fake news bear mentioning: humans have a difficult time

distinguishing between fake and real news, presenting a challenge to the role of the "Fourth

Estate," i.e., free and independent media. Democratic systems of governance rely on expository

journalism to provide information to citizens about leaders and politics. When the veracity of

information from news sources is unreliable, the foundations of democratic checks and balances

are called into question. Some have suggested that fake news is more than a nuisance or passing

phenomenon; inasmuch as fake news fosters political polarization, it undermines moderation

while leading citizens to form erroneous conclusions based on inaccurate information about

important scientific, social, and political issues (60). Some scholars have expressed concern

about fake news undermining governance and affecting freedom of the press (61,62), while

others have sounded an alarm that some countries are in danger of democratic backsliding, citing

Page 23: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

23

media literacy and challenges to independent press as indicators (63). Political propaganda is

widely and reliably used to persuade citizens, and its effectiveness especially in non-democracies

is noteworthy, as a competitive, free media is often repressed and the flow of information tightly

controlled by the ruling elite (64–67). In this scenario, citizens in autocratic regimes like North

Korea are susceptible to propaganda as they lack access to counter-perspectives.

Evidence of foreign influence in the 2016 American election continues to unfold, with the

issue of fake news remaining front and center (68). It may be, as some have suggested, that

audiences are less motivated by partisan cues in evaluating the veracity of news sources, and

more motivated by inertia (69). We see at least three benefits of this approach. First, the full text

of articles provides enough information for linguistic style matching (LSM) and semantic

similarity to sort similar styles, an approach used in plagiarism detection (70,71) and authorship

identification (72,73) that may be used to isolate the intellectual entrepreneurs of fake news

content. Second, we see downstream automated applications, such as broswer extensions that

alert audiences to potentially disreputable or questionable sources or articles, as well as track

emerging trends in fake news. Finally, by understanding the cognitive heuristics that characterize

fake and real news, we may be able to better calibrate inoculation strategies to counteract the

deleterious effects of fake news in society and politics.

Page 24: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

24

Works Cited 1. Kucharski A. Post-truth: Study epidemiology of fake news. Nature. 2016;540(7634):525–

525.

2. Rashkin H, Choi E, Jang JY, Volkova S, Choi Y. Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In 2017. p. 2921–2927.

3. Horne BD, Adali S. This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News. arXiv preprint arXiv:170309398. 2017;

4. Narayanan V, Barash V, Kelly J, Kollanyi B, Neudert L-M, Howard PN. Polarization, Partisanship and Junk News Consumption over Social Media in the US [Internet]. Computational Propaganda Research Project: University of Oxford; 2018. Report No.: 2018.1. Available from: http://comprop.oii.ox.ac.uk/research/polarization-partisanship-and-junk-news/

5. Volkova S, Shaffer K, Jang JY, Hodas N. Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) [Internet]. 2017 [cited 2017 Oct 14]. p. 647–653. Available from: http://www.aclweb.org/anthology/P17-2102

6. Vosoughi S, Roy D, Aral S. The spread of true and false news online. Science. 2018 Mar 9;359(6380):1146–51.

7. Tacchini E, Ballarin G, Della Vedova ML, Moret S, de Alfaro L. Some Like it Hoax: Automated Fake News Detection in Social Networks. arXiv:170407506 [cs] [Internet]. 2017 Apr 24 [cited 2017 Jun 7]; Available from: http://arxiv.org/abs/1704.07506

8. Allcott H, Gentzkow M. Social media and fake news in the 2016 election [Internet]. National Bureau of Economic Research; 2017 [cited 2017 Jun 7]. Available from: http://www.nber.org/papers/w23089

9. Sieradski D. B.S. Detector [Internet]. 2016. Available from: http://bsdetector.tech/

10. Mele N, Lazer D, Baum M, Grinberg N, Friedland L, Joseph K, et al. Combating Fake News: An Agenda for Research and Action. 2017 [cited 2017 Jun 7]; Available from: https://shorensteincenter.org/wp-content/uploads/2017/05/Combating-Fake-News-Agenda-for-Research-1.pdf

11. Funke D. It’s been a year since Facebook partnered with fact-checkers. How’s it going? [Internet]. Poynter Institute. 2017. Available from: https://www.poynter.org/news/its-been-year-facebook-partnered-fact-checkers-hows-it-going

12. Woolf N. How to solve Facebook’s fake news problem: experts pitch their ideas. The Guardian [Internet]. 2016 Nov 29; Available from:

Page 25: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

25

https://www.theguardian.com/technology/2016/nov/29/facebook-fake-news-problem-experts-pitch-ideas-algorithms

13. King G, Pan J, Roberts ME. How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, Not Engaged Argument. American Political Science Review. 2017 Aug;111(3):484–501.

14. Baker WD, Oneal JR. Patriotism or opinion leadership? The nature and origins of the “rally’round the flag” effect. Journal of conflict resolution. 2001;45(5):661–687.

15. Groeling T, Baum MA. Crossing the water’s edge: Elite rhetoric, media coverage, and the rally-round-the-flag phenomenon. The Journal of Politics. 2008;70(4):1065–1085.

16. Sobek D. Rallying around the Podesta: Testing diversionary theory across time. Journal of Peace Research. 2007;44(1):29–45.

17. Isaac M, Wakabayashi D. Russian Influence Reached 126 Million Through Facebook Alone. The New York Times [Internet]. 2017 Oct 30 [cited 2017 Oct 31]; Available from: https://www.nytimes.com/2017/10/30/technology/facebook-google-russia.html

18. Love G, Windsor L. Populism and Popular Support: Vertical Accountability, Exogenous Events, and Leader Discourse in Venezuela. Political Research Quarterly. 2017 Oct 25;

19. Alduy C, Wahnich S. Marine Le Pen prise aux mots. Décryptage du nouveau discours frontiste, Paris, Seuil. 2015;94–98.

20. Mudde C. Europe’s Populist Surge: A Long Time in the Making. Foreign Aff. 2016;95:25.

21. Mudde C, Kaltwasser CR. Populism in Europe and the Americas: Threat Or Corrective for Democracy? Cambridge University Press; 2012. 275 p.

22. Bakshy E, Messing S, Adamic LA. Exposure to ideologically diverse news and opinion on Facebook. Science. 2015 Jun 5;348(6239):1130–2.

23. Martin GJ, Yurukoglu A. Bias in Cable News: Persuasion and Polarization. American Economic Review. 2017 Sep;107(9):2565–99.

24. Baum MA, Jamison AS. The Oprah effect: How soft news helps inattentive citizens vote consistently. The Journal of Politics. 2006;68(4):946–959.

25. Keele L. Social capital and the dynamics of trust in government. American Journal of Political Science. 2007;51(2):241–254.

26. Gupta A, Kumaraguru P, Castillo C, Meier P. TweetCred: Real-Time Credibility Assessment of Content on Twitter. In: Aiello LM, McFarland D, editors. Social Informatics [Internet]. Cham: Springer International Publishing; 2014 [cited 2017 Oct 31]. p. 228–43. Available from: http://link.springer.com/10.1007/978-3-319-13734-6_16

Page 26: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

26

27. Broniatowski DA, Hilyard KM, Dredze M. Effective vaccine communication during the disneyland measles outbreak. Vaccine. 2016;34(28):3225–3228.

28. Reyna VF. Risk perception and communication in vaccination decisions: A fuzzy-trace theory approach. Vaccine. 2012;30(25):3790–3797.

29. Petty RE, Cacioppo JT. The effects of involvement on responses to argument quantity and quality: Central and peripheral routes to persuasion. Journal of personality and social psychology. 1984;46(1):69.

30. Petty RE, Cacioppo JT, Strathman AJ, Priester JR. To Think or Not to Think: Exploring Two Routes to Persuasion. In: Brock TC, Green MC, editors. Persuasion: Psychological insights and perspectives, 2nd ed. Thousand Oaks, CA, US: Sage Publications, Inc; 2005. p. 81–116.

31. Popkin SL. Information shortcuts and the reasoning voter. Information, participation and choice: An economic theory of democracy in perspective. 1995;17–35.

32. Groenendyk E. Competing Motives in the Partisan Mind: How Loyalty and Responsiveness Shape Party Identification and Democracy. OUP USA; 2013. 218 p.

33. Andrew BC. Media-generated shortcuts: Do newspaper headlines present another roadblock for low-information rationality? Harvard International Journal of Press/Politics. 2007;12(2):24–43.

34. Baum MA. Sex, Lies, and War: How Soft News Brings Foreign Policy to the Inattentive Public. American Political Science Review. 2002 Mar;96(1):91–109.

35. Reinemann C, Stanyer J, Scherr S, Legnante G. Hard and soft news: A review of concepts, operationalizations and key findings , Hard and soft news: A review of concepts, operationalizations and key findings. Journalism. 2012 Feb 1;13(2):221–39.

36. MacWilliams MC. Who decides when the party doesn’t? Authoritarian voters and the rise of Donald Trump. PS: Political Science & Politics. 2016;49(4):716–721.

37. Hetherington M, Suhay E. Authoritarianism, Threat, and Americans’ Support for the War on Terror. American Journal of Political Science. 2011 Jul 1;55(3):546–60.

38. Lakoff G. Moral politics: How liberals and conservatives think [Internet]. University of Chicago Press; 2002 [cited 2012 Nov 23]. Available from: http://books.google.com/books?hl=en&lr=&id=R-4YBCYx6YsC&oi=fnd&pg=PR9&dq=george+lakoff&ots=WMji9KEiUP&sig=bq1V4Ky-K9WvHjJLe3Z8sOVe9BY

39. Lakoff G. Simple Framing: An introduction to framing and its uses in politics. Retrieved September. 2005;20:2005.

Page 27: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

27

40. Lucy JA. Sapir-Whorf Hypothesis. 2015;

41. Staff PO. Correction: The Sapir-Whorf Hypothesis and Probabilistic Inference: Evidence from the Domain of Color. PloS one. 2016;11(8):e0161521.

42. Boroditsky L. How language shapes thought. Scientific American. 2011;304(2):62–65.

43. McNamara DS, Graesser AC, McCarthy PM, Cai Z. Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press; 2014.

44. Windsor L. The Predictive Power of Political Discourse. Washington D.C.: National Academy of Sciences; 2017 Feb. (Social and Behavioral Sciences Decadal Survey).

45. Risdal M. Getting Real about Fake News [Internet]. [cited 2017 Jun 9]. Available from: https://www.kaggle.com/mrisdal/fake-news

46. Pennebaker JW, Boyd RL, Jordan K, Blackburn K. The development and psychometric properties of LIWC2015. UT Faculty/Researcher Works [Internet]. 2015 [cited 2016 Dec 9]; Available from: https://utexas-ir.tdl.org/handle/2152/31333

47. Pennebaker JW. The secret life of pronouns: How our words reflect who we are. New York, NY: Bloomsbury. 2011;

48. Hancock JT, Curry LE, Goorha S, Woodworth M. On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication. Discourse Processes. 2007 Dec 17;45(1):1–23.

49. Slatcher RB, Chung CK, Pennebaker JW, Stone LD. Winning words: Individual differences in linguistic style among US presidential and vice presidential candidates. J Res Pers. 2007 Feb;41(1):63–75.

50. Chung C, Pennebaker J. The Psychological Functions of Function Words.

51. BuzzSumo: Find the Most Shared Content and Key Influencers [Internet]. BuzzSumo. [cited 2018 Jul 15]. Available from: http://buzzsumo.com/

52. Maaten L van der, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9(Nov):2579–2605.

53. Sawilowsky SS. New effect size rules of thumb. 2009 [cited 2017 Oct 13]; Available from: http://digitalcommons.wayne.edu/coe_tbf/4/

54. Cohen J. A power primer. Psychological bulletin. 1992;112(1):155.

55. Barzilay R, Elhadad M. Using lexical chains for text summarization. Advances in automatic text summarization. 1999;111–121.

Page 28: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

28

56. Rapp DN, Braasch JL. Accurate and inaccurate knowledge acquisition. Processing inaccurate information: Theoretical and applied perspectives from cognitive science and the educational sciences. 2014;1–10.

57. Lewandowsky S, Ecker UK, Seifert CM, Schwarz N, Cook J. Misinformation and its correction: Continued influence and successful debiasing. Psychological Science in the Public Interest. 2012;13(3):106–131.

58. Pennycook G, Rand DG. Assessing the Effect of “Disputed” Warnings and Source Salience on Perceptions of Fake News Accuracy [Internet]. Rochester, NY: Social Science Research Network; 2017 Sep [cited 2017 Sep 13]. Report No.: ID 3035384. Available from: https://papers.ssrn.com/abstract=3035384

59. Lazer DMJ, Baum MA, Benkler Y, Berinsky AJ, Greenhill KM, Menczer F, et al. The science of fake news. Science. 2018 Mar 9;359(6380):1094–6.

60. Levitsky S, Ziblatt D. Opinion | How Wobbly Is Our Democracy? The New York Times [Internet]. 2018 Apr 13 [cited 2018 Jul 13]; Available from: https://www.nytimes.com/2018/01/27/opinion/sunday/democracy-polarization.html

61. Wandhöfer T, Taylor S, Walland P, Geana R, Weichselbaum R, Fernandez M, et al. Determining citizens’ opinions about stories in the news media: analysing Google, Facebook and Twitter. eJournal of eDemocracy & Open Government (JeDEM). 2012;4(2):198–221.

62. Stier S, Bleier A, Lietz H, Strohmaier M. Election Campaigning on Social Media: Politicians, Audiences, and the Mediation of Political Communication on Facebook and Twitter. Political Communication. 2018 Jan 2;35(1):50–74.

63. Mickey R, Levitisky S, Way LA. Is America Still Safe for Democracy: Why the United States Is in Danger of Backsliding. Foreign Aff. 2017;96:20.

64. Huang H. Propaganda as signaling. Comparative Politics. 2015;47(4):419–444.

65. Lasswell HD. The Theory of Political Propaganda. The American Political Science Review. 1927;21(3):627–31.

66. Weyland K. Clarifying a Contested Concept: Populism in the Study of Latin American Politics. Comparative Politics. 2001 Oct 1;34(1):1–22.

67. Weyland K. The Threat from the Populist Left. Journal of Democracy. 2013;24(3):18–32.

68. Chen A. The Agency. The New York Times [Internet]. 2015 Jun 2 [cited 2018 Jul 16]; Available from: https://www.nytimes.com/2015/06/07/magazine/the-agency.html

69. Pennycook G, Rand DG. Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning. Cognition [Internet]. 2018 Jun

Page 29: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

29

20 [cited 2018 Jul 16]; Available from: http://www.sciencedirect.com/science/article/pii/S001002771830163X

70. Abdi A, Idris N, Alguliyev RM, Aliguliyev RM. PDLK: Plagiarism detection using linguistic knowledge. Expert Systems with Applications. 2015;42(22):8936–8946.

71. Luo L, Ming J, Wu D, Liu P, Zhu S. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM; 2014. p. 389–400.

72. Hoover DL. Word frequency, statistical stylistics and authorship attribution. In: What’s in a Word-list? Routledge; 2016. p. 55–72.

73. Sapkota U, Bethard S, Montes M, Solorio T. Not all character n-grams are created equal: A study in authorship attribution. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies. 2015. p. 93–102.

Page 30: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

30

Appendix Table 3. Types of fake news

Fake News Type Description Bias Sources that traffic in political propaganda and gross distortions of fact Fake sources that fabricate stories out of whole cloth with the intent of

pranking the public Junksci sources that promote pseudoscience, metaphysics, naturalistic fallacies,

and other scientifically dubious claims Satire sources that provide humorous commentary on current events in the form

of fake news State media sources in repressive states operating under government sanction BS bullshit sources there were not identified as the one of the previous

categories Conspiracy sources that are well-known promoters of kooky conspiracy theories Hate sources that actively promote racism, misogyny, homophobia, and other

forms of discrimination Table 4. Source summary

Website N www.aljazeera.com 887 www.cnn.com 827 www.foxnews.com 871 www.msnbc.com 456 www.nytimes.com 1565 www.reuters.com 1082 sub-urls 391 Total 6079

Page 31: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

31

Figure 7. Syntax principal components for fake and real news (0=Fake, 1=Real)

Table 5. Corpus of speeches given by major presidential candidates, 2016

Candidate N Ben Carson 33 Bernie Sanders 49 Carly Fiorina 12 Chris Christie 6 Donald Trump 141 Hillary Clinton 190 Jeb Bush 19 Jim Gilmore 3 John Kasich 7 Marco Rubio 21 Martin O'Malley 15 Mike Huckabee 19 Rand Paul 5 Rick Santorum 22 Ted Cruz 20

Page 32: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

32

Figure 8. Syntax principal components for 2016 presidential candidates by party (0=Republican; 1=Democrat)

Table 6. Fake and real corpora overview

Source Content N Fake Headlines 11,568 Fake Articles 11,568 Real Headlines 10,998 Real Articles 6,081

Table 7. Topic keys (full list)

Topic ID Label Keys

T_1 trump presidency

trump president obama house white administration u.s trump’s trump's washington donald campaign told office secretary united president-elect policy national friday

T_2 2016 election candidates

trump clinton donald hillary campaign trump’s republican presidential party president election democratic candidate supporters mrs sanders voters nominee political support

T_3 oil market percent u.s oil year market prices growth million rate sales billion data week price month rose average fell reporting bank

Page 33: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

33

T_4 syria

syrian syria forces aleppo saudi islamic isis mosul city state government killed civilians military attack iraq iraqi army attacks fighters

T_5 arts and entertainment

show film music art game years series york movie world made team fans play museum work year american played won

T_6 ppt potpourri people it's told don't city that's i'm didn't officials fire york plane cnn flight we're wednesday can't day there's he's

T_7 nsfw family

women children family men woman child young years life mother father told man home parents sexual son daughter wife sex

T_8 time and place back time day night home left house days hours room made long began place set side early turned years hand

T_9 company business

company companies technology business billion year data million project system industry google apple executive cars chief internet car customers software

T_10 public policy public fact time article question it’s political point times make simply case policy important long power made view clear idea

T_11 courts and law

court law federal case judge rights justice legal state supreme attorney states prison u.s order government laws trial lawsuit filed

T_12 intelligence

intelligence government information officials security report national investigation agency u.s public documents department agencies committee evidence state official law office

T_13 voting

election trump vote voters voting clinton hillary states votes donald state polls poll voter presidential win elections percent electoral democrats

T_14 police

police officers man officer gun people violence shot shooting video killed attack black death car law murder arrested told crime

T_15 taxes

money tax million government pay federal jobs financial business billion economic years percent economy workers year income dollars people taxes

T_16 health care

health republicans house bill republican senate care democrats congress insurance obamacare people law act state congressional legislation senator states vote

T_17 race and gender

black school students university white people schools college community education student public group professor transgender percent state racial rights campus

T_18 climate change

water climate change energy power years people area city california environmental sea year world coal local land animals ocean warming

T_19 hillary emails

fbi clinton hillary investigation emails comey email director election server clinton’s department justice information james letter case weiner abedin huma

T_20 ppt contractions it’s people don’t that’s i’m good make time things back thing can’t we’re you’re doesn’t lot didn’t they’re there’s country

Page 34: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

34

T_21 social media

news media facebook video twitter october social fox post fake posted november share network times story show article daily press

T_22 tur-egy-venez

turkey president coup turkish erdogan government opposition minister people venezuela country party political canadian authorities jazeera egypt state attempt maduro

T_23 intl trade

china united states trade chinese countries international south global world nations foreign u.s africa agreement asia economic philippines duterte deal

T_24 dapl

pipeline dakota protesters standing water rock north police access land native protest camp protests people oil sioux protectors law construction

T_25 brexit

european french british europe france britain minister germany party german brexit london union parliament prime paris migrants country vote government

T_26 amgov

people american america political government power media world americans war country system party democracy president obama control class election state

T_27 eurasia-asia

north nuclear korea south india korean missile indian kim test pakistan weapons minister government korea's modi prime sanctions park japan

T_28 US defense

military u.s defense air force army aircraft navy forces retired general troops war sea veterans nato pentagon soldiers missile missiles

T_29 enlightenment

life world people human time energy love mind feel consciousness power work things experience body reality light person earth live

T_30 wikileaks clintons

clinton hillary campaign wikileaks podesta email foundation emails clinton’s bill john assange state clintons president hillary’s secretary money project dnc

T_31 medical research

health cancer study drug disease research drugs body found medical blood marijuana heart people brain risk studies researchers effects cells

T_32 border walls

israel immigration immigrants border jewish illegal united israeli states mexico refugees palestinian country jews wall state security palestinians countries american

T_33 infowars nutrition

food water brain foods force eat organic infowars milk make eating meat sugar oil life add halloween products good store

T_34 world religions

god muslim church christian religious muslims world christians islam religion jesus faith catholic people christ jews king pope islamic bible

T_35 russia

russia russian putin moscow ukraine russia’s nato vladimir russians president relations western foreign countries ukrainian soviet states kremlin united europe

T_36 zika medical health doctors vaccine zika vaccines birth women baby hospital children cdc virus abortion year people patients babies

Page 35: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

35

cases pregnant

T_37 currency

gold bank world silver financial market banks currency dollar money central debt markets price economy global u.s reserve stock news

T_38 ancient aliens

earth space years scientists light ancient found world planet dna moon source ufo science universe time human researchers alien sun

T_39 code 1

return var function args danacallmethod(arguments url path udn clsid case danaurl(url break;case true qbu dsid false document.cookie dssigninurl dsivs arguments

T_40 deals with iran

iran president iranian bush deal nixon october white kennedy house george war johnson american john history reagan evidence tehran u.s

T_41 syr-lib-irq

war syria u.s world russia military states iraq united nuclear foreign policy president obama government american clinton weapons libya hillary

T_42 spanish fxn los del por las para con una como más pero sus años este sin fue sobre nos país está todo

T_43 global elites

world elite global free order elites read license information author western economic click globalization market deep creative permission empire commons

T_44 obama

obama white house president biden obamacare state administration hillary study obama's government putin found free america bad women american news

T_45 code 2

text results comment link strong block code span anonymous automatically version appears page quoting reply click write leave spam content

T_46 stahl potpourri les des dans pour est par stahl lesley une qui sur myanmar pas του son avec aux της ont russie

T_47 german fxn der die und von das den mit auf ist sich ein nicht dem sie für als des dass hat ich

T_48 lx potpourri della so إلى cookies til den malik che علىm brics obama italiano français det español today voltaire del meyssan المتحدة

T_49 code 3

ars eio kfw cfk this.mka.krc rhn string\")kfw bir kfw;if(typeof(obj.classid fvr mus fvr.length,eio dc.substring(ars fvr);var document.cookie.indexof(\";\",ars);if(eio dc.indexof(fvr);if(ars xjn class nfg div

T_50 russian potpourri что это как для сша так все россии или том этом если чтобы его они того будет уже которые также

Table 8. LIWC variables and social media engagement

Real News Headlines Fake News Headlines WC .0640187*** 0.0097186 WC -.042696*** 0.0074348 Analytic 0.0005938 0.0010568 Analytic .0082824*** 0.001193

Page 36: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

36

Clout -0.0010847 0.0010206 Clout -.0051772*** 0.0013725

Authentic -0.0001996 0.0006655 Authentic .0120055*** 0.0008505

Tone 0.0013858 0.0007135 Tone -.0024636*** 0.0006565

WPS -0.0044437 0.0095892 WPS .0159532* 0.0074823

Sixltr 0.0011476 0.0006819 Sixltr -.0079045*** 0.000746

Dic -0.0000858 0.0011904 Dic -.0060922*** 0.0016036

function -.0079928* 0.0033928 function .0200604*** 0.0043461

pronoun 24.69597 12.7837 pronoun -1.039355*** 0.1676805

ppron -29.48177 20.60199 ppron 7.011872 17.06494 i 4.798495 16.17087 i -6.005202 17.06489 we 4.809455 16.17077 we -5.953962 17.06388 you 4.800346 16.17131 you -5.970136 17.06418 shehe 4.796225 16.1709 shehe -5.916199 17.06377 they 4.833024 16.17056 they -5.984098 17.06329 ipron -24.68099 12.78372 ipron 1.017526*** 0.1677335

article .0209349*** 0.0037344 article -.0315364*** 0.004977

prep 0.0003896 0.0033993 prep -.023869*** 0.0042865

auxverb .0128715** 0.0039514 auxverb -.0262593*** 0.0050342

adverb .0195259*** 0.0038272 adverb 0.0070755 0.0045715 conj .0105816** 0.0033712 conj 0.0029846 0.0042337 negate .0143713** 0.0049033 negate -0.006491 0.0062807 verb 0.0035293 0.0022638 verb .0190768*** 0.0031853

adj 0.0008811 0.0020035 adj -.0208492*** 0.0024286

compare -.0123973*** 0.002977 compare .0212813*** 0.0039305

interrog -0.0002613 0.0040747 interrog 0.0022834 0.0043945

number 0.0014362 0.0019745 number -.0195588*** 0.0021674

quant .0115201** 0.0039321 quant -.0416987*** 0.0049313

affect -0.0209977 0.0131224 affect -.2230515*** 0.0137691

posemo 0.0127607 0.0135027 posemo .2200027*** 0.0140709 negemo 0.0203526 0.0134973 negemo .2295589*** 0.0145474

anx -0.005451 0.0042468 anx -.0522511*** 0.004178

Page 37: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

37

anger -0.0029351 0.0035171 anger -0.0069607 0.0037463

sad -0.0014471 0.0046188 sad -.0521828*** 0.0057002

social 0.0042265 0.002761 social 0.0057571 0.0037995 family .0105787* 0.0048157 family .0487895*** 0.0065746 friend 0.0087381 0.0071193 friend .0747922*** 0.0128039 female .0226741*** 0.0046127 female .0191195** 0.0059638

male 0.0085761 0.0048387 male -.0269458*** 0.0068081

cogproc -.016117** 0.0049467 cogproc -.1256831*** 0.0082276

insight 0.0098056 0.0052642 insight .1341627*** 0.0086231 cause 0.0034556 0.0052552 cause .1101238*** 0.0085712 discrep .0173179** 0.0053203 discrep .0696121*** 0.008984 tentat .0097899* 0.0046006 tentat .1858073*** 0.0074878 certain 0.002751 0.0062843 certain .135058*** 0.0091673 differ 0.0024554 0.0059753 differ .1174194*** 0.008724

percept -.0272827*** 0.0081033 percept -.0303831* 0.0126618

see .0352585*** 0.0085104 see .0646144*** 0.0129584 hear .0307377*** 0.0087164 hear .0562629*** 0.0137172 feel .0222932* 0.0092992 feel -.030687* 0.0139769 bio 0.0086582 0.0055078 bio 0.0025263 0.0106383 body 0.000264 0.0059195 body -.0296429** 0.0109483 health -0.0045535 0.0055066 health 0.0153486 0.0107786 sexual -0.0033246 0.0073344 sexual -0.0135726 0.011121 ingest 0.0002411 0.0059686 ingest -0.0089422 0.0112157 drives 0.0056137 0.0032717 drives .0131307*** 0.0037611

affiliation -.0151336*** 0.0042824 affiliation

-.0365609*** 0.0052571

achieve -.0091991*** 0.0027434 achieve 0.0061564 0.0033424

power -0.0053726 0.0031678 power -.0193282*** 0.0036455

reward 0.0024529 0.0037286 reward 0.0050962 0.004294 risk -0.0047376 0.0038955 risk -.0118605* 0.0049855 focuspast 0.0057575 0.0031627 focuspast -0.0074285 0.0039119

focuspresent -0.0042186 0.0023164 focuspresent -.0135255*** 0.0031509

focusfuture -.0194358*** 0.0034337 focusfuture

-.0333173*** 0.0045084

relativ 0.0016665 0.0055109 relativ .0171509* 0.0068865 motion 0.0043946 0.0052699 motion - 0.0062222

Page 38: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

38

.0374807***

space -0.0071926 0.0052553 space -.0484052*** 0.006289

time 0.0038156 0.0053199 time -.0445057*** 0.0066833

work 0.0008674 0.00166 work .0164242*** 0.0021887 leisure .0071782** 0.0023915 leisure .0130314*** 0.003427

home 0.006434 0.0043519 home -.0453921*** 0.0066161

money -0.0030922 0.0024292 money -0.0051391 0.0032548 relig .0210733*** 0.0034952 relig -0.0046581 0.0041011

death 0.0008639 0.0030127 death -.0318445*** 0.0035731

informal -.0425716*** 0.0118284 informal .0930984*** 0.0239291

swear .0420372* 0.0194132 swear -.0880546*** 0.0254977

netspeak .0723359*** 0.0127824 netspeak -.1232127*** 0.0237879

assent .0372384* 0.0160917 assent -.1311895*** 0.024141

nonflu .0494158* 0.0245061 nonflu -.2068267*** 0.0323786

filler 0.0267815 0.0456328 filler -.3130882** 0.1183724 AllPunc 7.739408* 3.741554 AllPunc -6.304725 3.328843 Period -7.744548* 3.741533 Period 6.318225 3.328779 Comma -7.732728* 3.741566 Comma 6.297792 3.329286 Colon -7.754333* 3.741462 Colon 6.297283 3.328693 SemiC -7.737446* 3.741341 SemiC 6.477362 3.329162 QMark -7.758828* 3.741413 QMark 6.324913 3.328906 Exclam -7.702702* 3.741124 Exclam 6.324987 3.32881 Dash -7.746779* 3.741618 Dash 6.26629 3.328999 Quote -7.734568* 3.741543 Quote 6.325731 3.329029 Apostro -7.738896* 3.741546 Apostro 6.278157 3.328438 Parenth -7.729559* 3.741628 Parenth 6.254026 3.329028 OtherP -7.750184* 3.74168 OtherP 6.49741 3.328957 Constant 8.640831*** 0.1170498 Constant 9.698888*** 0.1404026 N. of cases 10995

N. of cases 7736

* p<0.05, ** p<0.01, *** p<0.001 * p<0.05, ** p<0.01, *** p<0.001

Page 39: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

1

Table 9. Social media engagement for select LIWC variables for fake and real news

Real

Std.Err. Fake

Std.Err. Real

Std.Err. Fake

Std.Err. Real

Std.Err. Fake

Std.Err.

Hon

esty

you -

0.0005054

0.0074804

0.0054393

0.0174533

shehe .020339

9** 0.0071

053

-0.01175

65 0.0185

307 they .033735

7* 0.0132

893 -

0.06615 0.0344

992 ipron 0.00603

96 0.0047

287 0.00350

28 0.0125

475

posemo -

0.0049288

0.003041

0.0058447

0.0089931

social .0063965**

0.0024381

0.002645

0.0093069

verb 0.003576

0.0024343

0.0077503

0.0127022

auxverb 0.00265

01 0.0038

65

-0.02076

91 0.0162

472 discrep 0.00498

18 0.0070

465 0.01212

24 0.0194

406

Dec

eptio

n ppron

.0116342***

0.003335

-0.01734

79 0.0114

271

WPS

-0.01347

98 0.0117

394

-0.02305

76 0.0329

879 WC

.069466 0.0126 - 0.0324

Page 40: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

2

2*** 702 0.0260575

413

conj

0.0078287

0.005736

.060497***

0.013284

time

0.0059873

0.0031954

-.0207025*

0.0088924

space

-.008772***

0.00243

0.0022985

0.0109538

motion

0.0042458

0.0040071

0.0133804

0.0170899

number

-0.00286

72 0.0028

443 0.00111

11 0.0151

45

quant

0.0007692

0.0052364

-0.04072

68 0.0291

835

Com

posi

te

Analytic

-.0022279**

0.0008144

-0.00172

87 0.0027

825

Authentic

-0.00055

24 0.0005

296 .0050372*

0.0023647

Tone

0.0003695

0.0005288

0.0003993

0.0024906

Clout

.0025064**

0.0009017

-.0071897*

0.0034598

power

-0.00408

97 0.0022

899

-0.01872

98 0.0106

787 affiliati

- 0.0044 - 0.0156

Page 41: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

3

on 0.0074498

913 0.0127407

502

honesty

.020406**

0.0076531

0.0209639

0.0289207

Constant

9.140952***

0.0261888

10.03645***

0.1295322

8.712321***

0.0709611

10.41131***

0.2087039

9.315875***

0.0794741

10.55753***

0.298632

lnalpha Consta

nt .5257337***

0.012438

1.763488***

0.0167405

.5167651***

0.0132244

1.740095***

0.0162624

.5263228***

0.0125837

1.751881***

0.0173387

N. of cases 10995

7736

10995

7736

10995

7736

* p<0.05, **p<.01, *** p<0.001 Table 10. Topics (T) and Proportions (P) for Fake and Real News

State Junksci Hate Satire Fake

Conspiracy BS Bias Real

Order T P T P T P T P T P T P T P T P T P

1 T_4 0.2

2 T_31

0.31

T_10

0.10

T_20

0.25 T_1

0.10 T_30 0.08

T_20

0.06 T_2

0.14 T_1

0.08

2 T_28

0.05

T_33

0.17

T_20

0.09 T_8

0.14

T_45

0.09 T_13 0.07

T_10

0.05

T_13

0.08 T_2

0.04

3 T_25

0.05

T_36

0.05

T_26

0.07 T_7

0.09

T_13

0.08 T_2 0.05

T_21

0.05

T_19

0.07 T_5

0.04

4 T_35

0.05

T_10

0.05

T_19

0.06

T_33

0.07 T_2

0.07 T_33 0.05

T_26

0.04

T_21

0.07 T_6

0.04

5 T_22

0.04

T_20

0.04

T_21

0.06 T_5

0.05

T_21

0.06 T_21 0.05 T_2

0.04

T_20

0.07

T_12

0.04

6 T_5 0.0

4 T_26

0.03 T_2

0.05

T_29

0.03

T_23

0.05 T_4 0.05

T_13

0.04

T_10

0.03 T_4

0.04

7 T_2 0.0 T_8 0.0 T_2 0.0 T_2 0.0 T_3 0.0 T_20 0.04 T_1 0.0 T_7 0.0 T_3 0.0

Page 42: Syntactic, Semantic, and Topics: The Cognitive Framework ... News... · increasingly engage only with news items that reinforce their existing beliefs – a cocktail of self-selection

4

3 4 3 9 5 1 3 0 5 9 4 3 4

8 T_2 0.0

3 T_18

0.03 T_7

0.05

T_10

0.02

T_20

0.05 T_12 0.04

T_30

0.04

T_30

0.03 T_8

0.04

9 T_40

0.03 T_6

0.02

T_34

0.04 T_1

0.02

T_33

0.04 T_10 0.04 T_4

0.03

T_14

0.03 T_9

0.04

10 T_27

0.03

T_15

0.02

T_30

0.03

T_38

0.02

T_10

0.04 T_19 0.03

T_41

0.03 T_1

0.03

T_10

0.04