Textual Sentiment to Predict Financial Outcomes

!!

1!

Using Textual Sentiment in Corporate Disclosures to Predict Financial Outcomes Christine Kim, Kim Strauch, and Derek Racine

1 The Problem and Motivations.

Research on quantifying textual sentiment is growing rapidly, particularly in the field of

finance where new sources of information, particularly those that are difficult to obtain, can help

an investor gain a competitive edge on the market. There are a variety of methods for

accomplishing this task, but the majority of them use a bag-of-words model, which relies on the

frequency of unigrams found within a document. In order to normalize the values for use in a

formal model, several other features have been defined, including term-frequency (TF), TF-

inverse-document-frequency (TF-IDF), and LOG1P among others. These counts are then used to

produce a measure of sentiment based on a priori knowledge of how the features should be

combined or weights learned from linear regression. A dictionary is commonly used to reduce

the dimensionality of the data and infer more general latent variables, like positive or negative

tone. However, models of this type fail to account for both local context and long-range

dependencies. As a result, they may not be as accurate at predicting company outcomes.

Specifically, efforts to use annual corporate disclosures, or 10-Ks, to predict subsequent stock

returns or volatility have only been marginally effective.

2 A Survey of Existing Work.

Loughran and McDonald et al. (2011) use 10-Ks to predict stock performance in the

period from 1994 to 2008, demonstrating a weak relationship with textual sentiment (1). Rather

than using the Harvard dictionary, the researchers compile their own unigram lists in order to

avoid misclassifying words that take on a specific meaning within the financial domain. For

!!

2!

example, the word risk often has a negative connotation in everyday speech, but in finance is

associated with a real value and frequently has no relation to the sentiment of the text. These

LM lists include those for measuring positive (abundance, influential, revolutionize, upturn)

negative (abnormal, bankruptcy, bribery, crime), litigious (indict, interrogator, notarized,

usurious), weak modal (could, depending, might, possibly), strong modal (always, highest, must,

will), and uncertain (arbitrary, rumors, undeterminable, unknown) sentiments. Overall, they find

a significant correlation between the tone measurement generated by one or more of their word

lists and subsequent stock volatility, but, notably, not future returns (1).

Richard Socher and Andrew Ng et al. (2013) use a recursive neural tensor network

(RNTN) to improve the industry standard for positive versus negative sentence classification

from 80% to 85.4% accuracy (2). Furthermore, their model is, thus far, the only one that can

accurately capture the effect of contrastive conjunctions as well as negation and its scope at

various tree levels for both positive and negative phrases (2). They use a publicly available

dataset of labeled movie reviews called the Stanford Sentiment Treebank to learn their model

parameters and are currently crowd-sourcing further labeling on their website:

http://nlp.stanford.edu/sentiment/.

Kogan et al. (2009) used a support vector regression to predict stock volatility using word

features of the MD&A sections of 10-Ks. Overall, they find a significant improvement in a

model that combines their sentiment measure with historical volatility over those that use

historical volatility alone to predict future volatility (4). One of their key findings was that the

model improves following the passage of the Sarbanes-Oxley Act in 2002, suggesting it

successfully increased the information conveyed in Securities and Exchange Commission (SEC)

filings (4).

!!

3!

3 A Description of our Data, Algorithms, and Methods. Data:

We aggregated a dataset of approximately 300,000 10-K filings from 1994 to 2013 from

the SEC website at http://www.sec.gov/edgar.shtml and extracted Item 7, "Management's

Discussion and Analysis of Financial Condition and Results of Operations," from each of them.

We used Yahoo! Finance to obtain corresponding historical stock returns two weeks before and

after the release of each companys disclosure. We then calculated the average difference in

adjusted closing cost and volatility before and after the event to be used as our dependent

variables. We used the Python NLTK package Punkt sentence tokenizer to pre-process our

corpus such that bigram counts did not include contexts that crossed a sentence boundary. After

reconciling differences in the available financial data we were able to acquire, the data set

(N=6300) was divided into a training set (60%), two cross-validation sets (15% each), and a test

set (10%).

The Loughran and McDonald unigram word lists (Positive, Negative, Uncertainty,

Litigious, Modal Strong, and Modal Weak) served as our initial dictionary. We used MATLAB

to perform our machine learning tasks, as it provided a convenient format for big data

manipulation, and piped output into Excel spreadsheets as an intermediate step between

programs.

Furthermore, we extended Richard Sochers Recursive Neural Tensor Network (RNTN)

model, which is trained on the Stanford Sentiment Treebank, a set of publicly available parsed

and labeled phrases from Rotten Tomatoes movie reviews. We used the same training and test

sets from our word feature models to evaluate Socher's model.

!!

4!

Algorithms and Methods:

Initially, we used unigrams to establish a baseline model and learn more about our data.

While Loughran and McDonald just counted up the number of words within each list, weighting

each one equally, we wanted to learn the weights from the data. Therefore, we regressed our

dependent variables, stock returns and volatility, on TF counts for the LM words within each list

in our training set. TF normalizes raw frequency to the length of the document (TF = raw

frequency / total words in document).

Second, we wanted to see how informative each LM list was in terms of their predictive

ability. In 10-Ks, positive words could be less useful for measuring sentiment because theyre

frequently negated: for example, it's common to see phrases like did not benefit. As such, we

calculated the least-squares error (LSE) on our test set for each list. Then we regressed the actual

stock returns and volatility on those predicted for each list, using the weights learned from our

training set, in our first cross-validation set to see if we could improve our model's performance.

Having based the previous models on TF, we wanted to explore other metrics for

representing document word counts, such as TF-IDF and LOG1P. TF-IDF normalizes the raw

frequency to both the length of the document and the number of documents in the corpus in

which the term appears (TF-IDF = TF * log(total documents in corpus / documents in which

term appeared at least once)). This adjustment makes rare words count for more than they would

otherwise, as they occur at low frequencies by definition. We speculated that the appearance of

such infrequent words could be very informative for differentiating financial texts, as terms such

as bankrupt and merger do not usually occur at high frequency within a document, but could

be strong predictors of future stock performance. LOG1P adjusts the raw frequency such that

there are diminishing returns to larger term counts (LOG1P = log(raw frequency + 1)). Note the

!!

5!

plus 1 is to prevent frequencies of 0 from causing the feature value to be the log of 0, which is

undefined. As with our TF model, we first regressed stock returns and volatility on the word

features for each of the lists on the training set, then learned the weights for each list on our first

cross-validation set, and finally evaluated our results on the test set.

The bag-of-words models are nave in that they ignore local context, which may provide

important qualifying information. For example, in the phrases, did not succeed and "beyond

expectations," the "not" preceding succeed and "beyond" before expectations significantly

modify the sentiment of the words alone. To determine whether incorporating such context

would improve our predictions, we compiled TF counts for all bigrams containing an LM word

in each of the lists. Our model was formed in the same way it was for unigrams: first the weights

for each word were learned on the training set and then the weights for the individual lists were

learned from the first cross-validation set. Finally, we regressed stock returns and volatility in

our second cross-validation set on the predictions made by unigrams and bigrams again to see

whether or not our model was more accurate.

After verifying the importance of local context in predicting financial performance, we

wanted to see whether long-range dependencies would further enhance our predictions. Sochers

code trains a RNTN on a labeled movie review data set of parsed sentences, which learns

composition functions at each node. As such, the effects of negation, with the word not, and

contrastive conjunction, with terms like but, however, and nonetheless, are propagated

down the sentence. Therefore, we adapted Sochers model to analyze the tone of our 10-Ks. In

its original form, Sochers implementation classifies the sentiment of a sentence as Very

negative, Negative, Neutral, Positive, or Very positive. We mapped these categories to

real-valued sentiment scores between 0 and 1 (see Figure 1) in order to create our predictive

!!

6!

model. To capture the overall tone of a document, we simply took an average of the sentiment

scores for each sentence. Similar to our other analyses, we regressed stock returns and volatility

on these measurements in our training set and then evaluated the model on the test set.

Challenges:

Our first challenge was collecting the set of 300,000 10-K fillings from the online

database at http:///www.sec.gov/edgar.shtml. In particular, we did not anticipate the memory,

time, and computational constraints that would accompany managing such a massive amount of

data. For example, the dataset was so large that we could not store all it on one computer. Our

second roadblock was endeavoring to use an all-inclusive vocabulary, but soon found the task to

be intractable. Specifically, the time complexity was such that it took days to run our program to

obtain feature vectors on a few thousand documents. Thus, having a large dataset turned out to

be both a blessing and a curse: large enough to derive meaningful results, but also cause a

headache or two over scalability issues. Furthermore, we ran into underflow errors when doing

our analyses on TF-IDF counts.

4 The Results of our Experiments.

We established our baseline using a linear regression of stock returns and volatility on TF

counts in the training set for the LM unigrams in each of the lists. The least squares error (LSE)

of these models can be seen in Figure 2. Based on the absolute value of the weights assigned to

each word, we determined which words were the most informative in each list. The top five

words for each list can be seen in Figure 4. As expected, all of the positive words have positive

weights and four out of the five negative words have negative weights. These results correspond

with the intuition that positive sentiment correlates with an increase in returns whereas negative

sentiment predicts a decrease. With the exception of positive words, the top five most valuable

!!

7!

words for all of the lists were the same for both returns and volatility, although their weights

obviously differed. While the channel for how tone affects volatility is less well understood, our

findings suggest that positive words correlate with increased volatility whereas negative words

predict a decrease.

We then regressed stock returns and volatility in our cross-validation set on the

predictions for each of the LM lists. The LSE for this regression can be seen in Figure 3a. We

divided companies into quintiles based on what our TF model predicted for average returns and

plotted the actual cumulative returns in the two-week window following the release of these

companies 10-Ks (see Figure 7). In learning the weights to for each of the LM lists, we were

somewhat surprised by the results, as we expected negative words to be more informative than

positive ones. In contrast, for returns, the most important lists in order were: Positive, Litigious,

Weak Modal, Strong Modal, Uncertain, and Negative. For volatility, they were: Positive, Weak

Modal, Strong Modal, Uncertain, Litigious, and Negative. The actual coefficients in the linear

regression appear in Figure 6. Not surprisingly, the weight for the Negative list was negative and

that for the Positive list was positive. One explanation for the unexpected order of importance of

the lists, however, could be that our sample size was simply not large enough. Because our cross-

validation set was small, the results could be easily biased.

The LSE for returns and volatility using TF-IDF and LOG1P features of LM unigrams

can be seen in Figure 3a. The LSE for returns using TF-IDF was an order of magnitude smaller

than that using TF. In contrast, the LSEs for volatility using TF and TF-IDF were comparable.

Thus, our hypothesis that TF-IDF would outperform TF was confirmed. LOG1P, however,

performed worse than both TF and TF-IDF. This finding suggests that the frequencies of

common words within a document contain valuable information.

!!

8!

Our combined unigram and bigram model improved on our results in terms of LSE

compared to any of the unigram models (See Figure 3c), thus supporting our hypothesis that

local context is important, and more explanatory than single words alone. That said, the

improvement was not as large as we expected, especially for predicting volatility. The

cumulative returns graph for this model can be seen in Figure 11.

While we were initially skeptical about using a RNTN model trained on a Rotten

Tomatoes Movie Review dataset, we were pleasantly surprised by the results. On our training

set, we ran a linear regression of stock returns and volatility on the sentiment scores obtained

from running Socher's code, and decreased LSEs (see Figure 3b) compared to those for the TF,

TF-IDF, and LOG1P models of LM unigrams as well as bigrams. This suggests that long-range

dependencies are very informative and predict future performance better than unigrams or

bigram context. We divided companies into quintiles based on predicted average returns using

Sochers model, similar to our analysis for the other models (see Figure 7). Though our results

weren't perfect, they were surprisingly good.

Bag-of-words models based on dictionaries reduce the dimensionality of the data and

have been the industry-standard for evaluating the sentiment of financial texts with

computational methods. However, as we know, this method fails to account for context of any

kind. From our research, we have concluded that incorporating local context and long-range

dependencies into a sentiment analysis model is helpful in predicting a companys subsequent

stock returns and volatility within a 2-week window. From our initial results for TF for

Loughran-context bigrams, we found that the simple step of expanding the token window

significantly improved predictions. We found even better results using Sochers RNTN model,

which can better understand the semantic structure of sentences.

!!

9!

5 Ideas for future research.

Though our models outperformed the traditional bag-of-words model, we have a number

of ideas for future research. In particular, we would like to explore the significance of features

other than TF, TF-IDF, or LOG1P in predicting financial outcomes. For example, we might add

the Gunning fog index for readability, the ratio of forward-looking to retrospective sentences,

and overall document length to our document vectors.

Also, our project only looked at Item 7, which has traditionally been viewed as the

section with the most predictive value. Further work would analyze the importance of other

sections as well.

Finally, we would like to address some of the problems inherent in using a movie review

corpus to train Socher's model and yet predict the sentiment of financial texts. For example, the

model assigns the phrase as well as a serious debt to Road Warrior1 a Very positive

sentiment score. However, if we were to come across the closely related phrase, Blackberry has

acquired a serious debt... of $1 billion, in a 10-K, this would obviously not be the desirable

outcome. Therefore, in order to prevent domain-dependent issues like this, we would modify the

phrase dictionary in the Stanford model to use sentiment scores of bigram phrases learned from

our linear regression model on financial text. With this more robust initialization of the phrase-

to-sentiment dictionary, we expect we would see even better predictions.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1 Taken from a Rotten Tomatoes review of The Reign of Fire

!!

10!

Figures.

Sentiment Label! Real-valued Score!Very negative! 0.1!

Negative! 0.3!Neutral! 0.5!Positive! 0.7!

Very positive! 0.9! Figure 1: The table above shows our mapping of Sochers sentiment labels to real-valued scores between 0 and 1.

LM List! Returns LSE !

Volatility LSE!

Positive! 3.1156e+10! 6.9538e+10 !

Negative! 4.2570e+11! 8.9766e+11 !

Uncertain! 4.7217e+10 !

1.0439e+11 !

Litigious! 1.1229e+11! 2.4287e+11 !

Weak Modal! 2.4033e+10 !

5.6009e+10!

Strong Modal! 7.6806e+09 !

2.1980e+10 !

Figure 2: We regressed volatility as well as returns on term frequency. The table above lists the least squares error

for each LM list, based on each regression.

LM Unigrams! Returns LSE !

Volatility LSE!

TF! 1.2474e+09! 7.2906e+09 !

TF-IDF! 6.7211e+08 7.1549e+09 !

LOG1P! 1.7754e+09

7.8272e+09

Figure 3a: The table above shows the least squares error for the regression of volatility and of returns over TF, TF-

IDF, and LOG1P of all LM words.

Returns LSE Volatility LSE!2.2529e+08 7.4709e+08

Figure 3b: The table above shows the least squares error for the regression of volatility and of returns over

sentiment scores, using Sochers model

!!

11!

Returns LSE Volatility LSE!5.8612e+08 7.1025e+09

Figure 3c: The table above shows the least squares error for the regression of volatility and of returns over

unigrams and bigrams

! Positive! Negative! Uncertain! Litigious! Weak Modal! Strong Modal!1! Creativity

2.0520e+12 !

Disastrous -1.3996e+14

!

Speculating 6.5505e+12

!

Replevin -5.6243e+13!

Suggests -2.4830e+12

!

Clearly -8.3762e+11

!2! Invented

1.5234e+12 !

Understating -7.9882e+13

!

Unconfirmed 3.5659e+12

!

Cedant 4.8033e+13

!

Might 6.7969e+11

!

Undoubtedly -7.0534e+10

!3! Complimentary

1.2176e+12 !

Refusal -7.8511e+13

!

Suggest -2.6084e+12

!

Juror 2.9430e+13

!

Seldom -5.2999e+11

!

Undisputed 4.2742e+10

!4! Happiness

8.3957e+11 !

Unaccounted 7.1499e+13

!

Riskiest 1.7041e+12

!

Jurist -2.4876e+13

!

Appeared -3.8511e+11

!

Strongly 3.7738e+10

!5! Prospered

8.0903e+11 !

Overproduces -6.8838e+13

!

Unforecasted 1.5421e+12

!

Nonfeasance -2.4719e+13

!

Uncertain 2.5676e+11

!

Always 3.4733e+10

!

Figure 4: The table above lists the 5 most important terms and their weights in each of the LM lists for predicting returns

! Positive! Negative! Uncertain! Litigious! Weak Modal! Strong Modal!

1! Creativity 4.3742e+12

!

Disastrous -2.8544e+14

!

Speculating 1.3716e+13

!

Replevin -1.2000e+14

!

Suggests -5.2395e+12

!

Clearly -1.7695e+12

!2 !

Invented 3.3071e+12

!

Understating -1.6426e+14

!

Unconfirmed 7.6333e+12!

Cedant 1.0299e+14!

Might 1.4396e+12!

Undoubtedly -1.6557e+11!

3! Complimentary 2.2796e+12

!

Refusal -1.6265e+14

!

Suggest -5.5072e+12

!

Juror 6.1004e+13

!

Seldom -1.1185e+12

!

Undisputed 8.3130e+10

!4! Happiness

1.9427e+12 !

Unaccounted 1.4769e+14

!

Riskiest 3.5959e+12!

Jurist -5.2298e+13

!

Appeared -8.0127e+11

!

Strongly 7.5230e+10!

5! Bolstered 1.9404e+12!

Overproduces -1.4247e+14!

Unforecasted 3.2765e+12!

Nonfeasance -5.1974e+13!

Uncertain 5.3388e+11!

Always 6.7526e+10!

Figure 5: The table above lists the 5 most important terms and their weights in each of the LM lists for predicting

volatility.

!!

12!

LM List! Returns Weight!

Volatility Weight!

Positive! 0.0221 (1)!

0.0050 (1)!

Negative! -3.9834e-05 (6)!

-8.1918e-06 (6)!

Uncertain! -6.8091e-05 (5)!

-2.2408e-05 (4)!

Litigious! 3.0667e-04 (2)!

-1.9367e-05 (5)!

Weak Modal! -2.0899e-04 (3)!

7.6307e-05 (2)!

Strong Modal! 9.1578e-05 (4)!

-4.8578e-05 (3)!

Figure 6: We regressed volatility as well as returns on the values obtained from each of the LM lists, using the

weights learned from the training set. The table above lists the weight for each LM list in the regression described. The lists are ranked in magnitude of their weight from 1 to 6, in parentheses.

! !

13!

Figure 7: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using Sochers model

!

! !

14!

Figure 8: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using TF

!

! !

15!

Figure 9: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using TF-IDF

! !

16!

Figure 10: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using LOG1P

! !

17!

Figure 11: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using LM unigram and bigram TF-IDF Counts

!!

18!

References: 1. Loughran, Tim, and Bill McDonald, 2011, When is a Liability not a Liability, Journal of Finance, V66, pp. 35-65. 2. Socher, Richard, et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Stanford, CA: Stanford University, 2013. 3. Kearney, Colm, and Sha Liu. Textual sentiment in finance: A survey of methods and models. March 2013. 4. Kogan, Shimon, et al. Predicting Risk from Financial Reports with Regression. Proc. NAACL Human Language Technologies Conf. 2009.

Documents

Textual Sentiment to Predict Financial Outcomes