Upload
kimctine
View
34
Download
1
Embed Size (px)
DESCRIPTION
Kim, Racine, Strauch
Citation preview
!!
1!
Using Textual Sentiment in Corporate Disclosures to Predict Financial Outcomes Christine Kim, Kim Strauch, and Derek Racine
1 The Problem and Motivations.
Research on quantifying textual sentiment is growing rapidly, particularly in the field of
finance where new sources of information, particularly those that are difficult to obtain, can help
an investor gain a competitive edge on the market. There are a variety of methods for
accomplishing this task, but the majority of them use a bag-of-words model, which relies on the
frequency of unigrams found within a document. In order to normalize the values for use in a
formal model, several other features have been defined, including term-frequency (TF), TF-
inverse-document-frequency (TF-IDF), and LOG1P among others. These counts are then used to
produce a measure of sentiment based on a priori knowledge of how the features should be
combined or weights learned from linear regression. A dictionary is commonly used to reduce
the dimensionality of the data and infer more general latent variables, like positive or negative
tone. However, models of this type fail to account for both local context and long-range
dependencies. As a result, they may not be as accurate at predicting company outcomes.
Specifically, efforts to use annual corporate disclosures, or 10-Ks, to predict subsequent stock
returns or volatility have only been marginally effective.
2 A Survey of Existing Work.
Loughran and McDonald et al. (2011) use 10-Ks to predict stock performance in the
period from 1994 to 2008, demonstrating a weak relationship with textual sentiment (1). Rather
than using the Harvard dictionary, the researchers compile their own unigram lists in order to
avoid misclassifying words that take on a specific meaning within the financial domain. For
!!
2!
example, the word risk often has a negative connotation in everyday speech, but in finance is
associated with a real value and frequently has no relation to the sentiment of the text. These
LM lists include those for measuring positive (abundance, influential, revolutionize, upturn)
negative (abnormal, bankruptcy, bribery, crime), litigious (indict, interrogator, notarized,
usurious), weak modal (could, depending, might, possibly), strong modal (always, highest, must,
will), and uncertain (arbitrary, rumors, undeterminable, unknown) sentiments. Overall, they find
a significant correlation between the tone measurement generated by one or more of their word
lists and subsequent stock volatility, but, notably, not future returns (1).
Richard Socher and Andrew Ng et al. (2013) use a recursive neural tensor network
(RNTN) to improve the industry standard for positive versus negative sentence classification
from 80% to 85.4% accuracy (2). Furthermore, their model is, thus far, the only one that can
accurately capture the effect of contrastive conjunctions as well as negation and its scope at
various tree levels for both positive and negative phrases (2). They use a publicly available
dataset of labeled movie reviews called the Stanford Sentiment Treebank to learn their model
parameters and are currently crowd-sourcing further labeling on their website:
http://nlp.stanford.edu/sentiment/.
Kogan et al. (2009) used a support vector regression to predict stock volatility using word
features of the MD&A sections of 10-Ks. Overall, they find a significant improvement in a
model that combines their sentiment measure with historical volatility over those that use
historical volatility alone to predict future volatility (4). One of their key findings was that the
model improves following the passage of the Sarbanes-Oxley Act in 2002, suggesting it
successfully increased the information conveyed in Securities and Exchange Commission (SEC)
filings (4).
!!
3!
3 A Description of our Data, Algorithms, and Methods. Data:
We aggregated a dataset of approximately 300,000 10-K filings from 1994 to 2013 from
the SEC website at http://www.sec.gov/edgar.shtml and extracted Item 7, "Management's
Discussion and Analysis of Financial Condition and Results of Operations," from each of them.
We used Yahoo! Finance to obtain corresponding historical stock returns two weeks before and
after the release of each companys disclosure. We then calculated the average difference in
adjusted closing cost and volatility before and after the event to be used as our dependent
variables. We used the Python NLTK package Punkt sentence tokenizer to pre-process our
corpus such that bigram counts did not include contexts that crossed a sentence boundary. After
reconciling differences in the available financial data we were able to acquire, the data set
(N=6300) was divided into a training set (60%), two cross-validation sets (15% each), and a test
set (10%).
The Loughran and McDonald unigram word lists (Positive, Negative, Uncertainty,
Litigious, Modal Strong, and Modal Weak) served as our initial dictionary. We used MATLAB
to perform our machine learning tasks, as it provided a convenient format for big data
manipulation, and piped output into Excel spreadsheets as an intermediate step between
programs.
Furthermore, we extended Richard Sochers Recursive Neural Tensor Network (RNTN)
model, which is trained on the Stanford Sentiment Treebank, a set of publicly available parsed
and labeled phrases from Rotten Tomatoes movie reviews. We used the same training and test
sets from our word feature models to evaluate Socher's model.
!!
4!
Algorithms and Methods:
Initially, we used unigrams to establish a baseline model and learn more about our data.
While Loughran and McDonald just counted up the number of words within each list, weighting
each one equally, we wanted to learn the weights from the data. Therefore, we regressed our
dependent variables, stock returns and volatility, on TF counts for the LM words within each list
in our training set. TF normalizes raw frequency to the length of the document (TF = raw
frequency / total words in document).
Second, we wanted to see how informative each LM list was in terms of their predictive
ability. In 10-Ks, positive words could be less useful for measuring sentiment because theyre
frequently negated: for example, it's common to see phrases like did not benefit. As such, we
calculated the least-squares error (LSE) on our test set for each list. Then we regressed the actual
stock returns and volatility on those predicted for each list, using the weights learned from our
training set, in our first cross-validation set to see if we could improve our model's performance.
Having based the previous models on TF, we wanted to explore other metrics for
representing document word counts, such as TF-IDF and LOG1P. TF-IDF normalizes the raw
frequency to both the length of the document and the number of documents in the corpus in
which the term appears (TF-IDF = TF * log(total documents in corpus / documents in which
term appeared at least once)). This adjustment makes rare words count for more than they would
otherwise, as they occur at low frequencies by definition. We speculated that the appearance of
such infrequent words could be very informative for differentiating financial texts, as terms such
as bankrupt and merger do not usually occur at high frequency within a document, but could
be strong predictors of future stock performance. LOG1P adjusts the raw frequency such that
there are diminishing returns to larger term counts (LOG1P = log(raw frequency + 1)). Note the
!!
5!
plus 1 is to prevent frequencies of 0 from causing the feature value to be the log of 0, which is
undefined. As with our TF model, we first regressed stock returns and volatility on the word
features for each of the lists on the training set, then learned the weights for each list on our first
cross-validation set, and finally evaluated our results on the test set.
The bag-of-words models are nave in that they ignore local context, which may provide
important qualifying information. For example, in the phrases, did not succeed and "beyond
expectations," the "not" preceding succeed and "beyond" before expectations significantly
modify the sentiment of the words alone. To determine whether incorporating such context
would improve our predictions, we compiled TF counts for all bigrams containing an LM word
in each of the lists. Our model was formed in the same way it was for unigrams: first the weights
for each word were learned on the training set and then the weights for the individual lists were
learned from the first cross-validation set. Finally, we regressed stock returns and volatility in
our second cross-validation set on the predictions made by unigrams and bigrams again to see
whether or not our model was more accurate.
After verifying the importance of local context in predicting financial performance, we
wanted to see whether long-range dependencies would further enhance our predictions. Sochers
code trains a RNTN on a labeled movie review data set of parsed sentences, which learns
composition functions at each node. As such, the effects of negation, with the word not, and
contrastive conjunction, with terms like but, however, and nonetheless, are propagated
down the sentence. Therefore, we adapted Sochers model to analyze the tone of our 10-Ks. In
its original form, Sochers implementation classifies the sentiment of a sentence as Very
negative, Negative, Neutral, Positive, or Very positive. We mapped these categories to
real-valued sentiment scores between 0 and 1 (see Figure 1) in order to create our predictive
!!
6!
model. To capture the overall tone of a document, we simply took an average of the sentiment
scores for each sentence. Similar to our other analyses, we regressed stock returns and volatility
on these measurements in our training set and then evaluated the model on the test set.
Challenges:
Our first challenge was collecting the set of 300,000 10-K fillings from the online
database at http:///www.sec.gov/edgar.shtml. In particular, we did not anticipate the memory,
time, and computational constraints that would accompany managing such a massive amount of
data. For example, the dataset was so large that we could not store all it on one computer. Our
second roadblock was endeavoring to use an all-inclusive vocabulary, but soon found the task to
be intractable. Specifically, the time complexity was such that it took days to run our program to
obtain feature vectors on a few thousand documents. Thus, having a large dataset turned out to
be both a blessing and a curse: large enough to derive meaningful results, but also cause a
headache or two over scalability issues. Furthermore, we ran into underflow errors when doing
our analyses on TF-IDF counts.
4 The Results of our Experiments.
We established our baseline using a linear regression of stock returns and volatility on TF
counts in the training set for the LM unigrams in each of the lists. The least squares error (LSE)
of these models can be seen in Figure 2. Based on the absolute value of the weights assigned to
each word, we determined which words were the most informative in each list. The top five
words for each list can be seen in Figure 4. As expected, all of the positive words have positive
weights and four out of the five negative words have negative weights. These results correspond
with the intuition that positive sentiment correlates with an increase in returns whereas negative
sentiment predicts a decrease. With the exception of positive words, the top five most valuable
!!
7!
words for all of the lists were the same for both returns and volatility, although their weights
obviously differed. While the channel for how tone affects volatility is less well understood, our
findings suggest that positive words correlate with increased volatility whereas negative words
predict a decrease.
We then regressed stock returns and volatility in our cross-validation set on the
predictions for each of the LM lists. The LSE for this regression can be seen in Figure 3a. We
divided companies into quintiles based on what our TF model predicted for average returns and
plotted the actual cumulative returns in the two-week window following the release of these
companies 10-Ks (see Figure 7). In learning the weights to for each of the LM lists, we were
somewhat surprised by the results, as we expected negative words to be more informative than
positive ones. In contrast, for returns, the most important lists in order were: Positive, Litigious,
Weak Modal, Strong Modal, Uncertain, and Negative. For volatility, they were: Positive, Weak
Modal, Strong Modal, Uncertain, Litigious, and Negative. The actual coefficients in the linear
regression appear in Figure 6. Not surprisingly, the weight for the Negative list was negative and
that for the Positive list was positive. One explanation for the unexpected order of importance of
the lists, however, could be that our sample size was simply not large enough. Because our cross-
validation set was small, the results could be easily biased.
The LSE for returns and volatility using TF-IDF and LOG1P features of LM unigrams
can be seen in Figure 3a. The LSE for returns using TF-IDF was an order of magnitude smaller
than that using TF. In contrast, the LSEs for volatility using TF and TF-IDF were comparable.
Thus, our hypothesis that TF-IDF would outperform TF was confirmed. LOG1P, however,
performed worse than both TF and TF-IDF. This finding suggests that the frequencies of
common words within a document contain valuable information.
!!
8!
Our combined unigram and bigram model improved on our results in terms of LSE
compared to any of the unigram models (See Figure 3c), thus supporting our hypothesis that
local context is important, and more explanatory than single words alone. That said, the
improvement was not as large as we expected, especially for predicting volatility. The
cumulative returns graph for this model can be seen in Figure 11.
While we were initially skeptical about using a RNTN model trained on a Rotten
Tomatoes Movie Review dataset, we were pleasantly surprised by the results. On our training
set, we ran a linear regression of stock returns and volatility on the sentiment scores obtained
from running Socher's code, and decreased LSEs (see Figure 3b) compared to those for the TF,
TF-IDF, and LOG1P models of LM unigrams as well as bigrams. This suggests that long-range
dependencies are very informative and predict future performance better than unigrams or
bigram context. We divided companies into quintiles based on predicted average returns using
Sochers model, similar to our analysis for the other models (see Figure 7). Though our results
weren't perfect, they were surprisingly good.
Bag-of-words models based on dictionaries reduce the dimensionality of the data and
have been the industry-standard for evaluating the sentiment of financial texts with
computational methods. However, as we know, this method fails to account for context of any
kind. From our research, we have concluded that incorporating local context and long-range
dependencies into a sentiment analysis model is helpful in predicting a companys subsequent
stock returns and volatility within a 2-week window. From our initial results for TF for
Loughran-context bigrams, we found that the simple step of expanding the token window
significantly improved predictions. We found even better results using Sochers RNTN model,
which can better understand the semantic structure of sentences.
!!
9!
5 Ideas for future research.
Though our models outperformed the traditional bag-of-words model, we have a number
of ideas for future research. In particular, we would like to explore the significance of features
other than TF, TF-IDF, or LOG1P in predicting financial outcomes. For example, we might add
the Gunning fog index for readability, the ratio of forward-looking to retrospective sentences,
and overall document length to our document vectors.
Also, our project only looked at Item 7, which has traditionally been viewed as the
section with the most predictive value. Further work would analyze the importance of other
sections as well.
Finally, we would like to address some of the problems inherent in using a movie review
corpus to train Socher's model and yet predict the sentiment of financial texts. For example, the
model assigns the phrase as well as a serious debt to Road Warrior1 a Very positive
sentiment score. However, if we were to come across the closely related phrase, Blackberry has
acquired a serious debt... of $1 billion, in a 10-K, this would obviously not be the desirable
outcome. Therefore, in order to prevent domain-dependent issues like this, we would modify the
phrase dictionary in the Stanford model to use sentiment scores of bigram phrases learned from
our linear regression model on financial text. With this more robust initialization of the phrase-
to-sentiment dictionary, we expect we would see even better predictions.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1 Taken from a Rotten Tomatoes review of The Reign of Fire
!!
10!
Figures.
Sentiment Label! Real-valued Score!Very negative! 0.1!
Negative! 0.3!Neutral! 0.5!Positive! 0.7!
Very positive! 0.9! Figure 1: The table above shows our mapping of Sochers sentiment labels to real-valued scores between 0 and 1.
LM List! Returns LSE !
Volatility LSE!
Positive! 3.1156e+10! 6.9538e+10 !
Negative! 4.2570e+11! 8.9766e+11 !
Uncertain! 4.7217e+10 !
1.0439e+11 !
Litigious! 1.1229e+11! 2.4287e+11 !
Weak Modal! 2.4033e+10 !
5.6009e+10!
Strong Modal! 7.6806e+09 !
2.1980e+10 !
Figure 2: We regressed volatility as well as returns on term frequency. The table above lists the least squares error
for each LM list, based on each regression.
LM Unigrams! Returns LSE !
Volatility LSE!
TF! 1.2474e+09! 7.2906e+09 !
TF-IDF! 6.7211e+08 7.1549e+09 !
LOG1P! 1.7754e+09
7.8272e+09
Figure 3a: The table above shows the least squares error for the regression of volatility and of returns over TF, TF-
IDF, and LOG1P of all LM words.
Returns LSE Volatility LSE!2.2529e+08 7.4709e+08
Figure 3b: The table above shows the least squares error for the regression of volatility and of returns over
sentiment scores, using Sochers model
!!
11!
Returns LSE Volatility LSE!5.8612e+08 7.1025e+09
Figure 3c: The table above shows the least squares error for the regression of volatility and of returns over
unigrams and bigrams
! Positive! Negative! Uncertain! Litigious! Weak Modal! Strong Modal!1! Creativity
2.0520e+12 !
Disastrous -1.3996e+14
!
Speculating 6.5505e+12
!
Replevin -5.6243e+13!
Suggests -2.4830e+12
!
Clearly -8.3762e+11
!2! Invented
1.5234e+12 !
Understating -7.9882e+13
!
Unconfirmed 3.5659e+12
!
Cedant 4.8033e+13
!
Might 6.7969e+11
!
Undoubtedly -7.0534e+10
!3! Complimentary
1.2176e+12 !
Refusal -7.8511e+13
!
Suggest -2.6084e+12
!
Juror 2.9430e+13
!
Seldom -5.2999e+11
!
Undisputed 4.2742e+10
!4! Happiness
8.3957e+11 !
Unaccounted 7.1499e+13
!
Riskiest 1.7041e+12
!
Jurist -2.4876e+13
!
Appeared -3.8511e+11
!
Strongly 3.7738e+10
!5! Prospered
8.0903e+11 !
Overproduces -6.8838e+13
!
Unforecasted 1.5421e+12
!
Nonfeasance -2.4719e+13
!
Uncertain 2.5676e+11
!
Always 3.4733e+10
!
Figure 4: The table above lists the 5 most important terms and their weights in each of the LM lists for predicting returns
! Positive! Negative! Uncertain! Litigious! Weak Modal! Strong Modal!
1! Creativity 4.3742e+12
!
Disastrous -2.8544e+14
!
Speculating 1.3716e+13
!
Replevin -1.2000e+14
!
Suggests -5.2395e+12
!
Clearly -1.7695e+12
!2 !
Invented 3.3071e+12
!
Understating -1.6426e+14
!
Unconfirmed 7.6333e+12!
Cedant 1.0299e+14!
Might 1.4396e+12!
Undoubtedly -1.6557e+11!
3! Complimentary 2.2796e+12
!
Refusal -1.6265e+14
!
Suggest -5.5072e+12
!
Juror 6.1004e+13
!
Seldom -1.1185e+12
!
Undisputed 8.3130e+10
!4! Happiness
1.9427e+12 !
Unaccounted 1.4769e+14
!
Riskiest 3.5959e+12!
Jurist -5.2298e+13
!
Appeared -8.0127e+11
!
Strongly 7.5230e+10!
5! Bolstered 1.9404e+12!
Overproduces -1.4247e+14!
Unforecasted 3.2765e+12!
Nonfeasance -5.1974e+13!
Uncertain 5.3388e+11!
Always 6.7526e+10!
Figure 5: The table above lists the 5 most important terms and their weights in each of the LM lists for predicting
volatility.
!!
12!
LM List! Returns Weight!
Volatility Weight!
Positive! 0.0221 (1)!
0.0050 (1)!
Negative! -3.9834e-05 (6)!
-8.1918e-06 (6)!
Uncertain! -6.8091e-05 (5)!
-2.2408e-05 (4)!
Litigious! 3.0667e-04 (2)!
-1.9367e-05 (5)!
Weak Modal! -2.0899e-04 (3)!
7.6307e-05 (2)!
Strong Modal! 9.1578e-05 (4)!
-4.8578e-05 (3)!
Figure 6: We regressed volatility as well as returns on the values obtained from each of the LM lists, using the
weights learned from the training set. The table above lists the weight for each LM list in the regression described. The lists are ranked in magnitude of their weight from 1 to 6, in parentheses.
! !
13!
Figure 7: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using Sochers model
!
! !
14!
Figure 8: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using TF
!
! !
15!
Figure 9: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using TF-IDF
! !
16!
Figure 10: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using LOG1P
! !
17!
Figure 11: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using LM unigram and bigram TF-IDF Counts
!!
18!
References: 1. Loughran, Tim, and Bill McDonald, 2011, When is a Liability not a Liability, Journal of Finance, V66, pp. 35-65. 2. Socher, Richard, et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Stanford, CA: Stanford University, 2013. 3. Kearney, Colm, and Sha Liu. Textual sentiment in finance: A survey of methods and models. March 2013. 4. Kogan, Shimon, et al. Predicting Risk from Financial Reports with Regression. Proc. NAACL Human Language Technologies Conf. 2009.