18
1 Using Textual Sentiment in Corporate Disclosures to Predict Financial Outcomes Christine Kim, Kim Strauch, and Derek Racine 1 The Problem and Motivations. Research on quantifying textual sentiment is growing rapidly, particularly in the field of finance where new sources of information, particularly those that are difficult to obtain, can help an investor gain a competitive edge on the market. There are a variety of methods for accomplishing this task, but the majority of them use a bag-of-words model, which relies on the frequency of unigrams found within a document. In order to normalize the values for use in a formal model, several other features have been defined, including term-frequency (TF), TF- inverse-document-frequency (TF-IDF), and LOG1P among others. These counts are then used to produce a measure of sentiment based on a priori knowledge of how the features should be combined or weights learned from linear regression. A dictionary is commonly used to reduce the dimensionality of the data and infer more general latent variables, like positive or negative tone. However, models of this type fail to account for both local context and long-range dependencies. As a result, they may not be as accurate at predicting company outcomes. Specifically, efforts to use annual corporate disclosures, or 10-Ks, to predict subsequent stock returns or volatility have only been marginally effective. 2 A Survey of Existing Work. Loughran and McDonald et al. (2011) use 10-Ks to predict stock performance in the period from 1994 to 2008, demonstrating a weak relationship with textual sentiment (1). Rather than using the Harvard dictionary, the researchers compile their own unigram lists in order to avoid misclassifying words that take on a specific meaning within the financial domain. For

Textual Sentiment to Predict Financial Outcomes

Embed Size (px)

DESCRIPTION

Kim, Racine, Strauch

Citation preview

  • !!

    1!

    Using Textual Sentiment in Corporate Disclosures to Predict Financial Outcomes Christine Kim, Kim Strauch, and Derek Racine

    1 The Problem and Motivations.

    Research on quantifying textual sentiment is growing rapidly, particularly in the field of

    finance where new sources of information, particularly those that are difficult to obtain, can help

    an investor gain a competitive edge on the market. There are a variety of methods for

    accomplishing this task, but the majority of them use a bag-of-words model, which relies on the

    frequency of unigrams found within a document. In order to normalize the values for use in a

    formal model, several other features have been defined, including term-frequency (TF), TF-

    inverse-document-frequency (TF-IDF), and LOG1P among others. These counts are then used to

    produce a measure of sentiment based on a priori knowledge of how the features should be

    combined or weights learned from linear regression. A dictionary is commonly used to reduce

    the dimensionality of the data and infer more general latent variables, like positive or negative

    tone. However, models of this type fail to account for both local context and long-range

    dependencies. As a result, they may not be as accurate at predicting company outcomes.

    Specifically, efforts to use annual corporate disclosures, or 10-Ks, to predict subsequent stock

    returns or volatility have only been marginally effective.

    2 A Survey of Existing Work.

    Loughran and McDonald et al. (2011) use 10-Ks to predict stock performance in the

    period from 1994 to 2008, demonstrating a weak relationship with textual sentiment (1). Rather

    than using the Harvard dictionary, the researchers compile their own unigram lists in order to

    avoid misclassifying words that take on a specific meaning within the financial domain. For

  • !!

    2!

    example, the word risk often has a negative connotation in everyday speech, but in finance is

    associated with a real value and frequently has no relation to the sentiment of the text. These

    LM lists include those for measuring positive (abundance, influential, revolutionize, upturn)

    negative (abnormal, bankruptcy, bribery, crime), litigious (indict, interrogator, notarized,

    usurious), weak modal (could, depending, might, possibly), strong modal (always, highest, must,

    will), and uncertain (arbitrary, rumors, undeterminable, unknown) sentiments. Overall, they find

    a significant correlation between the tone measurement generated by one or more of their word

    lists and subsequent stock volatility, but, notably, not future returns (1).

    Richard Socher and Andrew Ng et al. (2013) use a recursive neural tensor network

    (RNTN) to improve the industry standard for positive versus negative sentence classification

    from 80% to 85.4% accuracy (2). Furthermore, their model is, thus far, the only one that can

    accurately capture the effect of contrastive conjunctions as well as negation and its scope at

    various tree levels for both positive and negative phrases (2). They use a publicly available

    dataset of labeled movie reviews called the Stanford Sentiment Treebank to learn their model

    parameters and are currently crowd-sourcing further labeling on their website:

    http://nlp.stanford.edu/sentiment/.

    Kogan et al. (2009) used a support vector regression to predict stock volatility using word

    features of the MD&A sections of 10-Ks. Overall, they find a significant improvement in a

    model that combines their sentiment measure with historical volatility over those that use

    historical volatility alone to predict future volatility (4). One of their key findings was that the

    model improves following the passage of the Sarbanes-Oxley Act in 2002, suggesting it

    successfully increased the information conveyed in Securities and Exchange Commission (SEC)

    filings (4).

  • !!

    3!

    3 A Description of our Data, Algorithms, and Methods. Data:

    We aggregated a dataset of approximately 300,000 10-K filings from 1994 to 2013 from

    the SEC website at http://www.sec.gov/edgar.shtml and extracted Item 7, "Management's

    Discussion and Analysis of Financial Condition and Results of Operations," from each of them.

    We used Yahoo! Finance to obtain corresponding historical stock returns two weeks before and

    after the release of each companys disclosure. We then calculated the average difference in

    adjusted closing cost and volatility before and after the event to be used as our dependent

    variables. We used the Python NLTK package Punkt sentence tokenizer to pre-process our

    corpus such that bigram counts did not include contexts that crossed a sentence boundary. After

    reconciling differences in the available financial data we were able to acquire, the data set

    (N=6300) was divided into a training set (60%), two cross-validation sets (15% each), and a test

    set (10%).

    The Loughran and McDonald unigram word lists (Positive, Negative, Uncertainty,

    Litigious, Modal Strong, and Modal Weak) served as our initial dictionary. We used MATLAB

    to perform our machine learning tasks, as it provided a convenient format for big data

    manipulation, and piped output into Excel spreadsheets as an intermediate step between

    programs.

    Furthermore, we extended Richard Sochers Recursive Neural Tensor Network (RNTN)

    model, which is trained on the Stanford Sentiment Treebank, a set of publicly available parsed

    and labeled phrases from Rotten Tomatoes movie reviews. We used the same training and test

    sets from our word feature models to evaluate Socher's model.

  • !!

    4!

    Algorithms and Methods:

    Initially, we used unigrams to establish a baseline model and learn more about our data.

    While Loughran and McDonald just counted up the number of words within each list, weighting

    each one equally, we wanted to learn the weights from the data. Therefore, we regressed our

    dependent variables, stock returns and volatility, on TF counts for the LM words within each list

    in our training set. TF normalizes raw frequency to the length of the document (TF = raw

    frequency / total words in document).

    Second, we wanted to see how informative each LM list was in terms of their predictive

    ability. In 10-Ks, positive words could be less useful for measuring sentiment because theyre

    frequently negated: for example, it's common to see phrases like did not benefit. As such, we

    calculated the least-squares error (LSE) on our test set for each list. Then we regressed the actual

    stock returns and volatility on those predicted for each list, using the weights learned from our

    training set, in our first cross-validation set to see if we could improve our model's performance.

    Having based the previous models on TF, we wanted to explore other metrics for

    representing document word counts, such as TF-IDF and LOG1P. TF-IDF normalizes the raw

    frequency to both the length of the document and the number of documents in the corpus in

    which the term appears (TF-IDF = TF * log(total documents in corpus / documents in which

    term appeared at least once)). This adjustment makes rare words count for more than they would

    otherwise, as they occur at low frequencies by definition. We speculated that the appearance of

    such infrequent words could be very informative for differentiating financial texts, as terms such

    as bankrupt and merger do not usually occur at high frequency within a document, but could

    be strong predictors of future stock performance. LOG1P adjusts the raw frequency such that

    there are diminishing returns to larger term counts (LOG1P = log(raw frequency + 1)). Note the

  • !!

    5!

    plus 1 is to prevent frequencies of 0 from causing the feature value to be the log of 0, which is

    undefined. As with our TF model, we first regressed stock returns and volatility on the word

    features for each of the lists on the training set, then learned the weights for each list on our first

    cross-validation set, and finally evaluated our results on the test set.

    The bag-of-words models are nave in that they ignore local context, which may provide

    important qualifying information. For example, in the phrases, did not succeed and "beyond

    expectations," the "not" preceding succeed and "beyond" before expectations significantly

    modify the sentiment of the words alone. To determine whether incorporating such context

    would improve our predictions, we compiled TF counts for all bigrams containing an LM word

    in each of the lists. Our model was formed in the same way it was for unigrams: first the weights

    for each word were learned on the training set and then the weights for the individual lists were

    learned from the first cross-validation set. Finally, we regressed stock returns and volatility in

    our second cross-validation set on the predictions made by unigrams and bigrams again to see

    whether or not our model was more accurate.

    After verifying the importance of local context in predicting financial performance, we

    wanted to see whether long-range dependencies would further enhance our predictions. Sochers

    code trains a RNTN on a labeled movie review data set of parsed sentences, which learns

    composition functions at each node. As such, the effects of negation, with the word not, and

    contrastive conjunction, with terms like but, however, and nonetheless, are propagated

    down the sentence. Therefore, we adapted Sochers model to analyze the tone of our 10-Ks. In

    its original form, Sochers implementation classifies the sentiment of a sentence as Very

    negative, Negative, Neutral, Positive, or Very positive. We mapped these categories to

    real-valued sentiment scores between 0 and 1 (see Figure 1) in order to create our predictive

  • !!

    6!

    model. To capture the overall tone of a document, we simply took an average of the sentiment

    scores for each sentence. Similar to our other analyses, we regressed stock returns and volatility

    on these measurements in our training set and then evaluated the model on the test set.

    Challenges:

    Our first challenge was collecting the set of 300,000 10-K fillings from the online

    database at http:///www.sec.gov/edgar.shtml. In particular, we did not anticipate the memory,

    time, and computational constraints that would accompany managing such a massive amount of

    data. For example, the dataset was so large that we could not store all it on one computer. Our

    second roadblock was endeavoring to use an all-inclusive vocabulary, but soon found the task to

    be intractable. Specifically, the time complexity was such that it took days to run our program to

    obtain feature vectors on a few thousand documents. Thus, having a large dataset turned out to

    be both a blessing and a curse: large enough to derive meaningful results, but also cause a

    headache or two over scalability issues. Furthermore, we ran into underflow errors when doing

    our analyses on TF-IDF counts.

    4 The Results of our Experiments.

    We established our baseline using a linear regression of stock returns and volatility on TF

    counts in the training set for the LM unigrams in each of the lists. The least squares error (LSE)

    of these models can be seen in Figure 2. Based on the absolute value of the weights assigned to

    each word, we determined which words were the most informative in each list. The top five

    words for each list can be seen in Figure 4. As expected, all of the positive words have positive

    weights and four out of the five negative words have negative weights. These results correspond

    with the intuition that positive sentiment correlates with an increase in returns whereas negative

    sentiment predicts a decrease. With the exception of positive words, the top five most valuable

  • !!

    7!

    words for all of the lists were the same for both returns and volatility, although their weights

    obviously differed. While the channel for how tone affects volatility is less well understood, our

    findings suggest that positive words correlate with increased volatility whereas negative words

    predict a decrease.

    We then regressed stock returns and volatility in our cross-validation set on the

    predictions for each of the LM lists. The LSE for this regression can be seen in Figure 3a. We

    divided companies into quintiles based on what our TF model predicted for average returns and

    plotted the actual cumulative returns in the two-week window following the release of these

    companies 10-Ks (see Figure 7). In learning the weights to for each of the LM lists, we were

    somewhat surprised by the results, as we expected negative words to be more informative than

    positive ones. In contrast, for returns, the most important lists in order were: Positive, Litigious,

    Weak Modal, Strong Modal, Uncertain, and Negative. For volatility, they were: Positive, Weak

    Modal, Strong Modal, Uncertain, Litigious, and Negative. The actual coefficients in the linear

    regression appear in Figure 6. Not surprisingly, the weight for the Negative list was negative and

    that for the Positive list was positive. One explanation for the unexpected order of importance of

    the lists, however, could be that our sample size was simply not large enough. Because our cross-

    validation set was small, the results could be easily biased.

    The LSE for returns and volatility using TF-IDF and LOG1P features of LM unigrams

    can be seen in Figure 3a. The LSE for returns using TF-IDF was an order of magnitude smaller

    than that using TF. In contrast, the LSEs for volatility using TF and TF-IDF were comparable.

    Thus, our hypothesis that TF-IDF would outperform TF was confirmed. LOG1P, however,

    performed worse than both TF and TF-IDF. This finding suggests that the frequencies of

    common words within a document contain valuable information.

  • !!

    8!

    Our combined unigram and bigram model improved on our results in terms of LSE

    compared to any of the unigram models (See Figure 3c), thus supporting our hypothesis that

    local context is important, and more explanatory than single words alone. That said, the

    improvement was not as large as we expected, especially for predicting volatility. The

    cumulative returns graph for this model can be seen in Figure 11.

    While we were initially skeptical about using a RNTN model trained on a Rotten

    Tomatoes Movie Review dataset, we were pleasantly surprised by the results. On our training

    set, we ran a linear regression of stock returns and volatility on the sentiment scores obtained

    from running Socher's code, and decreased LSEs (see Figure 3b) compared to those for the TF,

    TF-IDF, and LOG1P models of LM unigrams as well as bigrams. This suggests that long-range

    dependencies are very informative and predict future performance better than unigrams or

    bigram context. We divided companies into quintiles based on predicted average returns using

    Sochers model, similar to our analysis for the other models (see Figure 7). Though our results

    weren't perfect, they were surprisingly good.

    Bag-of-words models based on dictionaries reduce the dimensionality of the data and

    have been the industry-standard for evaluating the sentiment of financial texts with

    computational methods. However, as we know, this method fails to account for context of any

    kind. From our research, we have concluded that incorporating local context and long-range

    dependencies into a sentiment analysis model is helpful in predicting a companys subsequent

    stock returns and volatility within a 2-week window. From our initial results for TF for

    Loughran-context bigrams, we found that the simple step of expanding the token window

    significantly improved predictions. We found even better results using Sochers RNTN model,

    which can better understand the semantic structure of sentences.

  • !!

    9!

    5 Ideas for future research.

    Though our models outperformed the traditional bag-of-words model, we have a number

    of ideas for future research. In particular, we would like to explore the significance of features

    other than TF, TF-IDF, or LOG1P in predicting financial outcomes. For example, we might add

    the Gunning fog index for readability, the ratio of forward-looking to retrospective sentences,

    and overall document length to our document vectors.

    Also, our project only looked at Item 7, which has traditionally been viewed as the

    section with the most predictive value. Further work would analyze the importance of other

    sections as well.

    Finally, we would like to address some of the problems inherent in using a movie review

    corpus to train Socher's model and yet predict the sentiment of financial texts. For example, the

    model assigns the phrase as well as a serious debt to Road Warrior1 a Very positive

    sentiment score. However, if we were to come across the closely related phrase, Blackberry has

    acquired a serious debt... of $1 billion, in a 10-K, this would obviously not be the desirable

    outcome. Therefore, in order to prevent domain-dependent issues like this, we would modify the

    phrase dictionary in the Stanford model to use sentiment scores of bigram phrases learned from

    our linear regression model on financial text. With this more robust initialization of the phrase-

    to-sentiment dictionary, we expect we would see even better predictions.

    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1 Taken from a Rotten Tomatoes review of The Reign of Fire

  • !!

    10!

    Figures.

    Sentiment Label! Real-valued Score!Very negative! 0.1!

    Negative! 0.3!Neutral! 0.5!Positive! 0.7!

    Very positive! 0.9! Figure 1: The table above shows our mapping of Sochers sentiment labels to real-valued scores between 0 and 1.

    LM List! Returns LSE !

    Volatility LSE!

    Positive! 3.1156e+10! 6.9538e+10 !

    Negative! 4.2570e+11! 8.9766e+11 !

    Uncertain! 4.7217e+10 !

    1.0439e+11 !

    Litigious! 1.1229e+11! 2.4287e+11 !

    Weak Modal! 2.4033e+10 !

    5.6009e+10!

    Strong Modal! 7.6806e+09 !

    2.1980e+10 !

    Figure 2: We regressed volatility as well as returns on term frequency. The table above lists the least squares error

    for each LM list, based on each regression.

    LM Unigrams! Returns LSE !

    Volatility LSE!

    TF! 1.2474e+09! 7.2906e+09 !

    TF-IDF! 6.7211e+08 7.1549e+09 !

    LOG1P! 1.7754e+09

    7.8272e+09

    Figure 3a: The table above shows the least squares error for the regression of volatility and of returns over TF, TF-

    IDF, and LOG1P of all LM words.

    Returns LSE Volatility LSE!2.2529e+08 7.4709e+08

    Figure 3b: The table above shows the least squares error for the regression of volatility and of returns over

    sentiment scores, using Sochers model

  • !!

    11!

    Returns LSE Volatility LSE!5.8612e+08 7.1025e+09

    Figure 3c: The table above shows the least squares error for the regression of volatility and of returns over

    unigrams and bigrams

    ! Positive! Negative! Uncertain! Litigious! Weak Modal! Strong Modal!1! Creativity

    2.0520e+12 !

    Disastrous -1.3996e+14

    !

    Speculating 6.5505e+12

    !

    Replevin -5.6243e+13!

    Suggests -2.4830e+12

    !

    Clearly -8.3762e+11

    !2! Invented

    1.5234e+12 !

    Understating -7.9882e+13

    !

    Unconfirmed 3.5659e+12

    !

    Cedant 4.8033e+13

    !

    Might 6.7969e+11

    !

    Undoubtedly -7.0534e+10

    !3! Complimentary

    1.2176e+12 !

    Refusal -7.8511e+13

    !

    Suggest -2.6084e+12

    !

    Juror 2.9430e+13

    !

    Seldom -5.2999e+11

    !

    Undisputed 4.2742e+10

    !4! Happiness

    8.3957e+11 !

    Unaccounted 7.1499e+13

    !

    Riskiest 1.7041e+12

    !

    Jurist -2.4876e+13

    !

    Appeared -3.8511e+11

    !

    Strongly 3.7738e+10

    !5! Prospered

    8.0903e+11 !

    Overproduces -6.8838e+13

    !

    Unforecasted 1.5421e+12

    !

    Nonfeasance -2.4719e+13

    !

    Uncertain 2.5676e+11

    !

    Always 3.4733e+10

    !

    Figure 4: The table above lists the 5 most important terms and their weights in each of the LM lists for predicting returns

    ! Positive! Negative! Uncertain! Litigious! Weak Modal! Strong Modal!

    1! Creativity 4.3742e+12

    !

    Disastrous -2.8544e+14

    !

    Speculating 1.3716e+13

    !

    Replevin -1.2000e+14

    !

    Suggests -5.2395e+12

    !

    Clearly -1.7695e+12

    !2 !

    Invented 3.3071e+12

    !

    Understating -1.6426e+14

    !

    Unconfirmed 7.6333e+12!

    Cedant 1.0299e+14!

    Might 1.4396e+12!

    Undoubtedly -1.6557e+11!

    3! Complimentary 2.2796e+12

    !

    Refusal -1.6265e+14

    !

    Suggest -5.5072e+12

    !

    Juror 6.1004e+13

    !

    Seldom -1.1185e+12

    !

    Undisputed 8.3130e+10

    !4! Happiness

    1.9427e+12 !

    Unaccounted 1.4769e+14

    !

    Riskiest 3.5959e+12!

    Jurist -5.2298e+13

    !

    Appeared -8.0127e+11

    !

    Strongly 7.5230e+10!

    5! Bolstered 1.9404e+12!

    Overproduces -1.4247e+14!

    Unforecasted 3.2765e+12!

    Nonfeasance -5.1974e+13!

    Uncertain 5.3388e+11!

    Always 6.7526e+10!

    Figure 5: The table above lists the 5 most important terms and their weights in each of the LM lists for predicting

    volatility.

  • !!

    12!

    LM List! Returns Weight!

    Volatility Weight!

    Positive! 0.0221 (1)!

    0.0050 (1)!

    Negative! -3.9834e-05 (6)!

    -8.1918e-06 (6)!

    Uncertain! -6.8091e-05 (5)!

    -2.2408e-05 (4)!

    Litigious! 3.0667e-04 (2)!

    -1.9367e-05 (5)!

    Weak Modal! -2.0899e-04 (3)!

    7.6307e-05 (2)!

    Strong Modal! 9.1578e-05 (4)!

    -4.8578e-05 (3)!

    Figure 6: We regressed volatility as well as returns on the values obtained from each of the LM lists, using the

    weights learned from the training set. The table above lists the weight for each LM list in the regression described. The lists are ranked in magnitude of their weight from 1 to 6, in parentheses.

  • ! !

    13!

    Figure 7: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using Sochers model

    !

  • ! !

    14!

    Figure 8: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using TF

    !

  • ! !

    15!

    Figure 9: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using TF-IDF

  • ! !

    16!

    Figure 10: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using LOG1P

  • ! !

    17!

    Figure 11: Actual Performance of a random sample of ~6,000 companies, bucketed in quintiles based on predicted performance using LM unigram and bigram TF-IDF Counts

  • !!

    18!

    References: 1. Loughran, Tim, and Bill McDonald, 2011, When is a Liability not a Liability, Journal of Finance, V66, pp. 35-65. 2. Socher, Richard, et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Stanford, CA: Stanford University, 2013. 3. Kearney, Colm, and Sha Liu. Textual sentiment in finance: A survey of methods and models. March 2013. 4. Kogan, Shimon, et al. Predicting Risk from Financial Reports with Regression. Proc. NAACL Human Language Technologies Conf. 2009.