Upload
wegerg123343
View
12
Download
3
Embed Size (px)
DESCRIPTION
Automatic Analysis of Financial Event Phrases and Keywords in Form 8-K Disclosures
Citation preview
Automatic Analysis of Financial Event Phrases and
Keywords in Form 8-K Disclosures
Darina M. Slattery1 and Richard F. E. Sutcliffe2
1 School of Languages, Literature, Culture and Communication, University of Limerick, Limer-
ick, Ireland 2 Department of Computer Science and Information Systems, University of Limerick, Limer-
ick, Ireland
{darina.slattery, richard.sutcliffe}@ul.ie
Abstract. It is generally accepted that there are three different types of financial
information: information in past stock prices, information that is available to all
the public, and information that is both available to the public and available pri-
vately to insiders [1-4]. This study looks at publicly-available information in
Form 8K disclosures filed on the Securities and Exchange Commission’s
EDGAR system. We developed a prototype financial event phrase (FEP) recog-
nizer and automatically classified a small sample of 8-K disclosures by likely
share price response, using automatically-recognized FEPs and hand-chosen
keywords as disclosure features. We used C4.5 and SVM-Light for classifica-
tion but found that C4.5 was more successful overall. In one experiment, we
found that C4.5 was able to correctly classify 63.2% of disclosures that had a
positive share price reaction around the disclosure filing date, as against 58.2%
at chance.
Keywords: financial events, share price reaction, classification, content analy-
sis, disclosures, phrases, keywords.
1 Introduction
It is generally agreed that there are three different types of financial information: in-
formation in past stock prices, information that is available to all the public, and in-
formation that is both available to the public and available privately to insiders [1-4].
There is considerable debate about the possible impact that different kinds of in-
formation can have on the value of financial instruments. On the one hand, the theory
of market efficiency, or the efficient markets hypothesis (EMH), states that the price
of a financial instrument properly reflects all available information immediately [1]. If
security prices respond to all available information quickly, then the market is deemed
efficient and no excess profits or returns can be made. Even though the EMH has
many advocates, investment analysis is still a thriving and, oftentimes, a profitable
Aguado de Cea et al. (Eds.): Proceedings of the 10th Terminology and Knowledge Engineering Conference (TKE 2012), pp.291-305. 19-22 June 2012, Madrid, Spain
industry. The two main kinds of investment analyst—technical and fundamental—
would argue that the market is inefficient because information disseminates slowly
through the market and prices under- or over-react to the information [2].
Technical analysis involves evaluating time series patterns and trends relating to
previous prices of the financial instrument and the volume of trading, with a view to
predicting future prices and volumes. Fundamental analysis, on the other hand, in-
volves evaluating the value of a financial instrument using quantitative and qualitative
content derived from company financial statements, news stories, analysts’ reports,
and message forums. Relevant fundamental content can include ratios and variables
such as the earnings per share (EPS), net profit, and trading volume, as well as an-
nouncements relating to company policies and possible structural changes. Frequently
used online news sources include Bloomberg, Raging Bull, the Motley Fool, the Wall
Street Journal, and Yahoo! Finance. Other sources of quantitative and qualitative
information include online financial analysis tools (e.g. Datastream, Dow Jones News
Analytics, and Quotestream), and online databases such as the Sageworks Database
Platform and the Securities and Exchange Commission’s Electronic Data Gathering
Analysis and Retrieval (EDGAR) system. This latter system (EDGAR) hosts corpo-
rate disclosures such as annual Form 10-Ks and interim Form 8-Ks. Our study focuses
on fundamental content contained in Form 8-K disclosures filed on EDGAR.
A number of different data sources, features, goals, and methods have been used to
automatically analyze financial documents. Data sources have included online news
stories or messages and on- or off-line statements or reports. Document features have
included single words, keyword records and phrases, and financial ratios and varia-
bles. Some researchers have used combinations of features, whilst others have used
just one feature type. Goals have included prediction of prices, prediction of direction
or trends, and identification of market or message sentiment. Finally, a number of
different methods have also been employed; the most frequently used being rule in-
duction algorithms, neural networks, support vector machines, and Bayesian methods.
However, statistical methods, language modeling, multiple discriminant analysis, k-
nearest neighbor, and genetic algorithms have also been used. See [5-12] for some
relevant studies.
To-date, there has been very little research undertaken in the area of automatic
event phrase recognition and classification of online financial reports; to that end, our
goal was (1) to develop a prototype financial event phrase recognizer and (2) to clas-
sify the likely share price responses to online Form 8-Ks, using automatically recog-
nized financial event phrases (FEPs) and hand-chosen keywords as document fea-
tures. In various experiments, we used the C4.5 decision tree/ rule induction suite of
programs and SVM-Light, a support vector machine program, to classify the 8-Ks.
Our two datasets comprised 8-Ks filed by 50 randomly-chosen S&P 500 companies
before and after the 2004 SEC rule changes (these rule changes will be discussed in
Section 2).
A number of assumptions underlie our research. Firstly, we assume that the market
is not fully efficient i.e. that prices do not always fully reflect all available infor-
mation immediately. We also assume that Form 8-Ks have some informational value
for prediction purposes. We assume that prices adjust within a day to news filed in 8-
292
Ks and that all companies are equally affected by external factors i.e. we do not con-
trol for other variables. We assume that firms voluntarily disclose positive news but
that they only disclose negative news when they are legally obliged to do so. Finally,
we assume that positive or good news is easier to classify automatically than negative
news, as the latter is often disguised in the midst of positive news.
2 Form 8-K Disclosures
Form 8-Ks, also known as interim reports, must be filed within a few days of certain
financial events occurring. Prior to 2004, corporations were only required to file nine
types of events, including changes in control of registrant, acquisition or disposition
of assets, resignation of registrant’s directors, and other materially important events.
In 2002, the SEC put forward a proposal to increase the number of events and to
shorten the filing deadline. Due to the large number of comment and complaint letters
they received, they did not implement this proposal. However, in July 2002, Congress
passed the Sarbanes-Oxley Act. Amongst other things, this Act required public com-
panies to disclose information that was of material interest to investors and other rele-
vant parties, on a rapid and current basis. Following on from this Act, and having
taken on board many of the comments received after the 2002 proposal, the SEC offi-
cially amended the Form 8-K requirements in March 2004 [13]. Eight new event
types (or “items”) were added to the list of events that must be reported, two items
were transferred from the periodic reports (Form 10-Qs and Form 10-Ks), and two
existing items were modified. Items were reorganized and renumbered and similar
items were grouped. Also, the time period within which such filings must be made
was shortened to four days after the triggering event. The changes took effect in Au-
gust 2004.
When we compared the number of 8-Ks filed in 1997 and 2005 (before and after
the rule changes), we found there was an almost five-fold increase in the number of
filings in 2005, the year after the changes took effect (24,098 in 1997 vs. 116,282 in
2005). Interestingly, we also noted that there was a slight decline in filings after 2005,
which was possibly caused by a greater overall understanding of the requirements of
the new filing rules.
3 Methodology
In this section, we will discuss the datasets we used and why we decided to use clos-
ing prices and a three-day reaction window (Section 3.1). In Section 3.2, we will de-
scribe how we devised the event categories that were subsequently used by the auto-
matic financial event phrase (FEP) recognizer (Section 3.2). In Section 3.3, we will
outline the automatic FEP recognition process and then we will outline some patterns
that emerged from a preliminary pattern analysis of the FEPs recognized (Section
3.4). Finally, in Section 3.5, we will discuss our findings from a precision and recall
evaluation of our recognizer’s performance.
293
3.1 Datasets, Closing Prices, and Reaction Windows
This study uses three datasets of Form 8-Ks. The first dataset we obtained related to
specific types of companies (software-related) that filed over a five-year period in the
mid to late 1990s and were therefore only required to file a limited number of event
types, as discussed in Section 2. We used this initial dataset for some preliminary
experiments (see, for example, [14]) and to create the original event listing for the
financial event recognizer (see Section 3.2). The second and third datasets related to
two four-year periods but to a much broader spectrum of companies. The second da-
taset related to the four-year period from 1997 to 2000 and the third dataset related to
2005-2008. For both of these datasets, we randomly selected 50 companies from the
Standard and Poors’ 500, as this listing represents a cross-section of US industry and
is considered a leading indicator of US equities.
For each of the 8-Ks in the second and third datasets, we also obtained the closing
prices around the 8-K filing date. We used a three-day window when calculating the
share price return, by comparing the closing price the day after the filing was made
with the closing price the day before (t-1 to t+1, where t was the disclosure filing
date). Like van Bunningen [15], we used closing prices rather than intraday prices as
only closing prices were readily available. Also, it was not always possible to deter-
mine the specific time of the day when filings were submitted to the SEC and subse-
quently posted online, so it was best to use a consistent pricing method. Whilst
Mittermayer and Knolmayer [16] argue that closing prices should not be used as the
market reacts very quickly, and Andersen and Bollerslev [17] argue that closing pric-
es that are similar to the previous opening prices may disguise intraday volatility,
MacKinlay [18] was not convinced of the net benefit of high-frequency intraday data.
We worked on the assumption that closing prices within a day of an event would still
capture the effect of the event. We chose a three-day window for a number of reasons.
Firstly, we assumed that reactions to 8-Ks—as opposed to news headlines—do not
always occur immediately and may take up to a day. Secondly, as SEC filings are
sometimes posted after trade has ceased on day t, the effect of an 8-K might not be
seen until day t+1. Also, there can be a short lag between when an EDGAR filing is
made and when a filing appears on the public site [19]. However, because the filing
deadline was reasonably generous prior to 2004 (5 business or 15 calendar days), it is
possible that the event information had already been leaked, by the time our 8-Ks
were filed. Carter and Soo [20] found evidence of a limited market response to 8-Ks
and that timely filing was the single most important factor influencing the market
relevance of the 8-K. This was one of the reasons for creating the third dataset, which
spanned 2005 to 2008. 8-Ks filed since 2004 have to be filed more promptly so the
price reaction could be more closely related to the content, than previously. Antweiler
and Frank [21] suggested using windows greater than three days for robustness as
they found that the returns varied depending on the length of the window used.
Tetlock [22] also experimented with different window sizes and found that some neg-
ative effects were reversed within a week. Asthana and Balsam [23] used a five-day
window (t-1 to t+3) but reported similar results with three-day (t-1 to t+1) and seven-
day (t-1 to t+5) windows. Whisenant et al [24] used three- and seven-day windows
294
but suggested that if leakage of information occurs before an 8-K filing is made, then
the use of a three-day window could potentially “underestimate the information con-
tent of the disclosures” [p. 185]. We chose a three-day window as we wanted to elim-
inate the likelihood of confounding events which could occur in a longer window and
because a three-day window is consistent with several research studies e.g. [25-27].
3.2 Devising Event Categories for the Financial Event Phrase Recognizer
We manually analyzed the content of 185 disclosures from the first dataset, compris-
ing the top 56 ‘ups’, the top 56 ‘downs’, and all 73 ‘nochanges’. By top 56 ‘ups’, we
mean the 56 disclosures with the greatest positive change in share price return around
the disclosure filing date. Likewise, the 56 ‘downs’ had the greatest negative changes
and the 73 ‘nochanges’ had no change in share price return. We specifically chose the
top ‘ups’ and ‘downs’ because we felt they were more likely to contain content that
could impact the share price. During this manual content analysis, we identified 20
major event categories and several key event types within those categories. We then
identified all the disclosures that contained keywords relating to each major event
category, using a combination of search and concordance tools. Sample search key-
words included dismissal, appointment, resign and merger. Using search tools, we
manually analyzed disclosures that matched the search keywords and extracted the
key phrases that described the relevant events. Also, using concordance tools, we
manually analyzed all the returned sentences and extracted the most critical and inter-
esting parts of the sentences to create event phrases. These event phrases were then
edited manually to remove references to company names, product names, and finan-
cial values. Sample phrases included consummated a private offering, reported record
quarterly net income, and entered into an agreement to acquire all of the outstanding
capital stock. This filtering created a ‘best choices’ list of key phrases for each event
type. We later fine-tuned several key event types. Some new events were identified,
some events were split and others were merged. We then used these key phrases to
develop a prototype financial event phrase (FEP) recognizer.
We decided to develop our own list of FEP phrases, rather than use events and
keywords identified in other studies, because the majority of the other studies used
online news stories or messages, rather than disclosures e.g. see [5] and [28-30]. The
language in formal reports can be quite different to the language used in news head-
lines, stories and discussion board postings. We also decided to examine as many
events as possible, rather than just specific events, to avoid what Fama [31] referred
to as “dredging for anomalies” [p.287]. Nonetheless, developing the FEP lists for the
recognizer was not a straightforward task. It was difficult at times to interpret the
meaning of a phrase and to determine if a phrase was one type of event or another.
The stock-related events, in particular, were difficult to interpret at times. We are not
yet convinced that this is the ideal categorization of stock-related events and recom-
mend that further work is carried out in this area. Other researchers who highlighted
similar kinds of difficulties when trying to interpret the meaning of financial text in-
clude Hildebrandt and Snyder [32], who discussed the importance of context; Thomas
295
[33], who studied condensations and contractions; and Gillam et al [34], who dis-
cussed problems with negation, double negation and ambiguous terms.
Whilst we were developing the FEP lists, we also developed a list of named enti-
ties (NEs). We identified NEs such as accountant names (e.g. KPMG Peat Marwick
LLP and Price Waterhouse LLP) and types of employee (e.g. Senior Vice President,
Strategy, Finance and Administration and Chairman of the Board of Directors). We
also developed a list of types of financial object (TFO)—for example, we identified
84 different types of loss (e.g. loss before provision for income tax and unrealized
loss on investments), 193 types of stock, right or option (e.g. redeemable convertible
preferred shares and preferred stock) and 72 types of stock option agreement or plan
(e.g. stock option agreement and supplemental stock plan). The prototype FEP recog-
nizer used the NEs and TFOs to ensure that as many possible variations of a financial
event would be recognized and that where FEPs could not be recognized in an 8-K, at
least NEs and/ or TFOs would be written to the output, to facilitate classification later
on.
We also devised a list of the most frequently-occurring words. We then fine-tuned
this list by removing “useless” stop words and other words to create a list of 1,568
interesting keywords. Whilst not all these words were of a financial nature, they all
appeared frequently in our 8-Ks. These words were used as additional document fea-
tures in the classification experiments described in Section 4.
3.3 Automatic Recognition of FEPs
As discussed in the previous section, we devised a list of FEPs, NEs, TFOs, and key-
words. The following steps were undertaken to recognize these in 8-Ks1:
1. The full 8-K text was read in as a single string.
2. The 8-K string was split into a list of sentences.
3. Each sentence string was converted to lowercase.
4. Each lowercase sentence string was tokenized. Full-stops within words were re-
tained. Each tokenized sentence became a list of atoms.
5. The FEPs, NEs, and TFOs within each tokenized sentence were then recognized
using our FEP recognizer and written to an output file.
6. Every single word in the 8-K was written to a second output file. The items recog-
nized in step 5 were appended to this second output file.
Once the FEPs, NEs, and TFOs had been recognized and appended to all the words in
the 8-K, this second output file was compared to a list of 1,635 possible attributes,
comprising the 1,568 hand-chosen keywords, 49 FEP types, and 18 NE and TFO
types. A simple Perl program then wrote the value for each attribute (e.g. true or
false), in preparation for the C4.5 and SVM-Light classification experiments (see
Section 4).
1 As mentioned earlier, we did not recognize FEPs in the first dataset or use it for our main
classification experiments, as it was not based on the S&P 500 listing.
296
3.4 Automatic Pattern Analysis of FEPs
We carried out a preliminary analysis of FEPs found by our prototype recognizer, to
see if we could identify any interesting patterns about events in 8-Ks, prior to carrying
out the classification experiments. For the second and third datasets, we examined the
total number of words, the number of FEPs recognized, the different FEP types rec-
ognized, and the occurrences of each FEP type, in ‘downs’ and ‘ups’. We will now
summarize some of the more interesting findings.
With regards the number of words, there were 807,620 words in the first S&P 500
dataset (1997 to 2000) and 2,682,155 million words in the second (2005 to 2008),
which represents a 30% increase in words. We would expect there to be more words
in the latter dataset, as there were 44% more 8-Ks. The fact that there were 44% more
8-Ks but only 30% more words could suggest a number of things: (1) that there was
less content in the more recent disclosures, as 8-Ks focused on only one or two very
recent events and/ or (2) that disclosure writers adopted more concise writing strate-
gies in the more recent 8-Ks, possibly because their reporting practices were being
more closely monitored by the SEC. Perhaps of greater interest is the word count
difference between ‘downs’ and ‘ups’, in both datasets. Looking firstly at the first
dataset, there were 29% more words in the ‘ups’ compared to the ‘downs’, even
though there were only 14% more ‘up’ disclosures. When we performed the same
type of comparison with the second dataset, we found that there were 17% more
words in the ‘ups’ compared to the ‘downs’, and 17% more ‘up’ disclosures. This
suggests that our ‘ups’ contained more verbose language than the ‘downs’ before the
2004 rule changes but that the ‘ups’ and ‘downs’ were fairly similar (word-count
wise) after 2004. Assuming that there is a correlation between the disclosure content
and share price reaction (see Section 1 for a list of our assumptions), this finding for
the first S&P 500 dataset would appear to correlate with the Kohut and Segars [35]
finding that high-performing firms use more verbose language than poor-performing
firms. Li [36] found that the annual reports of poor-performing firms tend to be less
readable and that firms with more readable documents tend to have more persistent
positive earnings. Hildebrandt and Snyder [32] applied the ‘Pollyanna Hypothesis’ to
the writing of annual reports and found that there were significantly more positive
words than neutral or negative words, regardless of whether it was a financially- good
or bad year, suggesting that there is a general preference for using positive words in
disclosures. Assuming there is a correlation between content and share price return,
this could explain why the ‘ups’ contained more words than the ‘downs’ before 2004.
Also, it seems clear that the report writing style changed sometime in the intervening
period; either the ‘ups’ became more concise than previously or the ‘downs’ became
more verbose.
With regards the FEPs recognized in both datasets, we initially assumed that our
recognizer would find more FEPs in the second dataset, compared to the first, because
the second dataset was larger and firms were legally obliged to file more event types.
However, this was not the case for the ‘downs’. Even though there were more than
twice as many ‘downs’ in the second dataset compared to the first (574 vs. 256), our
recognizer found FEPs in proportionately less 8-Ks (20.4% vs. 30.9%). With regards
297
the ‘ups’, the recognizer did find more FEPs in the second dataset, but it found pro-
portionately less overall (24.3% vs. 30.8%)2.
When we compared the average number of FEPs recognized per 8-K, as well as the
minimum and maximum number, we found that the recognizer was less successful in
finding FEPs in the second dataset. On average, the recognizer found 1 FEP per file in
the first, but 0.4 (or none) in the second. The maximum number of FEPs recognized
also decreased in the second dataset, but this could also be because of more accurate
and timely reporting i.e. companies might have started to file single events more fre-
quently than before, rather than file several events in one disclosure. As already men-
tioned in this section, we noted that the recognizer found FEPs in a smaller percentage
of 8-Ks in the second dataset. One possible reason for this could be that the language
changed significantly in the intervening period, and therefore the recognizer was not
able to recognize as many FEPs. It is possible that the marked increase in the number
of events that needed to be filed, brought with it a change in the style of reporting
language. Also, as we know that the number of event types that had to be filed was
increased from 2004, it is quite likely that some of the new event types were not ca-
tered for in our FEP recognizer.
Our pattern analyses also yielded additional interesting findings. For example, we
found that whilst the total number of FEPs recognized decreased in the second da-
taset, the number of unique FEPs recognized increased; one possible reason for this
could be that there was less repetition of each event, as auditors became more consci-
entious about their reporting style after the 2004 rule changes. We also found that the
majority of FEP types that were recognized in the first dataset, were also recognized
in the second dataset. Of the 49 FEP types, 18 were recognized in one or more 8-Ks in
the first dataset and 19 were recognized in the second. Only the FEPs relating to ac-
countant dismissals, dividend distributions, and private placements of stock were
recognized in the second dataset, but not in the first; conversely, only the stock offer-
ing, stock option agreement or plan, and loss reporting FEP types were found in the
first dataset but not in the second. The only FEP type that appeared to be correlated
with an increase in share price was accountant appointment, which was only recog-
nized in the ‘ups’. We might expect a new appointment to be correlated with good
news, especially if there was an issue with a previous accountant. When we examined
the FEP types that were recognized most often, we found that merger events were
recognized in a fairly even number of ‘ups’ and ‘downs’, in both datasets. Merger and
acquisition agreements can have different implications for companies, depending on
whether they are the acquiring company or the company being acquired. Also, specif-
ic details regarding a merger can have different implications for shareholders as they
often have associated stock changes.
2 It is important to note here that these figures relate to FEPs found by our recogniser; it is
likely that there were more FEPs in the 8-Ks but that these were not recognised.
298
3.5 Precision and Recall
In the previous section, we reported that our recognizer found FEPs in proportionately
less ‘downs’ in the second S&P 5000 dataset, compared to the first (20.4% vs.
30.9%). With regards the ‘ups’, the recognizer found more FEPs in the second da-
taset, compared to the first, but proportionately less overall (24.3% vs. 30.8%).
As it became evident that further work was needed to improve the recall of the rec-
ognizer, we then decided to further evaluate the performance by evaluating the preci-
sion and recall on a subset of documents, using only 8-Ks that had one or more FEPs
automatically recognized in them. For this purpose, we selected 10% of the disclo-
sures used in one of our experiments. This experiment used 280 disclosures filed be-
tween 2005 and 2008, so we selected 14 ‘ups’ and 14 ‘downs’ (28 in total, or 10%).
We avoided using more than one disclosure from the same company for two reasons.
Firstly, it was possible that a company released more than one disclosure relating to
the same event and secondly, the auditors probably used similar language in each of
these disclosures. Using more than one disclosure describing the same event could
skew our evaluation. As this evaluation of precision and recall (P&R) was only on a
subset of the entire collection, and comprised only disclosures that had one or more
recognized FEPs in them, we will refer to this P&R as adjusted P&R.
We found that the adjusted recall of the ‘downs’ (.65) was greater than the adjusted
recall of the ‘ups’ (0.42). The adjusted precision was also higher for the ‘downs’
(0.77) than for the ‘ups’ (0.52). The combined adjusted recall and precision for
‘downs’ and ‘ups’ were 0.54 and 0.65 respectively. It is probably not surprising that
our overall adjusted recall is average as the recognizer was initially developed using
disclosures filed before the 2004 rule changes and we know from earlier sections that
the word counts and event types changed after these changes took effect. However,
we would expect the precision to be higher so we will now summarize some of the
difficulties we encountered when determining the precision. One problem related to
repetition of the same event in a disclosure. Whilst the first occurrence may have been
accurate, we counted the other occurrences is inaccurate, as there was really only one
event. Precision was also affected by previous events, which were recognized but not
truly accurate as we really only wanted to recognize recent events. There were also a
couple of outright inaccuracies, where the language used was misinterpreted as anoth-
er event type. Further work is needed to eliminate or at least minimize these inaccura-
cies.
4 Classification Experiments
We carried out a number of classification experiments using the second and third
datasets (see Section 3.1 for an overview of our three datasets), with a view to predict-
ing the likely share price responses to the financial events contained within the 8-Ks.
Table 1 summarizes the results from two experiments and a textual description of
both experiments is provided in Sections 4.1 and 4.2. A summary of the overall find-
ings is presented in Section 4.3. In both experiments, we used the C4.5 decision tree/
rule induction suite of programs and SVM-Light, a support vector machine program,
299
to classify the 8-Ks. Also, in both experiments, all the 8-Ks had one or more automat-
ically recognized financial event phrases (FEPs) in them, as well as a number of other
features—named entities (NEs), types of financial object (TFOs) and keywords. In
other words, we used these recognized FEPs, NEs, TFOs, and keywords as input fea-
tures for classification.
Table 1. Summary of Results for Both Experiments.
Average accu-
racy of ‘ups’
and ‘downs’
together (vs.
chance)
Average accu-
racy of
‘ups’(vs. arbi-
trary classifica-
tion to ‘up’)
Average accu-
racy of ‘downs’
(vs. arbitrary
classification to
‘down’)
Experiment 1 C4.5 52.1%
(vs. 50%)
53.3%
(vs. 53.3%)
50.6%
(vs. 46.7%)
SVM-Light 49.7%
(vs. 50%)
70%
(vs. 53.3%)
29.4%
(vs. 46.7%)
Experiment 2 C4.5 52.5%
(vs. 50%)
63.2%
(vs. 58.2%)
37.6%
(vs. 41.8%)
SVM-Light 49.6%
(vs. 50%)
98.8%
(vs. 58.2%)
0.4%
(vs. 41.8%)
4.1 Experiment 1: Second Dataset (1997 to 2000) (C4.5 and SVM-Light)
This dataset contained 169 8-K disclosures, comprising 90 ‘ups’ and 79 ‘downs’.
Using C4.5, the average accuracy of the ‘ups’ and ‘downs’ together was 52.1% using
10-fold cross-validation, which is only marginally better than chance, assuming a 50-
50 up-down decision (see Table 1). When we focused on the performance of the ‘ups’
specifically, C4.5 correctly classified 53.3% of them. Coincidentally, the ‘ups’ com-
prised 53.3% of the full dataset (90/169 disclosures), so our prototype recognizer and
subsequent classification did not yield improved results over arbitrary classification
i.e. if we had assigned every document to the ‘up’ category as a baseline benchmark.
However, when we focused on the performance of the ‘downs’, C4.5 correctly classi-
fied 50.6% of them (highlighted in bold in Table 1), whereas the full dataset com-
prised only 46.7% ‘downs’ (79/169 disclosures). If we had arbitrarily assigned all 169
disclosures to the ‘down’ category, we would only have been correct 46.7% of the
time; our automatic system, however, was correct 50.6% of the time. This is a posi-
tive result for the ‘downs’. Unfortunately, it was not possible to undertake a hypothe-
sis test for a population proportion to determine the statistical significance of this
result, as we did not have data relating to the entire population of ‘up’ and ‘down’
disclosures.
Using SVM-Light and 10-fold cross-validation, the average accuracy of the ‘ups’
and ‘downs’ together was 49.7%. Looking more closely at the ‘ups’ and ‘downs’
separately, we found that SVM-Light correctly classified 70% of the ‘ups’ (highlight-
ed in bold in Table 1) but only 29.4% of the ‘downs’. As we will discuss in the next
300
section, this high result for the ‘ups’ appears to be due to SVM-Light selecting the
majority class (‘up’) as the default classification.
4.2 Experiment 2: Third Dataset (2005 to 2008) (C4.5 and SVM-Light)
This dataset contained 280 disclosures, comprising 163 ‘ups’ and 117 ‘downs’. Using
C4.5, the average accuracy of the ‘ups’ and ‘downs’ together was 52.5% using 10-
fold cross-validation, which is only marginally better than chance, assuming a 50-50
up-down decision. When we focused on the performance of the ‘ups’, C4.5 correctly
classified 63.2% (highlighted in bold), whereas the full dataset comprised only 58.2%
‘ups’ (163/280 disclosures). If we had arbitrarily assigned all 280 disclosures to the
‘up’ category, we would only have been correct 58.2% of the time; our automatic
system, however, was correct 63.2% of the time so this is a positive result. When we
focused on the performance of the ‘downs’, C4.5 correctly classified 37.6% of them,
whereas the dataset comprised 41.8% ‘downs’ (117/280 disclosures). Our system
does not prove better than arbitrary classification in this case.
Using SVM-Light and 10-fold cross-validation, the average accuracy of the ‘ups’
and ‘downs’ together was 49.6%. Looking more closely, SVM-Light correctly classi-
fied 98.8% of the ‘ups’ (highlighted in bold) but only 0.4% of the ‘downs’. This sig-
nificant difference in accuracy levels between the ‘ups’ and ‘downs’ seems to confirm
that SVM-Light behaved like a default classifier here, assigning the majority of test
cases to the class with the most data [37].
4.3 Summary of Results
To summarize the results, when the ‘ups’ and ‘downs’ were classified together, the
average accuracy was in and around chance (50%). However, when we examined the
performance of ‘ups’ and ‘downs’ separately, the results were more revealing. Taking
both datasets and experiments together, C4.5 always equaled or outperformed arbi-
trary classification with the ‘ups’ and SVM-Light always outperformed it with the
‘ups’. With regards the ‘downs’, C4.5 only outperformed it with the ‘downs’ in the
second (earlier) dataset and SVM-Light never outperformed it with the ‘downs’.
Overall, C4.5 appeared to be better at classifying the ‘ups’ and ‘downs’. Whilst the
accuracy rate for the ‘ups’ was particularly high for SVM-Light in the third (more
recent) dataset, we must remember that SVM-Light seemed to use the default classifi-
cation (‘up’) for the majority of the classifications. Relying on the default classifica-
tion would not be recommended in a real-world investment scenario.
We should also mention here that the C4.5 average training error rates were rela-
tively low, ranging from 2.9% (second dataset) to 7.7% (third dataset). However, with
SVM-Light, the average training error rates were much higher, ranging from 20%
(second dataset) to 35.3% (third dataset).
301
5 Conclusions
This paper describes one approach to automatic financial event phrase recognition in
Form 8-K disclosures. It also describes the results from two classification experi-
ments; one based on disclosures filed before the SEC’s 2004 rule changes, the other
based on disclosures filed afterwards.
We found that the combined accuracy levels for ‘ups’ and ‘downs’ together did not
yield promising results (combined average accuracies range from 52.1% to 52.5% for
C4.5 and 49.6% to 49.7% for SVM-Light). However, when we investigated the classi-
fication accuracy for the ‘ups’ and ‘downs’ separately, the results were slightly more
promising. For example, if the share price for a particular disclosure is going to in-
crease (unbeknownst to the user), then both systems are more likely to predict ‘up’, so
one might expect to make a profit. However, if the share price is going to decrease,
then C4.5 would be the safer option. Lerman and Livnat [38] found that market reac-
tions to 8-Ks and other disclosure types varied by event—some events caused strong
positive returns, whereas others caused negative returns. They also suggested that for
some events there may have been an absence of information or inconsistent reactions
(event that may have been good for one firm may have been bad for another).
Mittermayer and Knolmayer [16] highlighted the level of noise that exists in training
data—for example, several 8-Ks could be released in a short timeframe, each describ-
ing the same event. However, only one of these may have actually impacted the price
(possibly the first one).
Focusing on the C4.5 results only (seeing as it did not appear to rely on a default
classification), our findings suggest that, using event phrases, keywords, and other
such features of the 8-Ks, the ‘downs’ seemed to be easier to classify than the ‘ups’
before the 2004 rule changes came into effect. At the time, corporations were only
required to file a limited number of event types and there was greater flexibility with
regards the filing deadline. We think that corporations would only have mentioned
negative news if it was a recent negative news event that they were obliged to dis-
close. This lack of ‘noise’ in negative news disclosures could have facilitated the
classification of the ‘downs’. Also, our pattern analysis (see Section 3.4) revealed that
the ‘ups’ contained proportionately more words than the ‘downs’ during this period
(when compared to post-2004); this could partially explain the difficulties faced by
the classifier when attempting to classify the ‘ups’. On the other hand, post 2004, the
‘ups’ appeared to be easier to classify than the ‘downs’. Our pattern analysis revealed
that the ‘ups’ became more concise (relative to the ‘downs’) during this period; one
possible reason for this improved conciseness could have been the stricter filing rules.
Because negative news disclosures tend to discuss positive news also to reduce the
likely impact of the negative news, this could have affected the classification of the
‘downs’.
There are a number of limitations to the experiments described this study. For real-
world implementation, the recall of the recognizer would need to be greatly improved.
This could be done by incorporating new event types and phrases that have emerged
in more recent years. Secondly, the two classification experiments described here use
relatively small datasets (169 and 280 disclosures respectively). Whilst the relatively
302
small size of the datasets is due to the performance of the recognizer (we only classi-
fied disclosures that contained one or more automatically recognized events), ideally
we would test the classifier using much larger datasets. Thirdly, we used a large num-
ber of classification attributes (1,635); further refinement of the event phrases and
keywords could yield more interesting classification results. Finally, we cannot be
certain that these disclosures were the sole cause of the share price changes; other
variables such as company history, industry type, and macroeconomic factors could
also have impacted the prices.
Despite these limitations, we think this research also has a number of strengths.
Our study differs from most other studies in that we analyze complete documents and
we use a combination of features (financial event phrases, named entities, types of
financial object, and keywords) when attempting to classify 8-Ks by likely share price
response. We do not claim to have developed the definitive solution; rather, we have
identified one possible solution that could facilitate the arduous task of analyzing 8-
Ks. With some more work, the output from the recognizer could be used by various
parties. For example, disclosure writers could examine the financial event phrases we
identified and incorporate the most appropriate phrases (or avoid using certain
phrases) to encourage the desired share price response. Likewise, fundamental ana-
lysts could incorporate these phrases into their content analysis toolkits. Individual
investors could search for specific phrases and then purchase or sell stocks based on
the content of disclosures.
Some recommendations for future work include further refinement of the financial
event phrases and keywords, classification of larger datasets (possibly using more
volatile companies outside the S&P 500), and incorporation of additional classifica-
tion variables beyond event phrases and keywords.
303
References. 1. Fama, E.F.: Efficient Capital Markets: A Review of Theory and Empirical Work. J Financ.
25(2), 383-417 (1970)
2. Haugen, R.A.: Modern Investment Theory. 2nd ed. Prentice Hall, Englewood Cliffs, New
Jersey (1990)
3. Hellstrom, T., Holmstrom, K.: Predicting the Stock Market. Technical Report Series IMa-
TOM-1997-07. Mälardalen University, Sweden (1998)
4. Elton, E.J., Gruber, M.J., Brown, S.J., Goetzmann, W.N. Modern Portfolio Theory and In-
vestment Analysis. John Wiley & Sons, Inc., New York (2003)
5. Wüthrich, B., Peramunetilleke, D., Leung, S., Cho, V., Zhang, J., Lam, W.: Daily Predic-
tion of Major Stock Indices from Textual WWW Data. In: Agrawal, R., Stolorz, P.,
Piatetsky, G. (eds.) Fourth International Conference on Knowledge Discovery and Data
Mining, 2720-2725. AAAI Press, New York (1998)
6. Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., Allan, J.: Language Models
for Financial News Recommendation. In: Agah, A., Callan, J., Rundensteiner, E. (eds.)
Ninth International Conference on Information and Knowledge Management, 389-396.
ACM Press (2000)
7. Peramunetilleke, D., Wong, R.K.: Currency Exchange Rate Forecasting from News Head-
lines. Aust. Comp. S. 24(2), 131-139 (2002)
8. Antweiler, W., Frank, M.Z.: Is All that Talk Just Noise? The Information Content of Inter-
net Stock Message Boards. J Financ. 59(3), 1259-1294 (2004)
9. Lam, M.: Neural Network Techniques for Financial Performance Prediction: Integrating
Fundamental and Technical Analysis. Decis. Support. Syst. 37(4), 567-581 (2004)
10. Fung, G.P.C., Yu, J.X., Lu, H.: The Predicting Power of Textual Information on Financial
Markets. IEEE Intell. Inform. Bull. 5(1), 1-10 (2005)
11. Tetlock, P.C., Saar-Tsechansky, M., Macskassy, S.: More than Words: Quantifying Lan-
guage to Measure Firms’ Fundamentals. J Financ. 63(3), 1437-1467 (2008)
12. Loughran, T., McDonald, B.: When is a Liability not a Liability? Textual Analysis, Dic-
tionaries, and 10-Ks. J Financ. 66(1), 35-65 (2011)
13. SEC, Final Rule: Additional Form 8-K Disclosure Requirements and Acceleration of Fil-
ing Date, http://www.sec.gov/rules/final/33-8400.htm (2004)
14. Slattery, D.M., Sutcliffe, R.F.E., Walsh, E.J.: Automatic Analysis of Corporate Financial
Disclosures. In: Gillam, L. (ed.) Making Money in the Financial Services Industry, a
Workshop at the Terminology and Knowledge Engineering Conference. Nancy, France
(2002)
15. van Bunningen, A.H.: Augmented Trading: From News Stories to Stock Price Predictions
Using Syntactic Analysis. Unpublished thesis (M.Sc.), University of Twente (2004)
16. Mittermayer, M.-A., Knolmayer, G.F.: Text Mining Systems for Market Response to
News: A Survey. Working Paper No 184, Institute of Information Systems, University of
Bern (2006)
17. Andersen, T.G., Bollerslev, T.: Deutsche Mark-Dollar Volatility: Intraday Activity Pat-
terns, Macroeconomic Announcements, and Longer Run Dependencies. J Financ. 53(1),
219-265 (1998)
18. MacKinlay, A.C.: Event Studies in Economics and Finance. J. Econ. Lit. 35(1), 13-39
(1997)
19. Griffin, P.A.: Got Information: Investor Response to Form 10-K and Form 10-Q EDGAR
Filings. Rev. Acc. Stud. 8(4), 433-460 (2003)
304
20. Carter, M.E., Soo, B.S.: The Relevance of Form 8-K Reports. J. Account. Res. 37(1), 119-
132 (1999)
21. Antweiler, W., Frank, M.Z.: Do U.S. Stock Markets Typically Overreact to Corporate
News Stories? Working Paper, University of British Columbia (2006)
22. Tetlock, P.C.: Giving Content to Investor Sentiment: The Role of Media in the Stock Mar-
ket. J Financ. 62(3), 1139-1168 (2007)
23. Asthana, S., Balsam, S.: The Effect of EDGAR on the Market Reaction to 10-K Filings. J.
Account. Public Pol. 20(4-5), 349-372 (2001)
24. Whisenant, J.S., Sankaraguruswamy, S., Raghunandan, K.: Market Reactions to Disclo-
sure of Reportable Events. Aud.: J. Prac. Theory. 22(1), 181-194 (2003)
25. Ball, R., Kothari, S.P.: Security Returns Around Earnings Announcements. Account. Rev.
66(4), 718-738 (1991)
26. Francis, J., Schipper, K., Vincent, L.: Earnings Announcement and Competing Infor-
mation. J. Account. Econ. 33(3), 313-342 (2002)
27. Henry, E.: Are Investors Influenced by how Earnings Press Releases are Written? J. Bus.
Commun. 45(4), 363-407 (2008)
28. Cho, V., Wüthrich, B., Zhang, J.: Text Processing for Classification. J. Comput. Intell. Fin.
7(2), 6-22 (1999)
29. Seo, Y.-W., Giampapa, J., Sycara, K.: Financial News Analysis for Intelligent Portfolio
Management. Technical Report CMU-RI-TR-04-04, Robotics Institute, Carnegie Mellon
University (2004)
30. Schumaker, R.P., Chen, H.: Textual Analysis of Stock Market Prediction Using Financial
News Articles. In: Rodriguez-Abitia, G., Ania, B., Ignacio (eds.) 12th Americas Confer-
ence on Information Systems, Acapulco, Mexico (2006)
31. Fama, E.F.: Market Efficiency, Long-Term Returns, and Behavioral Finance. J. Financ.
Econ. 49(3), 283-306 (1998)
32. Hildebrandt, H.W., Snyder, R.D.: The Pollyanna Hypothesis in Business Writing: Initial
Results, Suggestions for Research. J. Bus. Commun. 18(1), 5-15 (1981)
33. Thomas, J.: Disclosure in the Marketplace: The Making of Meaning in Annual Reports. J.
Bus. Commun. 34(1), 47-66 (1997)
34. Gillam, L., Ahmad, K., Ahmad, S., Casey, M., Cheng, D., Taskaya Temizel, T., de
Oliviera, P.C.F., Manomaisupat, P.: Economic News and Stock Market Correlation: A
Study of the UK Stock Market. In: Gillam, L. (ed.) Making Money in the Financial Ser-
vices Industry, a Workshop at the Terminology and Knowledge Engineering Conference.
Nancy, France (2002)
35. Kohut, G.F., Segars, A.H.: The President’s Letter to Stockholders: An Examination of
Corporate Communication Strategy. J. Bus. Commun. 29(1), 7-21 (1992)
36. Li, F.: Annual Report Readability, Current Earnings, and Earnings Persistence. J. Account.
Econ. 45(2-3), 221-247 (2008)
37. Joachims, T.: Estimating the Generalization Performance of a SVM Efficiently. In: Lang-
ley, P. (ed.) 17th International Conference on Machine Learning, pp. 431-438. Morgan
Kaufmann, San Francisco, CA (2000)
38. Lerman, A., Livnat, J.: The New Form 8-K Disclosures. Rev. Acc. Stud. 15(4), 752-778
(2009)
305