Automatic Analysis of Financial Event Phrases and Keywords in Form 8-K Disclosures

Automatic Analysis of Financial Event Phrases and

Keywords in Form 8-K Disclosures

Darina M. Slattery1 and Richard F. E. Sutcliffe2

1 School of Languages, Literature, Culture and Communication, University of Limerick, Limer-

ick, Ireland 2 Department of Computer Science and Information Systems, University of Limerick, Limer-

ick, Ireland

{darina.slattery, richard.sutcliffe}@ul.ie

Abstract. It is generally accepted that there are three different types of financial

information: information in past stock prices, information that is available to all

the public, and information that is both available to the public and available pri-

vately to insiders [1-4]. This study looks at publicly-available information in

Form 8K disclosures filed on the Securities and Exchange Commission’s

EDGAR system. We developed a prototype financial event phrase (FEP) recog-

nizer and automatically classified a small sample of 8-K disclosures by likely

share price response, using automatically-recognized FEPs and hand-chosen

keywords as disclosure features. We used C4.5 and SVM-Light for classifica-

tion but found that C4.5 was more successful overall. In one experiment, we

found that C4.5 was able to correctly classify 63.2% of disclosures that had a

positive share price reaction around the disclosure filing date, as against 58.2%

at chance.

Keywords: financial events, share price reaction, classification, content analy-

sis, disclosures, phrases, keywords.

1 Introduction

It is generally agreed that there are three different types of financial information: in-

formation in past stock prices, information that is available to all the public, and in-

formation that is both available to the public and available privately to insiders [1-4].

There is considerable debate about the possible impact that different kinds of in-

formation can have on the value of financial instruments. On the one hand, the theory

of market efficiency, or the efficient markets hypothesis (EMH), states that the price

of a financial instrument properly reflects all available information immediately [1]. If

security prices respond to all available information quickly, then the market is deemed

efficient and no excess profits or returns can be made. Even though the EMH has

many advocates, investment analysis is still a thriving and, oftentimes, a profitable

Aguado de Cea et al. (Eds.): Proceedings of the 10th Terminology and Knowledge Engineering Conference (TKE 2012), pp.291-305. 19-22 June 2012, Madrid, Spain

mailto:richard.sutcliffe%[email protected]

industry. The two main kinds of investment analyst—technical and fundamental—

would argue that the market is inefficient because information disseminates slowly

through the market and prices under- or over-react to the information [2].

Technical analysis involves evaluating time series patterns and trends relating to

previous prices of the financial instrument and the volume of trading, with a view to

predicting future prices and volumes. Fundamental analysis, on the other hand, in-

volves evaluating the value of a financial instrument using quantitative and qualitative

content derived from company financial statements, news stories, analysts’ reports,

and message forums. Relevant fundamental content can include ratios and variables

such as the earnings per share (EPS), net profit, and trading volume, as well as an-

nouncements relating to company policies and possible structural changes. Frequently

used online news sources include Bloomberg, Raging Bull, the Motley Fool, the Wall

Street Journal, and Yahoo! Finance. Other sources of quantitative and qualitative

information include online financial analysis tools (e.g. Datastream, Dow Jones News

Analytics, and Quotestream), and online databases such as the Sageworks Database

Platform and the Securities and Exchange Commission’s Electronic Data Gathering

Analysis and Retrieval (EDGAR) system. This latter system (EDGAR) hosts corpo-

rate disclosures such as annual Form 10-Ks and interim Form 8-Ks. Our study focuses

on fundamental content contained in Form 8-K disclosures filed on EDGAR.

A number of different data sources, features, goals, and methods have been used to

automatically analyze financial documents. Data sources have included online news

stories or messages and on- or off-line statements or reports. Document features have

included single words, keyword records and phrases, and financial ratios and varia-

bles. Some researchers have used combinations of features, whilst others have used

just one feature type. Goals have included prediction of prices, prediction of direction

or trends, and identification of market or message sentiment. Finally, a number of

different methods have also been employed; the most frequently used being rule in-

duction algorithms, neural networks, support vector machines, and Bayesian methods.

However, statistical methods, language modeling, multiple discriminant analysis, k-

nearest neighbor, and genetic algorithms have also been used. See [5-12] for some

relevant studies.

To-date, there has been very little research undertaken in the area of automatic

event phrase recognition and classification of online financial reports; to that end, our

goal was (1) to develop a prototype financial event phrase recognizer and (2) to clas-

sify the likely share price responses to online Form 8-Ks, using automatically recog-

nized financial event phrases (FEPs) and hand-chosen keywords as document fea-

tures. In various experiments, we used the C4.5 decision tree/ rule induction suite of

programs and SVM-Light, a support vector machine program, to classify the 8-Ks.

Our two datasets comprised 8-Ks filed by 50 randomly-chosen S&P 500 companies

before and after the 2004 SEC rule changes (these rule changes will be discussed in

Section 2).

A number of assumptions underlie our research. Firstly, we assume that the market

is not fully efficient i.e. that prices do not always fully reflect all available infor-

mation immediately. We also assume that Form 8-Ks have some informational value

for prediction purposes. We assume that prices adjust within a day to news filed in 8-

292

Ks and that all companies are equally affected by external factors i.e. we do not con-

trol for other variables. We assume that firms voluntarily disclose positive news but

that they only disclose negative news when they are legally obliged to do so. Finally,

we assume that positive or good news is easier to classify automatically than negative

news, as the latter is often disguised in the midst of positive news.

2 Form 8-K Disclosures

Form 8-Ks, also known as interim reports, must be filed within a few days of certain

financial events occurring. Prior to 2004, corporations were only required to file nine

types of events, including changes in control of registrant, acquisition or disposition

of assets, resignation of registrant’s directors, and other materially important events.

In 2002, the SEC put forward a proposal to increase the number of events and to

shorten the filing deadline. Due to the large number of comment and complaint letters

they received, they did not implement this proposal. However, in July 2002, Congress

passed the Sarbanes-Oxley Act. Amongst other things, this Act required public com-

panies to disclose information that was of material interest to investors and other rele-

vant parties, on a rapid and current basis. Following on from this Act, and having

taken on board many of the comments received after the 2002 proposal, the SEC offi-

cially amended the Form 8-K requirements in March 2004 [13]. Eight new event

types (or “items”) were added to the list of events that must be reported, two items

were transferred from the periodic reports (Form 10-Qs and Form 10-Ks), and two

existing items were modified. Items were reorganized and renumbered and similar

items were grouped. Also, the time period within which such filings must be made

was shortened to four days after the triggering event. The changes took effect in Au-

gust 2004.

When we compared the number of 8-Ks filed in 1997 and 2005 (before and after

the rule changes), we found there was an almost five-fold increase in the number of

filings in 2005, the year after the changes took effect (24,098 in 1997 vs. 116,282 in

2005). Interestingly, we also noted that there was a slight decline in filings after 2005,

which was possibly caused by a greater overall understanding of the requirements of

the new filing rules.

3 Methodology

In this section, we will discuss the datasets we used and why we decided to use clos-

ing prices and a three-day reaction window (Section 3.1). In Section 3.2, we will de-

scribe how we devised the event categories that were subsequently used by the auto-

matic financial event phrase (FEP) recognizer (Section 3.2). In Section 3.3, we will

outline the automatic FEP recognition process and then we will outline some patterns

that emerged from a preliminary pattern analysis of the FEPs recognized (Section

3.4). Finally, in Section 3.5, we will discuss our findings from a precision and recall

evaluation of our recognizer’s performance.

293

3.1 Datasets, Closing Prices, and Reaction Windows

This study uses three datasets of Form 8-Ks. The first dataset we obtained related to

specific types of companies (software-related) that filed over a five-year period in the

mid to late 1990s and were therefore only required to file a limited number of event

types, as discussed in Section 2. We used this initial dataset for some preliminary

experiments (see, for example, [14]) and to create the original event listing for the

financial event recognizer (see Section 3.2). The second and third datasets related to

two four-year periods but to a much broader spectrum of companies. The second da-

taset related to the four-year period from 1997 to 2000 and the third dataset related to

2005-2008. For both of these datasets, we randomly selected 50 companies from the

Standard and Poors’ 500, as this listing represents a cross-section of US industry and

is considered a leading indicator of US equities.

For each of the 8-Ks in the second and third datasets, we also obtained the closing

prices around the 8-K filing date. We used a three-day window when calculating the

share price return, by comparing the closing price the day after the filing was made

with the closing price the day before (t-1 to t+1, where t was the disclosure filing

date). Like van Bunningen [15], we used closing prices rather than intraday prices as

only closing prices were readily available. Also, it was not always possible to deter-

mine the specific time of the day when filings were submitted to the SEC and subse-

quently posted online, so it was best to use a consistent pricing method. Whilst

Mittermayer and Knolmayer [16] argue that closing prices should not be used as the

market reacts very quickly, and Andersen and Bollerslev [17] argue that closing pric-

es that are similar to the previous opening prices may disguise intraday volatility,

MacKinlay [18] was not convinced of the net benefit of high-frequency intraday data.

We worked on the assumption that closing prices within a day of an event would still

capture the effect of the event. We chose a three-day window for a number of reasons.

Firstly, we assumed that reactions to 8-Ks—as opposed to news headlines—do not

always occur immediately and may take up to a day. Secondly, as SEC filings are

sometimes posted after trade has ceased on day t, the effect of an 8-K might not be

seen until day t+1. Also, there can be a short lag between when an EDGAR filing is

made and when a filing appears on the public site [19]. However, because the filing

deadline was reasonably generous prior to 2004 (5 business or 15 calendar days), it is

possible that the event information had already been leaked, by the time our 8-Ks

were filed. Carter and Soo [20] found evidence of a limited market response to 8-Ks

and that timely filing was the single most important factor influencing the market

relevance of the 8-K. This was one of the reasons for creating the third dataset, which

spanned 2005 to 2008. 8-Ks filed since 2004 have to be filed more promptly so the

price reaction could be more closely related to the content, than previously. Antweiler

and Frank [21] suggested using windows greater than three days for robustness as

they found that the returns varied depending on the length of the window used.

Tetlock [22] also experimented with different window sizes and found that some neg-

ative effects were reversed within a week. Asthana and Balsam [23] used a five-day

window (t-1 to t+3) but reported similar results with three-day (t-1 to t+1) and seven-

day (t-1 to t+5) windows. Whisenant et al [24] used three- and seven-day windows

294

but suggested that if leakage of information occurs before an 8-K filing is made, then

the use of a three-day window could potentially “underestimate the information con-

tent of the disclosures” [p. 185]. We chose a three-day window as we wanted to elim-

inate the likelihood of confounding events which could occur in a longer window and

because a three-day window is consistent with several research studies e.g. [25-27].

3.2 Devising Event Categories for the Financial Event Phrase Recognizer

We manually analyzed the content of 185 disclosures from the first dataset, compris-

ing the top 56 ‘ups’, the top 56 ‘downs’, and all 73 ‘nochanges’. By top 56 ‘ups’, we

mean the 56 disclosures with the greatest positive change in share price return around

the disclosure filing date. Likewise, the 56 ‘downs’ had the greatest negative changes

and the 73 ‘nochanges’ had no change in share price return. We specifically chose the

top ‘ups’ and ‘downs’ because we felt they were more likely to contain content that

could impact the share price. During this manual content analysis, we identified 20

major event categories and several key event types within those categories. We then

identified all the disclosures that contained keywords relating to each major event

category, using a combination of search and concordance tools. Sample search key-

words included dismissal, appointment, resign and merger. Using search tools, we

manually analyzed disclosures that matched the search keywords and extracted the

key phrases that described the relevant events. Also, using concordance tools, we

manually analyzed all the returned sentences and extracted the most critical and inter-

esting parts of the sentences to create event phrases. These event phrases were then

edited manually to remove references to company names, product names, and finan-

cial values. Sample phrases included consummated a private offering, reported record

quarterly net income, and entered into an agreement to acquire all of the outstanding

capital stock. This filtering created a ‘best choices’ list of key phrases for each event

type. We later fine-tuned several key event types. Some new events were identified,

some events were split and others were merged. We then used these key phrases to

develop a prototype financial event phrase (FEP) recognizer.

We decided to develop our own list of FEP phrases, rather than use events and

keywords identified in other studies, because the majority of the other studies used

online news stories or messages, rather than disclosures e.g. see [5] and [28-30]. The

language in formal reports can be quite different to the language used in news head-

lines, stories and discussion board postings. We also decided to examine as many

events as possible, rather than just specific events, to avoid what Fama [31] referred

to as “dredging for anomalies” [p.287]. Nonetheless, developing the FEP lists for the

recognizer was not a straightforward task. It was difficult at times to interpret the

meaning of a phrase and to determine if a phrase was one type of event or another.

The stock-related events, in particular, were difficult to interpret at times. We are not

yet convinced that this is the ideal categorization of stock-related events and recom-

mend that further work is carried out in this area. Other researchers who highlighted

similar kinds of difficulties when trying to interpret the meaning of financial text in-

clude Hildebrandt and Snyder [32], who discussed the importance of context; Thomas

295

[33], who studied condensations and contractions; and Gillam et al [34], who dis-

cussed problems with negation, double negation and ambiguous terms.

Whilst we were developing the FEP lists, we also developed a list of named enti-

ties (NEs). We identified NEs such as accountant names (e.g. KPMG Peat Marwick

LLP and Price Waterhouse LLP) and types of employee (e.g. Senior Vice President,

Strategy, Finance and Administration and Chairman of the Board of Directors). We

also developed a list of types of financial object (TFO)—for example, we identified

84 different types of loss (e.g. loss before provision for income tax and unrealized

loss on investments), 193 types of stock, right or option (e.g. redeemable convertible

preferred shares and preferred stock) and 72 types of stock option agreement or plan

(e.g. stock option agreement and supplemental stock plan). The prototype FEP recog-

nizer used the NEs and TFOs to ensure that as many possible variations of a financial

event would be recognized and that where FEPs could not be recognized in an 8-K, at

least NEs and/ or TFOs would be written to the output, to facilitate classification later

on.

We also devised a list of the most frequently-occurring words. We then fine-tuned

this list by removing “useless” stop words and other words to create a list of 1,568

interesting keywords. Whilst not all these words were of a financial nature, they all

appeared frequently in our 8-Ks. These words were used as additional document fea-

tures in the classification experiments described in Section 4.

3.3 Automatic Recognition of FEPs

As discussed in the previous section, we devised a list of FEPs, NEs, TFOs, and key-

words. The following steps were undertaken to recognize these in 8-Ks1:

1. The full 8-K text was read in as a single string.

2. The 8-K string was split into a list of sentences.

3. Each sentence string was converted to lowercase.

4. Each lowercase sentence string was tokenized. Full-stops within words were re-

tained. Each tokenized sentence became a list of atoms.

5. The FEPs, NEs, and TFOs within each tokenized sentence were then recognized

using our FEP recognizer and written to an output file.

6. Every single word in the 8-K was written to a second output file. The items recog-

nized in step 5 were appended to this second output file.

Once the FEPs, NEs, and TFOs had been recognized and appended to all the words in

the 8-K, this second output file was compared to a list of 1,635 possible attributes,

comprising the 1,568 hand-chosen keywords, 49 FEP types, and 18 NE and TFO

types. A simple Perl program then wrote the value for each attribute (e.g. true or

false), in preparation for the C4.5 and SVM-Light classification experiments (see

Section 4).

1 As mentioned earlier, we did not recognize FEPs in the first dataset or use it for our main

classification experiments, as it was not based on the S&P 500 listing.

296

3.4 Automatic Pattern Analysis of FEPs

We carried out a preliminary analysis of FEPs found by our prototype recognizer, to

see if we could identify any interesting patterns about events in 8-Ks, prior to carrying

out the classification experiments. For the second and third datasets, we examined the

total number of words, the number of FEPs recognized, the different FEP types rec-

ognized, and the occurrences of each FEP type, in ‘downs’ and ‘ups’. We will now

summarize some of the more interesting findings.

With regards the number of words, there were 807,620 words in the first S&P 500

dataset (1997 to 2000) and 2,682,155 million words in the second (2005 to 2008),

which represents a 30% increase in words. We would expect there to be more words

in the latter dataset, as there were 44% more 8-Ks. The fact that there were 44% more

8-Ks but only 30% more words could suggest a number of things: (1) that there was

less content in the more recent disclosures, as 8-Ks focused on only one or two very

recent events and/ or (2) that disclosure writers adopted more concise writing strate-

gies in the more recent 8-Ks, possibly because their reporting practices were being

more closely monitored by the SEC. Perhaps of greater interest is the word count

difference between ‘downs’ and ‘ups’, in both datasets. Looking firstly at the first

dataset, there were 29% more words in the ‘ups’ compared to the ‘downs’, even

though there were only 14% more ‘up’ disclosures. When we performed the same

type of comparison with the second dataset, we found that there were 17% more

words in the ‘ups’ compared to the ‘downs’, and 17% more ‘up’ disclosures. This

suggests that our ‘ups’ contained more verbose language than the ‘downs’ before the

2004 rule changes but that the ‘ups’ and ‘downs’ were fairly similar (word-count

wise) after 2004. Assuming that there is a correlation between the disclosure content

and share price reaction (see Section 1 for a list of our assumptions), this finding for

the first S&P 500 dataset would appear to correlate with the Kohut and Segars [35]

finding that high-performing firms use more verbose language than poor-performing

firms. Li [36] found that the annual reports of poor-performing firms tend to be less

readable and that firms with more readable documents tend to have more persistent

positive earnings. Hildebrandt and Snyder [32] applied the ‘Pollyanna Hypothesis’ to

the writing of annual reports and found that there were significantly more positive

words than neutral or negative words, regardless of whether it was a financially- good

or bad year, suggesting that there is a general preference for using positive words in

disclosures. Assuming there is a correlation between content and share price return,

this could explain why the ‘ups’ contained more words than the ‘downs’ before 2004.

Also, it seems clear that the report writing style changed sometime in the intervening

period; either the ‘ups’ became more concise than previously or the ‘downs’ became

more verbose.

With regards the FEPs recognized in both datasets, we initially assumed that our

recognizer would find more FEPs in the second dataset, compared to the first, because

the second dataset was larger and firms were legally obliged to file more event types.

However, this was not the case for the ‘downs’. Even though there were more than

twice as many ‘downs’ in the second dataset compared to the first (574 vs. 256), our

recognizer found FEPs in proportionately less 8-Ks (20.4% vs. 30.9%). With regards

297

the ‘ups’, the recognizer did find more FEPs in the second dataset, but it found pro-

portionately less overall (24.3% vs. 30.8%)2.

When we compared the average number of FEPs recognized per 8-K, as well as the

minimum and maximum number, we found that the recognizer was less successful in

finding FEPs in the second dataset. On average, the recognizer found 1 FEP per file in

the first, but 0.4 (or none) in the second. The maximum number of FEPs recognized

also decreased in the second dataset, but this could also be because of more accurate

and timely reporting i.e. companies might have started to file single events more fre-

quently than before, rather than file several events in one disclosure. As already men-

tioned in this section, we noted that the recognizer found FEPs in a smaller percentage

of 8-Ks in the second dataset. One possible reason for this could be that the language

changed significantly in the intervening period, and therefore the recognizer was not

able to recognize as many FEPs. It is possible that the marked increase in the number

of events that needed to be filed, brought with it a change in the style of reporting

language. Also, as we know that the number of event types that had to be filed was

increased from 2004, it is quite likely that some of the new event types were not ca-

tered for in our FEP recognizer.

Our pattern analyses also yielded additional interesting findings. For example, we

found that whilst the total number of FEPs recognized decreased in the second da-

taset, the number of unique FEPs recognized increased; one possible reason for this

could be that there was less repetition of each event, as auditors became more consci-

entious about their reporting style after the 2004 rule changes. We also found that the

majority of FEP types that were recognized in the first dataset, were also recognized

in the second dataset. Of the 49 FEP types, 18 were recognized in one or more 8-Ks in

the first dataset and 19 were recognized in the second. Only the FEPs relating to ac-

countant dismissals, dividend distributions, and private placements of stock were

recognized in the second dataset, but not in the first; conversely, only the stock offer-

ing, stock option agreement or plan, and loss reporting FEP types were found in the

first dataset but not in the second. The only FEP type that appeared to be correlated

with an increase in share price was accountant appointment, which was only recog-

nized in the ‘ups’. We might expect a new appointment to be correlated with good

news, especially if there was an issue with a previous accountant. When we examined

the FEP types that were recognized most often, we found that merger events were

recognized in a fairly even number of ‘ups’ and ‘downs’, in both datasets. Merger and

acquisition agreements can have different implications for companies, depending on

whether they are the acquiring company or the company being acquired. Also, specif-

ic details regarding a merger can have different implications for shareholders as they

often have associated stock changes.

2 It is important to note here that these figures relate to FEPs found by our recogniser; it is

likely that there were more FEPs in the 8-Ks but that these were not recognised.

298

3.5 Precision and Recall

In the previous section, we reported that our recognizer found FEPs in proportionately

less ‘downs’ in the second S&P 5000 dataset, compared to the first (20.4% vs.

30.9%). With regards the ‘ups’, the recognizer found more FEPs in the second da-

taset, compared to the first, but proportionately less overall (24.3% vs. 30.8%).

As it became evident that further work was needed to improve the recall of the rec-

ognizer, we then decided to further evaluate the performance by evaluating the preci-

sion and recall on a subset of documents, using only 8-Ks that had one or more FEPs

automatically recognized in them. For this purpose, we selected 10% of the disclo-

sures used in one of our experiments. This experiment used 280 disclosures filed be-

tween 2005 and 2008, so we selected 14 ‘ups’ and 14 ‘downs’ (28 in total, or 10%).

We avoided using more than one disclosure from the same company for two reasons.

Firstly, it was possible that a company released more than one disclosure relating to

the same event and secondly, the auditors probably used similar language in each of

these disclosures. Using more than one disclosure describing the same event could

skew our evaluation. As this evaluation of precision and recall (P&R) was only on a

subset of the entire collection, and comprised only disclosures that had one or more

recognized FEPs in them, we will refer to this P&R as adjusted P&R.

We found that the adjusted recall of the ‘downs’ (.65) was greater than the adjusted

recall of the ‘ups’ (0.42). The adjusted precision was also higher for the ‘downs’

(0.77) than for the ‘ups’ (0.52). The combined adjusted recall and precision for

‘downs’ and ‘ups’ were 0.54 and 0.65 respectively. It is probably not surprising that

our overall adjusted recall is average as the recognizer was initially developed using

disclosures filed before the 2004 rule changes and we know from earlier sections that

the word counts and event types changed after these changes took effect. However,

we would expect the precision to be higher so we will now summarize some of the

difficulties we encountered when determining the precision. One problem related to

repetition of the same event in a disclosure. Whilst the first occurrence may have been

accurate, we counted the other occurrences is inaccurate, as there was really only one

event. Precision was also affected by previous events, which were recognized but not

truly accurate as we really only wanted to recognize recent events. There were also a

couple of outright inaccuracies, where the language used was misinterpreted as anoth-

er event type. Further work is needed to eliminate or at least minimize these inaccura-

cies.

4 Classification Experiments

We carried out a number of classification experiments using the second and third

datasets (see Section 3.1 for an overview of our three datasets), with a view to predict-

ing the likely share price responses to the financial events contained within the 8-Ks.

Table 1 summarizes the results from two experiments and a textual description of

both experiments is provided in Sections 4.1 and 4.2. A summary of the overall find-

ings is presented in Section 4.3. In both experiments, we used the C4.5 decision tree/

rule induction suite of programs and SVM-Light, a support vector machine program,

299

to classify the 8-Ks. Also, in both experiments, all the 8-Ks had one or more automat-

ically recognized financial event phrases (FEPs) in them, as well as a number of other

features—named entities (NEs), types of financial object (TFOs) and keywords. In

other words, we used these recognized FEPs, NEs, TFOs, and keywords as input fea-

tures for classification.

Table 1. Summary of Results for Both Experiments.

Average accu-

racy of ‘ups’

and ‘downs’

together (vs.

chance)

Average accu-

racy of

‘ups’(vs. arbi-

trary classifica-

tion to ‘up’)

Average accu-

racy of ‘downs’

(vs. arbitrary

classification to

‘down’)

Experiment 1 C4.5 52.1%

(vs. 50%)

53.3%

(vs. 53.3%)

50.6%

(vs. 46.7%)

SVM-Light 49.7%

(vs. 50%)

70%

(vs. 53.3%)

29.4%

(vs. 46.7%)

Experiment 2 C4.5 52.5%

(vs. 50%)

63.2%

(vs. 58.2%)

37.6%

(vs. 41.8%)

SVM-Light 49.6%

(vs. 50%)

98.8%

(vs. 58.2%)

0.4%

(vs. 41.8%)

4.1 Experiment 1: Second Dataset (1997 to 2000) (C4.5 and SVM-Light)

This dataset contained 169 8-K disclosures, comprising 90 ‘ups’ and 79 ‘downs’.

Using C4.5, the average accuracy of the ‘ups’ and ‘downs’ together was 52.1% using

10-fold cross-validation, which is only marginally better than chance, assuming a 50-

50 up-down decision (see Table 1). When we focused on the performance of the ‘ups’

specifically, C4.5 correctly classified 53.3% of them. Coincidentally, the ‘ups’ com-

prised 53.3% of the full dataset (90/169 disclosures), so our prototype recognizer and

subsequent classification did not yield improved results over arbitrary classification

i.e. if we had assigned every document to the ‘up’ category as a baseline benchmark.

However, when we focused on the performance of the ‘downs’, C4.5 correctly classi-

fied 50.6% of them (highlighted in bold in Table 1), whereas the full dataset com-

prised only 46.7% ‘downs’ (79/169 disclosures). If we had arbitrarily assigned all 169

disclosures to the ‘down’ category, we would only have been correct 46.7% of the

time; our automatic system, however, was correct 50.6% of the time. This is a posi-

tive result for the ‘downs’. Unfortunately, it was not possible to undertake a hypothe-

sis test for a population proportion to determine the statistical significance of this

result, as we did not have data relating to the entire population of ‘up’ and ‘down’

disclosures.

Using SVM-Light and 10-fold cross-validation, the average accuracy of the ‘ups’

and ‘downs’ together was 49.7%. Looking more closely at the ‘ups’ and ‘downs’

separately, we found that SVM-Light correctly classified 70% of the ‘ups’ (highlight-

ed in bold in Table 1) but only 29.4% of the ‘downs’. As we will discuss in the next

300

section, this high result for the ‘ups’ appears to be due to SVM-Light selecting the

majority class (‘up’) as the default classification.

4.2 Experiment 2: Third Dataset (2005 to 2008) (C4.5 and SVM-Light)

This dataset contained 280 disclosures, comprising 163 ‘ups’ and 117 ‘downs’. Using

C4.5, the average accuracy of the ‘ups’ and ‘downs’ together was 52.5% using 10-

fold cross-validation, which is only marginally better than chance, assuming a 50-50

up-down decision. When we focused on the performance of the ‘ups’, C4.5 correctly

classified 63.2% (highlighted in bold), whereas the full dataset comprised only 58.2%

‘ups’ (163/280 disclosures). If we had arbitrarily assigned all 280 disclosures to the

‘up’ category, we would only have been correct 58.2% of the time; our automatic

system, however, was correct 63.2% of the time so this is a positive result. When we

focused on the performance of the ‘downs’, C4.5 correctly classified 37.6% of them,

whereas the dataset comprised 41.8% ‘downs’ (117/280 disclosures). Our system

does not prove better than arbitrary classification in this case.

Using SVM-Light and 10-fold cross-validation, the average accuracy of the ‘ups’

and ‘downs’ together was 49.6%. Looking more closely, SVM-Light correctly classi-

fied 98.8% of the ‘ups’ (highlighted in bold) but only 0.4% of the ‘downs’. This sig-

nificant difference in accuracy levels between the ‘ups’ and ‘downs’ seems to confirm

that SVM-Light behaved like a default classifier here, assigning the majority of test

cases to the class with the most data [37].

4.3 Summary of Results

To summarize the results, when the ‘ups’ and ‘downs’ were classified together, the

average accuracy was in and around chance (50%). However, when we examined the

performance of ‘ups’ and ‘downs’ separately, the results were more revealing. Taking

both datasets and experiments together, C4.5 always equaled or outperformed arbi-

trary classification with the ‘ups’ and SVM-Light always outperformed it with the

‘ups’. With regards the ‘downs’, C4.5 only outperformed it with the ‘downs’ in the

second (earlier) dataset and SVM-Light never outperformed it with the ‘downs’.

Overall, C4.5 appeared to be better at classifying the ‘ups’ and ‘downs’. Whilst the

accuracy rate for the ‘ups’ was particularly high for SVM-Light in the third (more

recent) dataset, we must remember that SVM-Light seemed to use the default classifi-

cation (‘up’) for the majority of the classifications. Relying on the default classifica-

tion would not be recommended in a real-world investment scenario.

We should also mention here that the C4.5 average training error rates were rela-

tively low, ranging from 2.9% (second dataset) to 7.7% (third dataset). However, with

SVM-Light, the average training error rates were much higher, ranging from 20%

(second dataset) to 35.3% (third dataset).

301

5 Conclusions

This paper describes one approach to automatic financial event phrase recognition in

Form 8-K disclosures. It also describes the results from two classification experi-

ments; one based on disclosures filed before the SEC’s 2004 rule changes, the other

based on disclosures filed afterwards.

We found that the combined accuracy levels for ‘ups’ and ‘downs’ together did not

yield promising results (combined average accuracies range from 52.1% to 52.5% for

C4.5 and 49.6% to 49.7% for SVM-Light). However, when we investigated the classi-

fication accuracy for the ‘ups’ and ‘downs’ separately, the results were slightly more

promising. For example, if the share price for a particular disclosure is going to in-

crease (unbeknownst to the user), then both systems are more likely to predict ‘up’, so

one might expect to make a profit. However, if the share price is going to decrease,

then C4.5 would be the safer option. Lerman and Livnat [38] found that market reac-

tions to 8-Ks and other disclosure types varied by event—some events caused strong

positive returns, whereas others caused negative returns. They also suggested that for

some events there may have been an absence of information or inconsistent reactions

(event that may have been good for one firm may have been bad for another).

Mittermayer and Knolmayer [16] highlighted the level of noise that exists in training

data—for example, several 8-Ks could be released in a short timeframe, each describ-

ing the same event. However, only one of these may have actually impacted the price

(possibly the first one).

Focusing on the C4.5 results only (seeing as it did not appear to rely on a default

classification), our findings suggest that, using event phrases, keywords, and other

such features of the 8-Ks, the ‘downs’ seemed to be easier to classify than the ‘ups’

before the 2004 rule changes came into effect. At the time, corporations were only

required to file a limited number of event types and there was greater flexibility with

regards the filing deadline. We think that corporations would only have mentioned

negative news if it was a recent negative news event that they were obliged to dis-

close. This lack of ‘noise’ in negative news disclosures could have facilitated the

classification of the ‘downs’. Also, our pattern analysis (see Section 3.4) revealed that

the ‘ups’ contained proportionately more words than the ‘downs’ during this period

(when compared to post-2004); this could partially explain the difficulties faced by

the classifier when attempting to classify the ‘ups’. On the other hand, post 2004, the

‘ups’ appeared to be easier to classify than the ‘downs’. Our pattern analysis revealed

that the ‘ups’ became more concise (relative to the ‘downs’) during this period; one

possible reason for this improved conciseness could have been the stricter filing rules.

Because negative news disclosures tend to discuss positive news also to reduce the

likely impact of the negative news, this could have affected the classification of the

‘downs’.

There are a number of limitations to the experiments described this study. For real-

world implementation, the recall of the recognizer would need to be greatly improved.

This could be done by incorporating new event types and phrases that have emerged

in more recent years. Secondly, the two classification experiments described here use

relatively small datasets (169 and 280 disclosures respectively). Whilst the relatively

302

small size of the datasets is due to the performance of the recognizer (we only classi-

fied disclosures that contained one or more automatically recognized events), ideally

we would test the classifier using much larger datasets. Thirdly, we used a large num-

ber of classification attributes (1,635); further refinement of the event phrases and

keywords could yield more interesting classification results. Finally, we cannot be

certain that these disclosures were the sole cause of the share price changes; other

variables such as company history, industry type, and macroeconomic factors could

also have impacted the prices.

Despite these limitations, we think this research also has a number of strengths.

Our study differs from most other studies in that we analyze complete documents and

we use a combination of features (financial event phrases, named entities, types of

financial object, and keywords) when attempting to classify 8-Ks by likely share price

response. We do not claim to have developed the definitive solution; rather, we have

identified one possible solution that could facilitate the arduous task of analyzing 8-

Ks. With some more work, the output from the recognizer could be used by various

parties. For example, disclosure writers could examine the financial event phrases we

identified and incorporate the most appropriate phrases (or avoid using certain

phrases) to encourage the desired share price response. Likewise, fundamental ana-

lysts could incorporate these phrases into their content analysis toolkits. Individual

investors could search for specific phrases and then purchase or sell stocks based on

the content of disclosures.

Some recommendations for future work include further refinement of the financial

event phrases and keywords, classification of larger datasets (possibly using more

volatile companies outside the S&P 500), and incorporation of additional classifica-

tion variables beyond event phrases and keywords.

303

References. 1. Fama, E.F.: Efficient Capital Markets: A Review of Theory and Empirical Work. J Financ.

25(2), 383-417 (1970)

2. Haugen, R.A.: Modern Investment Theory. 2nd ed. Prentice Hall, Englewood Cliffs, New

Jersey (1990)

3. Hellstrom, T., Holmstrom, K.: Predicting the Stock Market. Technical Report Series IMa-

TOM-1997-07. Mälardalen University, Sweden (1998)

4. Elton, E.J., Gruber, M.J., Brown, S.J., Goetzmann, W.N. Modern Portfolio Theory and In-

vestment Analysis. John Wiley & Sons, Inc., New York (2003)

5. Wüthrich, B., Peramunetilleke, D., Leung, S., Cho, V., Zhang, J., Lam, W.: Daily Predic-

tion of Major Stock Indices from Textual WWW Data. In: Agrawal, R., Stolorz, P.,

Piatetsky, G. (eds.) Fourth International Conference on Knowledge Discovery and Data

Mining, 2720-2725. AAAI Press, New York (1998)

6. Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., Allan, J.: Language Models

for Financial News Recommendation. In: Agah, A., Callan, J., Rundensteiner, E. (eds.)

Ninth International Conference on Information and Knowledge Management, 389-396.

ACM Press (2000)

7. Peramunetilleke, D., Wong, R.K.: Currency Exchange Rate Forecasting from News Head-

lines. Aust. Comp. S. 24(2), 131-139 (2002)

8. Antweiler, W., Frank, M.Z.: Is All that Talk Just Noise? The Information Content of Inter-

net Stock Message Boards. J Financ. 59(3), 1259-1294 (2004)

9. Lam, M.: Neural Network Techniques for Financial Performance Prediction: Integrating

Fundamental and Technical Analysis. Decis. Support. Syst. 37(4), 567-581 (2004)

10. Fung, G.P.C., Yu, J.X., Lu, H.: The Predicting Power of Textual Information on Financial

Markets. IEEE Intell. Inform. Bull. 5(1), 1-10 (2005)

11. Tetlock, P.C., Saar-Tsechansky, M., Macskassy, S.: More than Words: Quantifying Lan-

guage to Measure Firms’ Fundamentals. J Financ. 63(3), 1437-1467 (2008)

12. Loughran, T., McDonald, B.: When is a Liability not a Liability? Textual Analysis, Dic-

tionaries, and 10-Ks. J Financ. 66(1), 35-65 (2011)

13. SEC, Final Rule: Additional Form 8-K Disclosure Requirements and Acceleration of Fil-

ing Date, http://www.sec.gov/rules/final/33-8400.htm (2004)

14. Slattery, D.M., Sutcliffe, R.F.E., Walsh, E.J.: Automatic Analysis of Corporate Financial

Disclosures. In: Gillam, L. (ed.) Making Money in the Financial Services Industry, a

Workshop at the Terminology and Knowledge Engineering Conference. Nancy, France

(2002)

15. van Bunningen, A.H.: Augmented Trading: From News Stories to Stock Price Predictions

Using Syntactic Analysis. Unpublished thesis (M.Sc.), University of Twente (2004)

16. Mittermayer, M.-A., Knolmayer, G.F.: Text Mining Systems for Market Response to

News: A Survey. Working Paper No 184, Institute of Information Systems, University of

Bern (2006)

17. Andersen, T.G., Bollerslev, T.: Deutsche Mark-Dollar Volatility: Intraday Activity Pat-

terns, Macroeconomic Announcements, and Longer Run Dependencies. J Financ. 53(1),

219-265 (1998)

18. MacKinlay, A.C.: Event Studies in Economics and Finance. J. Econ. Lit. 35(1), 13-39

(1997)

19. Griffin, P.A.: Got Information: Investor Response to Form 10-K and Form 10-Q EDGAR

Filings. Rev. Acc. Stud. 8(4), 433-460 (2003)

304

http://www.sec.gov/rules/final/33-8400.htm

20. Carter, M.E., Soo, B.S.: The Relevance of Form 8-K Reports. J. Account. Res. 37(1), 119-

132 (1999)

21. Antweiler, W., Frank, M.Z.: Do U.S. Stock Markets Typically Overreact to Corporate

News Stories? Working Paper, University of British Columbia (2006)

22. Tetlock, P.C.: Giving Content to Investor Sentiment: The Role of Media in the Stock Mar-

ket. J Financ. 62(3), 1139-1168 (2007)

23. Asthana, S., Balsam, S.: The Effect of EDGAR on the Market Reaction to 10-K Filings. J.

Account. Public Pol. 20(4-5), 349-372 (2001)

24. Whisenant, J.S., Sankaraguruswamy, S., Raghunandan, K.: Market Reactions to Disclo-

sure of Reportable Events. Aud.: J. Prac. Theory. 22(1), 181-194 (2003)

25. Ball, R., Kothari, S.P.: Security Returns Around Earnings Announcements. Account. Rev.

66(4), 718-738 (1991)

26. Francis, J., Schipper, K., Vincent, L.: Earnings Announcement and Competing Infor-

mation. J. Account. Econ. 33(3), 313-342 (2002)

27. Henry, E.: Are Investors Influenced by how Earnings Press Releases are Written? J. Bus.

Commun. 45(4), 363-407 (2008)

28. Cho, V., Wüthrich, B., Zhang, J.: Text Processing for Classification. J. Comput. Intell. Fin.

7(2), 6-22 (1999)

29. Seo, Y.-W., Giampapa, J., Sycara, K.: Financial News Analysis for Intelligent Portfolio

Management. Technical Report CMU-RI-TR-04-04, Robotics Institute, Carnegie Mellon

University (2004)

30. Schumaker, R.P., Chen, H.: Textual Analysis of Stock Market Prediction Using Financial

News Articles. In: Rodriguez-Abitia, G., Ania, B., Ignacio (eds.) 12th Americas Confer-

ence on Information Systems, Acapulco, Mexico (2006)

31. Fama, E.F.: Market Efficiency, Long-Term Returns, and Behavioral Finance. J. Financ.

Econ. 49(3), 283-306 (1998)

32. Hildebrandt, H.W., Snyder, R.D.: The Pollyanna Hypothesis in Business Writing: Initial

Results, Suggestions for Research. J. Bus. Commun. 18(1), 5-15 (1981)

33. Thomas, J.: Disclosure in the Marketplace: The Making of Meaning in Annual Reports. J.

Bus. Commun. 34(1), 47-66 (1997)

34. Gillam, L., Ahmad, K., Ahmad, S., Casey, M., Cheng, D., Taskaya Temizel, T., de

Oliviera, P.C.F., Manomaisupat, P.: Economic News and Stock Market Correlation: A

Study of the UK Stock Market. In: Gillam, L. (ed.) Making Money in the Financial Ser-

vices Industry, a Workshop at the Terminology and Knowledge Engineering Conference.

Nancy, France (2002)

35. Kohut, G.F., Segars, A.H.: The President’s Letter to Stockholders: An Examination of

Corporate Communication Strategy. J. Bus. Commun. 29(1), 7-21 (1992)

36. Li, F.: Annual Report Readability, Current Earnings, and Earnings Persistence. J. Account.

Econ. 45(2-3), 221-247 (2008)

37. Joachims, T.: Estimating the Generalization Performance of a SVM Efficiently. In: Lang-

ley, P. (ed.) 17th International Conference on Machine Learning, pp. 431-438. Morgan

Kaufmann, San Francisco, CA (2000)

38. Lerman, A., Livnat, J.: The New Form 8-K Disclosures. Rev. Acc. Stud. 15(4), 752-778

(2009)

305

Documents

Automatic Analysis of Financial Event Phrases and Keywords in Form 8-K Disclosures