90
© FIRST consortium Page 1 of 90 Project Acronym: FIRST Project Title: Large scale information extraction and integration infrastructure for supporting financial decision making Project Number: 257928 Instrument: STREP Thematic Priority: ICT-2009-4.3 Information and Communication Technology D2.1 Technical requirements and state-of-the-art Work Package: WP2 - Technical analysis, scaling strategy, and architecture Due Date: 31/03/2011 Submission Date: 31/03/2011 Start Date of Project: 01/10/2010 Duration of Project: 36 Months Organisation Responsible for Deliverable: JSI Version: 1.0 Status: Submitted Author Name(s): Miha Grcar (ed.), Marko Brakus, Marko Bohanec, Martin Znidarsic, Janez Kranjc, Elena Ikonomovska, Borut Sluban, Vid Podpečan, Matjaž Juršič, Igor Mozetič, Nada Lavrač (JSI); Mateusz Radzimski, Tomás Pariente Lobo (ATOS); Markus Gsell (IDMS); Wiltrud Kessler, Dominic Ressel, Achim Klein (UHOH); Mykhailo Saienko, Michael Siering (GUF) Reviewer(s): Michael Diefenthäler Paolo Miozzo IDMS MPS Nature: R Report P Prototype D Demonstrator O Other Dissemination Level: PU - Public CO - Confidential, only for members of the consortium (including the Commission) RE - Restricted to a group specified by the consortium (including the Commission Services) Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

© FIRST consortium Page 1 of 90

Project Acronym: FIRST

Project Title: Large scale information extraction and integration infrastructure for supporting financial decision making

Project Number: 257928

Instrument: STREP

Thematic Priority: ICT-2009-4.3 Information and Communication Technology

D2.1 Technical requirements and state-of-the-art

Work Package: WP2 - Technical analysis, scaling strategy, and architecture

Due Date: 31/03/2011

Submission Date: 31/03/2011

Start Date of Project: 01/10/2010

Duration of Project: 36 Months

Organisation Responsible for Deliverable: JSI

Version: 1.0

Status: Submitted

Author Name(s): Miha Grcar (ed.), Marko Brakus, Marko Bohanec, Martin Znidarsic, Janez Kranjc, Elena Ikonomovska, Borut Sluban, Vid Podpečan, Matjaž Juršič, Igor Mozetič, Nada Lavrač (JSI); Mateusz Radzimski, Tomás Pariente Lobo (ATOS); Markus Gsell (IDMS); Wiltrud Kessler, Dominic Ressel, Achim Klein (UHOH); Mykhailo Saienko, Michael Siering (GUF)

Reviewer(s): Michael Diefenthäler

Paolo Miozzo IDMS

MPS

Nature: R – Report P – Prototype D – Demonstrator O – Other

Dissemination Level: PU - Public CO - Confidential, only for members of the

consortium (including the Commission)

RE - Restricted to a group specified by the consortium (including the Commission Services)

Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)

Page 2: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 2 of 90

Revision history Version Date Modified by Comments 0.1 21/11/2010 Miha Grčar (JSI) First ver. of TOC provided.

0.2 03/12/2010 Mateusz Radzimski (ATOS)

TOC refinement.

0.3 05/12/2010 Miha Grčar (JSI) Revised TOC and ―Purpose of this document‖ according to the latest discussions and comments.

0.4 24/12/2010 Miha Grčar (JSI) Revised according to comments from Wiltrud, Achim, Mateusz, Markus.

0.5 11/01/2011 Miha Grčar (JSI) Incorporated Mateusz‘ last comments. TOC now finalised.

0.6 28/01/2011 Wiltrud Kessler, Dominic Ressel, Achim Klein (UHOH), Mateusz Radzimski (ATOS)

Added sections 2.3, 4.2, 2.2.1. Initial version of the process overview.

0.7 30/01/2011 Miha Grčar (JSI) Changes to TOC (merged ―Problem analysis‖ and ―State of the art‖) and report template (Annex 1, 2, 3… Appendix A, B, C…). Included preliminary contents provided by the partners.

0.8 02/02/2011 Miha Grčar (JSI) Added more content (Elena I. on stream data mining, Miha G. on visualisation). This version is the official preliminary draft of D2.1.

0.9 02/02/2011 Miha Grčar (JSI) Added even more content (Section 3.5 by Markus and Section 2.1.6 by Borut).

0.9.1 07/02/2011 Miha Grčar (JSI) First round of editing. Updated sections: 3.5, 2.1, 3.3, 3.5.2.

0.9.2 11/02/2011 Igor Mozetič (JSI) Shortening Elena‘s contribution.

0.9.3 23/02/2011 Miha Grčar (JSI) First thorough revision of JSI contributions. Inserted Janez‘ contributions (Sections 3.3 and 3.4). Added introductions, etc. Updated UHOH contributions.

0.9.4 28/02/2011 Mateusz Radzismki (ATOS), Miha Grčar (JSI)

Revised Requirements analysis section.

0.9.5 02/03/2011 Miha Grčar (JSI) Accepted all changes, revised Introduction, sent into the internal review.

0.9.6 03/03/2011 Igor Mozetič (JSI) Revision.

0.9.7 06/03/2011 Miha Grčar (JSI), Mykhailo Saienko (GUF), Michael Siering

Revision according to Markus‘ comments. Revised Section 3.5.1 (GUF).

Page 3: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 3 of 90

(GUF)

0.9.8 20/03/2011 Miha Grčar (JSI) Fixed references. Added Achim‘s input to technical requirements.

0.9.9 21/03/2011 Miha Grčar (JSI) Executive summary added. Responded to the reviewers‘ comments.

0.9.9.1 23/03/2011 Miha Grčar (JSI), Tomás Pariente Lobo (ATOS)

Filled out the Abbreviations and acronyms table. Added Conclusions. Proofing and polishing.

1.0 31/03/2011 Miha Grčar (JSI), Achim Klein (UHOH), Tomás Pariente Lobo (ATOS)

Final review, proofing and quality control. Ready for submission

Page 4: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 4 of 90

Copyright © 2011, FIRST Consortium

The FIRST Consortium (www.project-first.eu) grants third parties the right to use and distribute all or parts of this document, provided that the FIRST project and the document are properly referenced.

THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

----------------

Page 5: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 5 of 90

Executive summary

The goal of WP2 is to provide the foundation for a comprehensive, coherent, and integrated conceptual and technical architecture for the Integrated Financial Market Information System. The requirements captured in the first two deliverables in WP2, D2.1 and D2.2 will include the technical and conceptual point of view, respectively.

The purpose of this document (D2.1) is, on one hand, to present the state of the art related to the envisioned software components and, on the other, to present technical requirements posed by the envisioned software components towards the FIRST infrastructure. Note that in this document, we do not focus on the requirements of the use cases as this is covered by D1.1 and D1.2.

The content is divided into three main parts. The first part discusses the state of the art related to the technologies employed in the envisioned data preprocessing and analysis pipeline. In this context, the data acquisition pipeline is presented, followed by the discussion on semantic resources and evolving ontologies. In FIRST, ontologies will be fit for the purpose of entity recognition and sentiment analysis. Sentiment analysis has proven the be the main common denominator of all the three use cases studied in FIRST, thus this topic is discussed rather thoroughly in the report. The report also discusses data and information integration and decision support systems. In addition, two important general-purpose techniques for increasing the throughput of the stream-processing components, i.e., pipelining and parallelisation, are discussed and present the basis for the development of the FIRST scaling strategy (D2.3). The main purpose of the first part of the report is to give the reader (incl. FIRST industrial partners) the understanding of which know-how the technology providers in the project possess and what is the current state of the art (i.e., capabilities, limitations) related to the corresponding technologies.

The second part of the report is about technical requirements posed by the envisioned software components towards the FIRST infrastructure. We define requirements such as hardware and software infrastructure requirements, data storage requirements, scaling requirements, and runtime environment requirements. The purpose of this part is basically to collect several ―bottom-up‖ requirements in order to provide the basis for the development of the FIRST software and hardware infrastructure which will be further defined in the follow-up deliverable D2.2.

Last but not least, the report presents two concrete examples of the preliminary work done in FIRST: the ontology-based sentiment analysis and visualisation of document streams. The main purpose of this part is to give the reader a more complete picture of how the FIRST stream-processing pipelines, spanning over all the technical work packages, might look like in the end.

Page 6: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 6 of 90

Table of contents

Executive summary ....................................................................................................... 5

Abbreviations and acronyms ....................................................................................... 9

1 Introduction .......................................................................................................... 13

2 State of the art ...................................................................................................... 15

2.1 Data acquisition pipeline ................................................................................. 16

2.1.1 Data sources ............................................................................................ 17

2.1.2 HTML preprocessing ................................................................................ 19

2.1.3 Boilerplate removal .................................................................................. 20

2.1.4 Language detection .................................................................................. 23

2.1.5 Detecting near-duplicates in document streams ...................................... 24

2.1.6 Spam and opinion spam detection ........................................................... 25

2.2 Ontology learning and existing semantic resources ........................................ 27

2.2.1 Ontology learning ..................................................................................... 27

2.2.2 Existing relevant semantic resources ....................................................... 28

2.3 Sentiment classification and semantic feature extraction ................................ 29

2.3.1 Problem analysis for sentiment classification ........................................... 29

2.3.2 Assessment criteria .................................................................................. 29

2.3.3 State of the art in sentiment classification ................................................ 31

2.3.4 State of the art in semantic feature extraction .......................................... 34

2.4 Approaches to information integration ............................................................. 34

2.4.1 Physical integration .................................................................................. 35

2.4.2 Virtual integration ..................................................................................... 36

2.4.3 Hybrid approaches ................................................................................... 37

2.5 Decision support systems ............................................................................... 37

2.5.1 Machine learning models ......................................................................... 37

2.5.2 Stream-based data mining ....................................................................... 40

2.5.3 Qualitative multi-attribute models ............................................................. 43

2.5.4 Visualisation ............................................................................................. 46

2.6 General-purpose scaling techniques ............................................................... 49

2.6.1 Processing pipelines ................................................................................ 49

2.6.2 Parallelisation ........................................................................................... 51

3 Requirements study ............................................................................................. 54

3.1 Process overview ............................................................................................ 54

3.2 Technical requirements ................................................................................... 55

4 Preliminary work .................................................................................................. 59

4.1 Sentiment analysis .......................................................................................... 59

4.2 Document stream visualisation ....................................................................... 60

4.2.1 Document corpora visualisation pipeline .................................................. 60

4.2.2 Visualisation of document streams ........................................................... 62

4.2.3 Preliminary implementation ...................................................................... 64

5 Conclusions .......................................................................................................... 65

Page 7: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 7 of 90

References ................................................................................................................... 66

Annex 1. Preliminary empirical evaluation of the implemented language detection technique ..................................................................................................... 73

Annex a. Experiments ............................................................................................. 73

Annex b. Conclusions.............................................................................................. 76

Annex 2. Preliminary empirical evaluation of the implemented boilerplate removal technique ....................................................................................................... 77

Annex a. Feature selection ...................................................................................... 77

Annex b. Datasets ................................................................................................... 78

Annex c. Implemented method ............................................................................... 79

Annex d. Experiments and results ........................................................................... 80

Annex e. Conclusions.............................................................................................. 82

Annex 3. Manually created resources for sentiment words ............................... 83

Annex 4. Sentiment classification – problem definition ..................................... 84

Annex a. Sentiment ................................................................................................. 84

Annex b. Subproblems ............................................................................................ 85

Annex i. Sentiment retrieval ................................................................................ 85

Annex ii. Subjectivity classification ...................................................................... 85

Annex iii. Topic relevance and topic shift ............................................................. 85

Annex iv. Sentiment holder extraction .............................................................. 86

Annex 5. Document stream visualization pipeline implementation and testing 87

Annex 6. Data storage requirements .................................................................... 90

Index of Figures

Figure 1: Architecture of the FIRST analytical pipeline. ............................................................ 15

Figure 2: The data acquisition pipeline. .................................................................................... 16

Figure 3: The high-level architecture of a typical general-purpose Web crawler. ...................... 18

Figure 4: A ―difficult‖ HTML and the corresponding correct interpretation by a browser. ........... 20

Figure 5: Coverage of topics by the state of the art .................................................................. 30

Figure 6: Exemplary decision tree. ........................................................................................... 39

Figure 7: Exemplary neural network. ........................................................................................ 39

Figure 8: Exemplary Support Vector Machine (SVM). .............................................................. 39

Figure 9: General structure of a hierarchical multi-attribute model. ........................................... 44

Figure 10: Screenshots of DEXi, a computer program for qualitative multi-attribute decision modelling. ................................................................................................................................. 45

Figure 11: Topic space of company descriptions provided by Yahoo! Finance. ........................ 47

Figure 12: Canyon flow temporal visualisation of scientific publications through time. .............. 48

Figure 13: An illustration of an example pipeline with three stages. .......................................... 49

Page 8: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 8 of 90

Figure 14: A Gantt chart of the burst of five units in the pipeline. .............................................. 50

Figure 15: An example pipeline with multiple processing units forming the same stage. ........... 51

Figure 16: A Gantt chart of a pipeline with two B processing units and the arrival rate of 1 data unit per second. ........................................................................................................................ 52

Figure 17: An example pipeline where a single unit may be processed in a parallel fashion. .... 52

Figure 18: A Gantt chart of the runtime of the pipeline shown in Figure 17. .............................. 53

Figure 19: High-level FIRST process overview. ........................................................................ 54

Figure 20: Document space visualisation pipeline. ................................................................... 60

Figure 21: Document stream visualisation pipeline. .................................................................. 63

Figure 22: Language detection accuracy of different n-grams for different lengths of test texts. Bigrams are the most precise. .................................................................................................. 75

Figure 23: Language detection accuracy of different n-grams for different cutoffs. ................... 75

Figure 24. Features ordered by the information gain for the 2-class (boilerplate vs. full content) problem. ................................................................................................................................... 77

Figure 25: Features ordered by the information gain for the 6-class (all text classes) problem. 78

Figure 26: First few levels of a boilerplate removal decision tree. ............................................. 82

Figure 27: Time spent in separate stages of the pipeline when streaming the news into the system in chronological order. .................................................................................................. 88

Figure 28: Time spent in separate stages of the pipeline when streaming the news into the system in random order. ........................................................................................................... 88

Figure 29: The delay between packets exiting the pipeline. ...................................................... 89

Index of Tables

Table 1: Overview of machine learning methods. ..................................................................... 38

Table 2: List of technical requirements. .................................................................................... 58

Table 3: Language detection accuracy [%] for long and short test text. .................................... 74

Table 4: Code page detection accuracy [%] for long and short test texts. ................................. 74

Table 5: Language detection accuracy [%] for three different similarity measures on long test texts (1000 letters) without cutoff. ............................................................................................. 74

Table 6: Classification accuracies of the C4.5 for the 2-class (boilerplate vs. full content) problem. ................................................................................................................................... 81

Table 7: Classification accuracies of the C4.5 for the 6-class (all text classes) problem. .......... 81

Table 8: Misclassifications for the 6-class problem. .................................................................. 81

Page 9: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 9 of 90

Abbreviations and acronyms

AIS-BN Adaptive Importance Sampling algorithm for evidential reasoning in large Bayesian Networks, an approach to training Bayesian networks in machine learning

ANNIE A Nearly-New Information Extraction System, an information extraction (IE) system distributed with GATE; ANNIE relies on finite state algorithms and the JAPE language

API Application programming interface, a particular set of rules and specifications that a software program can follow to access and make use of the services and resources provided by another particular software program

ATOS Atos Research and Innovation, a project partner

C4.5 An algorithm for decision tree learning

CART Classification And Regression Tree, a category of decision trees in data mining

COIN Context integration framework, a framework for facilitating a continuous information integration process

CSS Cascading Style Sheets, a style sheet language used to describe the presentation semantics (the look and formatting) of a document written in a markup language (e.g., HTML)

CSV Comma-separated values, a set of file formats used to store tabular data in which numbers and text are stored in plain textual form that can be read in a text editor

DAG Directed acyclic graph, a directed graph with no directed cycles, used to model several different kinds of structure in mathematics and computer science

DEX, DEXi A methodology (DEX) and the corresponding computer program (DEXi) for qualitative multi-attribute decision modelling

DOM Document Object Model, a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML, and XML documents

DS Decision support

DSS Decision support systems

Dx.y Deliverable x.y; unless explicitly stated otherwise, in this document, Dx.y refers to a project report delivered or to be delivered by FIRST

EH Exponential histogram, a data structure used to improve the data-stream clustering algorithm devised by Guha et al. (2001)

EMM Europe Media Monitor, a number of news aggregation and analysis systems to support EU institutions and Member State organisations

ETL Extract, Transform, Load, a process to populate a physically integrated and consolidated database

EU European Union

FIFO First-In First-Out, an abstraction in ways of organizing and manipulating data relative to time and prioritization

Page 10: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 10 of 90

FIRST Large scale information extraction and integration infrastructure for supporting financial decision making

FP Framework Programme

GATE General Architecture for Text Engineering, a Java suite of tools for all sorts of natural language processing tasks, including information extraction in many languages (originally developed at the University of Sheffield)

GB Gigabytes

HDD Hard disk drive

HTML HyperText Markup Language, the predominant markup language for Web pages

HTTP Hypertext Transfer Protocol, the foundation of data communication for the World Wide Web

ICT Information and Communication Technologies

IDMS Interactive Data Managed Solutions, a project partner

IE Information extraction

IR Information retrieval

ISO International Organization for Standardization, an international standard-setting body composed of representatives from various national standards organizations

IST Information Society Technologies

JAPE Java Annotation Patterns Engine, a component of the GATE platform

JDPA J. D. Power and Associates; in this report, we refer to the JDPA sentiment corpus for the automotive domain

JSI Jozef Stefan Institute, a project partner

KL (divergence) Kullback-Leibler divergence, a non-symmetric measure of the difference between two probability distributions

LATINO Link analysis and text mining toolbox, an open-source software library containing various data mining algorithms and models

LSQR (solver) Least-squares (solver), a standard approach to the approximate solution of over-determined systems, i.e., sets of equations in which there are more equations than unknowns; in this report, we mainly refer to a particular implementation by Paige and Saunders (1982).

MAIDS A comprehensive data stream mining system by Dong et al. (2003)

MDS Multi-Dimensional Scaling, a method for dimensionality reduction and feature vector projection in machine learning

MPS Monte dei Pachi di Siena, a project partner

.NET A software framework for Microsoft Windows operating systems; it includes a large software library, and supports several programming languages which allows language interoperability

NEXT b-next, a project partner

NLP Natural-language processing

Page 11: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 11 of 90

OBIE Ontology-based information extraction

ODS Operational data stores, central repositories continuously fed with updated data

OL Ontology learning

PDF Portable Document Format, an open standard for document exchange

POS (tagger) Part-of-speech (tagger)

QMAM Qualitative multi-attribute model/modelling

RAM Read-only memory

RDF Resource Description Framework, a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model

REST Representational State Transfer, a style of software architecture for distributed hypermedia systems such as the World Wide Web

ROC (curve) Receiver Operating Characteristic (curve), a graphical plot of the sensitivity, or true positive rate, vs. false positive rate, for a binary classification problem; can be used as an evaluation metric in item ranking scenarios

RSS Really Simple Syndication, a family of Web feed formats used to publish frequently updated works—such as blog entries, news headlines, audio, and video—in a standardized format

SIRUP Semantic Integration Reflecting User-specific semantic Perspectives, a framework that focuses on user-specific information integration

SMAC SIGMEA MAize Coexistence, a decision-support tool for the assessment of coexistence between genetically modified and conventional maize developed in the European project SIGMEA (FP6-SSP1-2002-502981)

SOA Service-Oriented Architecture, a flexible set of design principles used during the phases of systems development and integration in computing

SOAP Simple Object Access Protocol, a protocol specification for exchanging structured information in the implementation of Web Services in computer networks

SPARQL SPARQL Protocol and RDF Query Language, an RDF query language, considered a key Semantic Web technology; an official W3C Recommendation as of January 2008

SQL Structured Query Language, a database computer language designed for managing data in relational databases

SSD Solid-state disk

SSE Stuttgart Stock Exchange, a project partner

STREP Specific Targeted Research Project, a medium-sized research project funded by the European Commission in the FP7 funding program

SVM Support vector machines, a set of related supervised learning methods that analyse data and recognize patterns, used for classification and regression analysis in machine learning

TB Terabytes

TF-IDF Term-Frequency, Inverse Document Frequency, a weighting scheme often used in information retrieval and text mining; a statistical measure used to

Page 12: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 12 of 90

evaluate how important a word is to a document in a collection of documents

UCS Universal Character Set, a standard set of characters upon which many character encodings are based

UHOH University of Hohenheim, a project partner

URL Uniform Resource Locator, specifies where the corresponding resource (e.g., an HTML document) is available and the mechanism for retrieving it

UTF-8 UCS Transformation Format (8-bit), a multi-byte character encoding for Unicode

VFDT Very Fast Decision Trees, a decision tree algorithm based on VFML

VFKM Very Fast K-Means, a k-means clustering algorithm based on VFML

VFML Very Fast Machine Learning, a general-purpose method for scaling up machine learning algorithms

VIPS Vision-based Page Segmentation, one of the approaches to boilerplate removal

WCF Windows Communication Foundation, an application programming interface (API) in the .NET Framework for building connected, service-oriented applications

WWW World Wide Web

XHTML eXtensible HyperText Markup Language, a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language (HTML), the language in which Web pages are written

XML Extensible Markup Language, a set of rules for encoding documents in machine-readable form; several higher-level formats such as RSS, XMPP, and XHTML (but not HTML) are based on XML

XMPP Extensible Messaging and Presence Protocol, an open-standard communications protocol for message-oriented middleware based on XML

Page 13: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 13 of 90

1 Introduction

The goal of WP2 is to provide the foundation for a comprehensive, coherent, and integrated conceptual and technical architecture for the Integrated Financial Market Information System. The architecture design will consider the conceptual and technical requirements collected in this report, addressing the user and use case requirements collected in WP1. Furthermore, the scaling strategy that will ―drive‖ the development efforts in the project will be specified within WP2.

Note that in this document, we do not focus on the requirements of the use cases as this will be done in WP1. However, certain use-case aspects already need to be considered as they influence the choice and anticipated characteristics of individual components. Since the use cases are not yet fully developed at this stage, we discuss two more generic use cases in this report: ontology-based sentiment analysis scenario and visualisation of document streams.

The content is divided into three main sections. Section 2 discusses the state of the art related to the technologies employed in the envisioned data preprocessing and analysis pipeline. In this context, the data acquisition pipeline is presented first. The purpose of the data acquisition pipeline is to deliver Web documents (i.e., HTML pages) in a form suitable for further text analysis together with the news provided through proprietary APIs. Another aspect, covered in this section, is concerned with semantic resources and evolving ontologies. In FIRST, ontologies will be ―connected to‖ data streams constantly flowing into the system, and will have to adapt to the current events evident from the streams. The ontologies will be fit for the purpose of entity recognition (i.e., entities such as companies, securities, and financial indices will be recognised) and sentiment analysis. Sentiment analysis has proven the be the main common denominator of all the three use cases studied in FIRST, thus this topic is discussed rather thoroughly in the report. The report also discusses data and information integration and decision support systems. The latter will be employed for solving the tasks put forward by the use cases, by employing, among other features, the sentiment index computed in the sentiment analysis process. Several approaches to decision support modelling are considered: machine learning models, qualitative multi-attribute models, and visualisation techniques, respectively. In addition, two important general-purpose techniques for increasing the throughput of the stream-processing components, i.e., pipelining and parallelisation, are discussed and present the basis for the development of the FIRST scaling strategy (D2.3). The use of these two techniques can be perceived as another technical requirement to be taken into account by the infrastructure. The main purpose of Section 2 is to give the reader the understanding of which know-how the technology providers in the project possess and what is the current state of the art (i.e., capabilities, limitations) related to the corresponding technologies.

Section 3 is about technical requirements posed by the envisioned software components towards the FIRST infrastructure. We define requirements such as hardware and software infrastructure requirements, data storage requirements, scaling requirements, and runtime environment requirements. In general, the requirements captured in the first two deliverables in WP2, D2.1 and D2.2, will include the technical and conceptual point of view. The technical aspect (i.e., bottom-up), presented in this deliverable, studies individual components, their dependencies, and roles. Because most of the developed components will be based on existing software libraries and frameworks rather than being implemented from scratch, the technologies used will pose additional constraints towards the integration infrastructure. On the other hand, the conceptual view (i.e., top-down) is concerned with the properties of the integrated software solution, including its configurability, extensibility, and deployability in real-life settings. The conceptual aspects will mostly be covered in D2.2 and will complement the requirements presented in this document. The purpose of this part is basically to collect several ―bottom-up‖ requirements—some of which are already implicitly evident from other sections in this report—in order to provide the basis for the development of the FIRST software and hardware infrastructure which will be further defined in the follow-up deliverable D2.2.

Page 14: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 14 of 90

Last but not least, Section 4 presents two concrete examples of the preliminary work done in FIRST: the ontology-based sentiment analysis and visualisation of document streams. The main purpose of this part is to give the reader a more complete picture of how the FIRST stream-processing pipelines, spanning over all the technical work packages, might look like in the end.

Page 15: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 15 of 90

2 State of the art

This section, it its broadest sense, presents the fields of computer science that provide the methods required for the implementation of the FIRST analytical pipeline. The pipeline is shown in Figure 1, taken out of the context of the FIRST information system architecture (end-user interfaces and customised solutions are not shown in the figure).

Figure 1: Architecture of the FIRST analytical pipeline.

The FIRST analytical pipeline starts with the data acquisition components developed in WP3. Section 2.1 presented the data acquisition part of the pipeline. The data acquisition pipeline consists of several technologies that interoperate to achieve the desired goal. These technologies are discussed in greater detail (including the related problem statement, state of the art, and in some cases, even a preliminary implementation) in the respective subsections and referenced appendices.

The acquired data is fed, on one hand, into the ontology evolution components developed in WP3 (see Section 2.2), and on the other, into the information extraction components developed in WP4 (see Section 2.3 and the corresponding appendices). The ontology evolution components are responsible for constantly updating the ontology with respect to the incoming data stream. Information extraction components employ the ontology mainly for the purpose of sentiment extraction. Even though other relevant semantic features will potentially be extracted in WP4, sentiment analysis seems to be the most important task according to the preliminary analysis of the use cases devised in WP1 (hence the focus of Section 2.3 is on sentiment analysis). A preliminary ontology-supported sentiment extraction experiment is briefly discussed further on in Section 4.1.

Decision support (DS) components developed in WP6 conclude the FIRST analytical pipeline (see Section 2.5). The DS components will be implemented by resorting to (1) machine learning and stream-based data mining techniques (―ML/DM‖), (2) qualitative multi-attribute modelling (―QMAM‖), and (3) visualisation techniques (―Visual.‖). These areas of computer science and the respective state of the art are discussed in details in the corresponding subsections. A preliminary implementation of a document stream visualisation pipeline is discussed in Section 4.2 and Annex 5.

In addition, Section 2.4 presents the state of the technology concerning persistent repositories (such as knowledge bases and databases) and provides the basis for the developments in WP5. In FIRST, a persistent repository that holds historical data and extracted knowledge is important from two perspectives. On one hand, historical data will be used for conducting

Ontologyevolution

Dataacquisition

Informationextraction(sentiment analysis)

Ontological support

FIRST analytical pipeline

Knowledgebase

Decisionsupport

ML/DM

QMAM

Visual.

Page 16: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 16 of 90

scientific experiments and evaluation, and on the other, it will allow the end-users to review past events.

Last but not least, in Section 2.6, two general-purpose scaling techniques, pipelining and parallelisation, are discussed. This section represents the preliminary basis for the FIRST scaling strategy that will be devised in the scope of WP2 in the upcoming months.

2.1 Data acquisition pipeline

The FIRST data acquisition pipeline, to be developed in WP3, will be responsible for acquiring unstructured data from several data sources, preparing it for the analysis, and brokering it to the appropriate analytical components (e.g., information extraction components developed in WP4). The data acquisition pipeline will be running continuously, polling the Web and proprietary APIs for new content, turning it into a stream of preprocessed text documents.

When dealing with official news streams—such as those provided to the consortium by IDMS—a lot of preprocessing steps can be avoided. Official news are provided in a semi-structured fashion such that titles, publication dates, and other metadata are clearly indicated. Furthermore, named entities (i.e., company names and stock symbols) are identified in texts, and article bodies are provided in a raw textual format without any boilerplate (i.e., undesired content such as advertisements, copyright notices, navigation elements, and recommendations).

Figure 2: The data acquisition pipeline.

Content from blogs, forums, and other Web content, however, is not immediately ready to be processed by the text analysis methods. Web pages contain a lot of ―noise‖ that needs to be identified and removed before the content can be analysed. In this section, we present the envisioned stream-data acquisition pipeline in which several technologies interoperate in order to achieve the desired goal.

The envisioned pipeline is shown in Figure 2. First, a HTML page is pre-processed (Section 2.1.2) and stripped of boilerplate (Section 2.1.3). After that, a language detector is employed to ―route‖ pages to the appropriate language-dependant text processing components1 (Section 2.1.4). Next, near-duplicates are identified and clearly marked (Section 2.1.5). Finally, a component checks if a certain document is (opinion) spam sending out untruthful information (Section 2.1.6). In the following subsections, we discuss these technologies and the related state-of-the-art in greater detail.

1 In FIRST, we will only analyse English content. Non-English pages will be filtered out.

Boiler-plate

removal

Datasources

HTML preproc.

Languagedetection

Near-duplicateremoval

Spam and opinion

spam detection

Information extraction

HT

ML p

age

s

Official news

Cle

an

te

xt

Page 17: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 17 of 90

2.1.1 Data sources

The data for Web mining (or Web-based stream mining) obviously comes from the Web. It is normally acquired through HTTP1 and HTTP-based protocols such as Really Simple Syndication (RSS) and proprietary Application Programming Interfaces (APIs) implemented as Web services. In the following paragraphs, we briefly look into some of these protocols, their main characteristics, and their role in FIRST.

2.1.1.1 RSS feeds

RSS stands for Really Simple Syndication and enables the user to acquire content updates from blog sites, forums, news sites, and news aggregators. Software applications called RSS readers are used to retrieve content updates from selected sites. RSS readers periodically ―poll‖ these sites to retrieve the most recent RSS documents. An RSS document is essentially an XML, containing titles and short descriptions (summaries) of a certain number of the most recent posts. Each RSS item also provides a link (i.e., URL) to a Web page containing the corresponding full content.

From the perspective of automatic document stream analysis, RSS suffers from two relevant drawbacks. Firstly, despite the fact that RSS feeds conceptually provide streams of documents, the RSS protocol is a ―pull‖ rather than a ―push‖ protocol. This means that the client is not notified if/when a certain RSS document changes. Instead, the client ―blindly‖ requests the RSS document at regular time intervals to see if there are updates available (this approach is called ―polling‖). Secondly and more importantly, in RSS documents, full contents are not provided in a ―clean‖ textual form the way titles and summaries are. Instead, each RSS item merely provides a link to a Web page containing the full content. The full content is thus normally in the HTML format which is not immediately ready to be processed by the text analysis methods. Preprocessing steps, such as boilerplate removal (see Section 2.1.3), are required to prepare HTML documents for automatic text analysis.

Despite its shortcomings, RSS will be extensively used and will most probably be the main mechanism for Web content acquisition. The reason is due to the fact that RSS has several advantages over crawling which is often used for Web data acquisition (see Section 2.1.1.2 on crawling). The first clear advantage is the ability to identify new, so far unseen content by examining the metadata in a RSS document. A new content item can be identified by examining its title, description, and publication date, all of which are normally provided as metadata in the corresponding RSS XML.

The RSS feed component will regularly poll numerous Web sites for RSS documents and collect the referenced Web pages. The collected content will be turned into an actual stream of (HTML) documents which will be consumed by the subsequent components in the pipeline (i.e., preprocessing and analytical components).

In the context of the reputational-risk assessment use case (WP1), a list of slightly over 100 Web sites of interest was provided. Most of the listed Web sites provide one or more RSS feeds through which the data will be acquired.

1 HTTP stands for Hyper-Text Transfer Protocol and implies transferring content written in Hyper-Text Markup

Language (HTML) or, to put it simply, transferring Web pages from a server to a client. However, HTTP is used to

transfer any kind of files, especially multimedia content and data encoded in non-HTML formats such as Extensible

Markup Language (XML) and Resource Description Framework (RDF).

Page 18: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 18 of 90

2.1.1.2 Crawling

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner1.

A Web crawler receives a set of seed URLs and starts downloading Web pages according to its page selection policy (e.g., breadth-first selection policy). From each downloaded Web page, the links are extracted and sent into the URL queue. The HTML content of the page, on the other hand, is usually stored into a file or a database. The crawler needs to resolve several technical problems such as link filtering and normalisation, loop detection, and parallelisation. It also needs to download pages politely meaning that it should occupy only a small portion of the server‘s bandwidth. The high-level architecture of a typical general-purpose Web crawler is given in Figure 3.

Figure 3: The high-level architecture of a typical general-purpose Web crawler.

A crawler can also be developed for a particular purpose (e.g., content acquisition from a particular Web site). Since the structure of the Web site is known, it is possible to direct crawling only to the parts of the Web site which contain the desired content. Furthermore, since the template of the Web site is usually known, it is possible to ―clean up‖ the content by using relatively simple regular expressions2. Such crawlers can also contain advanced logic to handle sites which are generally difficult to crawl (e.g., require authentication and/or user interaction). Note, however, that this approach is only feasible for acquiring data from only a few sites that contain a lot of valuable content in a limited number of common design templates. The reason is due to the fact that for each different template, a set of regular expressions needs to be devised manually, which is a time consuming and relatively expensive process.

Web crawlers suffer from the same deficiencies as RSS feeds when it comes to document stream analysis. Even more, in this setting there is no ―syndication‖ document to tell which pages are new or updated. To check if a Web page was updated, it needs to be downloaded3 and the database needs to be examined for near-duplicates (see Section 2.1.5). The only clear advantage that crawling has over RSS feeds is the ability to potentially acquire archives (i.e., old content) rather than just recent updates.

1 Taken from Wikipedia (http://en.wikipedia.org/wiki/Web_crawler).

2 See http://en.wikipedia.org/wiki/Regular_expression

3 In theory, it is enough to acquire the HTTP response header of the page.

URL queueDownload and

parse

Web

HTMLpages

ExtractedURLs

DB

Page 19: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 19 of 90

In FIRST, we will most probably need to develop several crawlers for specific Web sites with valuable content. It is however not clear whether a general-purpose crawler will be employed as most of the Web sites of interest provide RSS feeds.

2.1.1.3 Proprietary APIs

Data can also be acquired through proprietary Application Programming Interfaces (APIs) provided by organisations that own and/or broker the data. Such APIs are normally implemented as Web services that are SOAP or REST-compliant1, but other protocols can be provided as well. Technologically advanced companies would provide proper streaming protocols (i.e., push protocols) such as long polling (also referred to as comet programming)2 and XMPP (Extensible Messaging and Presence Protocol)3.

In FIRST, official news streams will be provided by IDMS through their elaborate API. The API is HTTP-based and supports both pull and push protocols.

2.1.2 HTML preprocessing

Over the years and in a quite ad-hoc fashion, Web pages evolved into relatively complex mixtures of JavaScript and HTML. Such form makes HTML a difficult format for any kind of automatic processing. Figure 4 shows an HTML document that uses JavaScript to ―inject‖ some content, Cascading Style Sheets (CSS) to hide some content, and character references to encode the main part of the content. A naive HTML processing component would normally exclude styles and scripts (grey text in the figure) and end up with text elements, shown in red in the figure. We can clearly see that such processing component would fail to grasp the true message communicated to the user by a Web browser that correctly interprets the given HTML document.

1 See http://en.wikipedia.org/wiki/Web_service

2 See http://en.wikipedia.org/wiki/Comet_%28programming%29

3 See http://en.wikipedia.org/wiki/Extensible_Messaging_and_Presence_Protocol

Page 20: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 20 of 90

Figure 4: A “difficult” HTML and the corresponding correct interpretation by a browser.

For this reason, we will implement a proxy server that will provide ―normalised‖ HTML documents. The proxy will execute JavaScript code, remove hidden elements, and resolve character entities. The resulting HTML will represent a much better basis for further automatic processing. The proxy will be based on one of the existing Web browser engines (such as Internet Explorer object available in .NET, WebKit, and Gecko1).

2.1.3 Boilerplate removal

2.1.3.1 Problem statement

The Web offers freely available, almost unlimited amount of heterogeneous data. Different information can be extracted from an average HTML page. Among the most informative types of pages are above all news articles and blogs posts. Most often, it is the main content (i.e., article text, or any meaningful text) of the HTML page that we are interested in. The undesired content of the Web page is called boilerplate (a reusable text or layout formulation commonly found in newspaper articles) and includes mostly scripts, styles, advertisements, etc. It is also desirable to distinguish between the different types of the relevant content, such as the article body, user comments, and headlines.

Some of the numerous reasons for the main content extraction, beside further linguistic processing, include indexing for a search engine, detection of Web pages with a (near) duplicate content, display of the Web page content on a small screen (i.e., cell phone), etc.

Considering the immense proliferation of informative Web content and the necessity for extracting meaningful text from Web pages, it is surprising that there is no existing or proposed HTML standard for marking up different semantic segments of the Web page. Defining heuristic rules to distinguish between Web page parts probably poses itself as the simplest solution, but

1 See http://en.wikipedia.org/wiki/List_of_web_browser_engines

<html>

<head>

<style type="text/css">

.hidden { visibility: hidden }

</style>

</head>

<body>

<div style="visibility:hidden">Some hidden text.</div>

<div style="display:none">More hidden text.</div>

<div class="hidden">Even more hidden text.</div>

<script language="javascript">

document.write("This is some js generated text.");

</script>

T&#104;&#105;&#115;&#032;&#105;&#115;&#032;&#115;&#111;&#109;&#101;&#032;&#101;&#110;&#099;&#111

;&#100;&#101;&#100;&#032;&#116;&#101;&#120;&#116;&#046;

</body>

</html>

Page 21: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 21 of 90

shows to be inflexible for different Web sources and ever changing page styles. Machine learning techniques put themselves as a necessity.

In the following sections, we present an overview of the boilerplate removal problem, show several popular methods, and explain one particular method in more details. Furthermore, in Annex 2, we present the preliminary evaluation of the implemented boilerplate removal component. We show the influence of different features on the extraction accuracy and attempt to choose the most effective ones. After outlying the final results, we point out the advantages and the weaknesses of the implemented method and conclude with several suggestions for further improvements.

2.1.3.2 Existing approaches

For a given Web page, we first wish to determine if it contains some meaningful content (i.e., longer, informative, not necessarily contiguous text, resembling a newspaper article). Then, we wish to extract the main content without the surrounding or the interleaving boilerplate. Besides the main goal of separating the main content from the boilerplate, we would also like to differ between various subtypes of the main content. These can be headlines, user comments, related content, supplemental content, and alike. Extracted text should retain all the original formatting and punctuation marks with the exception of HTML tags for the purpose of displaying the content to the user and also for the purpose of information extraction (a lot of information extraction algorithms, e.g., sentence splitting, rely on punctuation marks).

A first method that comes to mind, when skimming through the HTML of a group of Web pages from the same source, is to handcraft a rule that separates the meaningful text from the boilerplate. Such rule may provide the desired accuracy for a unique Web page template, but quickly becomes obsolete when the page template changes or when dealing with many different Web sources. Other than being unpractical in the long term, this manual task is also relatively expensive.

To overcome the aforementioned issues, Web pages from many different sources should be considered as a learning dataset for machine learning methods in order to automatically discover rules (models) accurate and yet general enough to suit various Web sources.

By looking at several news article Web pages, it becomes obvious that different semantic parts occupy usually the same place. The main article content is in the middle, headlines are above the main content, user comments are at the end, and the unwanted advertisements are on the sides. Visual Page Segmentation (VIPS) technique (Cai, Yu, Wen, & Ma, 2003) makes use of the page layout features to obtain a partitioning of the Web page. A tree of HTML blocks is built, according to their position in the Web page.

Most of the methods for boilerplate removal rely on dividing the Web page into contiguous blocks. This is implied by the HTML tree structure, where textual content is enclosed into blocks by the tags. On such sequence of blocks, features can be constructed and existing methods for finding and labelling sequences can be applied.

The basic way of annotating the main content in a Web page is marking the beginning and the end of it. The method of maximum subsequence segmentation (Pasternack & Roth, 2009) finds such beginning and an end by maximising the sum of the probabilities assigned to separate tokens (i.e., words, symbols, and tags). The probability that a token belongs to the article is estimated by a local (Naive Bayes) classifier trained on a dataset of HTML Web pages where the starts and the ends of the news articles are marked. To be accurately extracted, the article text should be contiguous, coherent, and more than about eight sentences in length. Undesired content inside the identified article block is removed by using simple heuristics. The article boundary detection technique showed to be too coarse for the content other than the contiguous article text. Besides the fairly good accuracy, its noticeable advantage is the linear

Page 22: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 22 of 90

complexity and thus the suitability for a fast real-time pipeline. An on-line demonstration of this method is available at http://took.cs.uiuc.edu/MSS/default.aspx.

More elaborate text extraction can be made by dividing a Web page into smaller blocks and classifying each block separately. The method based on shallow text features (Kohlschütter, Fankhauser, & Nejdl, 2010) extracts each block of text bounded by an opening or closing HTML tag. Such granularity, with a proper choice of features, allows rather accurate extraction of not necessarily contiguous content (i.e., main article text), but also of other valuable content such as headlines and user comments, which differ subtly. The article promotes rather simple features for boilerplate detection, namely text block features such as number of words per block, text density, and link density (see the following section). This choice is backed up by accurate classification results when employing a simple decision tree model. This is the method we chose to implement and discuss in more details further on in Annex 2. An on-line demonstration of this method is available at http://boilerpipe-web.appspot.com/.

2.1.3.3 Text block features distinguishing between text classes

Features for classifying text blocks extracted from Web pages can be defined on several different levels. Site features are specific to all documents originating from the same Web source. These are usually omitted as they may lead to over-fitting to the specific source. Structural features originate from the HTML structure, more specifically from the HTML tags preceding and following the text block. Specific CSS classes and sequences of HTML tags may lead to over-fitting and are therefore not considered. When examining text blocks, we extract language-independent higher-level shallow text features. These features are word-oriented and also include simple heuristics such as the number of a certain type of characters (such as digits and uppercase letters). Nearly as significant are densitometric features, primarily link density and text density.

The number of potential features is large. We list a few interesting features here, originally described in (Kohlschütter, Fankhauser, & Nejdl, 2010; Spousta, Marek, & Pecina, 2008; Gibson, Wellner, & Lubar, 2007):

Containment in a specific tag (e.g., <p>, <hx>, <div>...).

Names of the ancestor, descendant, and sibling tag.

Precedence or subsequence of a specific tag (e.g., <img> to exclude photo captions).

Number and type of the contiguous tags separating two blocks.

Count (relative, absolute) of a certain type of characters (whitespace, digits, punctuation marks).

Position of a block (relative, absolute) within the HTML document.

Number of sentences in the block (acquired by counting the proper punctuation marks).

Average length of a sentence (in words).

Detection of certain strings (e.g., times, dates, URLs) in the block.

Language profile of the text block. This can be used as a single major feature in some approaches to boilerplate removal(Evert, 2008).

Link density (i.e., anchor percentage) – a percentage of tokens inside the anchor tags.

The number of named entities in the block (i.e., organisations, people, locations...)

The number of words starting with an upper-case letter.

Page 23: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 23 of 90

Inter-block (e.g., ―previous vs. current block‖ quotient) features like the number of words quotient and text density quotient.

2.1.3.4 Conclusions

In the preceding sections, boilerplate removal was discussed. Boilerplate mostly refers to text fragments in an HTML document that should be excluded from further analyses. Such text fragments include advertisements, copyright notices, navigation elements, and recommendations.

After reviewing the current state of the art, we decided to implement the method presented by Kohlschütter et al. (2010). It is based on shallow language-independent text-block features and exhibits fairly high accuracy on news articles and blog posts. In addition, the method is flexible enough to detect other potentially relevant text segments such as, for example, user comments.

We discuss our implementation of the chosen boilerplate removal method in more details in Annex 2. Our preliminary experiments confirm the accuracy reported in the original paper.

2.1.4 Language detection

2.1.4.1 Problem statement

Most of the text mining and natural-language processing (NLP) tools are language-specific. In text mining, stemming (or lemmatisation) and lists of stop words depend on the language, and in NLP, POS tagging, chunking, and deep parsing are all language-dependent technologies. The first stage in the Web content mining tasks is usually gathering HTML pages from the Web (e.g., Web crawling or fetching pages through RSS feeds). The cleaning steps that follow need to take care of boilerplate removal, language detection, and code-page detection. This ensures that irrelevant content and HTML tags are removed (boilerplate removal), special characters are encoded correctly (code-page detection), and documents which cannot be handled by the selected language-dependent analysis tools are removed from the corpus (language detection and filtering).

In this section (and further on in Annex 1), we present a framework for language and code-page detection based on n-gram statistics. We provide experimental results for several East European languages and the English language. We vary the length of documents and the cut-off parameter to see how this influences the prediction accuracy. We experiment with several similarity measures: out-of-place measure, Cosine similarity, and Spearman rank correlation coefficient.

2.1.4.2 Existing approaches

Majority of language classification (detection or identification) methods rely on statistical properties of text and supervised learning from given reference languages. Text is usually divided into smaller parts like letters, sequences of letters or whole words. There are two popular methods. One is word-based and the other n-gram-based. Word-based method tokenises the text (language corpus) into words. One form of this method builds discriminators for languages by using all the words from the corresponding dictionaries. In general, it is not necessary to take into account all the words in a dictionary, therefore other methods focus only on specific words. One method uses only short frequent words of four to five letter maximum length (Grefenstette, 1995; Ingle, 1976). The other method generalises it and takes specific

Page 24: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 24 of 90

number of most frequent words of arbitrary length (Souter, Churcher, Hayes, & Hughes, 1994; Cowie, Ludovic, & Zacharski, 1999).

In contrast to the word-based models, n-grams can be analysed to build language-detection models (Cavnar & Trenkle, 1994). N-grams are n letters long sequences created by slicing up words (prior tokenisation is usually needed). A profile (histogram of n-gram occurrence counts or probabilities) is usually created for each language from the corresponding corpora. Differences among various n-gram methods are basically in different similarity measures between language profiles and ways of ranking n-grams. Despite of higher dimensionality (i.e., there are more n-grams than words in the text), n-gram methods offer better precision and reliability over word-based methods (Cavnar & Trenkle, 1994).

Another problem that we attempt to solve in our language-detection framework is the code-page detection task. Although the Unicode standard forms majority of today‘s textual encoding, local encodings are still being used. Most of the document format standards (HTML, XML, etc.) have the possibility of specifying the type of encoding in the document‘s header, but this feature is not always used. The problem of code page detection (or character encoding detection) is similar to that of language detection. If there are no special marks at the beginning of a textual file (as it is common in the Unicode encoding with the Byte Order Mark), there really is no other way to detect the encoding than to rely on statistical properties of the actual encoding.

Three different methods for code-page detection are being used in practice (Li & Momoi, 2001). In the coding scheme method, a state machine is defined for each encoding. By feeding the state machines one byte at a time, illegal byte sequences can be discovered which effectively excludes certain encodings from the consideration. While this method is appropriate for some types of multi-byte encodings, it fails for single-byte encodings. The other two methods are based on frequency statistics. The character distribution method relies on the fact that some characters are more frequent than others and counts the occurrence of each character. This method is shown to be suitable for code page detection of Asian languages (Chinese, Japanese, Korean, etc.). For languages with less characters (e.g., European languages), computing frequencies of only single characters does not provide adequate distinction between encodings. As an improvement, we can use the very same n-gram technique as for the language detection. At least 2-grams (two character long sequences) are needed for satisfactory code page detection. Technique with n-grams is shown to be suitable for detecting all types of encodings.

As the n-gram technique suites both language and code page detection, we employ it to solve both tasks in a single framework. In our framework, a language profile is built for each supported language. To also include different encodings, a separate language profile is created for each different encoding of the same language. In Annex 1, we describe experiments with several encodings of the Slovenian language.

2.1.5 Detecting near-duplicates in document streams

2.1.5.1 Problem statement

Much of the relevant Web content is duplicated. News stories probably put themselves as an everyday example. Exact duplicates can be identified by relatively simple hash-like methods. More problematic is the near-duplicate content, which differs in subtle details—such as copyright notices and advertisements—irrelevant for most of the further text processing. The first issue to consider in designing a system for near-duplicate detection arises from the ever growing size of the Web. Such system should scale to several billions of indexed Web pages and also support high throughput rate in a stream-based setting.

Page 25: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 25 of 90

2.1.5.2 Related work and preliminary implementation

Ideally, there would be a mapping (hashing) from a document as an ordered list of words into a unique, small, easily comparable number or a symbol, such that the original inter-document similarity is retained among the mapped values. The simhash fingerprinting technique (Charikar, 2002) has such property of local sensitivity. It maps a document into a 64-bit fingerprint with a Hamming distance used as a distance measure. The Hamming distance for two completely different documents would be large and for two similar documents it would be small. A near-duplicate of a document is any document whose fingerprint differs from the given one in at most k bits (e.g., k = 3). Thus, the problem of looking-up for the existing near duplicates is brought down to solving the Hamming Distance Problem – quickly finding at least one fingerprint that differs from the given one in at most k bits.

In our preliminary implementation, we compute a 64-bit simhash fingerprint by processing the bag-of-words representation of a document. In general, other specific features could be used instead of words. The final 64-bit document fingerprint is obtained by processing 64-bit hash codes of the separate words as described in(Manku, Jain, & Sarma, 2007).

Because of the large number of crawled documents (often in billions), we aim at looking-up the near duplicates (i.e., solving the Hamming distance problem) in sublinear time. By storing permuted fingerprints into several look-up tables, it is possible to achieve fast look-ups without significantly increasing space complexity. For example, for k = 3, each 64-bit fingerprint is split into 5 blocks of 13, 13, 13, 13, and 12 bits, respectively. There are 10 ways to choose 2 blocks (where 2 = 5 – k) from these 5, thus we build 10 look-up tables, each indexing fingerprints by the corresponding two blocks. The tables are implemented as binary trees having 25 or 26 levels each (i.e., the size of two blocks). When searching the existing fingerprints for possible near-duplicates, each of these ten binary trees is traversed. Candidates for near-duplicates are discovered quickly in this way and then checked if they are indeed near-duplicates (i.e., if they differ in 3 bits or less). It can be shown that the number of identified candidates is manageably small even for billions of fingerprints in the tables.

The paper (Manku, Jain, & Sarma, 2007) also describes other configurations for the look-up tables. A fingerprint compression technique and batch queries are also discussed in the paper.

2.1.6 Spam and opinion spam detection

For the rapidly increasing amount of information available on the Internet, there exists only little quality control, especially over the user-generated content that can be found on forums, blogs, and on other Web sites where the users can post their comments. Although it is recognised that user-generated content contains valuable information for a variety of applications, the lack of quality control attracts spammers who have found many ways to draw their benefits from spamming, some even make a living from it.

There exist different types of spam, which all have different target groups and aim at different goals. Most well-known is e-mail spam which is the form of unsolicited e-mail messages, often of commercial nature, advertising products, or even broadcasting political or social commentaries. Spam filters are widely used, they are built into user‘s e-mail programs and/or mail servers, they incorporate techniques for detection of keywords, templates, sentence structure, suspicious attachments, etc., that are typical for spam e-mails. Spammers continuously find ways to bypass spam filters, therefore on-going research in e-mail spam filtering is necessary (Sahami, Dumais, Heckerman, & Horvitz, 1998; Li, Zhong, & Liu, 2006; Fette, Sadeh-Koniecpol, & Tomasic, 2007).

Another type of spam is Web spam and its objective is to achieve higher ranking of certain Web pages by search engines. This objective is mainly achieved in two ways: content spam and link spam. Link spam is frequent on forums and Web sites allowing users to leave their comments.

Page 26: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 26 of 90

Content spam tries to include irrelevant or remotely relevant words to target pages and in this way fool search engines to rank those pages higher. Some research papers dealing with Web spam are(Gyongyi & Garcia-Molina, 2004; Ntoulas, Najork, Manasse, & Fetterly, 2006; Wu, Goel, & Davison, 2006; Castillo, et al., 2006; Wu & Davison, 2006).

Opinion spam, on the other hand, gives an untruthful opinion on a certain topic or product. It can be found among reviews and commentaries on e-commerce Web sites, news Web sites, review Web sites, etc. The spammers try to promote or damage the reputation of people, businesses, products, or services by posting untruthful opinions. A lot of work has been done on analysing the sentiment of user-generated online content, but the focus was only on whether the user‘s opinion is negative or positive (Dave, Lawrence, & Pennock, 2003; Pang, Lee, & Vaithyanathan, 2002; Popescu & Etzioni, 2005; Hu & Liu, 2004). Opinion spam, however, has not yet been extensively studied. Existing studies focus on consumer reviews of certain products as a place to look for opinion spam.

One of the research papers on this topic (Jindal & Liu, 2008) divides spam reviews into three types. Firstly, Type 1, being untruthful reviews, that deliberately give undeserving positive reviews to some product in order to promote it and/or give unjust or malicious negative reviews to other products to damage their reputation. Secondly, Type 2 are reviews on brands only, i.e., reviews that do not comment on a specific product but only brands, manufacturers, or sellers of the product. Although this type of reviews may be useful, they consider them as spam, because they do not target specific products and are often biased. Lastly, Type 3 are non-reviews, which can be roughly categorised into two main subtypes: (1) advertisements and (2) other irrelevant reviews containing no opinions (e.g., questions, answers, and random texts).

Type 2 and Type 3 reviews can be detected by employing standard machine learning techniques for classification using manually labelled spam and non-spam reviews, because these two types of spam reviews can be recognised manually. Therefore the problem of detecting those two types of spam is translated into the task of finding effective features for classification model construction.

Detecting Type 1 spam, on the other hand, proves to be much harder, since manual labelling by simply reading reviews is very hard, if not impossible. However, using duplicate or near-duplicate reviews as guidance for labelling a review as spam, allows spam detection models constructed from data labelled in this way to predict likely harmful reviews to a good extent.

Another approach to opinion spam detection in consumer reviews uses language modelling techniques. For example in (Lai, Xu, Lau, Li, & Jing, 2010), the KL divergence and the probabilistic language modelling based computational model is presented as an efficient approach for the detection of untruthful reviews. Also in (Lai, Xu, Lau, Li, & Song, 2010), an inferential language model equipped with high-order concept association knowledge is proposed as an effective approach for detection of untruthful reviews when compared with other baseline methods.

An empirical study of online consumer review spam is presented in (Lau, Liao, & Xu, 2010), proposing an effective methodology for detection of untruthful consumer reviews that enables an econometric analysis to examine the impact of fake reviews on product sales.

A group of research papers is conversely focused on detection of suspicious reviewers who likely produce untruthful reviews. In (Lim, Nguyen, Jindal, Liu, & Lauw, 2010), product review spammers are detected by a scoring method which measures the degree of spam for each reviewer. In (Jindal, Liu, & Lim, 2010), an unusual review patterns are identified which can represent atypical behaviour of reviewers. The task is to find unexpected rules or rule groups, these rules describe behavioural patterns of reviewers that deviate from the expectations of a truthful reviewer and thus indicate spam activities.

All in all, not a lot of work has been done in the area of opinion spam detection and it is not clear at this point, which approach will be considered in FIRST. Most likely, we will first analyse

Page 27: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 27 of 90

duplicates in the acquired data in order to better understand their nature, frequency, quantity, and purpose (e.g., spamming vs. ―adoption‖ of content by other sources).

2.2 Ontology learning and existing semantic resources

2.2.1 Ontology learning

An ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain. In the project, an ontology will be employed for supporting the information extraction task (i.e., the construction of high-level features used in decision support models). Rather than employing ―standard‖ knowledge acquisition methodologies (e.g., (Fernandez-Lopez, Gomez-Perez, & Juristo, 1997)), we plan to use ontology learning (OL) to (1) discover topics in streams of financial news, (2) enhance ontologies with gazetteers required for the information extraction tasks (e.g., sentiment vocabulary gazetteers), and (3) reuse existing ontologies and other (semantic) resources related to the financial domain (e.g., populate the ontology with named entities from proprietary databases). The reason for opting for automatic ontology evolution over the standard ontology engineering approaches is due to the fact that the standard approaches are relatively expensive and inappropriate for data streams because the ontologies need to be constantly adapted to new data.

In SEKT: Semantically-Enabled Knowledge Technologies (IP IST-2003-506826), several tools for learning ontologies from textual corpora were developed. Even though text is an unstructured data source which makes it a difficult source for OL, a lot of effort has been put into designing algorithms for learning ontologies from text. This is mainly due to two reasons: (1) text mining and NLP techniques are mature enough to take on this challenge, and (2) textual resources are available for every domain and in the context of nearly every application.

In general, there are two complementary approaches to OL from textual corpora: (1) the approach based on statistical analysis in which whole documents are perceived as instances, and (2) the approach based on NLP in which linguistic constructs extracted from a document are perceived as ontological entities (Gómez-Pérez & Manzano-Maho, 2003).

OntoGen (Fortuna, Grobelnik, & Mladenic, 2006), initially developed in SEKT, belongs to the first category. It is a tool for semi-automatic ontology construction from textual corpora. Each document, converted into its bag-of-words representation (Salton, 1991), is perceived as an instance. Instances are grouped into concepts through clustering of documents. The concepts are arranged into a hierarchy in an iterative process controlled by the user. Each concept is named with ―strong‖ keywords extracted from the corresponding cluster. Apart from these machine learning techniques, OntoGen employs active learning for advanced concept discovery and topic space visualisation for providing visual insights into bag-of-words space (Fortuna, Mladenic, & Grobelnik, 2006). OntoGen is capable of inducing populated concept hierarchies (termed also ―topic ontologies‖).

Text-To-Onto (Maedche & Staab, 2000) partially belongs to the category of tools that learn ontologies from textual corpora by employing NLP. It combines outputs of several different algorithms to produce a coherent ontology definition. Prior to applying statistical analysis, documents are processed with NLP tools (a shallow text processor for German language is used). In addition to concept extraction and hierarchical clustering, domain dictionaries are exploited and association rule mining techniques are used to extend the set of relations within the ontology.

Ontologies can be learnt from other types of data sources as well. Dictionaries, knowledge bases, semi-structured schemata, relational schemata, and even software source code (Grcar,

Page 28: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 28 of 90

Grobelnik, & Mladenic, 2007) have already been considered as data sources for OL. An excellent survey was conducted in the scope of the European project OntoWeb (IST-2000-29243) (Gómez-Pérez & Manzano-Maho, 2003).

In TAO: Transitioning Applications to Ontologies (STREP IST-2004-026460), a methodology for transitioning application domains to ontologies was proposed, with OL playing an important role in the process (Amardeilh, Vatant, Gibbins, & others, 2004). In the TAO OL process, data sources are first converted into an intermediate representation consisting of graphs and text documents. From here on, an algorithm based on link analysis and text mining is employed to prepare input for OntoGen which is the next tool in the pipeline (Grcar, Mladenic, Grobelnik, & others, 2007).

There is a lack of truly large-scale ontology construction algorithms that would learn ontologies from massive streams of documents. Such algorithms are expected to constantly update the ontology with respect to the associated document streams. A valid approach to adaptive hierarchical clustering is presented in (Novak, 2008). In FIRST, we will implement a large-scale adaptive topic ontology construction algorithm by replacing the document clustering algorithm in the original OL pipeline with a large-scale adaptive hierarchical clustering algorithm.

Publicly available semantic resources are lacking ontologies that are fit for the purpose of information extraction required by this project. Such ontologies need to be augmented with vocabularies and gazetteers extracted from textual corpora in an automated fashion. We will provide domain ontologies enriched with vocabularies and gazetteers, which will make them more suitable for information extraction tasks.

2.2.2 Existing relevant semantic resources

There are very few existing ontologies in the area of finance. Probably the best known example is Eddy Vanderlinden‘s ontology on financial instruments, involved parties, processes and procedures in securities handling1. Descriptions of ontologies containing financial instruments can be found in works by Thomas Locke Hobbs2 and Mike Bennett of Hypercube Ltd3. (Zhang, Zhang, & Ong, 2000) built an ontology for financial investment for the use with a multi-agent system, but it doesn‘t seem to be available for use from anywhere.

In the project DIP4 (Data, Information and Process Integration with Semantic Web Services, FP6 – 507483) a financial ontology has been developed for the first eBanking case study (mortgage simulator/comparator). The ontology contains financial services, financial products and a stock market ontology describing stock market operations.

Other semantic (and also not strictly semantic) resources potentially relevant to FIRST include:

McDonald‘s word list (http://www.nd.edu/~mcdonald/Word_Lists.html)

Ontology developed by UHOH, used in their preliminary sentiment analysis experiments

Publicly available sentiment-labelled datasets designed for sentiment analysis experiments, e.g.:

o http://www.cs.jhu.edu/~mdredze/datasets/sentiment/ o http://langtech.jrc.it/JRC_Resources.html#Sentiment-quotes

DMoz (http://www.dmoz.org/)

DBpedia (http://dbpedia.org/)

ResearchCyc (http://research.cyc.com/)

SentiWordNet (http://sentiwordnet.isti.cnr.it/)

1 Available from http://www.fadyart.com/ontologies/data/Finance.owl

2 Description available from http://www.isi.edu/~hobbs/open-domain/

3 Description available from www.hypercube.co.uk/docs/ontologyexploration.doc

4 Deliverables available from http://dip.semanticweb.org/deliverables.html, see especially D10.3 and D10.7

Page 29: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 29 of 90

WordNet (http://wordnet.princeton.edu/)

Financial glossaries (e.g., http://www.forbes.com/tools/glossary/index.jhtml)

Proprietary data available to the consortium via the IDMS API (contains companies, instruments, and notations)

2.3 Sentiment classification and semantic feature extraction

2.3.1 Problem analysis for sentiment classification

There is a wide variety of vocabulary used in the literature in work about automatic extraction of sentiment. The same task is referred to as sentiment analysis, sentiment extraction, opinion mining, subjectivity analysis, emotional polarity computation and other terms. The same is true for the object of classification, the polarity of opinions. Commonly used terms are sentiment orientation, polarity of opinion and semantic orientation. We follow (Pang & Lee, 2008) and (Liu, 2010) in regarding these terms as synonyms.

The definition of sentiment and sentiment orientation in literature is often vague. A variety of more or less formal definitions can be found in Annex 4.Annex a.

The classification of sentiment can be done at several levels: words, phrases, sentences, paragraphs, documents and even multiple documents. On each level the classification can be done directly, or it can build on the results of previous levels by aggregating them. On different levels, different techniques are suitable. The different levels of sentiment analysis are used for structuring the following section. Methods for classifying sentiment at the sentence and document level are often similar, that is why they have been treated in a single section together.

The problem we want to treat in FIRST is the extraction, classification and aggregation of sentiment. Apart from that, there is a variety of other subproblems in the area of sentiment analysis. Short introductions to these can be found in Annex 4.Annex b.

2.3.2 Assessment criteria

The approach of research which is tackled in our work can be specified in certain ways. Figure 5 illustrates this specification. The main purpose of this figure is to determine what literature is reviewed for the state of the art with regard to the sentiment classification.

Page 30: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 30 of 90

Figure 5: Coverage of topics by the state of the art

A major characteristic of the object of research is the focus on texts, so we are only interested in work about text classification. The source of these texts is limited to the World Wide Web. Throughout a text there can be subjective or objective statements. We are interested in the former. The general type of sentiment object is a kind of product (so sentiments about e.g. persons, etc. are excluded). The domain of these products is mainly finance. Nevertheless, there has been some research in the movie domain. As the methods applied are independently of the domain, the scientific work of some different domains is also included.

The problem that is tackled is the classification of these sentiments.

The perspective is automatic information extraction, which can be further divided into approaches based on machine learning methods or based on the use of extraction rules. The use of extraction rules encompasses what is sometimes also called methods of ―semantic orientation‖ (Chaovalit & Zhou, 2005). Both alternatives will be considered in this literature review. Furthermore the extraction systems can be supported by ontologies (or ―guided‖, (Wimalasuriya & Dou, 2010)), and these approaches will also be of special interest.

This leads to the following criteria for the inclusion of previous scientific work: it has to deal with the problem of automatic classification of text-based sentiments about products found in the World Wide Web.

Page 31: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 31 of 90

2.3.3 State of the art in sentiment classification

This section describes the scientific state of the art for sentiment classification from unstructured documents. The different levels of sentiment analysis are used for structuring this work, as on each level different techniques are suitable.

2.3.3.1 Sentiment classification of words

There are two basic strategies to label words with sentiment polarities. The first strategy relies on WordNet and the relations provided by it. Starting with a set of seed words with known sentiment, the relations are used to propagate the sentiment of these seed words through the WordNet concept graph. (Hu & Liu, 2004) use crisp classes (positive and negative) as labels. SentiWordNet1(Esuli, Sebastiani, & Moruzzi, 2006) assigns three numerical scores in the interval [0.0, 1.0] to each synset which indicate how positive, negative, and objective the terms contained in the synset are. (Andreevskaia & Bergler, 2006) assign a fuzzy score to each word based on its labels in several classification runs using different seed words.

The second approach for labelling words with sentiment polarities relies on linguistic properties of the distribution of data. In their pioneer work, (Hatzivassiloglou & McKeown, 1997) postulate that linguistic constructs (such as conjunctions) impose constraints on the semantic orientation of their arguments. They present a method for automatically determining semantic orientation (positive or negative) of adjectives from conjunctions found in a corpus using a log-linear regression model. (Turney & Littman, 2003) present a method to calculate the semantic orientation of words by the strength of its association with a set of positive words, minus the strength of its association with a set of negative words. Strength of association is computed as the Pointwise Mutual Information with the seed words. Based on the assumption that sentiment terms of similar orientation tend to co-occur at document level, (Turney & Littman, 2003) classify the semantic orientation of words using Latent Semantic Analysis. (Aue & Gamon, 2005) extend the approach of Turney and Littman by adding the assumption that sentiment terms of opposite orientation tend to not co-occur at the sentence level.

Additionally, there are manually constructed lists of sentiment words that are often used as a gold standard for the evaluation of the approaches presented above. Some of these resources are described in Annex 3.

Most methods and most lists presented in the previous sections regard sentiment as a crisp value. (Andreevskaia & Bergler, 2006) list results of comparing the agreement of human annotators of the General Inquirer list and the Hatzivassiloglou and McKeown list and find only 78,7% agreement. They state, that disagreement between human annotators is not necessarily a quality problem, but a structural problem of the semantic category. This indicates that for words on the periphery of the category the membership is more ambiguous and may be labelled differently by different human annotators. Therefore it makes sense to encode sentiment as a fuzzy category. SentiWordNet (Esuli, Sebastiani, & Moruzzi, 2006) use values from the interval [0.0, 1.0] to represent sentiment, but they do not base this representation on fuzzy theories. (Andreevskaia & Bergler, 2006) explicitly regard sentiment as a fuzzy category following Zadeh, where membership is gradual and some members are more central than others. Manually created word lists are usually containing crisp sentiment, as it would be difficult to annotate fuzzy sentiment.

(Andreevskaia & Bergler, 2006) state further, that neutral words and words at the border between the categories of positive and negative are a great challenge and a source for errors. (Baccianella, Esuli, & Sebastiani, 2010) have integrated a category 'objective' to their work, while most others only have positive and negative as categories. In these approaches, category

1 Available from http://sentiwordnet.isti.cnr.it/

Page 32: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 32 of 90

'objective' can be inferred to contain all words not in the list or having no sentiment orientation value assigned. But still there exist words that express a neutral sentiment like mediocre. To recognise that a neutral sentiment is expressed can be important for applications like subjectivity classification, for sentiment classification it is only hurtful if the category 'neutral' is also to be extracted on sentence or document level.

Some modifiers can intensify, slacken or reverse the semantic orientation of the modified word, e.g. very nice is more positive than just nice or not beautiful is negative while beautiful is positive. These words must be treated to determine the sentiment of a sentence. But the information which words affect semantic orientation must come from the lexical resources. Most of the manually created lists provide categories for these words, but then again the problem is to judge how much the sentiment changes. Linguistic hedges try to address that problem. The introduction of linguistic hedges in fuzzy logics is based on (Zadeh, 1972). Linguistic hedges are treated as special linguistic operators that modify other terms. The category membership of a term can be intensified or diminished. Hedges can also make a crisp term fuzzy (the table is roughly square vs. the table is square). There are two basic types of hedges, type 1 hedges are simple operators acting on a fuzzy set (very, slightly, .etc.), type 2 hedges act on components of the operand (actually, essentially, etc.). The effect of linguistic hedges is very much dependent on syntactic and semantic constructions. There is no one solution for the implementation of linguistic hedges, but the basic idea is to model linguistic hedges as an operator (mathematical function) that transforms a fuzzy set into a different fuzzy set. The operators are expressed through a set of basic operators like concentration, dilation or intensification. There are very few lists of linguistic hedges. One for English can be found in (Lakoff, 1973).

(Hatzivassiloglou & McKeown, 1997) use only adjectives in their approach. Because the list created by them for evaluation is often used to evaluate other approaches, many of them work only on adjectives. While it certainly is true that often adjectives are used to express sentiment, words of other parts of speech can have semantic orientation, too, like disappoint or mistake. According to (Turney & Littman, 2003), the approach of (Hatzivassiloglou & McKeown, 1997) can be adopted for adverbs as well, but not for nouns and verbs, because a conjunction of these parts of speech with different sentiment orientations is possible without problem (―the rise and fall of the Roman Empire” is perfectly possible, while ―the tax proposal was simplistic and well received” is incorrect). WordNet methods, word lists and the other corpus-based work for all parts of speech.

As WordNet is a general, non-domain-specific resource, the methods that are using WordNet are independent of domain. On one hand, this is an advantage, because for any domain there are resources readily available. On the other hand, sentiment is often expressed in a domain-specific way, and using non-domain-specific vocabulary may lead to misclassifications (Loughran & McDonald, 2010). Therefore, either domain-specific word lists have to be created manually or a corpus-based method has to be used. In the domain of finance, fortunately the work of (Loughran & McDonald, 2010) provide such list as a starting point.

A big problem that needs to be addressed when determining the semantic orientation of words is the ambiguity of words. Including word sense disambiguation into applications may help solve this problem.

2.3.3.2 Sentiment extraction and classification of sentences and documents

One of the main tasks of FIRST is to classify sentiments. The classification of sentiment has been the focus of work for many scientists from different perspectives. This section gives a short overview over methods used in sentiment classification on sentence and document level. The focus of this section is on methods for sentiment classification in general, not on the use of

Page 33: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 33 of 90

classification methods in finance, as in this field of, texts are often labelled according stock price reactions (rather fact than sentiment).

2.3.3.2.1 Classification via approaches based on machine learning

There are several different machine learning methods which are common for sentiment classification. These include artificial neural networks, support vector machines, naive bayes classifier, maximum entropy, decision trees, k-nearest neighbour, n-gram classifiers and the rocchio algorithm. (Pang & Lee, 2008) provide a survey of machine learning classifiers for sentiment classification. An example in the domain of finance is the work of (Das & Chen, 2007), who use five different classifiers based on machine learning and valence (polarity) of words to extract sentiment from messages from stock message boards and classify according absolute majority vote on the message level.

The work of (Dave, Lawrence, & Pennock, 2003) is an often cited reference for different features on which to base classification. Their method assigns every feature a score based on the frequency of this feature in the two classes, the overall classification for a document is then the linear combination of the scores for all features. They experiment with a wide range of features, for example unigrams vs. n-grams, substitutions of product names and/or numbers, WordNet similarity, Part-of-Speech, parsing, negations. They also try out different ways for feature selection by thresholding and experiment with the smoothing of frequencies and weighting of features by strength of evidence.

2.3.3.2.2 Classification via rule-based approaches

The simplest rule-based systems for determining the sentiment of a document are systems based on the counting of positive and negative terms the document contains. Terms can consist of single words or multi-word expressions. The overall value for the document can then be the average of all sentiment orientations (Zhou & Chaovalit, 2008), the net word count (Das & Chen, 2007) or a majority vote (Kennedy & Inkpen, 2006). This word count method can be enhanced by treating negations (Kennedy & Inkpen, 2006), weighting words (Das & Chen, 2007) and taking into account intensifiers and diminishers that intensify or slacken an opinion (Kennedy & Inkpen, 2006). Accuracy can be increased by linguistic preprocessing like lemmatisation, Word Sense Disambiguation or Part-of-Speech tagging. Such systems are often used as a baseline for comparison.

The second group of systems uses what one normally understands by rules. These rules (or patterns) are manually constructed; systems for learning rules from corpora have not been examined in this work. Rules can be applied sequentially for different tasks (Shaikh, Prendinger, & Mitsuru, 2007), can have different types (Kanayama, Nasukawa, & Watanabe, 2004) or all rules can be stored in a common pattern database (Yi, Nasukawa, Bunescu, & Niblack, 2003). These rule-based systems classify sentiment on the sentence level and do not aggregate it further to the document level. All rule-based classification systems use crisp classification categories, mostly only positive and negative, some use also a neutral category.

2.3.3.2.3 Ontology-guided approaches

The usage of ontologies in opinion mining has recently started. Ontologies are not a classification method in itself, but can be used to enhance an existing method. The first to use ontologies for opinion mining have been (Zhou & Chaovalit, 2008). They add an ontology to a supervised machine learning technique (n-gram language models) and to a basic term counting method. In their work, as in most other works, the ontology contained product features and was used only to identify these features in the text. A polarity is then assigned to this feature using the method the ontology enhances. The sentiment orientation of the whole document is then the weighted average of the polarity values of all segments (Zhou & Chaovalit, 2008), the category with the maximum sum of the scores of child nodes (Zhao & Li, 2009) or the polarity that the majority of concepts are tagged with (Vallés-Balaguer, Rosso, Locoro, & Mascardi, 2010).

Page 34: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 34 of 90

(Zhou & Chaovalit, 2008) construct the ontology they use manually, (Zhao & Li, 2009) extract it automatically from the corpus. Both use the evaluation corpus for the construction of the ontology. (Vallés-Balaguer, Rosso, Locoro, & Mascardi, 2010) criticise the constructing of an ontology from the evaluation corpus and propose a method for using (multiple) existing ontologies.

(Cadilhac, Benamara, & Aussenac-Gilles, 2010) note that all above mentioned works actually only use the ontology as a taxonomy by using only the is-a relation between concepts. They present an approach that uses also other relations between concepts for creating summaries of opinions. E.g. if there is a relation "look at" between the concepts "customer" and "design", from the sentence "nice to look at" the feature "design" can be extracted. They do not classify the summarised opinions.

2.3.4 State of the art in semantic feature extraction

Apart from sentiment, FIRST will extract other semantic features using information extraction (IE) methods. IE automatically extracts structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources (Sarawagi, 2008).

Possible entities to be extracted in FIRST are financial instruments, indicators (economic, technical), markets and organisations (companies). These entities are conceptualised in an ontology, so ontology-based information extraction (OBIE) can be used. In OBIE, an ontology guides the extraction process (Wimalasuriya & Dou, 2010). OBIE is in particular adequate for specific domains, which both narrow the space of potential information and require dedicated domain knowledge for extracting the right facts.

Probably the best known tool for information extraction is GATE1 (Cunningham, Maynard, Bontcheva, & Tablan, 2002). During the European project SEKT, GATE was extended with ontology support (Bontcheva, Tablan, Maynard, & Cunningham, 2004). One of the ontology-based tools is for example the ontology-based gazetteer which is aimed primarily at augmenting content (e.g. assigning concepts from the domain ontology to terms in documents) and populating ontologies with newly identified instances. The GATE‘s rule-based annotation processing engine (JAPE Transducer) was also extended with the ability to take ontologies into account.

2.4 Approaches to information integration

The broad term information integration describes the task to overcome for disparate data sources the heterogeneity in hardware, software, syntax and/or semantics by providing access tools that enable interoperability. The mere fact that the data is scattered among different heterogeneous sources with differing conceptual representations, i.e. differing data structures and potentially differing data semantics, will be encapsulated by these tools and hidden from the user. From the perspective of the user, the set of heterogeneous data sources appears as a single, homogeneous data source that can be uniformly accessed.

The underlying motivations for such integration efforts as well as the variety of data characteristics are manifold and both affect the choice of the appropriate integration methodology.

The motivation for integration may be based on strategic or operational considerations. Regarding strategic considerations and analysis, it may not be required to constantly integrate the data but to integrate data snapshots at a certain point in time. For operational analysis a real-time integration of the most up-to-date information may be required.

1 http://gate.ac.uk/

Page 35: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 35 of 90

Besides differing data structures and formats1, data characteristics also encompass the fact whether the underlying data sources contain completely complementary data, i.e., refer to different (sub-)domains, or data that is semantically overlapping. In the latter case, the technical integration has to be enhanced to cater for semantic integration as well.

In the context of FIRST, the data to be integrated is provided by separate technical work packages that have a clear-cut focus, i.e., they refer to a specific (sub-)domain without any overlap. E.g., there is a unique source for extracted sentiments to be stored. Therefore, no semantic integration has to be performed with other sentiments and the integration task is focused on dealing with different data structures and formats.

To enable the integration of data in differing structures or formats, there are basically three straight-forward approaches:

Brute-force data conversion: all necessary data conversions are explicitly implemented. For N data sources and M different clients thus 2*N*M converters need to be implemented and maintained to interchange data between sources and clients.

Global data standardisation: across all data sources, a common data standard is agreed upon, which causes adaptation and transformation efforts in all data sources.

Interchange standardisation: a standard for data interchange is implemented. Each data source needs to be able to map data to and from this interchange format, which requires a total of 2N data format converters.

Due to the fact that information integration typically is not a once-off conversion but an on-going task, there is the additional constraint that the chosen solution needs to be robust in terms of adaptability, extensibility, and scalability. The aforementioned traditional approaches fulfil these requirements only to a limited extent. Therefore, various concepts have been proposed to better cope with these requirements. E.g., the Context Integration (COIN) framework(Gannon, Madnick, Moulton, Sabbouh, Siegel, & Zhu, 2009) proposes a mediator-based solution that addresses heterogeneous data semantics by extending the interchange standardisation by mediators that use local ontologies to perform conversion tasks. The SIRUP2 approach further improves the adaptability by focusing the integration effort on user-specific integration. See e.g., (Ziegler & Dittrich, 2004).

Regardless of the chosen methodology, information integration in any form combines heterogeneous sources to a uniform entity. This entity can either be physically or logically integrated.

2.4.1 Physical integration

Physical integration of information means the actual import and aggregation of all available information in a common database within a uniform data model3. The advantages of this approach are the improved performance, as no costly data model transformations are necessary, and that all the available information and its dependencies can be exploited for queries. However, there are several drawbacks. By the definition of the uniform data model, flexibility in terms of adaptability and extensibility of the overall system is handicapped. Furthermore, many legacy systems will have to be adapted to work seamlessly with the new integrated data model. As this in reality is most often not possible at reasonable costs, physical integration is mainly used for strategic analysis rather than operational analysis. Therefore, only snapshots of the actual data in the operational data stores are integrated from time to time, but

1 Data structure refers to the logical data structure, e.g., a graph representation or a relational database

representation. Data format refers to the syntactical data structure, e.g., RDF or table and column names. 2 Semantic Integration Reflecting User-specific semantic Perspectives 3 This uniform data model reproduces a special form of the aforementioned global data standardization.

Page 36: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 36 of 90

no real-time integration of data is conducted. Well known examples for such physical integration of snapshots are data warehouses (DWh). Such snapshot analysis is performed to discover tacit knowledge contained in the heterogeneous, disparate data sources1.

If central repositories are continuously fed with updated data, they are referred to as operational data stores (ODS). However, having an ODS in place creates redundancy and it becomes difficult to maintain a consistent state between the redundant parts of the data. ODS are often read-only data sources for users.

The process to populate this physically integrated and consolidated data base typically follows the tasks extract, transform and load (ETL process). The first step is to extract the relevant data from the heterogeneous data sources. This typically requires the data source to be in idle mode to ensure consistency of the extracted data. Afterwards the extracted data is transformed to comply with the integrated target data format. This typically includes data cleansing and schema-mapping. Depending on the prerequisites, in addition to this syntactic transformation, a semantic transformation may be required as well. Finally, the result of the transformation task is loaded into the consolidated data base. As this process requires much information about the data sources, it is mainly used for intra-organisation integration.

2.4.2 Virtual integration

A virtual integration does not import data from the initial data sources, but rather provides a uniform access layer that encapsulates and hides the distributed structure of the data sources. Access to the data sources is technically performed using wrappers and/or mediators. The advantage is that every update to the initial data source is immediately available in the virtually integrated information base as well, which is vitally important in an environment where data changes dynamically. Furthermore, there is no need to adapt existing data sources in any way. The drawback here is that such wrappers may fail in case the structure of the underlying data sources changes. Furthermore, the indirect access to the data may constrain the query versatility in case several sources contain data for overlapping (sub-)domains that has to be combined to satisfy the query.

For virtual integration there are several existing approaches. Wrappers2 are a structural design pattern3, which solely perform an interface transformation. E.g., they transform a query from the virtual, global data model into the data model of a specific data source. Mediators, on the other hand, are a behavioural design pattern. They control the interaction among data recipients and data sources, and the processing of a query, i.e., where necessary they generate several sub-queries and dispatch them to disparate data sources and transform their results into one common result in the virtual global data model. The data recipient only interacts with the mediator and does not have to care about the internal structure of the virtually integrated data source nor any transformation requirements. Both, the wrapper as well as the mediator approach, are instances of the aforementioned interchange standardisation approach, as the data recipient does not know about any local data models but only uses the interchange data model for its requests.

With the arrival of standardised Web services and service-oriented architectures (SOA), the actual technology for implementing the information integration by mediators has experienced significant change – the data recipients are no longer required to use monolithic, SQL-based access tools. Access is distributed among loosely coupled entities that enable a dynamic, context-sensitive composition of individual result sets.

1 Regarding tacit knowledge, see (Nonaka & Takeuchi, 1995)

2 Also known as Adapters

3 Regarding design pattern, see (Gamma, Helm, Johnson, & Vlissides, 1994)

Page 37: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 37 of 90

2.4.3 Hybrid approaches

Physical and virtual approaches mark the endpoints of the spectrum of information integration methodologies. Of course there are hybrid approaches as well, where, e.g., the more or less static parts of the data are physically integrated to increase performance, while the more dynamic parts of the data are virtually integrated to ensure the accuracy and timeliness of the available data. Furthermore, in some of the hybrid approaches, more or less dynamic data is cached rather than physically stored.

2.5 Decision support systems

In general, Decision Support Systems (DSSs) are interactive computer-based systems intended to help decision makers utilise data and models to identify and solve problems and make decisions (Sprague & Carlson, 1982). Their main characteristics are:

DSSs incorporate both data and models;

they are designed to assist decision-makers in their decision processes in semi-structured or unstructured tasks;

they support, rather than replace, managerial judgment;

their objective is to improve the quality and effectiveness (rather than efficiency) of decisions.

In the following subsections, we discuss several complementary building blocks that constitute Decision Support Systems, namely, machine learning models and stream-based data mining, qualitative multi-attribute models, and document corpora/stream visualisation techniques that can be utilised to support a decision-making process.

2.5.1 Machine learning models

Since the relationships in financial markets are often highly dynamic and non-linear, hand-crafted models may quickly become obsolete. This leads to a necessity for machine learning models being able to efficiently adapt to new market constellations. Depending on the expected output and constraints as for the model‘s behaviour, more than a dozen of different paradigms can be employed to ensure adequate decision support. The most common ones are listed in Table 1.

Decision trees can be used to classify different datasets according to their attribute values. A decision tree consists of several branches, nodes, and leaf nodes. Within the decision tree, each node represents an attribute, whereas each branch represents the corresponding attribute value. In the end, the leaf nodes represent the associated classes. To classify a dataset, one starts at the root node of the decision tree, considers the attribute value, and moves down the tree along the branch with the appropriate attribute value. This procedure is repeated for every node until a leaf node is reached. This leaf node determines the class of the dataset (Turban, Aronson, Liang, & Sharda, 2010; Mitchell, 1997). An exemplary decision tree is shown in Figure 6.

Approach Learning methods Available software

Decision Trees CART, C4.5, See 5.0, Random Forests

RapidMiner, Weka, C4.5, See5.0

Support Vector Machines Lagrangian minimization using Mercel‘s kernel functions

libSVM, RapidMiner, Weka

Page 38: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 38 of 90

Bayesian Networks Structure: CI-based methods, score-based methods

Weighting: probabilistic sampling, likelihood weighting, AIS-BN, etc.

RapidMiner, Weka

Artificial Neural Networks Perceptron rule, delta rule, backpropagation, etc.

RapidMiner, Weka

Table 1: Overview of machine learning methods.

There are several algorithms available to build decision trees, for example ID3 and its successor C4.5.

Given a set of training examples represented as single points in space (problem space), the Support Vector Machine (Figure 8) tries to find a hyperplane (decision surface) that divides the points into two mutually exclusive classes with a gap between the classes as large as possible. This is usually done by solving a Lagrangian or a special form of its dual problem called Wolfe dual (Burges, 1998).

What distinguishes SVMs from other linear discriminant functions is the following:

The optimal decision surface is defined only by so-called support vectors, i.e. training examples closest to the decision surface. The support vectors represent training examples that are most difficult to classify and, as such, are most informative for the classification task. As a result, SVMs tend to be less prone to over-fitting than some other methods(Duda, Hart, & Stork, 2001, p. 262).

If no adequate hyperplane can be found in the original problem space, then a non-linear function is chosen that maps the original space to a higher-dimensional space, in which a linear decision surface exists (Boser, Guyon, & Vapnik, 1992).

Support Vector Machines are only directly applicable for two-group classifications (Cortes & Vapnik, 1995). However, it is generally possible to reduce a multi-group classification problem to a sequence of dichotomy classifications (Har-Peled, Roth, & Zimak, 2003).

A Bayesian (Belief) Network (also called causal network) is a directed acyclic graph (DAG), in which each node represents a variable that assumes several discrete states and each link from node A to node B – a conditional probability of B assuming a certain state depending on A‘s state. Contrary to decision trees, in which the associated classes are represented by the leaves in the tree, there is normally only one node encoding the class variable. This node can be located anywhere in the graph depending on the causal relationships between the variables.

Typically, the task of learning a Bayesian network can be divided into two subtasks: initially, the learning of the DAG structure of the network, and then determination of its parameters. In most general case, learning a previously unknown graph structure automatically is more than exponential (Kotsiantis, 2007), therefore several methods exist that aim at reducing the number of possible node combinations (Cooper, Heckerman, & Meek, 1997).

Depending on the structure of a Bayesian network, it may show certain similarities to probabilistic neural networks, thus, enabling us to use connectionist techniques to learn the network‘s parameters (Neal, 1992). For an extensive overview of parameter fitting methods and algorithms, see (Buntine, 1996).

Artificial Neural Networks are inspired by the design of human neural networks which consist of neurons that are interconnected with each other. Consequently, Artificial Neural Networks consist of artificial neurons that are interconnected, too. An artificial neuron is a unit that can process inputs from other neurons and which can deliver outputs to further neurons. More precisely, each input into a neuron has a weight that can be adjusted during the learning phase. The adjustment of the weights makes it possible to model non-linear relationships (Mitchell, 1997). Figure 7 shows an exemplary artificial neural network. It consists of an input and an

Page 39: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 39 of 90

output layer as well as a hidden layer. Unfortunately, there is no solid rule which states how many neurons should be chosen. To train the network, the input data (e.g., attribute values) are put into the input layer. Thereafter, the network computes the output which can be compared with the desired output (e.g., a pre-defined class). If the result is not satisfactory, the weights can be adjusted and the output can be recalculated.

Figure 6: Exemplary decision tree.

Figure 7: Exemplary neural network.

Figure 8: Exemplary Support Vector Machine (SVM).

Page 40: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 40 of 90

Under certain circumstances, classification problems can be solved more accurately and efficiently on a so-called reduced problem space. A reduced space consists of a reduced number of random variables compared to the original space. The following dimensionality reduction methods have proven successful in the financial domain:

Fisher‘s Discriminant Analysis

Principal Component Analysis (Wang, 2006)

First attempts were made to apply nonlinear dimensionality reduction methods, such as Curvilinear Component Analysis, while analyzing the dynamics of stock markets. The results were, however, not satisfying(Lendasse, Lee, de Bodt, Wertz, & Verleysen, 2001).

2.5.2 Stream-based data mining

Recently, a new class of emerging applications has become widely recognised: applications in which data is generated at very high rates in the form of transient data streams. Examples include financial applications, network monitoring, security, telecommunication data management, Web applications, manufacturing, sensor networks, and others. In the data stream model, individual data items may be relational tuples, e.g., network measurements, call records, Web page visits, sensor readings, and so on. However, their continuous arrival in multiple, rapid, time-varying, possibly unpredictable and unbound streams opens new fundamental research problems. This rapid generation of continuous streams of information has challenged the storage, computation and communication capabilities in computing systems.

From the last decade, data mining, meaning extracting useful information or knowledge from large amounts of data, has become the key technique to analyse and understand data. Typical data mining tasks include association mining, classification, and clustering. These techniques help find interesting patterns, regularities, and anomalies in the data. However, traditional data mining techniques cannot be directly applied to data streams. This is because most of them require multiple scans of data to extract the information, which is unrealistic for stream data. The amount of past events is usually overwhelming, so they can be either dropped after processing or archived separately in secondary storage. More importantly, the characteristics of the data stream can change over time and the evolving pattern needs to be captured. Furthermore, we also need to consider the problem of resource allocation in mining data streams. Due to the large volume and the high speed of streaming data, mining algorithms must cope with the effects of a system overload. Thus, achieving optimum results under various resource constraints becomes a challenging task.

In this section, we present the theoretical foundations of data stream analysis and critically review the stream data mining techniques.

2.5.2.1 Foundations of stream mining

The foundations, on which stream data mining solutions rely, come from the field of statistics, complexity, and computational theory. The online nature of data streams and their potentially high arrival rates impose high resource requirements on data stream processing systems. In order to deal with resource constraints in a graceful manner, many data summarisation techniques have been adopted from the field of statistics. They provide means to examine only a subset of the whole dataset or to transform the data vertically or horizontally to an approximate smaller size data representation so that known data mining techniques can be used. Also, techniques from computational theory have been implemented to achieve time and space efficient solutions.

Summarisation techniques are often used for producing approximate answers from large databases. They synthesise techniques for data reduction and synopsis construction.

Page 41: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 41 of 90

Summarisation techniques refer to the process of transforming data to a suitable form for stream data analysis. This can be done by summarising the whole dataset or choosing a subset of the incoming stream to be analysed. When summarising the whole dataset, techniques such as sampling, sketching, and load shedding are used. For choosing a subset of the incoming stream, synopsis data structures and aggregation techniques are used. An excellent review of data reduction techniques is presented in (Barbara & others, 1997). We present the basics of these techniques with examples of their applications in the context of data stream analysis.

Sampling – The idea of representing a large dataset by a small random sample of the data elements goes back to the end of the nineteenth century and has led to the development of a large body of sampling techniques. Sampling is the process of statistically selecting the elements of the incoming stream that would be analysed. The problem with using sampling in the context of data stream analysis is the unknown dataset size and fluctuating data rates. Designing sampling-based algorithms for computing approximate answers that are provably close to the exact answer is an important and active area of research.

Sketching involves building a summary of a data stream using a small amount of memory. It is the process of vertically sampling the incoming stream. Sketching has been applied in comparing different data streams and in aggregate queries. The major drawback of sketching is low accuracy.

Load shedding refers to the process of eliminating a batch of subsequent elements (randomly or semantically) from being analysed. It has the same problems as sampling. Load shedding is not a preferred approach with mining algorithms, especially in time series analysis because it drops chunks of data streams that might represent a pattern of interest. Still, it has been successfully used in sliding window aggregate queries.

Synopsis data structures embody the idea of small-space, approximate solutions to massive data set problems. Creating synopsis of data refers to the process of applying summarisation techniques that are capable of summarising the incoming stream for further analysis. Wavelet analysis, histograms, and frequency moments have been proposed as synopsis data structures.

Wavelets are one of the often-used techniques for providing a summary representation of the data. Wavelet coefficients are projections of the given signal (set of data values) onto an orthogonal set of basis vectors.

Histograms approximate the data in one or more attributes of a relation by grouping attribute values into ―buckets‖ (subsets) and approximating true attribute values and their frequencies in the data based on a summary statistics maintained in each bucket.

Aggregation is the representation of a number of elements in one aggregated element using some statistical measure such as mean, variance, and average. It is often considered as a data rate adaptation technique in a resource-aware mining. The problem with aggregation is that it does not perform well with highly fluctuating data distributions.

Sliding Window is considered an advanced technique for producing approximate answers to a data stream query. The idea behind sliding window is to perform detailed analysis over the most recent data items and over summarised versions of the old ones. This idea has been adopted in many techniques implemented in the comprehensive data stream mining system MAIDS (Dong, Han, Lakshmanan, Pei, Wang, & Yu, 2003). Imposing sliding windows on data streams is a natural method for approximation that has several attractive properties. It is well-defined and easily understood. It is deterministic, so there is no danger that unfortunate random choices will produce a bad approximation. Most importantly, it emphasises recent data, which, in the majority of real-world applications, is more important and relevant than old data.

Page 42: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 42 of 90

2.5.2.2 Stream mining techniques

Mining data streams has been a very attractive field of research in the data mining community for the last few years. A number of algorithms have been proposed to deal with the high speed feature in mining data streams using different techniques. In the following, we present the related work in mining data streams, concerning clustering, classification, and frequency counting techniques.

2.5.2.2.1 Clustering

Clustering of stream data has been one of the most widely studied data mining tasks in this growing field of research. The centre of attention for many researchers has been the k-median problem, initially posed by Weber (Wesolowsky, 1993). The objective is to minimise the average distance from data points to their closest cluster centres.

A large body of algorithms has been proposed to deal with this problem. (Guha, Mishra, Motwani, & O'Callaghan, 2000; Guha, Koudas, & Shim, 2001) proposed an algorithm that makes a single pass over the data stream and uses small space. (Babcock, Datar, Motwani, & O'Callaghan, 2003) have used exponential histogram (EH) data structure to improve the algorithm presented in (Guha, Mishra, Motwani, & O'Callaghan, 2000). (Charikar, O'Callaghan, & Panigrahy, 2003) have proposed another k-median algorithm that overcomes the problem of increasing approximation factors in the algorithm presented in (Guha, Mishra, Motwani, & O'Callaghan, 2000).

Another algorithm that captured the attention of many scientists is the k-means clustering algorithm. This algorithm has also been studied analytically by (Domingos & Hulten, Mining High-Speed Data Streams, 2000; Domingos & Hulten, 2001). They have proposed a general method for scaling up machine learning algorithms named Very Fast Machine Learning (VFML). They have applied this method to the k-means clustering (i.e., VFKM) and decision tree classification techniques (i.e., VFDT). (Ordonez, 2003) has proposed an improved incremental k-means algorithm for clustering binary data streams. (O'Callaghan, Mishra, Meyerson, Guha, & Motwani, 2002) proposed the STREAM and LOCALSEARCH algorithms for high quality data stream clustering. (Aggarwal, Han, Wang, & Yu, 2003) have proposed a framework for clustering evolving data steams called CluStream algorithm. In (Aggarwal, Han, Wang, & Yu, 2004), the authors have proposed HPStream; a projected clustering for high-dimensional data streams, which outperforms CluStream. The Stanford‘s STREAM project (Charikar, O'Callaghan, & Panigrahy, 2003) has studied an approximate k-median clustering with guaranteed probabilistic bound.

2.5.2.2.2 Classification

Several authors have studied the idea of implementing a decision tree technique for classification of stream data. (Ding, Ding, & Perrizo, 2002) have developed decision tree learning based on Peano count tree data structure. (Domingos & Hulten, 2000; Domingos, Hulten, & Spencer, 2001) have studied the problem of maintaining decision trees over data streams. In(Domingos & Hulten, 2000), they have developed a VFDT system. It is a decision tree learning system based on Hoeffding trees. (Ganti, Gehrke, & Ramakrishnan, 2002) have developed analytically two algorithms, GEMM and FOCUS, for model maintenance and change detection between two data sets in terms of the data mining results they induce. The algorithms have been applied to decision tree models and frequent itemset models. Techniques such as decision trees are useful for one-pass mining of data streams but these cannot be easily used in the context of an on-demand classifier in an evolving environment.

The concept drifting problem in stream data classification has been addressed by several authors. (Wang, Fan, Yu, & Han, 2003) have proposed a general framework for mining concept drifting data streams. The proposed technique uses weighted classifier ensembles to mine data streams. (Last, 2002) has proposed an online classification system which dynamically adjusts

Page 43: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 43 of 90

the size of the training window and the number of new examples between model re-constructions to the current rate of concept drift. (Aggarwal, Han, Wang, & Yu, On Demand Classification of Data Streams, 2004) have presented a different view on the data stream classification problem, in which simultaneous training and testing streams are used for dynamic classification of data sets.

2.5.2.2.3 Frequency counting

In contrast to clustering and classification, frequency counting has not attracted much attention among the researchers in the field. Counting frequent items or itemsets is one of the issues considered in frequency counting. (Cormode & Muthukrishnan, 2003) have developed an algorithm for counting frequent items. The algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. (Giannella, Han, Pei, Yan, & Yu, 2003) have developed a frequent itemset mining algorithm over data stream. They have proposed the use of tilted windows to calculate frequent patterns for the most recent transactions. (Manku & Motwani, 2002) have proposed and implemented an incremental algorithm for approximate frequency counting in data streams; the algorithm uses all the historical data to calculate the frequent patterns.

2.5.2.3 Conclusions

The spreading of data stream phenomenon in real life applications has influenced in great deal the development of stream mining algorithms. Mining data streams has raised a number of research challenges for the data mining community. Due to the resource and time constraints, many summarisation and approximation techniques have been adopted from the fields of statistics and computational theory. Based on these foundations, a number of clustering, classification, and frequency counting techniques have been developed.

Mining data streams is an immature, growing field of study. There are many open issues that need to be addressed. The development of systems that will fully address these issues is crucial for accelerating scientific discoveries in the fields of physics and astronomy, as well as in business and finance-related domains.

2.5.3 Qualitative multi-attribute models

Qualitative decision support models belong to a broad category of decision support systems (DSS), known as model-driven DSS (Power, 2002). Model-driven DSS emphasise access to and manipulation of statistical, financial, optimisation and/or simulation models. Models use data and parameters provided by decision makers to aid decision makers in analysing a situation, for instance, assessing and evaluating decision alternatives and examining the effect of changes. Simple statistical and analytical tools provide the most elementary level of functionality, but, in general, model-driven DSS use complex financial, simulation, optimisation, or multi-criteria models to provide decision support.

For sentiment analysis in FIRST, we envision the use of multi-attribute (or multi-criteria) models, particularly their special form, qualitative multi-attribute models. In decision support, multi-attribute decision models (Bouyssou, Marchant, Pirlot, Tsoukias, & Vincke, 2006) are generally used to assess decision alternatives, taking into account multiple and possibly conflicting criteria. Each alternative is represented by a set of properties or features, which are first evaluated individually and then aggregated by the model into an overall utility: the higher the utility, the better the alternative. Usually, the model takes some form of a hierarchical structure which represents a decomposition of the overall decision problem into smaller and possibly more manageable sub-problems.

Page 44: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 44 of 90

Figure 9 shows the general structure of hierarchical multi-attribute models. A model itself is represented by a triangle, where:

input data enters the model at the bottom of the triangle,

input data is aggregated in the model in a bottom-up way, following a hierarchical structure of variables (called attributes),

the aggregation itself is carried out by the so-called value functions (these are not explicitly shown in Figure 9),

output assessment of decision alternatives is provided by one or more output variables located at or near the top of the model.

Figure 9: General structure of a hierarchical multi-attribute model.

Input, intermediate, and output data represent ―objects‖ or ―decision alternatives‖ that are assessed by a particular model. In the context of sentiment analysis in FIRST, a typical ―object‖ will be an individual document, and a typical evaluation result will be an assessment of its sentiment. A higher-level assessment of a stream of documents is also possible with qualitative models, for example for detecting illegal trading strategies and frauds.

In FIRST, we will primarily use a specific class of multi-attribute models, known as qualitative multi-attribute models (Žnidaršič, Bohanec, & Zupan, 2008). Unlike traditional methodologies (Bouyssou, Marchant, Pirlot, Tsoukias, & Vincke, 2006), qualitative models use qualitative (symbolic, discrete) attributes instead of numerical ones. Also, the aggregation of values in the model is specified with decision rules which are typically defined in the form of ‗if-then‘ rules or lookup tables. This makes qualitative models very suitable for less formalised decision problems in which approximate judgment prevails over precise numerical calculations. This approach has already been proven useful in many real-world decision-making problems, including those in EU projects SolEuNet (FP5-IST-1999-11495), ECOGEN (FP5-QLK5-2002-01666), SIGMEA (FP6-SSP1-2002-502981), Co-Extra (FP6-FOOD-2005-7158), and HEALTHREATS (FP6-STREP-150107).

Multi-attribute models can be developed in at least three different ways:

1. The first way, which is the most common and prevailing, is through expert modelling, where the structure of the model, its variables, value functions and decision rules are provided by an expert or a group of experts. A model is usually ‗hand-crafted‘ in face-to-face meetings of experts and decision analysts, and supported by suitable model-development software. For

Input variables

Output variable:

Evaluation/assessment of ―objects‖

Hierarchically structured

internal variables

with value functions

―Objects‖ / ―Decision alternatives‖:

e.g., documents

Input variablesInput variables

Output variable:

Evaluation/assessment of ―objects‖

Output variable:

Evaluation/assessment of ―objects‖

Hierarchically structured

internal variables

with value functions

―Objects‖ / ―Decision alternatives‖:

e.g., documents

―Objects‖ / ―Decision alternatives‖:

e.g., documents

Page 45: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 45 of 90

the development of qualitative models, we intend to use the freely available computer program DEXi (Bohanec, 2008) (Figure 101). An example of a complex qualitative model that has been developed in this way is a model for economic and ecological assessment of maize cropping systems (Bohanec, et al., 2008).

2. The second approach is to create decision rules from data, where a data table, for example, a table containing some simulation results, is converted into decision rules. Furthermore, such rules can be supplemented by expert-defined rules, so that the model is partly created from data and partly by an expert. This approach has been successfully used in the EU project SIGMEA to create a decision support system SMAC, dealing with coexistence issues of genetically-modified and conventional maize (Bohanec, Messéan, Angevin, & Žnidaršič, 2006).

3. The third approach is model revision (Žnidaršič & Bohanec, 2007). Here, an initial model, which is potentially inaccurate or incomplete, is developed by an expert. Then, this model is confronted with a data stream of examples. An automatic algorithm compares the model and examples, and modifies decision rules in the model so that it becomes more complete and consistent with the examples.

Figure 10: Screenshots of DEXi, a computer program for qualitative multi-attribute decision modelling.

Why are the qualitative multi-attribute models useful in FIRST? They have four properties of which at least the first three seem very interesting and convenient for sentiment analysis:

1. First and foremost, qualitative multi-attribute models complement very well machine-learned models (see Sections 2.5.1 and 2.5.2). Machine-learned models are induced in an automatic or semi-automatic way from data, but may have deficiencies because of flaws in the data. For instance, noise or some other data imperfection can cause that learned models are incomplete, inaccurate or just plain wrong. Qualitative models offer a possibility to improve or extend the automatically learned models by adding new indicators (e.g., those that are not available in data or those that turn out difficult for a machine-learning algorithm) and by formulating decision rules from expert knowledge. In this way it is possible to develop models that are partly machine-learned from data and partly formulated by an expert (Bohanec, Messéan, Angevin, & Žnidaršič, 2006).

1 The figure shows a part of a complex multi-attribute model for the assessment of cropping systems and a

visualisation of some results. The model was developed in the EU project SIGMEA (FP6-SSP1-2002-502981).

Page 46: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 46 of 90

2. Qualitative models use a very comprehensible representation of knowledge (hierarchical structure of variables and decision rules). This supports and fosters communication between experts, as well as provides a clear representation of knowledge in the problem domain.

3. Once developed, qualitative models specify a working method of evaluation of ―objects‖, which can be easily embedded into software, e.g., for sentiment analysis of documents.

4. In decision analysis, a highly regarded feature of multi-attribute models is their ability to support analysis of decision alternatives, for instance through sensitivity analysis, ‗what-if‘ analysis, etc. It seems that in FIRST, where we will be dealing with automatic assessment of sentiment in document streams, this feature will be less important, but it is nevertheless worth to be mentioned and possibly explored.

2.5.4 Visualisation

Visualisation is an extremely useful tool for providing overviews and insights into overwhelming amounts of data. A visualisation pipeline usually consists of two major phases. In the first phase, unstructured data is represented in a form of a model (e.g. through data mining and machine learning or through information extraction process that involves natural language processing). In the second phase, the modelled knowledge is visualised to the user in a useful way that supports his decision making process.

A topic space is a high-dimensional bag-of-words space in which documents are represented as points. To visualise a topic space, we need to project documents onto a 2-dimensional canvas so that the distances between the points reflect the similarities between the corresponding documents. One of such visualisation tools is Document Atlas (Fortuna, Mladenic, & Grobelnik, 2006) which is also integrated into the ontology-learning tool OntoGen (Fortuna, Grobelnik, & Mladenic, 2006). It was developed in the course of the European project SEKT. Another algorithm for topic space visualisation was proposed in the European project TAO (Paulovich, Nonato, & Minghim, 2006; Grcar, 2008). Note that topic spaces can provide different views on the same data. They can group documents according to word co-occurrences (default view), sentiment, geographical locations, industry sectors, and so on. In Figure 11, company descriptions provided by Yahoo! Finance are visualised in the form of a topic space. In the visualisation, each point represents a company. Points grouped together represent companies providing similar services and/or products. The strongest clusters are labelled (this was done manually and is not part of visualisation).

Page 47: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 47 of 90

Figure 11: Topic space of company descriptions provided by Yahoo! Finance.

While topic spaces give insights into large amounts of documents with respect to topic coverage, temporal visualisations provide valuable insights into how topics evolved through time. One of such visualisations is called ThemeRiver (Caruana, Gehrke, & Joachims, 2005). It visualises thematic variations over time for a given time period. The visualisation resembles a river of coloured ―currents‖ representing different topics. A current narrows or widens to indicate a decrease or increase in the strength of the corresponding topic. The topics of interest are predefined by the analyst as a set of keywords. The strength of a topic is computed as the number of documents containing the corresponding keyword. Shaparenko et al. build on top of the ThemeRiver idea to analyse the dataset of NIPS1 publications (Shaparenko, Caruana, Gehrke, & Joachims, 2005). They identify topics automatically by employing the k-means clustering algorithm. In Figure 12, a temporal visualisation of business-related scientific publications is shown. To create the visualisation, a database of scientific publications was searched for publications containing the keyword ―business‖. We can see that ―Estonia market development‖ was a hot topic in 1994 (when Russian troops left Estonia2), while in 2006,

1 Conference on Neural Information Processing Systems (NIPS) is a machine learning and computational neuroscience

conference held every December in Vancouver, Canada. 2 See http://en.wikipedia.org/wiki/Estonia

Oil, gas

Steel industry

Hotels, inns

Insurance

Banking

Mutual funds

Casinos

Software,

management

Wireless

Manufactures

Pharmacy

Drug research

Page 48: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 48 of 90

―software‖, ―enterprise‖, and ―financial‖ play more important roles. This visualisation is taken from the IST World1 portal <http://www.ist-world.org>.

Figure 12: Canyon flow temporal visualisation of scientific publications through time.

One of the recent European projects dealing with visualisation is VIDI: Visualising the Impact of the Legislation by Analysing Public Discussions using Statistical Means (EP-08-01-14) (Grcar, Stojanovic, & others, 2009). VIDI is aimed at analysing the impact of enacted policies to public opinion. Through interactive visualisation of respective discussion forums, VIDI will support the decision makers in taking the public opinion into account in subsequent revisions of the corresponding policy. Several visualisation components were developed and employed in the European project IST World1 (SSA 2004-IST-3-015823), where visualisation is used, on one hand, to give overviews and depict trends of topics and competences of research institutions, researchers, and projects, and on the other, to show how research institutions collaborate.

When it comes to visualising document streams in FIRST, we are mostly interested in the algorithms belonging to the family of temporal pooling algorithms (Albrecht-Buehler, Watson, & Shamma, 2005). These techniques maintain a buffer (a pool) of data instances: new instances constantly flow into the buffer, while outdated instances flow out of the buffer. The content of the buffer is visualised to the user and the visualisation is at all times synchronised with the dynamic content of the buffer. In (Albrecht-Buehler, Watson, & Shamma, 2005), the authors discuss TextPool, a system for document stream visualisation based on temporal pooling. They extract salient terms from the buffer and construct the term co-occurrence graph. They employ a force-directed graph layout algorithm to visualise the graph to the user.

Krstajić et al. (Krstajić, Mansmann, Stoffel, Atkinson, & Keim, 2010) just recently presented a system for large-scale online visualisation of news collected by the European Media Monitor (EMM). EMM collects and preprocesses news from several news sources, most notably it extracts named entities which are then used in the visualisation phase. Each named entity corresponds to one topic in a ThemeRiver-like visualisation. They complement this visualisation with a named-entity co-occurrence graph visualisation. The system processes roughly 100,000 news per day in an online fashion.

In Section 4.2, we present our preliminary work on document stream visualisation pipeline. We mostly build on top of the algorithm for ―static‖ document corpora visualisation presented by Paulovich et al. (Paulovich, Nonato, & Minghim, 2006). They utilise several methods consecutively to compute a topic space layout (Figure 11 was created by following their approach). More details on this method as well as on our adaptation of the method for the purpose of document stream visualisation can be found in Section 4.2 and further on in Annex 5.

1 IST World portal: http://www.ist-world.org/ . IST World project homepage: http://ist-world.dfki.de/ .

Page 49: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 49 of 90

2.6 General-purpose scaling techniques

In FIRST, the main objective is to develop a large-scale financial data analysis platform. By ―large-scale‖, we mainly refer to the ability to process massive data streams flowing into the system in real time. From this perspective, we are not interested in scaling up the number of components constituting the system, but rather the throughput of the system. This means that the system needs to be able to process data at a higher rate than that at which the data is flowing into the system.

In this section, we discuss two general-purpose techniques that can be used in a software system to increase its throughput. First, we discuss pipelining which refers to the parallel execution of software components in the ―horizontal‖ direction (i.e., one after another). Even though the components are executed one after another, the pipeline maximises the throughput by processing several data units at the same time (see Section 2.6.1). Alternatively or even in addition, parallelisation in the ―vertical‖ direction can be implemented so that two or more components, processing the same data unit, are executed simultaneously (see Section 2.6.2).

The following subsections discuss these techniques in greater detail. Instead of providing (rather complex) theoretical background, the main principles are demonstrated on several toy examples. This section represents the preliminary basis for the FIRST scaling strategy that will be devised in the scope of WP2 at Month 12 (Deliverable D2.3).

2.6.1 Processing pipelines

In this section, the basics of processing pipelines are described. First, the definition and motivation are presented. Measures for evaluating pipelines are presented next, along with some scenarios that should be avoided when constructing pipelines.

A processing pipeline is a set of data processing elements connected sequentially so that the output of an element is the input of the next one. These processing elements are executed simultaneously to maximise the throughput of the pipeline (Virant, 1991; Kleinrock, 1996).

Although the term ―pipeline‖ comes from a rough analogy with physical pipelines (such as those transporting water, gas, or oil), the processing pipeline is very similar to the assembly line concept in manufacturing. Consider an assembly line with three stages (stage A, B and C), each of these stages taking ten minutes. Each stage can only process one product at a time. When the first stage is finished processing a product, it passes it on to stage B, and a new product can enter the line at stage A. Similarly, when the two products move to their next consecutive stages, a new product enters the line at the first stage. When the first product leaves the assembly line at the end of stage C, it has taken 30 minutes since its entry, but additional products will come off the assembly line at only 10-minute delays.

Figure 13: An illustration of an example pipeline with three stages.

Two measures are of interest when dealing with pipelines. The pipeline throughput is the amount of processed data that exits a pipeline during a certain time period – it is the measure of how often a single unit of data exits the pipeline. The pipeline latency is the amount of time that a single unit of data takes to travel through the pipeline.

The motivating example lacked important variables such as the arrival rate of data units at the start of the pipeline. Stages also seldom take the same time to process data, so queues are

Page 50: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 50 of 90

formed at the start of each stage. Data arrival rate should be considered when constructing pipelines to avoid undesirable scenarios.

Consider the above example with the following latencies for the individual stages: 1 second, 2 seconds, and 1 second, respectively. The arrival rate of data units is 1 unit per second. By observing the first stage, it can be determined that the departure rate of data units from the first

stage is also 1 unit per second (determined by calculating , where L is the latency of the

stage). The departure rate of the first stage also equals the arrival rate of the second stage, since they are consecutive. The departure rate of the second stage is 0.5 units per second

(determined by calculating ). Since the departure rate of the second stage is smaller than

the arrival rate, the size of the queue at the second stage will eventually overflow. Note that the throughput of such pipeline is 0.5 units per second (the departure rate of units for the entire pipeline). The latency of such pipeline cannot be computed since it increases with each consecutive unit arriving at the pipeline. If a limit is posed on the queue, not all of the units can be processed by such pipeline. In order to avoid this, the arrival rate into the pipeline should not be higher than the throughput of the pipeline.

Units seldom arrive into the pipeline at constant time intervals. Consider the units arriving into the pipeline in bursts of x units each 10 seconds. Since the throughput of this pipeline is 0.5 units per second, in order to avoid the problem of overflowing, the burst should not exceed 5 units each 10 seconds. This maximum can be determined by multiplying the throughput of the pipeline with the burst interval. Considering the burst size of 5 units per 10 seconds, it is possible to determine on average how long the units wait in line. Even though the throughput of the pipeline is 0.5 units per second, the total processing time for the 5 units is 12 seconds (the time before the pipeline fills up must also be accounted for). The times spent in the pipeline by the units in this particular case are 4 s, 6 s, 8 s, 10 s, and 12 s. The latency of this pipeline is therefore 8 s (the arithmetic mean). The times spent by the units waiting in queues are 0 s, 2 s, 4 s, 6 s, and 8 s. The average waiting time for a unit in the pipeline is therefore 4 s. The sum of the latencies of each of the stages is 4 s, which means that each unit on average spends half of the time waiting in queues and half of the time being processed. An illustration of the pipeline processing for the five units in the burst can be observed in Figure 14. In the figure, each cell represents one time unit (one second). Red cells represent the time units spent waiting in queues. Each row represents perspective of one data unit. Looking at the first two rows, for example, it is possible to see that the second data unit is being processed in stage A at the same time the first data unit is being processed in stage B.

A B B C

- A - B B C

- - A - - B B C

- - - A - - - B B C

- - - - A - - - - B B C

Figure 14: A Gantt chart of the burst of five units in the pipeline.

It has been shown that in order to maintain a stable pipeline (i.e., a pipeline where the queue does not overflow), the throughput should be higher than or equal to the arrival rate of the data units into the pipeline.

The speedup of a processing pipeline in contrast to processing each unit of data separately in a non-pipelined process can be demonstrated easily with the above example. In a non-pipelined process, a data unit takes 4 seconds to be processed. For five data units, 20 seconds are needed to process all the data. This was calculated by multiplying the number of data units with the pipeline latency. In a processing pipeline, the same amount of data takes 12 seconds. This was calculated by multiplying the throughput of the pipeline with the amount of data and adding

Page 51: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 51 of 90

the time units needed for the pipeline to fill up. Each consecutive burst of 5 units will exit the pipeline in increments of 10 seconds. If the initial delay is disregarded, the entire speedup ratio of pipelining can be expressed as the product of the non-pipelined processing time and the throughput.

2.6.2 Parallelisation

In the previous section, where pipelines were presented, it was stated that the stability of the pipeline depends on the throughput of the pipeline and the arrival rate of the data units into the pipeline. In this section, we propose two methods for pipeline parallelisation in order to increase the pipeline throughput and decrease the latency of the pipeline.

Pipelining itself is a form of parallelisation, by executing stages at the same time but on different units of data. Further level of parallelisation can be achieved either by adding more processing units for the same stages (reduces load on the processing units) or by parallelising a certain stage or a part of the pipeline (for this, the data units and the computational problem need to be of such nature that different stages can process different parts of the same data unit at the same time).

The first proposed solution is the parallelisation of processing stages of the pipeline by offering same services on separate processing units. Consider the example from the previous section; a pipeline consisting of three stages (A, B, and C), and their respective latencies (1 s, 2 s, 1 s). In case the arrival rate of data units at the start of the pipeline is higher than the throughput of the pipeline (0.5 units per second), the throughput must be increased to maintain stability of the pipeline (guaranteed processing for all data units). This can be achieved by parallelising stage B, so that two data units may be processed at the same time in this stage. An example of such pipeline is shown in Figure 15.

Figure 15: An example pipeline with multiple processing units forming the same stage.

Supposing that there is always a large amount of data units in the queue, this parallel processing unit‘s throughput is 1 unit per second. This optimisation does not decrease the latency of the stage, meaning that it only increases the throughput when the arrival rate for the stage is high enough. This can be demonstrated by an example; parallelising stage C would not increase the throughput of the pipeline or the stage itself, because the arrival rate at the entry of stage C is not high enough to utilise all the processing units. Any number of surplus units would remain unused during the runtime of the processing pipeline. For the arrival rate of 1 unit per second, adding only one additional processing unit for stage B increases the throughput of the pipeline to its theoretical maximum (i.e., the arrival rate at the start of the pipeline). An example Gantt chart of such pipeline with the arrival rate of 1 data unit per second is presented in Figure 292.

Page 52: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 52 of 90

A B B C

A B B C

A B B C

A B B C

A B B C

Figure 16: A Gantt chart of a pipeline with two B processing units and the arrival rate of 1 data unit per second.

To properly adjust a pipeline to higher data arrival rates, stages with the highest latency should be parallelised, but only to such a degree that all the parallelised processing units are utilised most of the time. The utilisation of each data unit may be determined by the ratio of the data arrival rate and the maximum throughput of the stage. If this value is greater than one, the size of the queue will continuously grow, leading to an unstable pipeline. If the value is exactly one this means that the processing unit or units will be utilised at all times to their full extent.

The second proposed solution to pipeline overflowing, i.e., parallelising a certain stage or a part of the pipeline, is dependent on the type of data and the type of problem the pipeline is solving. The idea is to divide the data unit into several (smaller) units and process these units in parallel, later on joining them into one single (intermediate) result for further processing. Consider the pipeline in the above example with two extra stages; one for splitting the data and one for joining it. An example of such pipeline is shown in Figure 17.

Figure 17: An example pipeline where a single unit may be processed in a parallel fashion.

Let us assume for this example, that the latencies of the splitting and joining operations can be neglected. Consider the example in the previous section, where the arrival rate is a burst of five data units every ten seconds. A Gantt chart showing a total runtime of 10 seconds is shown in Figure 18.

Page 53: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 53 of 90

A

B B

C

- A

B B C

- - A

- B B C

- - - A

- - B B C

- - - - A

- - - B B C

Figure 18: A Gantt chart of the runtime of the pipeline shown in Figure 17.

The parallelisation shown in this example does not increase the throughput of the pipeline. It only decreases the latency and the waiting times of the data units. This sort of division of the stages should be used when a data unit does not require the utilisation of all processing stages. The splitting stage can determine which stages the data requires, which can in the end increase the throughput, but does not guarantee the original order sequence of the data units at the output (due to latencies depending on the data units themselves).

Page 54: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 54 of 90

3 Requirements study

This section lists requirements from the technical point of view. Since the stream-mining components tend to be relatively ―expensive‖ technologies, the requirements are mostly posed by the components and are to be respected by the hardware and software infrastructure hosting these components (i.e., bottom-up requirements). This section thus addresses requirements towards hardware (e.g., internet bandwidth, number of processing units, and storage capacity), towards software infrastructure (e.g., the ability to execute pipelines), towards data storage (e.g., uniform data access), and towards runtime environments (e.g., .NET Framework and Java Runtime Engine). Some requirements, however, are posed in a top-down manner. These ensure seamless communication between the components, sufficient pipeline throughput, availability of the most recent ontology, and request some specific features from certain parts of the analytical pipeline (e.g., the data acquisition part of the pipeline needs to perform boilerplate removal).

In Section 3.1, we first give a high-level overview of the overall FIRST process. The overview is given for the reader‘s convenience, as the requirements, presented later on, refer to the specific stages in the pipeline as well as to the pipeline as a whole.

It is important to note that the requirements presented in Section 3.2 are aligned with the use case definition (D1.1) and requirement analysis (D1.2) thus ensuring the compatibility of the technical requirements and the requirements posed by the use cases.

3.1 Process overview

One of the goals in FIRST is to process massive non-structured and semi-structured data streams in a uniform manner. Once acquired, the data passes through multiple stages where it is processed, resulting in relevant knowledge being extracted. Each stage analyses and processes the received data, enriches it with annotations, and passes it to the next stage. In the final stage, the outcome is presented to the end-user. The whole process can be divided into 6 main stages as illustrated in Figure 19.

Figure 19: High-level FIRST process overview.

The depicted process covers all functional parts with their proper order and direction of data processing. Pipelining is the fundamental idea of near real-time massive stream processing in FIRST. Every stage of the pipeline is able to process data at the same time, ensuring the high throughput that is required for handling massive amounts of data.

Data Acquisition

OntologyEvolution

InformationExtraction

SentimentAnalysis

Decision Support

Visualisation

Acquiring semi-

structured and unstructured data,

removing noise and

boilerplate

Named entity

extraction and natural language

processing.

Evolving ontologies

with regard to the

data stream being

processed.

Analyzing sentiment

related to identified objects.

Providing means for

deriving decisions based on modeled

knowledge.

End-user visualization

for financial decision

support.

Page 55: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 55 of 90

3.2 Technical requirements

The requirements presented in this section form a guideline for the project development from the technical point of view. Together with the requirements coming from the use-case scenarios, they represent a coherent description of features as they are perceived from both perspectives:

the end-user point of view, coming from the definition of use cases (i.e., functional and non-functional requirements defined in D1.2),

a strictly technical point of view (technological standards, architectural requirements, bottom-up analysis).

Table 2 lists identified technical requirements constituting a bottom-line technological perspective of the system and forms important input for architecture definition and system design. In this deliverable, we focus only on specifying technical requirements, while the deliverable D2.2 will later analyse and assess them together with the use cases requirements to ensure their feasibility, coherence, and priority with regard to the overall system. At this point, the requirements presented in Table 2 are still subject to changes.

Category Ref. Topic Required features Resp.1

Hardware infrastructure

R1.1 Internet connection bandwidth

The infrastructure will provide sufficient internet connection bandwidth.

The internet connection bandwidth must not represent the bottleneck in the data acquisition process. JSI will provide hardware infrastructure with the theoretical bandwidth of 1 Gbit/s.

JSI

R1.2 Concurrent execution of processes – hardware infrastructure

The hardware system, hosting the FIRST information system, will be able to execute several processes concurrently as discussed in Section 2.6.

JSI will provide hardware infrastructure with 4 processors, 12 cores each (resulting in the ability to run 48 concurrent processes).

JSI

R1.3 Memory and persistent storage

The hardware system, hosting the FIRST information system, will provide sufficient memory capacity and sufficient persistent-storage capacity.

JSI will provide hardware infrastructure with 256 GB of RAM, 220 GB of solid-state disk (SSD) capacity, and 25 TB of hard disk (HDD) capacity.

JSI

Software infrastructure

R2.1 API for external access

FIRST information system will expose its core functionality to external applications through the FIRST API.

The API must provide functionality necessary for the implementation of the 3 devised use cases (see D1.1 and D1.2).

WP2 / WP7

R2.2 Flexibility of the infrastructure

FIRST information system will provide the means for implementing all the devised use cases on top of a common framework, minimising the effort for separate use case implementation.

Core components will remain unchanged for the 3 FIRST use cases. Adaptation will be done by providing specific GUI components, knowledge bases, and decision support models. Architecture should be general enough for potential exploitation in different contexts.

WP2 / WP7

1 Responsible WP or organisation.

Page 56: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 56 of 90

R2.3 Concurrent execution of processes – software infrastructure

The software infrastructure will be able to execute several processes concurrently as discussed in Section 2.6 (under the assumption that the underlying hardware supports this feature).

The actual behaviour of the processing pipelines must resemble the corresponding theoretical models for pipelining and parallelisation to at least 50% of theoretical performance.

WP2

R2.4 Logging & monitoring

The software infrastructure will provide facilities for logging and monitoring for the purpose of problem detection and performance assessment.

A log4j-like logging API1 will be provided to the hosted

software components.

WP2 / WP7

System integrity

R3.1 Stability The developed components will perform most of the tasks (i.e., all the envisaged tasks with the exception of some boundary cases) in the scope of the 3 use cases without crashing the system. The system, on the other hand, will assure flawless execution of the processing pipeline in the scope of the 3 use cases.

WP2 / WP3 / WP4 / WP5 / WP6

Analytical pipeline (see

Section 2,

Figure 1)

R4.1 Pipeline latency

The latency of the implemented analytical pipeline will be small enough to accommodate the real-time requirements of the use cases.

The pipeline latency will be properly monitored and, if necessary and feasible, ―horizontal‖ pipelining will be replaced with ―vertical‖ parallelisation in order to reduce latency (see Section 2.6 for more details). E.g., in the concrete visualization scenario, evaluated in Annex 5, the latency of the pipeline is 9.5 seconds.

WP3 / WP4 / WP6

R4.2 Pipeline throughput

The throughput of the implemented pipeline will be sufficient for the purposes of the use cases.

The pipeline throughput will be monitored and, if necessary, the bottleneck will be identified and eliminated. E.g., in the concrete visualization scenario, evaluated in Annex 5, the pipeline throughput is roughly 2.5 documents per second.

WP3 / WP4 / WP6

R4.3 Document format and encoding

The data acquisition component will provide (but will not necessarily be limited to) HTML documents.

The acquired HTML documents will be properly transcoded into UTF-8. All the subsequent components will be able to handle UTF-8.

WP3

R4.4 Interchange data format

Data will be passed from one component to another in a ―standardised‖ interchange format, agreed upon within the respective WPs.

Seamless interchange of data between the components will be achieved by employing a GATE-compatible ―annotated document corpus‖ format.

WP3 / WP4

Data storage, see Annex 6 for additional details

R5.1 Data formats Various data formats will be supported by the system.

The data storage will store RDF documents, text documents, decision rules, sentiment time series, and technical/economic indicator time series.

WP5

R5.2 Unified access The underlying persistence layer will expose an API to other components.

The API will provide necessary functionality for storing and retrieving data.

WP5

1 log4j is available at http://logging.apache.org/log4j/1.2/

Page 57: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 57 of 90

Semantic resources

R6.1 Ontology format

The ontology format needs to be compatible with Protégé

1. The ontology will thus be stored in the RDF

format.

WP3

R6.2 Ontology availability

The most recent version of the ontology will be at all times available on the Web. Older versions of the ontology will be available from the knowledge store. The ontology will be versioned at regular time intervals.

WP3 / WP5

R6.3 Ontology purpose

The ontology will be fit for the purpose of the information extraction (IE) tasks in WP4. Thus, the base topology/hierarchy of concepts should be static.

Suitability of the ontology for the required IE tasks will be assessed through the evaluation of the information extraction components.

WP3 / WP4

R6.4 Ontology evolution

The ontology will be constantly updated with respect to the incoming data stream. This basically results in the creation of new instances of existing concepts. The creation of new concepts is undesirable in the light of a rather static rule-based information extraction system.

The number of recognised (relevant) entities, sentiment-bearing terms, and other relevant high-level features will gradually increase due to the ontology evolution process.

WP3 / WP4

Specific technical requirements

R7.1 Data acquisition pipeline – functionality

The data acquisition pipeline will acquire and ―clean‖ the data, providing a unified data stream suitable for the analysis tasks.

The data acquisition component will implement the following features (see Section 2.1):

RSS acquisition

boilerplate removal

near-duplicate detection

language detection

spam detection

WP3

R7.2 Data acquisition pipeline – supported Web content formats

The data acquisition pipeline will mainly be able to handle textual content within HTML documents.

Images, Flash content, obfuscated PDFs, and content fragments, loaded on certain user actions, will not be acquired and parsed. If feasible, PDFs from several selected sources will be converted into text documents for further analysis.

WP3

R7.3 Information extraction components

Information extraction will facilitate the sentiment analysis process. Furthermore, it will extract entities such as companies, securities, and countries (to be used as high-level features in the devised decision support models).

The core functionalities over the preprocessed text are the following: tokenization, sentence splitting, POS tagging, lemmatization, ontology-based entity extraction. These functionalities will be based on the software tools provided by the GATE framework.

WP4

R7.4 Sentiment analysis

Sentiment analysis will provide high-level features for decision support models by extracting and aggregating sentiment attributed to the identified financial objects.

The main functionalities over the preprocessed text are the following: sentiment extraction on sentence level, sentiment classification on document level, and

WP4

1 Protégé is available at http://protege.stanford.edu/

Page 58: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 58 of 90

sentiment aggregation.

R7.5 Decision support models – features

Decision support models will employ features extracted in the information extraction process.

We plan to employ machine-learning models and qualitative models that only work with high-level features. The information extraction pipeline will provide such high-level features to the decision support components.

WP6

R7.6 Decision support models – streams

Decision support models will make use of streams of features provided by the information extraction components rather than operating on static datasets. This means that several decision support components will need to implement the online (incremental) variants of the data analysis algorithms (see Section 2.5.2).

WP6

Programming / runtime environments

R8.1 Programming / runtime environment – data acquisition pipeline

The data acquisition pipeline will be implemented in .NET (C#). The software infrastructure (i.e., the FIRST software infrastructure and operating system) needs to be able to execute .NET components (.NET Framework 2.0+).

WP2 / WP3 / WP7

R8.2 Programming / runtime environment – information extraction pipeline

The information extraction pipeline will be implemented in Java, using the GATE software library. The software infrastructure needs to be able to execute GATE-based Java components. Optionally, for sentiment aggregation and evaluation, parts might be implemented in Matlab. Matlab code however can be transformed into Java or .NET code.

WP2 / WP4 / WP7

R8.3 Runtime environment – knowledge base

The knowledge base environment is to be determined at a later time. The current preference is the .NET Windows Communication Foundation (WCF) technology.

WP2 / WP5 / WP7

R8.4 Programming / runtime environment – decision support components

Several decision support components will be implemented in .NET (C#). The software infrastructure needs to be able to execute .NET components (.NET Framework 2.0+).

WP2 / WP6 / WP7

Table 2: List of technical requirements.

Page 59: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 59 of 90

4 Preliminary work

4.1 Sentiment analysis

We present a knowledge-based approach to measure investor sentiment directly at high frequency by automatic sentiment extraction, classification, and aggregation from highly ambiguous and unstructured financial weblog texts. We extract sentiment related to the expected future price of a financial instrument.

Our approach consists of the following steps: (1) information retrieval of weblog documents with respect to a specific financial instrument from the internet, (2) extraction of basic text properties such as author and date, (3) Natural language pre-processing, (4) extraction of investor sentiment on the sentence level, (5) classification of the sentiment orientation, (6) aggregation of the sentiment to the document level and averaging over a set of documents for a sentiment index. Domain knowledge is modelled in an ontology containing financial instruments, indicators and finance-specific words with semantic orientation (called orientation terms in the following).

Information retrieval is keyword based using the ontology‘s terms on a financial instrument. The Google Ajax API delivers up to 64 document results per day and per financial instrument. The search is restricted to specific blog websites. Each document is stripped off from non-text elements, links, tags, comments, and advertisements.

The extraction of basic document properties is based on the HTML-structure of a blog document and HTML tag patterns. Thus, we extract document date, document author, document title and the document body text.

Natural language pre-processing of document texts is performed based on GATE‘s ANNIE IE-System (Cunningham, Maynard, Bontcheva, & Tablan, 2002). After tokenisation, sentence splitting, and POS-tagging, ontology concepts are identified. Orientation terms that appear in the same phrase are aggregated and treated as one in the subsequent steps. Indicators that appear in the same noun phrase and have the same correlation coefficient associated are also aggregated.

The extraction of investor sentiment on the sentence level is based on the identification of hand-crafted textual patterns similar to (Shaikh, Prendinger, & Mitsuru, 2007), but taking into account domain knowledge as modelled in the ontology. Sentiment is extracted with regard to a specific financial instrument or an indicator.

The classification of the sentiment orientation for sentences that contain a financial instrument is based on the orientation of the orientation term found in the sentence. For sentences containing an indicator, the correlation of the indicator with the financial instrument it is assumed to refer to is considered as well. In the case of a negation contained in the sentence, the sentiment orientation so is inverted.

The aggregation of the sentiment to the document level is done by quantifying the classified sentences and takes into account the direction as well as the intensity of the sentence sentiment orientations.

We follow (Das & Chen, 2007) to create a sentiment index time series that reflects the sentiment of the market on a given day. For this we average over all sentiments from all documents of the given day regarding one specific financial instrument.

In an evaluation on a corpus of manually classified weblog documents we achieve an accuracy of 68%, which significantly outperforms state-of-the-art machine learning approaches on the same corpus.

Page 60: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 60 of 90

4.2 Document stream visualisation

This section discusses an adaptation of a document space visualisation algorithm for document stream visualisation. A document space is a high-dimensional bag-of-words space in which documents are represented as feature vectors. To visualise a document space, feature vectors need to be projected onto a 2-dimensional canvas so that the distances between the planar points reflect the cosine similarities between the corresponding feature vectors.

When visualising static document spaces, the dataset can be fairly large (e.g., a couple of millions of documents) thus it is important that it can be processed in a time that is still acceptable by the application (e.g., a couple of hours). On the other hand, when dealing with streams, new documents constantly flow into the system, requiring the algorithm to update the visualisation in near-real time. In this case, we want to ensure that the throughput of the visualisation algorithm suffices for the stream‘s document rate.

The contribution of the work presented in the following sections is most notably a new algorithm for document stream visualisation. In addition, we implicitly show that a set of relatively simple improvements suffice for transforming an algorithm for static data processing to an iterative optimisation methods and on parallelisation through pipelining. The former means that we use the solution computed at time t − 1 as the initial guess at time t, which results in faster convergence, while the latter refers to breaking up the algorithm into independent consecutive stages that can be executed in parallel.

We mostly build on top of the algorithm for ―static‖ document corpora visualisation presented by Paulovich et al. (Paulovich, Nonato, & Minghim, 2006). They utilise several methods consecutively to compute a topic space layout. Details on this method as well as on our adaptation of the method for the purpose of document stream visualisation are in the following subsections.

4.2.1 Document corpora visualisation pipeline

The static document corpus visualisation algorithm presented in (Paulovich, Nonato, & Minghim, 2006) utilises several methods to compute a layout. In this section, we make explicit that these methods can be perceived as a pipeline, which makes an important reinterpretation when designing algorithms for large-scale processing of streams. Throughout the rest of this section, we present our own implementation of each of the pipeline stages. The visualisation pipeline is illustrated in Figure 20. In contrast to the work presented in Paulovich et al. (2006), we provide details on the document preprocessing, argue for a different way of selecting representative instances, and concretise the algorithms for projection of representative instances, neighbourhoods computation, and least-squares interpolation, respectively.

Figure 20: Document space visualisation pipeline.

Page 61: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 61 of 90

4.2.1.1 Document preprocessing

To preprocess documents (i.e., convert them into a bag-of-words representation), we followed a typical text mining approach (Feldman & Sanger, 2006). The documents were tokenised, stop words were removed, and the tokens (i.e., words) were stemmed. Bigrams were considered in addition to unigrams. If a term appeared in the corpus less than 5 times, it was removed from the vocabulary. In the end, TF-IDF vectors were computed and normalised. From each vector, the lowest weighted terms of which cumulative weight accounted for 20% of the overall cumulative weight were removed (i.e., their weights were reset to 0).

4.2.1.2 Clustering

To segment the document space, we implemented the k-means clustering algorithm (Hartigan & Wong, 1979). The purpose of the clustering step is to obtain ―representative‖ instances. In (Paulovich, Nonato, & Minghim, 2006), it is suggested to take the medoids of the clusters as the representative instances. However, we decided to take the centroids rather than the medoids. In the least-squares interpolation process (the final stage of the visualisation pipeline), each non-control point is required to be directly or indirectly linked to a control point. If the control points are represented by the centroids, each non-control point is guaranteed to have at least one non-orthogonal neighbour which is a control point. This prevents the situations in which a point or a clique of points is not linked to a control point and thus cannot be positioned. We believe that this change to the original algorithm results in visually more pleasing layouts (we do not provide experimental evidence to support or reject this claim at this point).

k-means clustering is an iterative process. In each iteration, the quality of the current partition is computed as the average cosine similarity between a document instance and the centroid to which the instance was assigned. If the increase in quality, from one iteration to another, is below a predefined threshold, the clustering process is stopped.

4.2.1.3 Stress majorisation

In the final stage of the pipeline, the least-squares solver interpolates between coordinates of the projected representative instances in order to determine planar locations of the other instances. Since the number of representative instances is relatively low, it is possible to employ computationally expensive methods to project them onto a planar canvas. We therefore resorted to the stress majorisation method which monotonically decreases the stress function in each iteration.

Stress majorisation can be reformulated as an iterative process (Gansner, Koren, & North, 2004). If the reduction in stress, from one iteration to another, is below a predefined threshold, the layout computation process is stopped.

4.2.1.4 Neighbourhoods computation

For the interpolation step, it is also necessary to determine the k nearest neighbours of each data instance. The basic idea of the algorithm is simple: for each data instance, (a) compute the similarities to all the other instances and (b) select the k nearest instances from the list.

Part (b) of the naive algorithm can be efficiently implemented by choosing one of the best performing selection algorithms (e.g., the Median of Medians algorithm) which are guaranteed to have O(n) worst case time complexity (here, n denotes the number of instances). Provided that we need to execute this selection for each data instance, we get O(n2) combined time

Page 62: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 62 of 90

complexity. Efficient implementation of part (a) is more intriguing and is possible due to the fact that we use the cosine similarity measure to determine similarities between data instances. Computing cosine similarity between two instances is equivalent to computing the dot product of the two corresponding vectors (provided that the vectors are normalised). This computation can be done relatively efficiently by employing the algorithm presented in (Grcar, Podpecan, Jursic, & Lavrac, 2010).

4.2.1.5 Least-squares interpolation

The final stage of the pipeline employs a least-squares solver to compute the layout of the non-control points by interpolating between the coordinates of the control points. To construct a system of linear equations required for the interpolation process, we need the coordinates of the control points (obtained by the stress majorisation algorithm) and the neighbourhoods of the document instances and centroids. The basic idea is that each point can then be described as the centre of its neighbours. This process is discussed in greater detail in (Sorkine & Cohen-Or, 2004; Grcar, Podpecan, Jursic, & Lavrac, 2010).

In our document stream visualisation framework, the LSQR solver, developed by Paige and Saunders (Paige & Saunders, 1982), was used to solve the resulting system of linear equations. The solution is a set of planar points corresponding to the high-dimensional feature vectors (i.e., the final layout).

4.2.2 Visualisation of document streams

This section discusses the adaptations of the document corpus visualisation pipeline to document stream visualisation. All stages of the pipeline are modified in a way which allows fast sequential updates, thereby allowing us to efficiently process document streams. The online document stream visualisation pipeline is illustrated in Figure 21. The document stream flows into the buffer of limited capacity and thus outdated documents are gradually removed from the buffer following the FIFO (first in, first out) principle (the buffer thus implements a queue). The model required for the visualisation and the visualisation itself are at all times synchronised with the content of the buffer.

Before going into more details of how separate stages of the pipeline are implemented, let us establish common notions required for the configuration of the pipeline (these notations are also used in Annex 5):

• Let nC denote the number of clusters computed in the k-means clustering process. This corresponds exactly to the number of representative instances (i.e., centroids) that are positioned using the stress majorisation procedure.

• Let ui be the number of documents that enter the buffer and vi the number of documents that are removed from the buffer at time step i. For the sake of simplicity, we will assume that

u ui vi at each i.

• Let nN be the number of closest neighbours that are assigned to each instance. The neighbourhoods are used to construct the system of linear equations.

• Let nQi be the number of instances in the buffer at time step i. For the sake of simplicity, we will

assume that nQ nQi at each i.

In the next subsections, we provide online variants of the document preprocessor, k-means algorithm, stress majorisation optimisation method, k-nearest neighbours algorithm, and least-squares interpolation method.

Page 63: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 63 of 90

Figure 21: Document stream visualisation pipeline.

4.2.2.1 Online document preprocessing

Online document preprocessing can be seen as a queue of term frequency (TF) bag-of-word vectors. When a number of vectors are removed from the queue, the vocabulary is updated accordingly: global document frequency (DF) values are decreased appropriately. If a global DF value reaches zero, the corresponding word is removed from the vocabulary. When a batch of new TF vectors is enqueued, on the other hand, global DF values are increased accordingly and new words (i.e., those not yet contained in the vocabulary) are added to the vocabulary.

At any time, any TF vector in the queue can be converted into its normalised TF-IDF representation by taking the global DF values into account. In this process, the original TF vector is not altered and remains at its original position in the queue. Note that a single TF vector can have many different TF-IDF representations, depending on the state of the vocabulary at the time a TF-IDF vector is computed (when the queue changes, the global DF values normally change; this results in different TF-IDF values in the affected vectors). In our preliminary implementation (see Annex 5), each TF-IDF vector is computed immediately after the corresponding TF vector is enqueued.

4.2.2.2 Online k-means clustering

The online k-means clustering algorithm takes into account the centroids and the assignments of instances to the centroids from the preceding step. After the centroids are updated due to the removal of the outdated instances and assignment of the newly arrived instances (this is a relatively fast operation), the online k-means clustering algorithm proceeds with the usual k-means loop. Assuming that the perturbation of the buffer is small and the set of data instances is much larger (u << nQ), the centroids are proven to be stable (Rakhlin & Caponnetto, 2007), which means that the k-means algorithm will converge rapidly on the perturbed set of data

instances. Specifically, if the perturbation is limited by )( QnO , where nQ is the number of data

instances in the buffer, the online variant of the k-means algorithm is expected to converge rapidly.

4.2.2.3 Online stress majorisation

Taking into account the stability and rapid convergence of the online variant of the k-means clustering algorithm, it is easy to see that stress majorisation of the set of representative data instances (centroids) is also fast. Since the perturbation of particles (centroids) in our stress

Page 64: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 64 of 90

majorisation optimisation problem is small, the overall increase in stress is small as well, which guarantees that only a small number of recomputations of particles‘ positions is needed.

4.2.2.4 Online neighbourhood computation

In the first step, the neighbourhoods are computed in the ―offline‖ way by using the algorithm discussed in Section 3.4 in (Grcar, Podpecan, Jursic, & Lavrac, 2010): for each instance, a list of most similar neighbours is constructed. The online k-NN procedure starts by removing the outdated instances from the queue. The outdated instances are, on the one hand, the outdated bags-of-words and, on the other, the centroids that have been changed in the online k-means step. Next, the removed instances need to be removed from the lists of neighbours as well. The algorithm, discussed in greater detail in Grcar et al. (2010), uses several relatively simple heuristics to perform this step more efficiently. Finally, the newly arrived instances need to be enqueued and the lists need to be updated accordingly. Apart from the new bags-of-words obtained from the stream, the updated centroids are also enqueued. Again, several heuristics can be used to perform this step more efficiently.

This procedure results in updated neighbourhoods. The nN most similar neighbours of each instance are passed on to the next stage of the pipeline, where the system of linear equations is constructed.

4.2.2.5 Online coordinate interpolation

Modifying the coordinate interpolation step to work with streams is a relatively trivial task. We construct the system of linear equations in exactly the same way as in the original visualisation algorithm. In addition, we take the coordinates from the previous step into account when solving the system in the least-squares sense. In our preliminary implementation (see Annex 5), we employ the LSQR least-squares solver (Paige & Saunders, 1982) which is based on a conjugate gradient iterative method that starts with an initial guess for the solution and iteratively modifies the solution vector towards the optimal solution. In our online visualisation process, the coordinates of points at time step i + 1 are similar to those at time step i. This results from the fact that most of the data instances and similarities between them are unchanged and thus the instances tend to move only marginally from their previous positions. Since the coordinates correspond to the solution of the least-squares solver, the coordinates from the preceding step can be used as a good initial guess for the solution. The only set of instances to which we are unable to assign coordinates from the preceding step corresponds to the batch of documents that entered the system at step i. We simply initialise that part of the solution vector to zeros.

4.2.3 Preliminary implementation

In Annex 5, we present a preliminary implementation of the discussed online algorithm for document stream visualisation through a distance-preserving MDS-like projection onto a 2D canvas. The algorithm can be executed as a 4-stage pipeline, which greatly increases the processing speed. We show that in a particular setting with limited buffer capacity and constant document batch size, the pipeline can efficiently handle 25% of the entire active blogosphere1, which should be sufficient for most real-life applications. Also important to note is that the achieved visualisation nicely transitions from one frame to another which enables the user to visually track a point (i.e., a document) gradually moving in the 2D space.

1 Technorati <http://technorati.com/> tracks approximately 100 million blogs; roughly 15 million of them are active.

Around 1 million blog posts are published each day (i.e., around 10 each second).

Page 65: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 65 of 90

5 Conclusions

This report discussed, on one hand, the state of the art related to the envisioned software components and, on the other, technical requirements such as hardware and software infrastructure requirements, data storage requirements, scaling requirements, and runtime environment requirements, posed by the envisioned software components towards the FIRST hardware and software infrastructure.

The state of the art sections include overviews of methods, tools, and fields of science relevant for FIRST. For some specific technologies, concrete decisions and preliminary implementations and experiments were already done (e.g., language detection, boilerplate removal, near-duplicate detection, and document-stream visualisation). Furthermore, specific programming and runtime environments and software tools were chosen for certain tasks in the project (e.g., .NET and Java will interoperate in the FIRST analytical pipeline, GATE will be used for the information extraction tasks, potentially supplemented with Matlab, and DEXi will be used for qualitative multi-attribute modelling). The report also put forward two general-purpose scaling techniques, pipelining and parallelisation. These two techniques will be applied in the FIRST scale-up cycle and present the basis for the definition of the FIRST scaling strategy.

The requirements collected in this report are not yet ―engraved in stone‖. The reader should note that they are still subject to changes. They will be revised accordingly after the end-user requirements towards the envisioned technologies are fully determined (preliminary reported in D1.2 at Month 6) and especially after the conceptual architecture requirements (D2.2) and the scaling strategy (D2.3) are delivered at Month 12.

Page 66: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 66 of 90

References

Aggarwal, C., Han, J., Wang, J., & Yu, P. S. (2003). A Framework for Clustering Evolving Data

Streams. Proceedings of the Int. Conf. on Very Large Data Bases. Berlin.

Aggarwal, C., Han, J., Wang, J., & Yu, P. S. (2004). A Framework for Projected Clustering of

High Dimensional Data Streams. Proceedings of the Int. Conf. on Very Large Data Bases.

Toronto.

Aggarwal, C., Han, J., Wang, J., & Yu, P. S. (2004). On Demand Classification of Data Streams.

Proceedings of the Int. Conf. on Knowledge Discovery and Data Mining. Seattle.

Albrecht-Buehler, C., Watson, B., & Shamma, D. (2005). Visualising Live Text Streams Using

Motion and Temporal Pooling. IEEE Computer Graphics and Applications , 25 (3), 52–59.

Amardeilh, F., Vatant, B., Gibbins, N., & others. (2004). TAO Deliverable 1.2.2: SWS

Bootstrapping Methodology v2. TAO (IST-2004-026460).

Andreevskaia, A., & Bergler, S. (2006). Mining WordNet for Fuzzy Sentiment : Sentiment Tag

Extraction from WordNet Glosses. pp. 209-216.

Aue, A., & Gamon, M. (2005). Automatic identification of sentiment vocabulary: Exploiting low

association with known sentiment terms.

Babcock, B., Datar, M., Motwani, R., & O'Callaghan, L. (2003). Maintaining Variance and k-

Medians over Data Stream Windows. Proceedings of the 22nd Symposium on Principles of

Database Systems.

Baccianella, S., Esuli, A., & Sebastiani, F. (2010). SENTIWORDNET 3.0 : An Enhanced

Lexical Resource for Sentiment Analysis and Opinion Mining. pp. 2200-2204.

Barbara, D., & others. (1997). The New Jersey data reduction report. Technical Committee on

Data Engineering , 20, 3-45.

Bohanec, M. (2008). DEXi: Program for multi-attribute decision making, User's manual, Version

3.00. IJS Report DP-9989. Ljubljana: Jožef Stefan Institute.

Bohanec, M., Messéan, A., Angevin, A., & Žnidaršič, M. (2006). SMAC Advisor: A decision-

support tool on coexistence of genetically-modified and conventional maize. Proceedings of

Information Society IS 2006, (pp. 9-12). Ljubljana.

Bohanec, M., Messean, A., Scatasta, S., Angevin, F., Griffiths, B., Krogh, P., et al. (2008). A

qualitative multi-attribute model for economic and ecological assessment of genetically modified

crops. Ecological Modelling , 215 (1-3), 247-261.

Bontcheva, K., Tablan, V., Maynard, D., & Cunningham, H. (2004). Evolving GATE to Meet

New Challenges in Language Engineering. Natural Language Engineering , 10 (3/4), pp. 349-

373.

Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers.

Proceedings of the fifth annual workshop on Computational learning theory (pp. 144-152).

Pittsburgh, PA, USA: ACM.

Bouyssou, D., Marchant, T., Pirlot, M., Tsoukias, A., & Vincke, P. (2006). Evaluation and

decision models with multiple criteria: Stepping stones for the analyst. International Series in

Operations Research and Management Science , 86.

Buntine, W. (1996, April). A guide to the literature on learning probabilistic networks from data.

IEEE Transactions on Knowledge and Data Engineering , 8 (2), pp. 195 - 210.

Burges, C. (1998). A Tutorial on Support Vector Machines for Pattern. Data Mining and

Knowledge Discovery , 2, pp. 121–167.

Cadilhac, A., Benamara, F., & Aussenac-Gilles, N. (2010). Ontolexical resources for feature

based opinion mining : a case-study. Proceedings of the 6th Workshop on Ontologies and

Lexical Resources, 23rd International Conference on Computational Linguistics, (pp. 77-86).

Page 67: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 67 of 90

Cai, D., Yu, S., Wen, J., & Ma, W. (2003). Extracting content structure for web pages based on

visual representation. Proceedings of the 5th Asia Pacific Web Conference.

Caruana, R., Gehrke, J., & Joachims, T. (2005). Identifying Temporal Patterns and Key Players

in Document Collections. Proceedings of the IEEE ICDM Workshop on Temporal Data Mining:

Algorithms, Theory and Applications (TDM-05), (pp. 165–174).

Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., et al. (2006). A

reference collection for web spam. Proceedings of SIGIR Forum’06.

Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. Proceedings of

SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, (pp. 161-

175).

Chaovalit, P., & Zhou, L. (2005). Movie review mining: A comparison between supervised and

unsupervised classification approaches. Proceedings of the Hawaii International Conference on

System Sciences (HICSS).

Charikar, M. (2002). Similarity estimation techniques from rounding algorithms. Proceedings of

STOC 2002.

Charikar, M., O'Callaghan, L., & Panigrahy, R. (2003). Better streaming algorithms for

clustering problems. Proceedings of the 35th ACM Symposium on Theory of Computing.

Cooper, G., Heckerman, D., & Meek, C. (1997). A Bayesian Approach to Causal Discovery.

Microsoft Research.

Cormode, G., & Muthukrishnan, S. (2003). What's hot and what's not: tracking most frequent

items dynamically. Proceedings of PODS 2003, (pp. 296-306).

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning , 20 (3), 273-297.

Cowie, J., Ludovic, Y., & Zacharski, R. (1999). Language Recognition for Mono-and Multi-

lingual Documents. Proceedings of Vextal Conference. Venice.

Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: A Framework and

Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of

the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02).

Philadelphia.

Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: A Framework and

Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of

the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02).

Philadelphia.

Das, S. R., & Chen, M. Y. (2007, sep). Yahoo! for Amazon: Sentiment Extraction from Small

Talk on the Web. 53 (9), pp. 1375-1388.

Dave, K., Lawrence, S., & Pennock, D. M. (2003). Mining the Peanut Gallery : Opinion

Extraction and Semantic Classification of Product Reviews. Proceedings of the 12th

international conference on World Wide Web.

Dave, K., Lawrence, S., & Pennock, D. (2003). Mining the peanut gallery: opinion extraction

and semantic classification of product reviews. Proceedings of WWW 2003.

Ding, Q., Ding, Q., & Perrizo, W. (2002). Decision Tree Classification of Spatial Data Streams

Using Peano Count Trees. Proceedings of the ACM Symposium on Applied Computing. Madrid.

Domingos, P., & Hulten, G. (2001). A General Method for Scaling Up Machine Learning

Algorithms and its Application to Clustering. Proceedings of the Eighteenth International

Conference on Machine Learning. Williamstown: Morgan Kaufmann.

Domingos, P., & Hulten, G. (2000). Mining High-Speed Data Streams. Proceedings of the

Association for Computing Machinery Sixth International Conference on Knowledge Discovery

and Data Mining.

Domingos, P., Hulten, G., & Spencer, L. (2001). Mining time-changing data streams.

Proceedings of the ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, (pp.

97–106).

Page 68: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 68 of 90

Dong, G., Han, J., Lakshmanan, L. V., Pei, J., Wang, H., & Yu, P. S. (2003). Online mining of

changes from data streams: Research problems and preliminary results. Proceedings of the 2003

ACM SIGMOD Workshop on Management and Processing of Data Streams. San Diego.

Duda, R., Hart, P., & Stork, D. (2001). Pattern Classification. New York: John Wiley & Sons,

Inc.

Erjavec, T. (2004). MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications,

Lexicons and Corpora. Proceedings of the Fourth Intl. Conf. on Language Resources and

Evaluation, LREC'04. Paris.

Esuli, A., Sebastiani, F., & Moruzzi, V. G. (2006). SENTIWORDNET : A Publicly Available

Lexical Resource for Opinion Mining. pp. 417-422.

Evert, S. (2008). A lightweight and efficient tool for cleaning web pages. Proceedings of the

Sixth International Language Resources and Evaluation (LREC’08). Marrakech: European

Language Resources Association (ELRA).

Feldman, R., & Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in

Analyzing Unstructured Data. Cambridge University Press.

Fernandez-Lopez, M., Gomez-Perez, A., & Juristo, N. (1997). Methontology: From Ontological

Art towards Ontological Engineering. Proceedings of the AAAI97 Spring Symposium, (pp. 33–

40). Stanford, USA.

Fette, I., Sadeh-Koniecpol, N., & Tomasic, A. (2007). Learning to Detect Phishing Emails.

Proceedings of WWW 2007.

Fortuna, B., Grobelnik, M., & Mladenic, D. (2006). Semi-automatic Data-driven Ontology

Construction System. Proceedings of the 9th International Multiconference Information Society

IS-2006. Ljubljana.

Fortuna, B., Mladenic, D., & Grobelnik, M. (2006). Visualization of Text Document Corpus.

Informatica , 29, 497–502.

Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994). Design Patterns. Elements of

Reusable Object-Oriented Software. Addison-Wesley.

Gannon, T., Madnick, S., Moulton, A., Sabbouh, M., Siegel, M., & Zhu, H. (2009, January).

Framework for the Analysis of the Adaptability, Extensibility, and Scalability of Semantic

Information Integration and the Context Mediation Approach. Retrieved from MIT Sloan School

of Management Working Paper: http://ssrn.com/abstract=1356653

Gansner, E. R., Koren, Y., & North, S. C. (2004). Graph Drawing by Stress Majorization. 239–

250.

Ganti, V., Gehrke, J., & Ramakrishnan, R. (2002). Mining Data Streams under Block Evolution.

SIGKDD Explorations , 3 (2).

Giannella, C., Han, J., Pei, J., Yan, X., & Yu, P. S. (2003). Mining Frequent Patterns in Data

Streams at Multiple Time Granularities. Next Generation Data Mining .

Gibson, J., Wellner, B., & Lubar, S. (2007). Adaptive Web-page Content Identification.

Proceedings of the 9th annual ACM international workshop on Web information and data

management, WIDM '07. New York.

Gómez-Pérez, A., & Manzano-Maho, D. (2003). OntoWeb Deliverable 1.5: A Survey of

Ontology Learning Methods and Techniques. OntoWeb (IST-2000-29243).

Grcar, M. (2008). TAO Deliverable 2.2.2: Ontology Learning Implementation v2. TAO (IST-

2004-026460).

Grcar, M., Grobelnik, M., & Mladenic, D. (2007). Using Text Mining and Link Analysis for

Software Mining. Proceeding of the ECML/PKDD’07 Workshop on Mining Complex Data.

Warsaw.

Grcar, M., Mladenic, D., Grobelnik, M., & others. (2007). TAO Deliverable 2.2.1: Ontology

Learning Implementation v1. TAO (IST-2004-026460).

Page 69: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 69 of 90

Grcar, M., Podpecan, V., Jursic, M., & Lavrac, N. (2010). Efficient Visualization of Document

Streams. Proceedings of Discovery Science 2010 (pp. 174–188). Canberra: Springer-Verlag

Berlin Heidelberg.

Grcar, M., Stojanovic, N., & others. (2009). VIDI Deliverable D2.1: Architecture of the VIDI

Integrated Systems and Test Scenarios. VIDI (EP-08-01-14).

Grefenstette, G. (1995). Comparing Two Language Identification Schemes. Proceedings of

JADT-95, 3rd International Conference on the Statistical Analysis of Textual Data. Rome.

Guha, S., Koudas, N., & Shim, K. (2001). Data-streams and histograms. Proceedings of the of

33rd Annual ACM Symp. on Theory of Computing, (pp. 471–475).

Guha, S., Mishra, N., Motwani, R., & O'Callaghan, L. (2000). Clustering data streams.

Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE.

Gyongyi, Z., & Garcia-Molina, H. (2004). Web Spam Taxonomy, Technical Report. Stanford

University.

Hall, M. A. (1999). Correlation based feature selection for machine learning. Hamilton:

University of Waikato, Dept. of Computer Science.

Har-Peled, S., Roth, D., & Zimak, D. (2003). Constraint Classification for Multiclass

Classification and Ranking. In S. Becker, S. Thrun, & K. Obermayer (Ed.), Advances in Neural

Information Processing Systems 15: Proceedings of the 2002 Conference (pp. 809-816). British

Columbia, Canada: MIT Press.

Hartigan, J. A., & Wong, M. A. (1979). Algorithm 136: A k-Means Clustering Algorithm.

Applied Statistics , 28, 100–108.

Hatzivassiloglou, V., & McKeown, K. (1997). Predicting the Semantic Orientation of

Adjectives. pp. 174-181.

Hu, M., & Liu, B. (2004). Mining and summarising customer reviews. Proceedings of

KDD’2004.

Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. pp. 168-177.

Ingle, N. C. (1976). A Language Identification Table. The Incorporated Linguist , 15 (4), 98–

101.

Jindal, N., & Liu, B. (2008). Opinion spam and analysis. Proceedings of WSDM 2008.

Jindal, N., Liu, B., & Lim, E.-P. (2010). Finding Unusual Review Patterns Using Unexpected

Rules. Proceedings of CIKM’2010.

Kanayama, H., Nasukawa, T., & Watanabe, H. (2004). Deeper sentiment analysis using machine

translation technology. Proceedings of the 20th international conference on Computational

Linguistics - COLING ’04 (p. 494). Morristown, NJ, USA: Association for Computational

Linguistics.

Kennedy, A., & Inkpen, D. (2006, may). Sentiment classification of movie reviews using

contextual valence shifters. 22 (2), pp. 110-125.

Kessler, J. S., Eckert, M., Clark, L., & Nicolov, N. (2010). The 2010 ICWSM JDPA Sentment

Corpus for the Automotive Domain.

Kleinrock, L. (1996). Queueing systems: problems and solutions. New York: John Wiley &

Sons.

Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate Detection using Shallow Text

Features. Proceedings of The Third ACM International Conference on Web Search and Data

Mining, WSDM 2010. New York.

Kotsiantis, S. (2007, October 3). Supervised Machine Learning: A Review of Classification.

Informatica , 31 (3), pp. 249-268.

Krstajić, M., Mansmann, F., Stoffel, A., Atkinson, M., & Keim, D. A. (2010). Processing Online

News Streams for Large-scale Semantic Analysis. Proceedings of DESWeb 2010.

Page 70: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 70 of 90

Lai, C. L., Xu, K. Q., Lau, R. Y., Li, Y., & Jing, L. (2010). Toward a Language Modeling

Approach for Consumer Review Spam Detection. Proceedings of IEEE 7th International

Conference on E-Business Engineering, (pp. 1-8).

Lai, C. L., Xu, K. Q., Lau, R. Y., Li, Y., & Song, D. (2010). High-Order Concept Associations

Mining and Inferential Language Modeling for Online Review Spam Detection. Proceedings of

IEEE International Conference on Data Mining Workshops, (pp. 1120-1127).

Lakoff, G. (1973). Hedges: A study in meaning criteria and the logic of fuzzy concepts. 2 (4), pp.

458-508.

Last, M. (2002). Online Classification of Nonstationary Data Streams. Intelligent Data Analysis ,

6 (2), 129-147.

Lau, R. Y., Liao, S. S., & Xu, K. (2010). An Empirical Study of Online Consumer Review

Spam: A Design Science Approach. Proceedings of ICIS 2010.

Lehrer, A. (1974). Semantic Fields and Lexical Structure. North Holland, Amsterdam and New

York.

Lendasse, A., Lee, J., de Bodt, R., Wertz, V., & Verleysen, M. (2001). Dimension reduction of

technical indicators for the prediction of financial time series - Application to the BEL20 Market

Index. European Journal of Economic and Social Systems , 15, pp. 31-48.

Li, S., & Momoi, K. (2001). A Composite Approach to Language/Encoding Detection.

Proceedings of the Nineteenth International Unicode Conference. San Jose.

Li, W., Zhong, N., & Liu, C. (2006). Combining Multiple Email Filters Based on Multivariate

Statistical Analysis. Proceedings of ISMIS 2006.

Lim, E.-P., Nguyen, V.-A., Jindal, N., Liu, B., & Lauw, H. W. (2010). Detecting Product Review

Spammers using Rating Behaviors. Proceedings of CIKM 2010.

Liu, B. (2010). Sentiment Analysis and Subjectivity.

Loughran, T., & McDonald, B. (2010). When is a Liability not a Liability? Textual Analysis,

Dictionaries, and 10-Ks.

Maedche, A., & Staab, S. (2000). The TEXT-TO-ONTO Ontology Learning Environment. As

software demonstration at the Eight International Conference on Conceptual Structures ICCS-

2000.

Manku, G. S., & Motwani, R. (2002). Approximate frequency counts over data streams.

Proceedings of the 28th International Conference on Very Large Data Bases. Hong Kong.

Manku, G. S., Jain, A., & Sarma, A. D. (2007). Detecting Near-Duplicates for Web Crawling.

Proceedings of WWW 2007.

Mitchell, T. (1997). Machine Learning. McGraw-Hill.

Neal, R. (1992, July). Connectionist learning of belief networks. Artificial Intelligence , 56 (1),

pp. 71-113 .

Nonaka, I., & Takeuchi, H. (1995). The Knowledge Creating Company. Oxford University Press.

Novak, B. (2008). Odkrivanje tematik v zaporedju besedil in sledenje njihovim spremembam

[Topic Detection and Tracking in a Stream of Documents], B.Sc. thesis. Ljubljana: Faculty of

Computer and Information Science.

Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting Spam Web Pages

through Content Analysis. Proceedings of WWW 2006.

O'Callaghan, L., Mishra, N., Meyerson, A., Guha, S., & Motwani, R. (2002). Streaming-data

algorithms for high-quality clustering. Proceedings of IEEE International Conference on Data

Engineering.

Ordonez, C. (2003). Clustering Binary Data Streams with K-means. Proceedings of the ACM

DMKD 2003.

Paige, C. C., & Saunders, M. A. (1982). Algorithm 583: LSQR: Sparse Linear Equations and

Least Squares Problems. ACM Transactions on Mathematical Software , 8, 195–209.

Pang, B., & Lee, L. (2008, January). Opinion Mining and Sentiment Analysis. 2 (1-2), pp. 1-135.

Page 71: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 71 of 90

Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using

machine learning techniques. Proceedings of EMNLP’2002.

Pasternack, J., & Roth, D. (2009). Extracting Article Text from the Web with Maximum

Subsequence Segmentation. Proceedings of WWW 2009.

Paulovich, F. V., Nonato, L. G., & Minghim, R. (2006). Visual Mapping of Text Collections

through a Fast High Precision Projection Technique. Proceedings of the 10th Conference on

Information Visualization, (pp. 282–290).

Popescu, A.-M., & Etzioni, O. (2005). Extracting Product Features and Opinions from Reviews.

Proceedings of EMNLP’2005.

Power, D. J. (2002). Decision support systems: concepts and resources for managers. Westport,

Connecticut: Quorum Books.

Rakhlin, A., & Caponnetto, A. (2007). Stability of k-Means Clustering. Advances in Neural

Information Processing Systems , 1121–1128.

Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian Approach to

Filtering Junk E-Mail, AAAI Technical Report WS-98-05.

Salton, G. (1991). Developments in Automatic Text Retrieval. Science , 253, 974–979.

Sarawagi, S. (2008). Information Extraction. 1 (3), pp. 261-377.

Shaikh, M. A., Prendinger, H., & Mitsuru, I. (2007). Assessing Sentiment of Text by Semantic

Dependency and Contextual Valence Analysis. In A. Paiva, R. Prada, & R. Picard (Eds.).

Springer Berlin / Heidelberg.

Shaparenko, B., Caruana, R., Gehrke, J., & Joachims, T. (2005). Identifying Temporal Patterns

and Key Players in Document Collections. Proceedings of TDM 2005, (pp. 165–174).

Sorkine, O., & Cohen-Or, D. (2004). Least-Squares Meshes. Proceedings of Shape Modeling

International, (pp. 191–199).

Souter, C., Churcher, G., Hayes, J., & Hughes, J. (1994). Natural Language Identification Using

Corpus-Based Models. Hermes Journal of Linguistics , 183–203.

Spousta, M., Marek, M., & Pecina, P. (2008). Victor: the Web-Page Cleaning Tool. Proceedings

of the 4th Web as Corpus Workshop (WAC4), LREC 2008. Marrakech.

Sprague, R. H., & Carlson, E. D. (1982). Building effective decision support systems. Englewood

Cliffs, N.J.: Prentice-Hall.

Stone, P. J. (1966). The General Inquirer: A Computer Approach to Content Analysis. The MIT

Press.

Turban, E., Aronson, J., Liang, T.-P., & Sharda, R. (2010). Decision Support and Business

Intelligence Systems. Prentice Hall.

Turney, P. D., & Littman, M. L. (2003). Measuring Praise and Criticism: Inference of Semantic

Orientation from Association. 21 (4), pp. 315-346.

Vallés-Balaguer, E., Rosso, P., Locoro, A., & Mascardi, V. (2010). Análisis de Opiniones con

Ontologías. January-June 2010 (41), pp. 29-37.

Virant, J. (1991). Modeliranje in simuliranje racunalniških sistemov. Radovljica: Didakta.

Wang, H., Fan, W., Yu, P., & Han, J. (2003). Mining Concept-Drifting Data Streams using

Ensemble Classifiers. Proceedings of the 9th ACM International Conference on Knowledge

Discovery and Data Mining (SIGKDD). Washington DC.

Wang, X. (2006, November). On the Effects of Dimension Reduction Techniques on Some

High-Dimensional Problems in Finance. Journal of Operations Research , 54 (6).

Wesolowsky, G. (1993). The Weber problem: History and perspective. Location Science , 1, 5-

23.

Wiebe, J., Wilson, T., Bruce, R., Bell, M., & Martin, M. (2004, September). Learning Subjective

Language. 30 (3), pp. 277-308.

Wilson, T., Wiebe, J., & Hoffmann, P. (2005). Recognizing Contextual Polarity in Phrase-Level

Sentiment Analysis. pp. 347-354.

Page 72: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 72 of 90

Wimalasuriya, D. C., & Dou, D. (2010, mar). Ontology-based information extraction: An

introduction and a survey of current approaches. 36 (3), pp. 306-323.

Wu, B., & Davison, B. D. (2006). Identifying link farm spam pages. Proceedings of WWW’06.

Wu, B., Goel, V., & Davison, B. D. (2006). Topical TrustRank: using topicality to combat Web

spam. Proceedings of WWW 2006.

Yi, J., Nasukawa, T., Bunescu, R. C., & Niblack, W. (2003). Sentiment Analyzer: Extracting

Sentiments about a Given Topic using Natural Language Processing Techniques. Third IEEE

International Conference on Data Mining ICDM, (pp. 427-434).

Yu, L., & Liu, H. (2005). Feature Selection for High-Dimensional Data: A Fast Correlation-

Based Filter Solution. In Proceedings of CORES 2005, the 4th International Conference on

Computer Recognition Systems.

Zadeh, L. (1972, 2 2). A fuzzy-set-theoretic interpretation of linguistic hedges. Journal of

Cybernetics , pp. 4-34.

Zhang, Z., Zhang, C., & Ong, S. S. (2000). Building an Ontology for Financial Investment.

Intelligent Data Engineering and Automated Learning — IDEAL 2000. Data Mining, Financial

Engineering, and Intelligent Agents (pp. 379-395). Springer.

Zhao, L., & Li, C. (2009). Ontology Based Opinion Mining for Movie Reviews. In D.

Karagiannis, & Z. Jin (Eds.). Springer Berlin / Heidelberg.

Zhou, L., & Chaovalit, P. (2008). Ontology-Supported Polarity Mining. 59 (1), pp. 98-110.

Ziegler, P., & Dittrich, K. R. (2004). User-Specific Semantic Integration of Heterogeneous Data:

The SIRUP Approach. First International IFIP Conference on Semantics of a Networked World

(ICSNW 2004). LNCS 3226, pp. 44-64. Paris, France: Springer.

Žnidaršič, M., & Bohanec, M. (2007). Automatic revision of qualitative multi-attribute decision

models. Foundations of Computing and Decision Sciences , 32 (4), 315-326.

Žnidaršič, M., Bohanec, M., & Zupan, B. (2008). Modelling impacts of cropping systems:

Demands and solutions for DEX methodology. European Journal of Operational Research ,

189, 594-608.

Page 73: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 73 of 90

Annex 1. Preliminary empirical evaluation of the implemented language detection technique

We develop a single framework for both language and code page detection. Text files for language corpora are read, tokenised and corresponding n-gram profiles are build. As already mentioned, n-grams are n letter long sequences made by slicing single words in a sliding window manner. A profile (model) built for each language corpus or test text is a histogram of n-grams. Built profiles are easily written (serialised) into files for efficient reuse. . Infrequent n-grams are cut off by a specified cutoff parameter to lessen the model. Some similarity measures (e.g. 'out-of-place' measure) require profiles to be ranked (i.e. frequencies of n-grams are converted into ranks).

Code page detection is incorporated into the language detection as it uses the same n-gram technique. Language corpus can be encoded in any of the supported code pages. Such different encodings of the same language are treated by the framework simply as separate languages. We show that the same n-gram technique and similarity measures used for language detection suit code page detection.

Evaluation platform performs tests on a wide range of parameters with an aim of finding the optimal language model.

Annex a. Experiments

As the main goal of the performed experiments is to assess the accuracy of the implemented language and code page detection algorithms, we find it sufficient to focus the tests only on a group of similar languages. We use a dataset of language corpora named 'Multext-East cesDoc'1 (Erjavec, 2004), comprising newspaper articles, spoken texts and one novel ("1984" by George Orwell) for each language. We include nine East European languages into the test: Bulgarian, Czech, Estonian, Hungarian, Lithuanian, Romanian, Russian, Slovenian, and Serbian. In addition, we include English as the language to be analysed in FIRST. These language corpora were halved into training and test sets, which were used to build profiles for languages and profiles for test texts, respectively. Tests were run on a range of parameter values to determine optimum parameter values.

There are several parameters which influence the accuracy. It is desired to keep the model as small as possible, but without losing much of its distinctive n-grams. It is theoretically claimed, as n-grams ordered by frequency follow the Zipf's distribution (Cavnar & Trenkle, 1994), that having only a certain smaller number of most frequent n-grams is enough, which we also confirm in our experiments. We refer to this parameter as the cutoff parameter. The accuracy is greatly affected (Table 5) by the similarity measure with which we compare profiles. Among the tested methods, the out-of-place measure on n-gram ranks is shown to be the most accurate one. We also employ Spearman similarity on ranks and Cosine similarity on frequencies, and show that they are both outperformed by the out-of-place measure. Experiments show that the accuracy is significantly lower for shorter texts (i.e. 100 letters and below) as this is evident from Table 3 and Figure 24.

We experimented with n-grams of length from 1 (unigrams) to 5, among which 2-grams (bigrams) offer the best accuracy. We varied the size of profiles from 100 to 1000 most frequent n-grams (set with the cutoff parameter). It is shown in Table 3 and Figure 23 that having a model of only the first 300 n-grams is adequate even for shorter texts. The shortest test text is 100 letters long and the longest is 1000 letters long (Table 3 and Figure 24) which is about the size of an average paragraph in a document.

1 Available at http://nl.ijs.si/ME/ .

Page 74: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 74 of 90

n-gram test text length = 1000 test text length = 100

cutoff = 300 no cutoff cutoff = 300 no cutoff

1 99.59897 99.59897 94.25949 94.25949

2 100.00 100.00 99.20668 99.46826

3 100.00 100.00 99.17094 99.80703

4 99.98568 100.00 98.78786 99.62692

5 99.94271 100.00 98.04314 99.85849

Table 3: Language detection accuracy [%] for long and short test text.

We made a separate evaluation of the code page detection algorithm using the same out-of-place similarity measure. In Table 4, we can see that the accuracy depends on the length of the test text and the cutoff parameter even more than it is the case with the language detection.

n-gram test text length = 1000 test text length = 100

cutoff = 300 no cutoff cutoff = 300 no cutoff

1 100.00 100.00 89.31661 89.31661

2 100.00 100.00 88.79127 89.31208

3 100.00 100.00 88.67352 88.79127

4 99.8639 100.00 88.14365 88.77315

Table 4: Code page detection accuracy [%] for long and short test texts.

Table 5 shows the results of experiments with the two alternative similarity measures, i.e. the Cosine and the Spearman similarity measure. While the most accurate similarity measure is the 'out-of-place' measure (used in all graphs and tables in this report) on ranked 2-grams, the other two measures have the advantage of being bounded on [0,1].

Similarity measure Accuracy

Out-of-place (2-grams) 100.00

Cosine (5-grams) 82.49785

Spearman (3-grams) 79.20366

Table 5: Language detection accuracy [%] for three different similarity measures on long test texts (1000 letters) without cutoff.

Page 75: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 75 of 90

Figure 22: Language detection accuracy of different n-grams for different lengths of test texts. Bigrams are the most precise.

Figure 23: Language detection accuracy of different n-grams for different cutoffs.

Accuracy of different n-grams,

for different test text lengths

test text length [characters]

language d

ete

ction a

ccura

cy [

%]

100 200 300 400 500 600 700 800 900 1000

94.2

5949

95.6

7802

97.0

9655

98.5

1507

99.9

3360

1-gram2-gram3-gram

4-gram5-gram

cutoff (# most frequent n-grams)

language d

ete

ction a

ccura

cy [

%]

Accuracy of different n-grams,

for different cutoffs

200 300 400 500 600 700 800 900 1000

98.2

6939

98.6

9532

99.1

2126

99.5

4719

99.9

7313

2-gram3-gram4-gram

5-gram

Page 76: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 76 of 90

Annex b. Conclusions

We presented one of the most common methods for language and code page detection and showed that the method is perfectly accurate for longer texts (more than 100 letters). The same model successfully detects both languages and encodings. Code page detection showed to be more sensitive to the length of the test text, because the distinctions between encodings are made by a few special characters. Normally, if these distinctive characters are not present in the text, there are no means of accurately detecting the encoding.

We have conducted tests with different parameter values on a group of similar languages, in order to find a model as small (fast) and as accurate as possible. There was a tradeoff between the size of a model (cutoff) and accuracy, but not significant. Taking more than 300 most frequent n-grams showed no significant improvements. Bigrams (2-grams) performed best for language detection, closely followed by 3-grams.

All the texts used in the language detection tests were encoded in Unicode. To evaluate the code page detection algorithm, on the other hand, we used several different encodings of Slovenian texts (Unicode, IBM-852, ISO Latin 2, Windows 1250). In this setting, unigrams outperformed bigrams, but not significantly.

From our tests, we can conclude that the single most appropriate model for both language and code page detection of longer text is bigram (2-gram) profile of length (cutoff) 300 with the 'out-of-place' profile similarity measure.

Page 77: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 77 of 90

Annex 2. Preliminary empirical evaluation of the implemented boilerplate removal technique

Annex a. Feature selection

In order to make some use of the structural information (ie. block precedence), we treat features of the previous and next blocks as the common features of the regarded block. Figures 24 and 25 depict the actual features, which were employed in the training of the decision tree, descendingly ordered by the information gain. The names of the features are prefixed with the letters p, c and n indicating the block precedence - previous, current and next, respectively.

Figure 24. Features ordered by the information gain for the 2-class (boilerplate vs. full content) problem.

Page 78: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 78 of 90

Figure 25: Features ordered by the information gain for the 6-class (all text classes) problem.

There are two primary text classes (fragments), which we wish to distinguish, namely boilerplate and full content. It is shown in the method we implemented (Kohlschütter, Fankhauser, & Nejdl, 2010) that these two main text classes can be accurately separated by employing only a small subset of features. The full content text class further comprises smaller text classes: headline, full text, enumeration, navigation, disclaimer (copyright) notice, user comments, supplemental text, etc. These finer classes differ in subtler details. Assumably, more features are necessarily to achieve the necessary accuracy. In practice, a simpler model (smaller decision tree) could be used for the case of the two major classes and a more complex one when we need to differ among several classes.

Annex b. Datasets

All of the approaches to boilerplate removal, described in the papers used a collection of marked web pages from several sources as a learning and testing dataset. The sources were creditable news sites (eg. those found on Google News).

Page 79: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 79 of 90

Different text classes in the collected web pages were annotated manually, usually by the help of a browser-like annotation tool such as the one described in (Spousta, Marek, & Pecina, 2008).

All the techniques examined used their own homemade learning news article dataset. Some of them also involved the more widely acclaimed CLEANEVAL dataset (available at http://cleaneval.sigwac.org.uk/), which is a benchmark corpus part of the shared boilerplate removal competition. Because of its plain text structure, it is more aimed at the evaluation purposes than at the learning of the supervised methods.

The implemented shallow-text-features method (Kohlschütter, Fankhauser, & Nejdl, 2010) used its own news article dataset (referred to as the GoogleNews dataset) available for download at http://www.l3s.de/~kohlschuetter/boilerplate/ for both the learning and the testing of the supervised (decision tree) model. The dataset comprises 621 news articles from 408 different web sites (sources). It is a randomly sampled subset of a much larger collection of a crawled 254,000 articles from 7,854 web sites obtained from the Google News search engine in 2008. Regional and topical diversity is ensured by choosing the articles from the six different English-spoken Google News portals and four categories. The fragments of the dataset are annotated with the <span> tags into six different text classes, namely not content (ie. boilerplate), headline, full text, supplemental, related content, comments. The rest of the web page, which is not annotated, is considered not content. The method of Kohlschütter et al. (2010) also employed the CLEANEVAL dataset for the evaluation of the simpler model on the boilerplate vs. full content problem and to test the general domain independence.

Annex c. Implemented method

As mentioned, we chose to implement the Boilerplate Detection using Shallow Text features method (Kohlschütter, Fankhauser, & Nejdl, 2010). In this section, we try to give an insight into the implementation itself.

Tokenising the HTML file, which is often improperly formed poses the first part of the problem. A missing enclosing tag or angular bracket can cause severe misinterpretation of the content. Generally speaking, there are fierce disagreements on whether to use the structural DOM approach to parsing HTML or to rely on using the regular expressions. There are few major approaches to managing (improper) HTML. In (Gibson, Wellner, & Lubar, 2007), raw HTML is sanitised and transformed into XHTML with the designated tools. In the rare cases of failure, the document is discarded. In our implementation, we currently employ regular expressions to tokenise the HTML and recognise different token types. Further experiments with the DOM approach to parsing and various pre-processing tools are planned. Some of the popular tools/libraries for the HTML pre-processing are Tag Soup, Beautiful Soup, Lynx and HtmlAgillityPack.

Created tokens (HTML tags, words (including numbers), symbols, punctuation marks) are further processed to create blocks. A special attention is given to skip the content inside the comments and the <script> and <style> tags, which certainly is not a content relevant to any of the text classes but the boilerplate. Each block contains at least one word enclosed by the HTML tags. The enclosing tags can be any but the anchor tags, which are ignored as to calculate the link density (anchor percentage) feature. Features of the last block are calculated before the new block is about to be created. Additional pass through the list of blocks may be necessary to calculate the features relative to the number of blocks in the document (eg. relative block position).

If the given collection of documents is annotated, <span> tags with the specific CSS classes indicating text classes are treated as new text class annotators. This does not necessarily imply the creation of a new block, if the current one already belongs to that text class. The annotated documents are used only in the learning phase, when creating the learning dataset for the

Page 80: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 80 of 90

decision tree. In the case of the non-annotated documents, all the <span> tags are treated as block separators.

The learning dataset with chosen block-features is usually saved into a csv file and then used to learn the decision tree. Complying with the paper (Kohlschütter, Fankhauser, & Nejdl, 2010), we used the Weka toolkit to build the decision tree and generate its textual printout, which is to be read by the program later on. No actual decision tree learning algorithm was implemented. Also, no other classifier was experimented with, although the SMO was mentioned in the paper.

Any given HTML document, for which we want to extract the content belonging to one or more text classes, is cleaned, tokenised, divided into blocks and for every block its features are calculated. Each block is solely classified into one of the text classes by passing its features through the decision tree. Textual content of all the blocks which are classified as the inquired text class(es), is assembled together as the result.

Annex d. Experiments and results

Considering the vast number of potential features, we put significant effort in finding the best subset. Complying with the most of the papers, we relied on the simplest way of feature selection - generating the learning datasets with various feature combinations and training the tree on them. We put off more thorough feature selection for the potential later improvements.

The evaluation was done by generating a learning CSV dataset from the collection of the annotated documents (GoogleNews dataset from the paper (Kohlschütter, Fankhauser, & Nejdl, 2010), obtainable at http://www.l3s.de/~kohlschuetter/boilerplate/). As there are no known learning parameters in the Weka toolkit to choose only a subset of the attributes, a new dataset with different attributes had to be created for learning of the each different tree.

The learning of the decision trees occurs completely on part of the Weka toolkit. The trees were built with the reduced error post-pruning and validated with the 10-fold cross-validation. Accuracies of the decision tree models for the boilerplate vs. full text problem are depicted in the Table 6, along with the given features and the tree complexities (number of leaves). In this two-class problem, everything but the boilerplate was considered as the main content. The first few levels of the corresponding decision tree are shown in Figure 26.

The case of the more demanding six-class problem has lower accuracies, as expected. Table 7 contains accuracies of the several models with the belonging features. More detailed overview of the text class separability for the overall most accurate model can be seen in Table 8.

Page 81: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 81 of 90

Features class Precision Recall F1-score

FP Rate ROC AuC

# Leaves

All features (as depicted in Figure

27)

boilerplate 93.3 94.5 93.9 4.6 96.8

601 full content 96.5 95.7 96.1 6.4 96.8

Word count (p.Wrds, c.Wrds,

n.Wrds)

boilerplate 87.3 90.3 88.8 8.5 95.6

251 full content 93.6 91.5 92.6 9.7 95.6

Word count and link density

(p.Wrds, p.LnkDns, c.Wrds, c.LnkDns, n.Wrds, n.LnkDns)

boilerplate 92.4 93 92.7 4.9 96.9

489

full content 95.5 95.1 95.3 7 96.9

Table 6: Classification accuracies of the C4.5 for the 2-class (boilerplate vs. full content) problem.

Features class Precision Recall F1-score

FP Rate ROC AuC

# Leaves

All features (as depicted in

Figure 28)

boilerplate 91.9 95.2 93.5 5.4 96.6

682

headline 72.3 55.6 62.9 0.2 86.3

full text 88.8 95.6 92.1 11.7 94.2

supplement 47.7 24.5 32.4 0.4 73.6

related 27.7 2.6 4.7 0 77.2

comments 69.2 39.5 50.3 1.6 80.7

Table 7: Classification accuracies of the C4.5 for the 6-class (all text classes) problem.

predicted-> boilerplate headline full text supplement related comments

boilerplate 202003 475 6542 926 134 2115

headline 1446 2917 549 200 2 132

full text 5339 302 255434 697 34 5472

supplement 3240 191 2228 1977 18 412

related 2883 24 225 26 84 33

comments 4856 127 22708 323 31 18312

Table 8: Misclassifications for the 6-class problem.

Page 82: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 82 of 90

Figure 26: First few levels of a boilerplate removal decision tree.

Annex e. Conclusions

We presented our implementation of one of the most recent boilerplate removal methods and mentioned other similar ones. The method is a fairly accurate one and suitable for news article web sites in different languages and topics. Moreover, its well annotated learning dataset makes the method support various text classes. The more-class problem can as well be solved fairly accurately, using the proper features. The implementation's other advantages are simple features (word oriented) and the simple model overall (decision tree), ability to handle non-contiguous article chunks, easiness of adding new text-block features as needed, language and topic independence, etc.

The first thing to be further improved is better preprocessing of the improper and dynamically generated HTML content, although it probably does not belong to the boilerplate removal method itself. As for the accuracy, there is a mere 5% to be diminished in the simpler two-class problem. More could be improved with the many-class problem accuracy.

boilerplate content

c.LnkDns <= 0.1562

c.Wrds <= 15c.CapWrdRtio

<= 1.087

yes no

n.Wrds <= 13n.RelPos <=

0.987382

yes no

p.Wrds <= 20 n.InP = 0

yes no

yes no

content boilerplate

yes no

boilerplate content

yes no

Page 83: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 83 of 90

Annex 3. Manually created resources for sentiment words

The General Inquirer (Stone, 1966) is a tool that maps each text file with counts on dictionary-supplied categories1. The Harvard-IV-4 Psychosociological Dictionary is part of the General Inquirer. The categories positive (containing 1,915 words of positive outlook) and negative (containing 2,291 words of negative outlook) are often used in opinion mining. The words have been manually classified with various types of positive or negative semantic orientation and words having to do with agreement or disagreement. There are also many other (non-exclusive) categories including 'negate' (217 words that refer to reversal or negation), 'Ovrst' (overstatement, 696 words indicating emphasis) and 'Undrst' (understatement, 319 words indicating de-emphasis).

Whissell's Dictionary of Affect in Language2 is a word list of 8742 words which have been rated by humans along the dimensions of pleasantness, activation, and imagery. In each case the scale used was a three-point scale.

As part of their work in subjectivity classification, (Wilson, Wiebe, & Hoffmann, 2005) created a prior-polarity subjectivity lexicon for subjectivity clues (words and phrases)3. It contains over 8000 clues collected from General Inquirer, the list of (Hatzivassiloglou & McKeown, 1997) and additional words. Words have been labelled by human annotators as positive, negative and neutral as well as strong and weak subjective.

(Kessler, Eckert, Clark, & Nicolov, 2010) created the ICWSM4 2010 JDPA corpus for the automotive domain5. The corpus consists of manually annotated blog posts. Sentiment expressions have been annotated as negators (expressions which invert the polarity of a sentiment expression or modifier), neutralisers (expressions that do not commit the the speaker to the truth of the target sentiment expression or modifier), committers (expressions which shift the commitment of the speaker toward the truth a sentiment expression or modifier) and intensifiers (expressions which shift the intensity of a sentiment expression or modifier).

(Loughran & McDonald, 2010) manually created lists of oriented words for the finance domain6. They extracted all words and word counts from all corporate 10-K reports filed during 1994-2008. They examine all words occurring in at least 5\% of the documents, to consider their most likely usage in financial documents. The results are lists of words expressing negative, positive and uncertain sentiments, as well as litigious words (reflecting a propensity for legal contest) and strong and weak modal words. Words may occur in several lists. The lists contain all inflections of the words. The negative word list contains 2,337 words, the positive list 353 words.

1 Available from http://www.wjh.harvard.edu/~inquirer/

2 Available from ftp://ftp.hdcus.com/wdalx.exe

3 Available from http://www.cs.pitt.edu/mpqa/

4 International AAAI Conference on Weblogs and Social Media

5 Available from http://www.icwsm.org/data/

6 Available from http://www.nd.edu/~mcdonald/Word_Lists.html

Page 84: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 84 of 90

Annex 4. Sentiment classification – problem definition

There is a wide variety of vocabulary used in the literature in work about automatic extraction of sentiment. The same task is referred to as sentiment analysis, sentiment extraction, opinion mining, subjectivity analysis, emotional polarity computation and other terms. The same is true for the object of classification, the polarity of opinions. Commonly used terms are sentiment orientation, polarity of opinion and semantic orientation. We follow (Pang & Lee, 2008) and (Liu, 2010) in regarding these terms as synonyms.

Annex a. Sentiment

The definition of polarity of sentiment in literature is often vague. (Pang & Lee, 2008) cite the definition of Merriam-Webster‘s Online Dictionary on the synonyms of opinions: ''opinion, view, belief, conviction, persuasion [and] sentiment mean a judgment one holds as true. Opinion implies a conclusion thought out yet open to dispute [...]. View suggests a subjective opinion [...]. Belief implies often deliberate acceptance and intellectual assent [...]. Conviction applies to a firmly and seriously held belief [...]. Persuasion suggests a belief grounded on assurance (as by evidence) of its truth [...]. Sentiment suggests a settled opinion reflective of one‘s feelings [...].''.

(Pang & Lee, 2008) point out the difference between classifying the orientation of an opinion and classifying the strength of an opinion by noting that it is possible to feel quite strongly (high on the ―strength‖ scale) that something is mediocre (middling on the ―evaluation‖ scale). Most research concentrates on predicting the orientation of an opinion only, although some predict also the strength.

(Hatzivassiloglou & McKeown, 1997) define semantic orientation or polarity of a word as ''indicating the direction the word deviates from the norm for its semantic group or lexical field'' (following (Lehrer, 1974)). The semantic polarity of a word constrains its use in a language.

(Andreevskaia & Bergler, 2006) view sentiment as a fuzzy category (Zadeh et al., 1987), where membership is gradual and some members are more central than others. Words that are less central are more likely to be ambiguous and will also be difficult to annotate for human annotators. By only including words with greater centrality, the accuracy of a system can be increased and a system can be evaluated more realistically.

(Zhou & Chaovalit, 2008) define the semantic orientation of a word as the measure of the ''relative difference in semantic scores between a target word and the average of all words in a text''.

(Turney & Littman, 2003) define as semantic orientation as ''the evaluative character of a word''. They write that ''Positive semantic orientation indicates praise (e.g., ―honest‖, ―intrepid‖) and negative semantic orientation indicates criticism (e.g., ―disturbing‖, ―superfluous‖). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong).''

For their paper on the extraction of sentiment from stock message boards, (Das & Chen, 2007) define sentiment as ''the net of positive and negative opinion expressed about a stock on its message board''. Opinion is here defined as the result of classifying a message as positive or negative. The classification is done by five classifiers independently and the majority vote is taken.

(Liu, 2010) defines an opinion on a feature f as ''a positive or negative view, attitude, emotion or appraisal on f from an opinion holder'' (Liu, 2010). An opinion holder is the person or organization that expresses the opinion. A feature is a component (or part) or an attribute of an object, that can be explicitly mentioned or be implied in the sentence. The object itself is included as a special feature. Following Liu, the orientation of an opinion on a feature indicates

Page 85: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 85 of 90

whether the opinion is positive, negative or neutral. Opinions can be direct (about one object) or indirect (by comparing it to another object). They can occur in text as explicit opinions (''The quality is good'') or implicit opinions implied in an objective sentence (''The earphone broke in two days''). A direct opinion is then represented as a quintuple of object, feature, orientation of the opinion, opinion holder and time.

Annex b. Subproblems

Annex i. Sentiment retrieval

Before sentiments can be analysed, they have to be retrieved. The field of information retrieval (IR) is very broad and older than the internet (Manning et al.2008). Nevertheless, the world wide web has become the main source of information and thus we ―restrict‖ ourselves to this medium. Furthermore the focus of our work does not lie in information retrieval techniques but rather in the analysis of sentiments. For a comprehensive introduction and overview of IR Manning et al. (2008) is recommendable.

Annex ii. Subjectivity classification

Subjectivity classification is the task of classifying a sentence or a document as opinionated (containing an opinion) or not opinionated. This is not the same as classifying sentences as having positive, negative and neutral opinion. A neutral opinion is an opinion that lies between positive and negative. It is possible to have a very strong opinion that something is mediocre (Pang & Lee, 2008). This is different to sentences that express no opinion at all, but a fact like ''Yesterday I went shopping''.

Opinionated sentences are not necessary only subjective sentences ((Liu, 2010),(Pang & Lee, 2008)). For example a piece of news can be good or bad news without being subjective: for instance, ―the stock price rose‖ is objective information that is generally considered to be good news in appropriate contexts. Often objective information can help to determine the sentiment expressed in a document so these sentences are considered opinionated sentences.

Subjectivity classification is often done in combination with sentiment classification, but it also can be done as a step before sentiment classification to prevent the classifier from considering irrelevant or misleading examples. Most researchers use machine learning classifiers for subjectivity classification.

We do not want to tackle the problem of subjectivity classification in this work, so we are assuming a sentence to be opinionated and relevant for sentiment classification, if it is in present tense (no past, no conditional) and contains a financial instrument or an indicator. For more information on subjectivity classification see (Wiebe, Wilson, Bruce, Bell, & Martin, 2004).

Annex iii. Topic relevance and topic shift

There are two main problems related to the focused topic when dealing with sentiment analysis: topic relevance and topic shift. O‘Hare et al. (2009) give an overview about these types of problems. They call their approach topic-dependent sentiment analysis, which emphasises the fact that they tackle topic-related problems.

Topic relevance concerns the degree of relatedness of a sentiment towards the topic of interest. Within our work, this relevance is assumed. The analysed sentiments are chosen from relevant

Page 86: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 86 of 90

blog entries which deal with the financial object in question, thus these sentiments fulfil our criteria for topic relevance.

Topic shift means that there are ―several topics discussed in one document‖ (O‘Hare et al., 2009). Although it must be assumed that this is often the case, this aspect will also not be dealt with in our work.

Annex iv. Sentiment holder extraction

It is often the case that the author of a document expresses other people‘s sentiments within his document. Some work has tackled this problem (Kessler et al., 2010; Ruppenhofer et al., 2008; Kim and Hovy, 2004), though in our work we assume that all sentiments inside a document are sentiments of the author of the document.

Page 87: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 87 of 90

Annex 5. Document stream visualization pipeline implementation and testing

We implemented the online document stream visualization pipeline in C# on top of LATINO1, a software library providing a range of data mining and machine learning algorithms with the emphasis on text mining, link analysis, and data visualization. The only part of the visualization pipeline that is implemented in C++ (and not in C#) is the least-squares solver.

To measure the throughput of the visualization pipeline, we processed the first 30,000 news (i.e., from 20.8.1996 to 4.9.1996) of the Reuters Corpus Volume 1 dataset2. Rather than checking if the visualization pipeline is capable of processing the stream at its natural rate (i.e., roughly 1.4 news documents per minute), we measured the maximum possible throughput of the pipeline at constant u (document inflow batch size) and nQ (buffer capacity). In our experiments, the buffer capacity was set to nQ = 5,000, the document batch size to u = 10, the number of clusters and thus representative instances to nC = 100, and the size of neighbourhoods to nN = 30.

Figure 27 shows the time that packets spent in separate stages of the pipeline (in milliseconds) when streaming the news into the system chronologically. The timing started when a particular packet (i.e., a batch of documents and the corresponding data computed in the preceding stage) entered a stage and stopped when it has been processed3. We measured the actual time rather than the processor time to get a good feel for the performance in real-life applications. Our experiments were conducted on a simple laptop computer with an Intel processor running at 2.4 GHz, having 2 GB of memory. The purpose of the experiment was to empirically verify that each stage of the pipeline processes a packet in constant time provided that nQ is constant. However, the chart in Figure 27 is not very convincing as the time spent in some of the stages seems to increase towards the end of the stream segment (e.g., the k-means clustering algorithm takes less than 3 seconds when 10,000 documents are processed and slightly over 4 seconds when 30,000 documents are processed). Luckily, this phenomenon turned out to be due to some temporal dataset properties. Specifically, for some reason which we do not explore in this work (e.g., ―big‖ events, changes in publishing policies, different news vendors...), the average length of news documents in the buffer has increased over a certain period of time. This resulted in an increase of nonzero components in the corresponding TF-IDF vectors and caused dot product computations to slow down as on average more scalar products were required to compute a dot product. In other words, the positive trends in consecutive timings of the k-means clustering and neighbourhoods computation algorithms are coincidental and do not imply that the pipeline will eventually overflow.

1 LATINO (Link Analysis and Text Mining Toolbox) is open-source—mostly under the LGPL license—and is available at

http://latino.sourceforge.net/ 2 Available at http://trec.nist.gov/data/reuters/reuters.html

3 Even if the next pipeline stage was still busy processing the previous packet, the timing was stopped in this experimental

setting.

Page 88: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 88 of 90

Figure 27: Time spent in separate stages of the pipeline when streaming the news into the system in chronological order.

Figure 28: Time spent in separate stages of the pipeline when streaming the news into the system in random order.

To prove this, we conducted another experiment in which we randomly shuffled the first 30,000 news documents and thus fed them into the system in random order. Figure 28 shows the time spent in separate stages of the pipeline when streaming the news into the system in random order. From the chart, it is possible to see that each of the pipeline stages is up to the task. After the number of instances in the buffer has reached nQ, it is possible to clearly observe that the processing times are kept in reasonable bounds that do not increase over time, which implies constant processing time at each time step. The grey series in the chart represent the actual times while the black series represent the moving average over 100 steps (i.e., over 1,000 documents).

In addition to measuring processing times in separate pipeline stages, we computed the delay between packets exiting the pipeline in a real pipeline-processing scenario. We simulated the pipeline processing by taking the separate processing times into account. Note that in order to actually run our algorithm in a true pipeline sense, we would need a machine that is able to

Page 89: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 89 of 90

process 5 processes in parallel (e.g., a computer with at least 5 cores). Let s1, s2, s3a, s3b, and s4 correspond to separate stages of the pipeline, that is to document preprocessing, k-means clustering, stress majorization, neighbourhoods computation, and least-squares interpolation, respectively. Note that the stages s3a and s3b both rely only on the preprocessing in s2 and can thus be executed in parallel. These two stages can be perceived as a single stage, s3, performing stress majorization and neighbourhoods computation in parallel. The time a packet spends is s3 is equal to the longer of the two times spent in s3a and s3b. Figure 29 shows the delay between packets exiting the pipeline. From the chart, it is possible to see that after the buffer has been filled up, the delay between two packets—this corresponds to the delay between two consecutive updates of the visualization—is roughly 4 seconds on average. This means that we are able to process a stream with a rate of at most 2.5 documents per second. Note that this roughly corresponds to 25% of the entire blogosphere rate and should be sufficient for most real-life applications. Note also that each packet, i.e., each visualization update, is delayed for approximately 9.5 seconds on average from the time a document entered the pipeline to the time it exited and was reflected in the visualization. Furthermore, since nQ = 5,000, at 2.5 documents per second, the visualization represents an overview of half an hour worth of documents and is suitable for real-time applications such as public sentiment surveillance in financial market decision-making.

Figure 29: The delay between packets exiting the pipeline.

Page 90: D2.1 Technical requirements and state-of-the-artfirst.ijs.si/FirstShowcase/Content/reports/D2.1.pdf · ―Problem analysis‖ and ―State of the art‖) and report template (Annex

D2.1

© FIRST consortium Page 90 of 90

Annex 6. Data storage requirements

As the knowledge base shall provide a persistent storage for the extracted information, a physical storage is required. The size of the available storage solution depends on the definition of the data items to be stored and their cardinality.

The knowledge base structure is required to enable efficient storage and retrieval capabilities for data of different modalities. Regarding data retrieval a uniform way for general-purpose querying shall be provided. As the integration of information needs to be tailor-made with respect to the data items, structures and formats in question as well as towards the querying purposes, a precise definition of what shall be stored and in which combinations data shall be retrievable is required. This definition is to be provided by the other technical work packages and will be derived from the requirements of the specific use cases.

As the integrated knowledge base shall provide a uniform way of accessing data of different modalities, it has to bridge different query paradigms. While data that is stored in a relational database schema is typically queried using SQL, data represented in RDF graphs – such as ontologies – is typically queried using SPARQL1.

Addressing these requirements, work package 5 will develop and implement an appropriate solution for the knowledge base model. Deliverable D5.1 will specify the knowledge base model, while the subsequent deliverable D5.2 provides a prototype. Deliverable D5.3 will then implement the agreed scaling strategy accordingly.

1 SPARQL is a W3C recommendation, see http://www.w3.org/TR/rdf-sparql-query/