Le-Phuoc-Linked Stream Data Processing

  • View
    9

  • Download
    2

Embed Size (px)

DESCRIPTION

Le-Phuoc-Linked Stream Data Processing

Text of Le-Phuoc-Linked Stream Data Processing

  • Linked Stream Data Processing

    Danh Le-Phuoc, Josiane Xavier Parreira, and Manfred Hauswirth

    Digital Enterprise Research Institute,National University of Ireland, Galway{danh.lephuoc,josiane.parreira,manfred.hauswirth}@deri.org

    Abstract. Linked Stream Data has emerged as an eort to representdynamic, time-dependent data streams following the principles of LinkedData. Given the increasing number of available stream data sources likesensors and social network services, Linked Stream Data allows an easyand seamless integration, not only among heterogenous stream data, butalso between streams and Linked Data collections, enabling a new rangeof real-time applications.

    This tutorial gives an overview about Linked Stream Data process-ing. It describes the basic requirements for the processing, highlightingthe challenges that are faced, such as managing the temporal aspectsand memory overow. It presents the dierent architectures for LinkedStream Data processing engines, their advantages and disadvantages.The tutorial also reviews the state of the art Linked Stream Data pro-cessing systems, and provide a comparison among them regarding thedesign choices and overall performance. A short discussion of the currentchallenges in open problems is given at the end.

    Keywords: Linked Stream Data, Data Stream Management Systems,Linked Data, Sensors, query processing.

    1 Introduction

    We are witnessing a paradigm shift, where real-time, time-dependent data isbecoming ubiquitous. Sensor devices were never so popular. For example, mo-bile phones (accelerometer, compass, GPS, camera, etc.), weather observationstations (temperature, humidity, etc.), patient monitoring systems (heart rate,blood pressure, etc.), location tracking systems (GPS, RFID, etc.), buildingsmanagement systems (energy consumption, environmental conditions, etc.), andcars (engine monitoring, driver monitoring, etc.) are continuously producing anenormous amount of information in the form of data streams. Also on the Web,services like Twitter, Facebook and blogs, deliver streams of (typically unstruc-tured) real-time data on various topics.

    Integrating these new information sourcesnot only among themselves, butalso with other existing sourceswould enable a vast range of new, real-timeapplications in the areas of smart cities, green IT, e-health, to name a few.However, due to the heterogeneous nature of such diverse streams, harvestingthe data is still a dicult and labor-intensive task, which currently requires alot of hand-crafting.

    T. Eiter and T. Krennwallner (Eds.): Reasoning Web 2012, LNCS 7487, pp. 245289, 2012.c Springer-Verlag Berlin Heidelberg 2012

  • 246 D. Le-Phuoc, J.X. Parreira, and M. Hauswirth

    Recently, there have been eorts to lift stream data to a semantic level, e.g., bythe W3C Semantic Sensor Network Incubator Group1 and [25,103,126]. The goalis to make stream data available according to the Linked Data principles [22]aconcept that is known as Linked Stream Data [99]. As Linked Data facilitates thedata integration process among heterogenous collections, Linked Stream Datahas the same goal with respect to data streams. Moreover, it also bridges thegap between stream and more static data sources.

    Besides the existing work on Linked Data, Linked Stream Data also benetsfrom the research in the area of Data Streams Management Systems (DSMS).However, particular aspects of Linked Stream Data prevents existing work inthese two areas to be directly applied. One distinguishing aspect of streams thatthe Linked Data principles do not consider is their temporal nature. Usually,Linked Data is considered to change infrequently. Data is rst crawled and storedin a centralised repository before further processing. Updates on a dataset areusually limited to a small fraction of the dataset and occur infrequently, orthe whole dataset is replaced by a new version entirely. Query processing inLinked Data databases, as in traditional relational databases, is pull based andone-time, i.e., the data is read from the disk, the query is executed against itonce, and the output is a set of results for that point in time. In contrast, inLinked Stream Data, new data items are produced continuously, the data isoften valid only during a time window, and it is continually pushed to the queryprocessor. Queries are continuous, i.e., they are registered once and then areevaluated continuously over time against the changing dataset. The results of acontinuous query are updated as new data appears. Therefore, current LinkedData query processing engines are not suitable for handling Linked Stream Data.It is interesting to notice that in recent years, there has been work that points outthe dynamics of Linked Data collections [115,107]. Although at a much slowerpace compared to streams, it has been observed that centralised approaches willnot be suitable if freshness of the results is important, i.e., the query resultsare consistent with the actual live data under certain guarantees, and thusan element of live query execution will be needed [107]. Though this diersfrom stream data, some of properties and techniques for Linked Stream Dataprocessing may also be applicable to this area.

    Data Streams Management Systems, on the other hand, are designed to han-dle and scale with fast changing, temporal data, such as Linked Stream Data.However Linked Stream Data is usually represented as an extension of RDFthe most popular standard for Linked Data representation. This contrasts withthe relational storage model used in DSMS. It has been shown that in orderto eciently process RDF data using the relational model, the data needs tobe heavily replicated [125,91]. Replication fast changing RDF streams is pro-hibitive, therefore DSMS cant be directly used for storage and processing ofLinked Stream Data.

    In this tutorial we will give an overview on Linked Stream Data process-ing. We will highlight the basic requirements, the dierent solutions and the

    1 http://www.w3.org/2005/Incubator/ssn/

  • Linked Stream Data Processing 247

    advantages and disadvantages from each approach. We will also review existingLinked Stream Data processing system, and provide a comparison among them,regarding the design choices and overall performance.

    We will start by introducing a running example that will be used throughoutthe tutorial (Section 1.1). The example highlights the potential benets in inte-grating stream data with other sources, and it also shows what are the challengesfaced when designing a Linked Stream Data Processing system. And we men-tioned earlier, Linked Stream Data is closely related to Linked Data and streamprocessing. We assume that the attendees of this tutorial are familiar with theresearch in Linked Data, and in Section 2 we will focus providing a backgroundon the fundamentals of stream processing that also applies to Linked StreamData. We will show what are the basic models and techniques, how continu-ous semantics are represented, the diverse operators and processing optimisa-tion techniques, and how issues like time management and memory overow arehandled. In the Linked Stream Data processing (Section 3), we rst show theformalisation to represent the data and the continuous queries and the query op-erators needed. We then moved into the architecture and system design of LinkedStream Data processing engines, showing the dierent possible approaches anddesign choices. We highlight the state of the art systems in Linked Stream Dataprocessing, and show how each of them implement the dierent architectures.We also provide a performance comparison in terms of query processing timesand scalability. The end of this tutorial is dedicated to a short discussion of thecurrent challenges in open problems (Section 4).

    1.1 Running Example

    Inspired by the experiments of Live Social Semantics [4,109], we use the follow-ing running example through the rest of the tutorial. The scenario focuses on thedata integration problem between data streams given by a tracking system andstatic data sets. Similar to several real deployments in Live Social Semantics,the tracking system is used for gathering the relationship between real-worldidentiers and physical spaces of conference attendees. These tracking data canthen be correlated with non-stream datasets, like online information about theattendees (social network, online proles, publication record, etc). The benetsof correlating these two sources of information are manifold. For instance, con-ference rooms could be automatically assigned to the talks, based on the talkstopic and the number of people that might be interested in attending it (basedon their prole). Conference attendees could be notied about fellow co-authorsin the same location. A service that suggest which talks to attend, based onprole, citation record, and distance between talks locations can be designed.These are just a few examples.

    For the tracking service in our example, attendees of a conference wear RFIDtags that constantly stream the location in a building, i.e which room/sec-tion/area they currently are. Each reading streamed from the RFID tags toRFID readers has two properties, the tagid and the signal strength.

  • 248 D. Le-Phuoc, J.X. Parreira, and M. Hauswirth

    The static datasets include data about the conference attendees and metadataabout the conference building. Each attendee has a prole containing his personalinformation and the tagid given to him. The prole has also data links to theattendees publication records in DBLP. The data about the conference buildingincludes information such as location name, description, layout, and connectionbetween location/rooms.

    The data from streams and static datasets of this example can not be modelledby a relational data model because the data items involved are hosted in dierentstorages. Bec