18
ViSTA-TV: Video Stream Analytics for Viewers in the TV Industry FP7 STREP ICT-296126 | 296126 co-funded by the European Commission ICT-2011-SME-DCL | SME Initiative on Digital Content and Languages D4.1 Recommendations Engine Dummy Version Thomas Scharrenbach (UZH) Yamuna Krishnamurthy (TUDo) Christian Bockermann (TUDo) Project start date: June 1st, 2012 Project duration: 24 months Document identifier: ViSTA-TV/2012/D4.1 Date due: 30 Nov 2012 Submission date: 30 Nov 2012 Version: v1.0 Status: Final Distribution: PU www.vista-tv.eu ViSTA-TV Consortium

D4.1 Recommendations Engine Dummy

Embed Size (px)

Citation preview

Page 1: D4.1 Recommendations Engine Dummy

ViSTA-TV: Video Stream Analytics for Viewers in the TV Industry

FP7 STREP ICT-296126 | 296126 co-funded by the European Commission ICT-2011-SME-DCL | SME Initiative on Digital Content and Languages

D4.1 Recommendations Engine Dummy Version

Thomas Scharrenbach (UZH) Yamuna Krishnamurthy (TUDo) Christian Bockermann (TUDo)

Project start date: June 1st, 2012 Project duration: 24 months Document identifier: ViSTA-TV/2012/D4.1 Date due: 30 Nov 2012 Submission date: 30 Nov 2012

Version: v1.0 Status: Final Distribution: PU

www.vista-tv.eu

ViSTA-TV Consortium

Page 2: D4.1 Recommendations Engine Dummy

This document is part of a collaborative research project funded by the FP7 ICT Programme of the Commission of the European Communities, grant number 296126. The following partners are involved in the project:

University of Zurich (UZH) - Coordinator Dynamic and Distributed Information Systems Group (DDIS) Binzmühlstrasse 14 8050 Zürich, Switzerland Contact person: Abraham Bernstein E-mail: [email protected]

Techniche Universität Dortmund (TUDo) Computer Science VIII: Artificial Intelligence Unit D-44221 Dortmund, Germany Contact person: Katharina Morik E-mail: [email protected]

Rapid-I GmbH (RAPID-I) Stockumer Strasse 475 44227 Dortmund, Germany Contact person: Ingo Mierswa E-mail: [email protected]

Zattoo Europa AG (Zattoo) Eggbühlstrasse 28 CH-8050 Zürich, Switzerland Contact person: Bea Knecht E-mail: [email protected]

Vrije Universiteit Amsterdam (VUA) Web & Media Group, Department of Computer Science, Faculty of Sciences (FEW) De Boelelaan 1081a NL-1081 HV Amsterdam, The Netherlands Contact person: Guus Schreiber E-mail: [email protected]

The British Broadcasting Corporation (BBC) BBC Research & Development, Centre House 56 Wood Lane London, W12 7SB, United Kingdom Contact person: Chris Newell E-mail: [email protected]

Copyright © 2012 The ViSTA-TV Consortium

Page 3: D4.1 Recommendations Engine Dummy

D1.4 Data Security and Privacy 3

Executive Overview

This document describes the dummy version of the recommendations engine (RE) as part of the ViSTA-TV stream-processing framework. The RE comprises three parts: First, the complex event processing engine, second the show-user-model (SUM) engine and third the actual recommendations rules (RERULES) engine. All of the above are sub-topologies in the ViSTA-TV SOTRM stream-processing topology. In the sequel we give precise definitions of the data flow and the data format for input and output.

Page 4: D4.1 Recommendations Engine Dummy

ViSTA-TV 4

Contents

1   Introduction ...................................................................................................................................... 5  1.1   Scope of this Document ............................................................................................................ 5  1.2   Outline ....................................................................................................................................... 5  

2   The ViSTA-TV Project Setup ........................................................................................................... 6  2.1   ViSTA-TV Complex Event Processing and Temporal Facts. .................................................... 7  2.2   ViSTA-TV Recommendation Rules ........................................................................................... 8  2.3   ViSTA-TV Show-User-Model .................................................................................................... 8  2.4   Implementation Notes ............................................................................................................... 9  2.5   Licensing and Availability .......................................................................................................... 9  

3   Complex Event Processing Engine (CEP) .................................................................................... 11  3.1   Data Formats .......................................................................................................................... 11  

3.1.1   Input ................................................................................................................................. 11  3.1.2   Output ............................................................................................................................... 11  

3.2   Parameters ............................................................................................................................. 11  3.3   Implementation ....................................................................................................................... 11  

4   Show-User-Model Engine (SUM) .................................................................................................. 13  4.1   Data format ............................................................................................................................. 13  

4.1.1   Input ................................................................................................................................. 13  4.1.2   Output ............................................................................................................................... 13  

5   Recommendations Rules Engine (RERULES) .............................................................................. 14  5.1   Data format ............................................................................................................................. 14  

5.1.1   Input ................................................................................................................................. 14  5.1.2   Output ............................................................................................................................... 14  

5.2   Parameters ............................................................................................................................. 15  5.3   Implementation ....................................................................................................................... 15  

6   Testing ........................................................................................................................................... 15  7   Summary ....................................................................................................................................... 17  8   Acknowledgements ....................................................................................................................... 17  References .......................................................................................................................................... 17  

Page 5: D4.1 Recommendations Engine Dummy

D4.1 Recommendations Engine Dummy Version 5

1 Introduction

1.1 Scope of this Document The document describes the first version of the Recommendations Engine of the ViSTA-TV stream-processing engine. This engine is based on the STORM stream-processing framework and is described in document ViSTA-TV/2012/D3.1. 1 That document also describes how the components of the ViSTA-TV stream-processing engine are to be implemented in the STORM stream-processing framework. This document specifies specifically how the components of the Recommendations Engine are implemented.

The Recommendations Engine comprises the following components:

• Complex Event Processing Engine (CEP) • Show-User-Model Engine (SUM) • Recommendations Rules Engine (RERULES)

While the CEP and SUM engines provide further input for the RERULES engine, the latter provides an output to the end user apps (cf. Figure 2).

In its current version all components are capable of

• reading input tuples in the correct format, • declaring the correct format for output tuples, • writing output tuples in the correct format, • being configurable properly

1.2 Outline This document is structured as follows:

In Section 2 we give an overview over the ViSTA-TV project as was already done in the Description of Work (DOW). After that in Section 3 we describe the input and output formats as well as the configuration parameters of the CEP engine. In the remaining section we provide the same information for the SUM engine (Section 4) and the RERULES engine (Section 5). We finally give an overview how to test all components independent from the other components of the ViSTA-TV Stream Processing engine (Section 6). We conclude with a summary in Section 7.

1 http://storm-project.net/

Page 6: D4.1 Recommendations Engine Dummy

ViSTA-TV 6

2 The ViSTA-TV Project Setup Please note that much of the content of this section was re-used from the Description of Work document (DOW) to make this document self-contained.

In the past video content was consumed through broadcast TV delivered through the air or by cables. Broadcasters had information – both meta-data and the actual video stream – about their own programming as well as coarse usage information supplied by survey companies such as Nielsen or GfK.12 Cable providers viewed themselves as pure bundling and transportation channels whilst viewers used secondary sources to inform themselves about the desired programming sources. Fueled by Internet TV (IPTV) offers of telecommunication giants, cable providers, or pure online providers such as YouTube or Hulu as well as digital rental companies such as Netflix and iTunes, people increasingly watch video content delivered over IP networks. The technological issues arising from IPTV as well as the opportunities for interactive, personalized TV have been amply investigated in EU-projects such as NoTube.3 However, the use of behavioral information accruing during the process of viewing live IPTV in combination with the actual video streams and the electronic program guide (EPG) information as a foundation for a market research data marketplace has been largely overlooked.

Figure 1: The ViSTA-TV universe (images are owned by the respective companies).

The ViSTA-TV project proposes to view IPTV as a two-way channel, where the viewer can take advantage of the video streams whilst the ViSTA-TV platform employs behavioral information about the viewer gathered by IPTV transmitters to improve the experience for all participants in the TV supply chain (see Figure 1). Specifically, ViSTA-TV proposes to gather consumers' viewing behavior and the actual video streams from broadcasters and IPTV providers to combine them with enhanced EPG information as the input for a holistic live-stream analysis. The analysis – comprising of personalized feature construction, supplementary tag generation, and real-time recommendation generation – will provide the input for a triplication procedure that generates Data in the Resource Description Framework (RDF) – a data format for storing typed graphs – and enables the reuse of the resulting data for three different data pools:

• Linked Open Data as a Basis for Analytic Processing, • Viewing Behavior Data, and • recommendation Data.

2 http://www.gfk.com/

3 http://notube.tv/

!"#"$%&'()*+,&#$%&-(&,$

./&01+,'0$2330$$$

./&#,&#$4'/5(6,'0$

265,'70,'0$

8"7&-$"-,&)(,0$

9:!$.;/16$

!"#"$<"',*/10,$

$

=4>?@4'/5(6,'0$

8,)/++,&@6"7/&$%&-(&,$

!"#$%&$!'()*+,-.'

A,*"5(/'$6"#"$4'/5(6,'0$

>"-@B#',"+$>"-@B#',"+$

8"C$!"#"$

9(&D,6$:3,&!"#"$

E,#"6"#"$

8,)/++,&@$6"7/&0$

A,*"5(/'$!"#"$

A,*"5(/'$!"#"$

8"C$!"#"$

Page 7: D4.1 Recommendations Engine Dummy

D4.1 Recommendations Engine Dummy Version 7

Each of these data pools with its production pipeline will provide the foundation for a data marketplace with its own commercial as well as societal rationale such as bootstrapping innovative applications relying on TV data, selling services relying on detailed viewing behavior analysis beyond today's capabilities, or helping viewers finding the shows that best match their interest at any given point in time.

The heart of the ViSTA-TV platform will be an online analysis system that operates on streams of content data in real time. This system will be developed on the basis of the STORM stream- processing framework.1 We will extend existing methods for feature extraction from content streams, as well as methods for querying and matching events on streams of background information to make them stream-ready. Last but not least we will develop methods for real-time recommendation based on complex event processing and hybrid user/show profiles. For learning and evaluating the corresponding data models we will rely upon existing offline machine learning methods.

2.1 ViSTA-TV Complex Event Processing and Temporal Facts. The data enrichment process will generate event-streams of feature-value-changes that describe shows or simple usage events such as “user with id ‘1234’ watches Sports Today”. Apart from their direct usage these event-streams provide the basis for constructing more complex events as aggregated/constructed features (e.g., the complex event after-work-TV-night could be characterized as watching the sequence: ‘sports’ → ‘news’ → ‘weather’ → ‘crime’ show). The Complex Events Processing (CEP) engine performs the detection of such complex events. Furthermore, this CEP engine will also be able to construct temporal facts, i.e. things that are known to be true for a certain amount of time – in contrast to events that take place at a certain point in time and are over in the next instant.

The difference of events and temporal facts can be best illustrated by an example: The event that a thousand football fans switch to the match on channel no 5 may make valid the temporal fact that a lot of people interested in football are watching the match to become true. If such an amount of users switch again channels (or event turn off their IPTV consuming device) and only a handful of football fans remain watching the match, this may cause the above temporal fact to become invalid.4

In the ViSTA-TV Stream Processing engine, temporal facts and complex events serve as an input for the recommendations rules engine. To construct aggregated tags ViSTA-TV will develop a mechanism that efficiently detects complex event patterns on tag (or RDF-triple) streams. State-of-the-art event processing frameworks such as SASE+ [1] or TelegraphCQ [2] are defined on a relational data model. The relational model, however, makes it difficult to incorporate background knowledge, efficiently. Linked Data was designed to overcome this limitation of the relational models.

EP-SPARQL [3] was presented, recently, for defining and querying complex events on streams of Linked Data or, more specifically, RDF. Yet, EP-SPARQL does not make assumptions on real-time processing or complex time constraints as do automata based approaches like SASE+. An extension of the relational stream querying language CQL [4] is C-SPARQL [5]. Both approaches perform processing triples/tuples within a fixed window, which can be based on time-intervals, number of data triples, or partitions based on aggregates. C-SPARQL considers the triples in the window as a static knowledge base. An extension for making this approach work more dynamically was proposed, recently [6]. However, complex events whose components may go beyond the window limits (such as long-term TV usage behavior) are not yet taken into account.

Real-time complex event matching requires efficient and effective methods for pruning the current matching hypotheses. Current approaches prune by implementing memory eviction schemes such as first-in-first-out (essentially a time window) or least-recently-used (i.e., keeping only active patterns not the most useful one). These schemes essentially introduce a time-window though the backdoor. ViSTA-TV, in contrast, will employ the SASE+ automata based model and the EP-SPARQL querying capabilities but develop methods for estimating the probabilities of automata (representing complex event patterns) to match or not. Hence, this novel approach will base its eviction decision based on inherent data statistics, which is expected to minimize the number of false-positives and false-negatives. Consequently, ViSTA-TV will provide novel methods for matching time-window agnostic

4 Note that the temporal fact remains true whatsoever. Ending a temporal fact simply assigns it an

interval but does not falsify the fact.

Page 8: D4.1 Recommendations Engine Dummy

ViSTA-TV 8

complex events on streams of Linked Data in real-time, which will be used for feature/tag construction.

In the present version, however, the focus is laid on integrating the components of the recommendations engine into the overall framework of the ViSTA-TV Stream Processing Engine. Therefore, the actual implementation does not yet take into account actual extensions of SPARQL of implement a complex event processor. To match the goal of this deliverable the components of the CEP engine are implemented as dummies.

2.2 ViSTA-TV Recommendation Rules Recommender systems have long been in the focus of e-commerce – the famous Netflix challenge for movie recommendation is a good example here. The techniques for recommender systems are generally categorized as content-based, collaborative filtering, or hybrid methods. Content-based methods focus on recommending objects that are similar to the ones a user has previously chosen [7]. Methods based on collaborative filtering identify recommendable items based on user similarity [8–10]. Hybrid approaches combine content-based and collaborative methods and have been reported to raise the accuracy on recommendations [11].

Another distinction made in the recommender system literature is memory- versus model-based approaches. Memory based approaches work directly on the user/content data and produce recommendation based on the k-nearest neighbors. Hence, every new user or item needs to be compared to all existing users/items – a computational cost unacceptable for our setting. More recently model-based approaches, which try to model the relation between users and content mathematically, have been shown to not only speed up the recommendation process, but may even improve accuracy (e.g., matrix factorization techniques have outperformed neighboring based methods in the Netflix challenge [12]). Furthermore, tensor and matrix factorization techniques that also recently gained interest in social network analysis [13–15] are supporting this trend. Unfortunately, however, these model-based techniques are still computationally intensive and, therefore, do not fit the scenario of real-time recommendation of shows and movies as intended by ViSTA-TV.

ViSTA-TV will ensure real-time recommendations by addressing computational complexity of current approaches whilst improving recommendation quality through personalized models: While some work on simple events exists, none of the state-of-the art recommender systems operates on complex events. ViSTA-TV will make it possible to use complex events for the actual recommendations. This has not been investigated so far – in particular not for real-time processing. In addition, current recommendation approaches work with handcrafted and manually selected features to describe items. In contrast, ViSTA-TV will integrate dynamic feature extraction and selection within the recommendation process.

The current implementation of the recommendations rules engine will be a linear combination of the features that are fed into this engine. A recommendation will then be given according to the outcome of the application of the linear combination to a classification system such as a perceptron or a decision tree. The dummy engine will output recommendations via an XMPP server. Yet, the STORM framework provides the necessary flexibility to add further output mechanisms as required.

2.3 ViSTA-TV Show-User-Model To address the computational effort of current approaches we intend to take a two-pronged approach. First, TUDo and UZH will not use the recommendation-algorithms on the actual video/audio-streams but rather on the feature-event streams extracted in WP2 (see above). This will massively reduce the dimensionality of the streams. Second, we will use a personalized heterogeneous clustering approach developed by TUDo that exploits frequent term set clustering [16], [17]. To speed it up, TUDo will combine this approach with online item set mining [18]. The combination will result in a novel approach for reducing the computational complexity of recommendation systems, which has the added benefit of providing improved recommendations due to its underlying personalized clustering. Note that the description of these measures is presented in the document ViSTA-TV/2012/D4.4.

To address the large-scale cold-start problem of collaborative filtering new items are not being recommended as not enough users have rated them and they do not get ratings as they are not recommended. Both hybrid and model-based approaches have been shown to reduce the cold-start problem [19] – still recommender-systems have been mostly used where new content is the exception

Page 9: D4.1 Recommendations Engine Dummy

D4.1 Recommendations Engine Dummy Version 9

and not the rule. In the broadcasting TV world of ViSTA-TV many shows on TV are first time transmissions. Hence, TUDo and UZH will explore a novel hybrid 2-way clustering approach: based on available content descriptions (WP1) and external data (WP6) (e.g. this movie is a TV-premiere, but it is already rated as cinema-movie) on one side as well as user viewing information on the other side clusters of shows and users are discovered in a mutually reinforcing process. Once these clusters are available the recommendation can be viewed as a prediction task [20] using techniques such as Bayesian classifiers, decision trees or artificial neural networks. Since the recommendations are now based not only on single shows but also on show-clusters new shows can “simply” be assigned to a cluster based on its content characteristics and then recommended along with other similar shows overcoming the cold-start problem.

We refer to this two-way hierarchical clustering approach as finding out the show-user-model (SUM). While the clustering clearly is an offline task the application of its learned model will be an integral part of the ViSTA-TV Stream Processing Engine. Some significant amount of time before a show will be broadcasted the SUM engine will query the data warehouse which shows will be actually broadcasted. This information, along with its application to the show-user-model will serve as an input to the CEP engine as well as the RERULES engine (cf. Figure 2). Note that feeding these engines with the data can be performed some time before the actual broadcasting takes place. The SUM engine can hence be seen as an initializer of the CEP and the RERULES engine allowing them to prepare and optimize their processing for upcoming shows and the stream of feature events before they take place.

Note that the SUM engine does currently not have any input from any other stream-processing component in the ViSTA-TV stream-processing engine. Fortunately, the STORM framework underlying the ViSTA-TV stream-processing engine provides means for easily adding input from other stream-processing components, e.g. the data enrichment engine, should that become a necessity.

2.4 Implementation Notes Note that all components are implemented w.r.t. the ViSTA-TV stream processing core engine as described in document ViSTA-TV/2012/D3.1. All implementations are provided in Java 1.6 and use Maven2 for configuring the build process and defining dependencies to other packages.

All engines described in this document are implemented as a module within the overall Maven project:

Main project: GroupId: eu.vista-tv.streamengine ArtifactId: vistatv-streamenging Version: 0.0.1

Module: GroupId: eu.vista-tv.streamengine ArtifactId: vistatv-streamengine-recommendations Version: 0.0.1

2.5 Licensing and Availability The ViSTA-TV Recommendations engine was authored by the University of Zurich and the Technical University Dortmund. All software is licensed under the following licenses:

• GNU AFFERO GENERAL PUBLIC LICENSE, Version 3, 19 November 2007.5

The recommendations module of the ViSTA-TV stream-processing engine is published at the Maven server of the University of Zurich, Department of Informatics.6

5 http://www.gnu.org/licenses/agpl-3.0.html

6 https://maven.ifi.uzh.ch/

Page 10: D4.1 Recommendations Engine Dummy

ViSTA-TV 10

Figure 2: The WP4 components of the ViSTA-TV Stream Processing Engine (inside rectangle) and peripherals (outside rectangle). Note that the external data is pulled from the data warehouse (dashed lines). Light (blue) arrows indicate data flow between STORM bolts and spouts whereas black arrows indicate a data flow outside STORM.

IPTV Providers

XMPP Server

App1 App2 App3

Data Warehouse

ViST

A-T

V St

ream

pro

cess

ing

Engi

ne

Rec

omm

enda

tion

Rul

es E

ngin

e

Output

Rules Processor

Input SUM

Input CEP Input

DEE

Com

plex

Eve

nt P

roce

ssin

g (C

EP

) Eng

ine

Output

Rules Processor

Input SUM

Input DEE

Dat

a En

richm

ent

(DE

E) E

ngin

e

Output

Sho

w-U

ser-

Mod

el

(SU

M) E

ngin

e

Output

Rules Processor

Input Raw Data

Input Meta-Data

DEE Processor

Page 11: D4.1 Recommendations Engine Dummy

D4.1 Recommendations Engine Dummy Version 11

3 Complex Event Processing Engine (CEP) The complex event-processing engine performs continuous processing of atomic events streaming into the engine. SPARQL operators (e.g. conjunction, disjunction, etc.) and extensions thereof (e.g. sequencing, Allen’s interval semantics, etc.) combine atomic events to complex events, both of which can, in turn, validate or invalidate temporal facts. These temporal facts, in turn, will be sent to the rules engine.

3.1 Data Formats

3.1.1 Input

There are two kinds of input to the CEP engine: First from the data enrichment (DEE) engine and second from the show-user-model (SUM) engine.

Input from Data Enrichment The input from the DEE engine will be provided in the form of triples ⟨time, key, value⟩ where time > 0, and key ∈ ΣDEE. The vocabulary ΣDEE is a finite set of strings and comprises the dictionary of the feature names as generated by the DEE engine. Since the DEE engine is an online processing component, the input for the CEP engine is implemented as a STORM Bolt.

Input from Show-User-Model The input from the SUME will be provided in the form of tuples as described in Section 4.1.2

3.1.2 Output

The CEP engine will output temporal triples. These are time-interval annotated RDF statements, i.e. 5-tuples of the form f = ⟨t_start, t_end, subject, predicate, object⟩ where

• t_start ∈ ℕ+ is the time at which f starts. • t_end ∈ ℕ+ ⋃ { ∞} is the time at which f ends; ∞ specifies a yet undetermined end time. • subject, predicate, object are RDF terms [21].

3.2 Parameters Parameters to the CEP engine engine are:

• A list of filenames containing a SPARQL query each. Future versions will have the extensions of SPARQL queries that are developed in the ViSTA-TV project as an input..

3.3 Implementation All relevant classes are contained in the Java package ch.uzh.ifi.ddis.vistatv.streamengine.cep. The following tables provide an overview over

• the STORM bolts for the CEP engine (Table 1), • the implementations of the data formats (Table 2), and • some utility functions and constants (Table 3).

The actual layout of the STORM bolts can be taken from Figure 2.

Page 12: D4.1 Recommendations Engine Dummy

ViSTA-TV 12

Table 1: Components for building the STORM sub-topology, i.e. the bolts.

Class Usage

CEPSubTopology A STORM sub-topology that comprises input bolts, processing bolts and output bolts. Currently there exists but a single processing bolt that verifies that the input data arrives in the correct format. Future versions will create a number of bolts according to the queries presented by the configuration.

CEPFeatureInputBolt Handles input from the DEE engine and conveys it to the respective processing bolts.

SUMInputBolt Handles input from the SUM engine and conveys it to the respective processing bolts.

CEPResultBolt Output of the CEP engine. Writes temporal triples to the STORM collector. In the current version it outputs random temporal triples only.

CEPDummyBolt The dummy implementation of a processing bolt.

Table 2: Data formats used for encoding data items for input and output as well as for configuring the sub-topology of the CEP engine.

Class Usage

CEPNode Interface for expressing RDF nodes. May be moved to the core module to simplify data exchange between sub-topologies.

TemporalTriple A 5-tuple ⟨t_start, t_end, subject, predicate, object⟩ for representing temporal facts and events.

TEFSPARQLQuery Interface that shall hold (extensions of) SPARQL queries.

TemporalTripleDataFormat Specialization of the SimpleTupleDataFormat of the core module.

Page 13: D4.1 Recommendations Engine Dummy

D4.1 Recommendations Engine Dummy Version 13

Class Usage

TemporalTripleDataFormat Specialization of the SimpleTupleDataFormat of the core module.

CEPConstants String and numeric constants that are commonly used as a dictionary.

CEPConfigUtils Several utility functions which are too general for being part of any particular class. Methods from this class may be transferred to the core module.

Table 3: Utility functions and constants of the CEP engine.

Class Usage

DummySUMSpout This class is a simple ISpout implementation, which emits random data in the format described in Section 3.1.2

TimeIdKeyValueTuple The dataformat implementation which provides the encoded SUM data

Table 4: Classes used for providing the SUM connector spout.

4 Show-User-Model Engine (SUM) The SUM engine provides user↔program mappings, which can be used as an initial recommendation to the CEPE. The engine can be triggered by some event, for example system time, which indicates the beginning of a new show or detection of an event like a user logging in. It can then poll the data warehouse for existing user→show groups or show→user groups mappings discovered by either hierarchical clustering or frequent item set mining and provide this information to the CEPEs. The SUM engine will be implemented using the TUDo streams framework and will provide an output through a STORM spout.

4.1 Data format

4.1.1 Input

The SUM engine gets its data from the DEE engine and the data warehouse. The input from the DEE engine will be provided in the form of triples ⟨time, key, value⟩ where time > 0, and key ∈ ΣDEE. The vocabulary ΣDEE is a finite set of strings and comprises the dictionary of the feature names as generated by the DEE engine.

The SUM can also receive inputs from other sources such as time and event triggers.

4.1.2 Output

The SUM engine will provide the output through a STORM Spout. The output provided follows the format of the DEE output and is a quadruple (time, id, key, value) where time is timestamp, id is either „user-show“ or „show-user“ depending on the type of mapping that is emitted.

The key attribute holds the source of the mapping, whereas the value contains an array of the target of the mapping emitted. The following table shows this format:

The Spout will deliver these as serialized triples <timestamp, id, key, value>.

Page 14: D4.1 Recommendations Engine Dummy

ViSTA-TV 14

Output Type Description

time Long The timestamp associated with the emitted data.

id String This field indicates whether a „user → show“ or a „show→ user“ mapping is provided by the tuple.

key String This field contains the source of the mapping, i.e. either a user ID (if the id value is „user→ show“) or a show ID if id is „show→ user“.

value String Array This field provides an array of ID values, which are either user IDs or show IDs, depending on the value of the id field.

Table 5: Date enrichment engine output format.

5 Recommendations Rules Engine (RERULES) The recommendations rules engine takes features as an input, turns non-binary features into binary features and computes a decision whether to recommend a show or not according to the combinations of these weights. Any feature that appears in the input and any feature that appears in the output may group recommendations rules. That way we may implement recommendations clustered by shows, user groups, etc.

Note that this document does not describe the data models for the RERULES engine. These can be found in document ViSTA-TV/2012/D4.4.

5.1 Data format

5.1.1 Input

Inputs to the RERULES engine are

• Features from the DEE engine as triples. • Temporal facts from the CEP engine. • SUM-clusters from the SUM engine.

5.1.2 Output

The output of the RERULES system is XMPP messages.7 These will be sent via Bidirectional-streams Over Synchronous HTTP (BOSH).8 The original transmission protocol is based on TCP with open-ended XML streams over long-lived TCP connections). Quoting the BOSH specification best summarizes the advantages of BOSH:

“This specification defines a transport protocol that emulates the semantics of a long-lived, bidirectional TCP connection between two entities (such as a client and a server) by efficiently using multiple synchronous HTTP request/response pairs without requiring the use of frequent polling or chunked responses.”

7 http://xmpp.org/

8 http://xmpp.org/extensions/xep-0124.html

Page 15: D4.1 Recommendations Engine Dummy

D4.1 Recommendations Engine Dummy Version 15

Recommendation messages comprise the following information:

1. Timestamp (long) 2. Channel Id (string) 3. Show Id (string) 4. User/Cluster Id (String)

The semantics of a recommendation message is that at the specified timestamp the particular show on a particular channel is recommended to a single user or a group thereof. Both the ViSTA Stream Processing engine as well as the Apps know the list of user ids and group ids beforehand.

Messages will be broadcasted to clients via the XMPP protocol extension XEP-0045: Multi-User Chat (MUC).9 There may be more than one MUC, for example, one MUC per channel will be an option to be considered for the next version at milestone MS3. There will be a N:M relationship between MUCs and XMPP clients.

In its current version, the RERULES engine does not provide XMPP server capabilities. Instead, there exists a STORM bolt, which connects to an XMPP server as a client and sends XMPP messages to that server. The XMPP server is therefore currently not part of the ViSTA-TV stream-processing engine. This allows a decoupling of functionality. If, however, it will become necessary to include the XMPP server into the stream-processing engine, this can be easily accomplished by embedding the Apache Vysper XMPP server into a STORM output bolt.

5.2 Parameters The RERULES engine requires a mapping from feature names to the indices of the binary feature vector as well as the weights vector.

5.3 Implementation All relevant classes are contained in the Java package ch.uzh.ifi.ddis.vistatv.streamengine.rec. The following tables provide an overview over

• the STORM bolts for the RERULES engine (Table 6), • the implementations of the data formats (Table 7), and • some utility functions and constants.

The actual layout of the STORM bolts can be taken from Figure 2.

6 Testing All components of the recommendations module can be tested using JUnit 4.8 or higher. In the current version, the test is contained in the class TestRecommendationsRuleEngine in package ch.uzh.ifi.ddis.vistatv.streamengine.

The tests are self-contained and use an embedded XMPP server, namely Apache Vysper.10 Further test were carried out successfully with the OpenFire XMPP server.11

Mock objects simulate the input of the DEE engine. These mock objects generate data in a STORM spout, which is connected to the input bolts of the CEP engine and the RERULES engine in the abovementioned test class. The output of the DEE mock objects is produced in the class DataEnrichmentMockTupleGenerator of the abovementioned test package.

9 http://xmpp.org/extensions/xep-0045.html

10 http://mina.apache.org/vysper/

11 http://www.igniterealtime.org/projects/openfire/

Page 16: D4.1 Recommendations Engine Dummy

ViSTA-TV 16

Note that the interfaces and abstract classes used for generalizing the process of using mock objects for testing have been transferred to the core module of the ViSAT-TV stream-processing engine. For further reference see document ViSTA-TV/2012/D3.1.

Class Usage

RecSubTopology A STORM sub-topology that comprises input bolts, processing bolts and output bolts. Currently there exists but a single processing bolt that verifies that the input data arrives in the correct format. Future versions will create a number of bolts according to the queries presented by the configuration.

AbstractRecInputBolt Provides some commonly used methods, e.g. for declaring STORM fields.

RecInputDerBolt<T> Handles input from the DEE engine and conveys it to the respective processing bolts. The parameter T determines the kind of feature that is being sent.

RecInputCepBolt Handles input from the CEP engine and conveys it to the respective processing bolts.

RecOutputBoltXMPP Output of the RERULES engine. Sends recommendations to an XMPP server outside the ViSTA-TV Stream processing engine.

RecDummyBolt The dummy implementation of a processing bolt.

Table 6: The STORM bolts of the RERULES engine.

Class Usage

Recommendation Record holding information about the time, the channel, the show and the user/group of users for which a recommendation is given.

TemporalTriple A 5-tuple ⟨t_start, t_end, subject, predicate, object⟩ for representing temporal facts and events.

TEFSPARQLQuery Interface that shall hold (extensions of) SPARQL queries.

TemporalTripleDataFormat Specialization of the SimpleTupleDataFormat of the core module.

Table 7:Data formats used for encoding data items for input and output as well as for configuring the sub-topology of the RERULES engine.

Page 17: D4.1 Recommendations Engine Dummy

D4.1 Recommendations Engine Dummy Version 17

7 Summary The ViSTA-TV recommendations engine comprises the show-user-model engine, the complex-event-processing engine and the recommendations rules engine. For all of the above we provided a comprehensive description of their data formats as well as an implementation. The whole engine can be run on its own with random dummy data as a proof of concept. Recommendations will currently be output via XMPP but may be extended to other formats as the needs arise.

Acknowledgements The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2011 under grant agreement n° 296126.

References

[1] Y. Diao, N. Immerman, and D. Gyllstrom, “Sase+: An agile language for kleene closure over event streams,” 2008.

[2] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah, “TelegraphCQ  : Continuous Dataflow Processing for an Uncertain World +,” in CIDR 2003, First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 5-8, 2003, Online Proceedings, 2003.

[3] D. Anicic, P. Fodor, S. Rudolph, and N. Stojanovic, “EP-SPARQL: a unified language for event processing and stream reasoning,” in Proceedings of the 20th international conference on World wide web - WWW ’11, 2011, pp. 635–644.

[4] A. Arasu, S. Babu, and J. Widom, “The CQL continuous query language: semantic foundations and query execution,” The VLDB Journal, vol. 15, no. 2, pp. 121–142, Jul. 2005.

[5] D. F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and M. Grossniklaus, “C-SPARQL: SPARQL for continuous querying,” in Proceedings of the 18th international conference on World wide web, 2009, pp. 1061–1062.

[6] D. Le-phuoc, M. Dao-tran, J. X. Parreira, and M. Hauswirth, “A Native and Adaptive Approach for Unified Processing of Linked Streams and Linked Data,” in ISWC 2011, 2011, no. 24761.

[7] R. J. Mooney and L. Roy, “Content-Based Book Recommending Using Learning for Text Categorization,” in Proceedings of the Fifth ACM Conference on Digital Libraries, San Antonio, TX, June 2000, 2000, no. June, pp. 195–240.

[8] T. Hofmann, “Latent semantic models for collaborative filtering,” ACM Transactions on Information Systems, vol. 22, no. 1, pp. 89–115, Jan. 2004.

[9] O. Flasch, A. Kaspari, K. Morik, and M. Wurst, “Aspect-based tagging for collaborative media organization,” in From Web to Social Web: …, 2007.

[10] I. Mierswa, K. Morik, and M. Wurst, “Collaborative Use of Features in a Distributed System for the Organization of Music Collections,” in Intelligent Music Information Systems: Tools and Methodologies, S. Shen and L. Cui, Eds. Idea Group Publishing, 2007, pp. 147–176.

Page 18: D4.1 Recommendations Engine Dummy

ViSTA-TV 18

[11] P. Melville, R. J. Mooney, and R. Nagarajan, “Content-Boosted Collaborative Filtering for Improved Recommendations,” in Proceedings of the Eighteenth National Conference on Artificial Intelligence(AAAI-2002), Edmonton, Canada, July 2002, 2002, no. July, pp. 187–192.

[12] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, pp. 42–49, 2009.

[13] E. Acar, S. Çamtepe, and B. Yener, “Collective Sampling and Analysis of High Order Tensors for Chatroom Communications,” in Intelligence and Security Informatics, vol. 3975, S. Mehrotra, D. Zeng, H. Chen, B. Thuraisingham, and F.-Y. Wang, Eds. Springer Berlin / Heidelberg, 2006, pp. 213–224.

[14] A. Banerjee, S. Basu, and S. Merugu, “Multi-way clustering on relation graphs,” SDM. SIAM, 2007.

[15] Y. Lin, J. Sun, and P. Castro, “Metafac: community discovery via relational hypergraph factorization,” in KDD ’09 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 527–536.

[16] K. Morik, A. Kaspari, M. Wurst, and M. Skirzynski, “Multi-objective frequent termset clustering,” Knowledge and Information Systems, vol. 30, no. 3, pp. 715–738, Jul. 2011.

[17] B. Fung, K. Wang, and M. Ester, “Hierarchical document clustering using frequent itemsets,” in Proc. SIAM International Conference on Data Mining 2003 (SDM 2003), 2003.

[18] G. Cormode and M. Hadjieleftheriou, “Methods for finding frequent items in data streams,” The VLDB Journal, vol. 19, no. 1, pp. 3–20, Dec. 2010.

[19] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock, “Methods and Metrics for Cold-Start Recommendations,” in Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002)., 2002.

[20] D. Billsus and M. J. Pazzani, “Learning Collaborative Information Filters,” in ICML ’98 Proceedings of the Fifteenth International Conference on Machine Learning, 1998, pp. 46 – 54.

[21] G. Klyne and J. J. Carroll, “Resource Description Framework (RDF): Concepts and Abstract Syntax,” 2004.

[22] S. Harris and A. Seaborne, SPARQL 1.1 Query Language. W3C, 2011.