of 33 /33
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing Engines Master thesis by: Riccardo Tommasini (799120) Advisor: Prof. Emanuele Della Valle Co-Advisors: Daniele Dell’Aglio e Marco Balduini Scuola di Ingegneria Industriale e dell’Informazione Computer Science and Engineering Anno Accademico 2013 – 2014 1

Heaven: Supporting Systematic Comparative Research of RDF Stream Processing Engines

Embed Size (px)

Text of Heaven: Supporting Systematic Comparative Research of RDF Stream Processing Engines

  • Heaven: Supporting Systematic Comparative Research of RDF Stream Processing Engines

    Master thesis by: Riccardo Tommasini (799120)

    Advisor: Prof. Emanuele Della Valle Co-Advisors: Daniele DellAglio e Marco Balduini

    Scuola di Ingegneria Industriale e dellInformazione Computer Science and Engineering

    Anno Accademico 2013 2014

    1

  • Master Degree Thesis Riccardo Tommasini

    Agenda

    2

    Motivations

    Research Question

    Conclusion

    Evaluation

    Development

  • Master Degree Thesis Riccardo Tommasini

    Stream Reasoning

    3

    Reasoning upon rapidly changing information flows

    - Emanuele Della Valle, Stefano Ceri, 2009

  • Master Degree Thesis Riccardo Tommasini

    Computer Science research mainly focus on proposing new systems and models, lacking for empirical evaluations of the existing ones.

    - Walter F. Tichy, 25 August 1994

    Motivations

    4

  • Master Degree Thesis Riccardo Tommasini

    5

    RSP ENGINE C-SPAEQL Engine CQELS SPARQLstreamEP-

    SPARQL INSTANS SparkWAVE DynamiTE Trowl

    C-SPARQL Engine CQELS

    SPARQLstream EP-SPARQL

    INSTANS SparkWAVE DynamiTE

    Trowl

    State of the art in RSP Comparison

  • Master Degree Thesis Riccardo Tommasini

    Agenda

    6

    Motivations

    Research Question

    Conclusion

    Evaluation

    Development

    Motivations

  • Master Degree Thesis Riccardo Tommasini

    In Social Science

    Problem Setting - Comparative Method

    7

    Comparative Analysis is Case Driven

    Cases are seen as a combination of properties

    Similarities and differences are examined with shared methods

    Baselines define analysis guidelines

  • Master Degree Thesis Riccardo Tommasini

    Problem Setting - Test Stand

    Evaluate engines with Test Stands

    8

    In Aerospace engineering

    Experimental Environment

    Reproducibility, Repeatability, ComparabilityEvaluation of running systems

  • Master Degree Thesis Riccardo Tommasini

    < ,Q>

    RSP Engine

    9

    RDF Stream Processing Engine

    data streams integration through RDF data model

    continuously infers implied triples w.r.t. ontology T

    heterogeneous data streams

    continuous querying (Q) answering

    T

  • Master Degree Thesis Riccardo Tommasini

    RSP Engine - Complexity

    10

    RDF Stream Model

    Execution Semantics

    Inference Rule

    +

    +

  • Master Degree Thesis Riccardo Tommasini

    11

    Benchmark DataStreams & Ontologies Queries Metrics Test Stand Baselines Method

    SR Bench Feasibility

    LS Bench Feasibility, Throughput

    CSRBench Feasibility,

    Throughput, Correctness

    State of the art of RSP Engine Benchmarking

  • Master Degree Thesis Riccardo Tommasini

    Research Question

    Heavena framework to enable

    Systematic Comparative Research Approach (SCRA) of RDF Stream Processing (RSP) Engines

    Can an engine test stand, together with existing queries, dataset and metrics, enable Systematic Comparative

    Research Approach of RSP Engines?

    12

    Contribution

    we developed and released as open source

  • Master Degree Thesis Riccardo Tommasini

    Agenda

    13

    Conclusion

    Evaluation

    Development

    Research QuestionMotivations

    Research Question

  • Master Degree Thesis Riccardo Tommasini

    Do not Influence the experiment

    Extendible Design

    Engine, Query, Dataset and Ontologies independence allows to exploits existing solutions presented before

    Moreover

    Extendible Measurement Set

    Heaven - Test Stand Requirements

    14

  • Master Degree Thesis Riccardo Tommasini

    RSPEngine< ,Q>

    Heaven - Test Stand

    15

    E,D,T,QE

    Input output

    StartStop

    Inte

    rface

    Inte

    rface

    T

    T QD

    Streamer D

    ResultCollector

  • Master Degree Thesis Riccardo Tommasini

    Heaven - Test Stand

    16

    Disk

    ResultCollector Streamer RSPEngine

    Experiment

    Analyser

    Start MB StopTestStand

    MB

  • Master Degree Thesis Riccardo Tommasini

    Agenda

    17

    Conclusion

    Evaluation

    Development Research Question

    Motivations

    Development

  • Master Degree Thesis Riccardo Tommasini

    Do

    Heaven extends the traditional top-down analysis, enabling the comparative methods

    How to start the research?

    We evaluate four naive RSP Engines, called Baselines, included in the framework

    18

  • Master Degree Thesis Riccardo Tommasini

    Heaven - Baseline Engines

    19

    RDF StreamNaive

    +

    -

    DSMS Reasoner

    Incremental

    Input Triple Inferred Triple

    active window

    RDF Stream

    DSMS Incremental Reasoner

    RDF Stream RDF Stream

    Incremental

  • Master Degree Thesis Riccardo Tommasini

    Haven - Data

    adapts LUBM data to a streaming scenario

    20

    The RDF2RDFStream Module

    generates many RDF Stream controlling the number of contemporary triple

    Constant Flow Rate

    Con

    tem

    pora

    ry tr

    iple

    s

    time

    Step Flow Rate

    Con

    tem

    pora

    ry tr

    iple

    s

    time

  • Master Degree Thesis Riccardo Tommasini

    Doing - Queries

    = S

    S = 1

    S > 1

    21

    Tumbling Window

    Sliding Window

    Variations of the full

    materialisation query

    Window Dimension [ms]Slide Parameter = 100 [ms]

    S N

  • Master Degree Thesis Riccardo Tommasini

    Experiments

    22

    15 SOAK Tests

    10 TIMES

    FOR EACH BASELINE

    168 HOURS OF EXECUTION

    6 STEP Tests

    10 TIMES

    FOR EACH BASELINE

    150 HOURS OF EXECUTION

    Con

    tem

    pora

    ry

    tripl

    es

    time

    Con

    tem

    pora

    ry

    tripl

    es

    time

  • Master Degree Thesis Riccardo Tommasini

    Heaven - Analyser

    We exploit a layered investigation method, which answer different possible question about RSP Engine analysis

    L0 - How to choose an engine?

    L1 - What distinguish an engine?

    L2 - When choosing an engine?

    L3 - Why choosing this engine?23

  • Master Degree Thesis Riccardo Tommasini

    Doing - Analyser L0 - Dashboard

    24

    Memory(mb)

    Latency(ms)

    Memory(mb)

    Latency(ms)

    Memory(mb)

    Latency(ms)

    Memory(mb)

    Latency(ms)

    Increasing Window

    Dim

    ension (ms)

  • Master Degree Thesis Riccardo Tommasini

    25

    Doing - Analyser L1 - Statistical Comparison

    6.3 SOAK Test Evaluation Results

    (a) Incremental

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1 G

    10 G '100 G ' '1000 G ' ' '10000 NA T S T T

    (b) Triple

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1 I

    10 I I

    100 N I I

    1000 N I I I

    10000 NA I I I I

    (c) Naive

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1 '10 ' '100 G ' T1000 G ' T T10000 NA ' ' T T

    (d) Graph

    Triple Slots

    in Number

    Window 1 10 100 1000 1000

    1 I

    10 I I

    100 ' I I1000 N I I I

    10000 NA I I I I

    Table 6.7 Analyser Investigation Stack - Level 1 - SOAK Test average

    latency comparison trough a qualitative approach. The following convention

    indicates the baseline has not reached the Steady State Condition: G, T, N, I.

    (a), (c) - latency results comparison between Incremental and Naive approaches;

    (b), (d) - latency results comparison between Graph-based and Triple-based

    models.

    representation, butHeaven allows also more detailed analysis with quantitativecomparisons as shows in Section 5.4. To properly read the tables note that

    they report that a baseline is better than another one when the dierence in

    term of latency or memory is bigger than 5%, otherwise we consider the two

    terms of comparison as equal and we use the simble '. Moreover, we indicatethat the better solution has not reached the Steady State Condition with the

    underlined symbols G, T, N, I.

    When N >1, the results in Table 6.7.a and 6.7.c allow to say that using

    a Triple-base RDF stream is faster than Graph-based one. In particular, for

    the case N=1000 when the window contains 1000 triples (i.e., each CTEvent

    contains only one triple), the Naive Triple-based approach is about 10% faster

    than the Naive Graph-based one while the Incremental Graph-based is even

    about 20% faster. This findings confirm [Hp.2], while the cases when N=10

    the does not confirm the hypothesis because the results can be consider as

    equal (result dierences are smaller than 5%). A possible explanation is that

    109

    Latency

    Evaluation

    (a) Incremental

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1 T

    10 G T

    100 G T G

    1000 G G G T

    10000 NA G G G G

    (b) Triple

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1 N

    10 I N

    100 N N I

    1000 N I I I

    10000 NA I I I I

    (c) Naive

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1 G

    10 G T

    100 G G T

    1000 G G G T

    10000 NA G G T T

    (d) Graph

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1 N

    10 N N

    100 ' N I1000 ' I I I100000 NA N I I I

    Table 6.8 Analyser Investigation Stack - Level 1 - SOAK Test average

    memory comparison trough a qualitative approach.The following convention

    indicates the baseline has not reached the Steady State Condition: G, T,

    N, I. (a), (c) - memory results comparison between Incremental and Naive

    approaches; (b), (d) - memory results comparison between Graph-based and

    Triple-based models

    the dimension of the graph cannot be considered small w.r.t the window when

    N=10.

    When N=1 (i.e., the window contains only one CTEvent) instead, the

    results in Table 6.7.b and Table 6.7.d show that for large events the Naive

    approach is faster than the Incremental one, as we stated when we formulate

    [Hp.1]. Instead, when CTEvent contains only few triples, the Incremental

    approach is faster and this is not intuitive, because to formulate [Hp.1] we

    consider the changes dimension in percentage.

    The results in Table 6.7.b and 6.7.d support [Hp.1] by stating that when

    the number of changing triples in + (Section 4.2) is a small fraction ofthose in the window an Incremental approach is faster than the Naive one. The

    exception of case N=1, but it can be seen as a limit case, where the reasoner

    is asked to deduce all the implicit triples implied by the only explicit triple in

    the window.

    110

    Memory

    I: IncrementalN: Naive

    SS

    Window Dimension () = Slide Parameter () S

  • Master Degree Thesis Riccardo Tommasini

    Doing - Analyser L2 - Pattern Identification

    26

    6.3 SOAK Test Evaluation Results

    (a) Graph Naive

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1

    10

    100

    1000

    10000

    (b) Graph Incremental

    Triple Slots

    in Number

    Window 1 10 100 1000 10000

    1

    10

    100

    1000

    10000

    Table 6.11 The figure shows the representation in the time domain of mem-

    ory for GN (a) and GI (b).

    117

    Memory

    Naive

  • Master Degree Thesis Riccardo Tommasini

    Doing - Analyser L3 - Visual Comparison

    27

  • Master Degree Thesis Riccardo Tommasini

    Agenda

    28

    Motivations

    Research Question

    Conclusion

    Evaluation

    Development

    Evaluation

  • Master Degree Thesis Riccardo Tommasini

    Done

    My contributions are

    Can an engine test stand, together with existing queries, dataset and metrics, enable SCRA of RSP Engines?

    Test Stand

    Baselines

    Method

    Analysis29

  • Master Degree Thesis Riccardo Tommasini

    Future Works

    SCRA of RSP Engines is just at the beginning. Further development of Heaven are possibile.

    Benchmark Suite

    Heaven as a Service

    Research on Baselines

    30

    Research on Existing RSP Engines

  • Master Degree Thesis Riccardo Tommasini

    Agenda

    31

    Motivations

    Research Question

    Conclusion

    Evaluation

    Development

    Conclusion

  • Master Degree Thesis Riccardo Tommasini

    Thank You

    32

    Thank You!

  • Master Degree Thesis Riccardo Tommasini

    Questions?

    ?????????

    33