Database foundations for scalable RDF processing - csd.uoc.grhy561/papers/storageaccess/optimization/Database foundations for... · Database foundations for scalable RDF processing

Database foundations for scalable RDFprocessing

Katja Hose1, Ralf Schenkel1,2, Martin Theobald1, and Gerhard Weikum1

1 Max-Planck-Institut fur Informatik, Saarbrucken, Germany2 Saarland University, Saarbrucken, Germany

Abstract. As more and more data is provided in RDF format, storinghuge amounts of RDF data and efficiently processing queries on suchdata is becoming increasingly important. The first part of the lecturewill introduce state-of-the-art techniques for scalably storing and query-ing RDF with relational systems, including alternatives for storing RDF,efficient index structures, and query optimization techniques. As central-ized RDF repositories have limitations in scalability and failure tolerance,decentralized architectures have been proposed. The second part of thelecture will highlight system architectures and strategies for distributedRDF processing. We cover search engines as well as federated query pro-cessing, highlight differences to classic federated database systems, anddiscuss efficient techniques for distributed query processing in generaland for RDF data in particular. Moreover, for the last part of this chap-ter, we argue that extracting knowledge from the Web is an excellentshowcase – and potentially one of the biggest challenges – for the scal-able management of uncertain data we have seen so far. The third part ofthe lecture is thus intended to provide a close-up on current approachesand platforms to make reasoning (e.g., in the form of probabilistic infer-ence) with uncertain RDF data scalable to billions of triples.

1 RDF in centralized relational databases

The increasing availability and use of RDF-based information in the last decadehas led to an increasing need for systems that can store RDF and, more impor-tantly, efficiencly evaluate complex queries over large bodies of RDF data. Thedatabase community has developed a large number of systems to satisfy thisneed, partly reusing and adapting well-established techniques from relationaldatabases [122]. The majority of these systems can be grouped into one of thefollowing three classes:

1. Triple stores that store RDF triples in a single relational table, usually withadditional indexes and statistics,

2. vertically partitioned tables that maintain one table for each property, and

3. Schema-specific solutions that store RDF in a number of property tableswhere several properties are jointly represented.

In the following sections, we will describe each of these classes in detail, fo-cusing on two important aspects of these systems: storage and indexing, i.e., howare RDF triples mapped to relational tables and which additional support struc-tures are created; and query processing, i.e., how SPARQL queries are mappedto SQL, which additional operators are introduced, and how efficient executionplans for queries are determined. In addition to these purely relational solutions,a number of specialized RDF systems has been proposed that built on non-relational technologies, we will briefly discuss some of these systems. Note thatwe will focus on SPARQL3 processing, which is not aware of underlying RDF/Sor OWL schema and cannot exploit any information about subclasses; this isusually done in an additional layer on top.

We will explain especially the different storage variants with the runningexample from Figure 1, some simple RDF facts from a university scenario. Here,each line corresponds to a fact (triple, statement), with a subject (usually aresource), a property (or predicate), and an object (which can be a resource or aconstant). Even though resources are represented by URIs in RDF, we use stringconstants here for simplicity. A collection of RDF facts can also be represented asa graph. Here, resources (and constants) are nodes, and for each fact <s,p,o>, anedge from s to o is added with label p. Figure 2 shows the graph representationfor the RDF example from Figure 1.

<Katja,teaches,Databases>

<Katja,works_for,MPI Informatics>

<Katja,PhD_from,TU Ilmenau>

<Martin,teaches,Databases>

<Martin,works_for,MPI Informatics>

<Martin,PhD_from,Saarland University>

<Ralf,teaches,Information Retrieval>

<Ralf,PhD_from,Saarland University>

<Ralf,works_for,Saarland University>

<Saarland University,located_in,Germany>

<MPI Informatics,located_in,Germany>

Fig. 1. Running example for RDF data

1.1 Triple Stores

Triple stores keep RDF triples in a simple relational table with three or fourattributes. This very generic solution with low implementation overhead hasbeen very popular, and a large number of systems based on this principle areavailable. Prominent examples include 3store [56] and Virtuoso [41] from theSemantic Web community, and RDF-3X [101] and HexaStore [155] that weredeveloped by database groups.

3 http://www.w3.org/TR/rdf-sparql-query/

Martin RalfKatja

Saarland University

MPI Informatics

Databases Information Retrieval

Germany

TU Ilmenau

teachesteachesteaches

PhD_from

PhD_from

PhD_from

works_for

works_forworks_for

located_in

located_inlocated_in

works_for

Fig. 2. Graph representation for the RDF example from Figure 1

Storage. RDF facts are mapped to a generic three-attribute table of the form(subject,property,object), also know as triple table; for simplicity, we willabbreviate the attributes by S, P, and O. To save space (and to make accessstructures more efficient), most systems convert resource identifiers, propertiesand constants to numeric ids before storing them in the relation, for example byhashing. The resulting map is usually stored in an additional table, sometimesseparately for resource ids and constants. If a system stores data from morethan one source (or more than one RDF graph), the relation is often extendedby a fourth numeric attribute, the graph id (abbreviated as G), that uniquelyidentifies the source of a triple. In this case, the relation is also called a quadrupletable.

Figure 3 shows the resulting three-attribute relation for the example fromFigure 1.

For efficient query processing, indexes on (a subset of) all combinations ofS, P, and OS are maintained. This allows to efficiently retrieve all matches fora triple pattern of a SPARQL query. We will often refer to indexes with thesequence of the abbreviations of the indexed attributes (such as SPO). Since eachindex has approximately the size of the relation, the number of combinations forwhich indexes are kept is usually limited, or indexes are stored in a compressedway.

Virtuoso [40] comes with a space-optimized way of mapping resources, predi-cates and constants to numeric ids (IRI ID). These strings are mapped to numericids only if they are long (which means at least 9 bytes long), otherwise, they arestored as text in the quadruple relation (this saves for short objects over a solu-

subject property object

Katja teaches Databases

Katja works for MPI Informatics

Katja PhD from TU Ilmenau

Martin teaches Databases

Martin works for MPI Informatics

Martin PhD from Saarland University

Ralf teaches Information Retrieval

Ralf PhD from Saarland University

Ralf works for Saarland University

Ralf works for MPI Informatics

Saarland University locatedIn Germany

MPI Informatics located in Germany

Fig. 3. Triple store representation for the running example from Figure 1

tion that maps everything to ids). It uses a standard quadruple table (G,S,P,O)with a primary key index on all four attributes together. In addition, it usesa bitmap index on (O,G,P,S): This index maintains, for each combination of(O,G,P) in the data, a bit vector. Each subject is assigned a bit position, andfor a quadruple (g,s,p,o) in the data, the bit position for s is set to 1 in the bitvector for (o,g,p). Virtuoso stores distinct values within a page only once andeliminates common prefixes of strings. An additional compression of each pagewith gzip yields a compression from 8K to 3K for most pages.

RDF-3X uses a standard triple table and does not explicitly support multi-ple graphs. However, RDF-3X never actually materializes this table, but insteadkeeps clustered B+ tree indexes on all six combinations of (S,P,O). Additionally,RDF-3X includes aggregated indexes for each possible pair of (S,P,O) and eachorder, resulting in six additional indexes. The index on (S,O), for example, storesfor each pair of subject and object that occurs in the data the number of tripleswith this subject and this object, we’ll refer to this index as SO*. Such an indexallows to efficiently answer queries like select ?s ?o where {?s ?p ?o}. Wecould process this by scanning the SOP index, but we don’t need the exact bind-ings for ?p to generate the result, so we read many index entries that don’t addnew results. All we need is, for each binding of ?s and ?o, the number of tripleswith this subject and this object, so that we can generate the right number ofresults for this binding (including duplicates). The SO* index can help a lot here.Finally, RDF-3X maintains aggregated indexes for each single attribute, againplus triple counts.

To reduce space requirements of these indexes, RDF-3X stores the leaves ofthe indexes in pages and compresses them. Since these leaves contain triples thatoften share some attribute values and all attributes are numeric, it uses deltaencoding for compression. This, together with encoding strings into comparablyshort numbers, helps to keep the overall size of the database comparable oreven slightly smaller than the size of the uncompressed RDF triples in textualrepresentation. The original RDF-3X paper [101] includes a discussion of space-

time tradeoffs for compression and shows that, for example, compression withLZ77 generates more compact indexes, but requires significantly more time todecompress.

Query Processing and Optimization. Query execution on a quadruple storeis done in two steps, converting the SPARQL query into an equivalent SQL query,and creating and executing a query plan for this SQL query.

Step 1. The conversion of a SPARQL query to an equivalent SQL query onthe triple/quadruple table is a rather straight-forward process; we’ll explain itnow for triple tables. For each triple pattern in the SPARQL query, a copy ofthe triple relation is added to the query. Whenever a common variable is usedin two patterns, a join between the corresponding relation instances is createdon the attributes where the variable occurs. Any constants are directly mappedto conditions on the corresponding relation’s attribute.

As an example, consider the SPARQL query

SELECT ?a ?b WHERE

{?a works_for ?u.

?b works_for ?u.

?a phd_from ?u. }

which selects people who work at the same place where they got their phd,together with their coworkers. This is mapped to the SQL query

SELECT t1.s, t2.s FROM triple t1, triple t2, triple t3

WHERE t1.p=’works_for’

AND t2.p=’works_for’

AND t3.p=’phd_from’

AND t1.o=t2.o

AND t1.o=t3.o

AND t1.s=t3.s

Note that in a real system, the string constants would usually be mapped tothe numeric id space first. Further note that we can optimize away one join here(t2.o=t3.o) since it is redundant.

Step 2. Now that we have a standard SQL query, it is tempting to simplyrely on the existing relational backends for optimizing and processing this query.This is actually done in many systems, and even those systems which implementtheir own backend system use the same operators used in relational databases.Converting the SQL query into an equivalent abstract operator tree, for examplean expression in relational algebra, is again straight-foward.

Once this is done, the next step is creating an efficient physical executionplan, i.e., decide how the abstract operators (joins, projection, selection) aremapped to physical implementations in a way that the resulting execution isas cheap (in terms of I/O and CPU usage) and as fast (in terms of processingtimes) as possible. The choice of implementations is rather huge, for example ajoin operator can be implemented with merge joins, hash joins, or nested loop

joins. Additionally, a number of specialized joins exist (such as outer joins andsemi joins) that can further improve efficiency. An important physical operator inmany systems are index lookups and scans which exploit the numerous indexesthat systems keep. Often, each triple pattern in the original SPARQL querycorresponds to an index scan in the corresponding index if that index is available,for example, the triple pattern ?a works for ?b could be mapped on a scan ofthe PSO index, if that exists. If the optimal index does not exist, scans of lessspecific indexes can be used, but some information from that index must beskipped. For example, if the system provides only an O index, a pattern ?a

works for MPI Informatics can be mapped to a scan of the O index, startingat MPI Informatics and skipping all entries that do not match the predicateconstraint.

Finding the most efficient plan now includes considering possible variants ofphysical plans (such as different join implementations, different join orders, etc.)and selecting the most efficient plan. This, in turn, requires that the executioncost for each plan is estimated. It turns out that off-the-shelf techniques imple-mented in current relational databases, for example attribute-level histograms torepresent the distribution. These techniques were not built for dealing with a sin-gle, large table. The main problem is that they ignore correllation of attributes,since statistics are available only separately for each attribute. Estimates (forexample how many results a join will have, or how many results a selectionwill have) are therefore often way off, which can lead to arbitrarily bad execu-tion plans. Multi-dimensional histograms, on the other hand, could capture thiscorrellation, but can easily get too large for large-scale RDF data.

RDF-3X [100, 101] comes with specializes data structures for maintainingstatistics. It uses histograms that can handle any triple pattern and any join,but assume independence of different patterns, and it comes with optimizationsfor frequent join paths. To further speed up processing, it applies sideway infor-mation passing between operators [100]. It also includes techniques to deal withunselective queries which return a large fraction of the database.

Virtuoso [40] exploits the bit vectors in its indexes for simple joins, whichcan be expressed as a conjunction of thse sparse bit vectors. As an example,consider the SPARQL query

select ?a

where {?a works_for Saarland University.

?a works\for MPI Informatics.}

To execute this, it is sufficient to load the bit vectors for both triple patternsand intersect them. For cost estimation, Virtuoso does not rely on per-attributehistogram, but uses query-time sampling: If a triple pattern has constants forp and o and the graph is fixed, it loads the first page of the bit vector for thatpattern from the index, and extrapolates selectivity from the selectivity of thissmall sample.

Further solutions on the problem of selectivity estimation for graph querieswere proposed by [90, 91] outside the context of an RDF system; Stocker etal. [135] consider the problem of query optimization with graph patterns.

1.2 Vertically Partitioned Tables

The vast majority of triple patterns in queries from real applications has fixedproperties. To exploit this fact for storing RDF, one table with two attributes,one for storing subjects and one for storing objects, is created for each propertyin the data; if quadruples are stored, a third attribute for the graphid is added.An RDF triple is now stored in the table for its property. Like in the Triple tablesolution, string literals are usually encoded as numeric ids. Figure 4 shows howour example data from Figure 1 is represented with vertically partitioned tables.

teaches

subject object

Katja Databases

Martin Databases

Ralf Information Retrieval

works for

subject object

Katja MPI Informatics

Martin MPI Informatics

Ralf MPI Informatics

Ralf Saarland University

PhD from

subject object

Katja TU Ilmenau

Martin Saarland University


located in

subject object

Saarland University Germany

MPI Informatics Germany

Fig. 4. Representation of the running example from Figure 1 with vertically partitionedtables

Sinces tables have only two columns, this idea can be further pushed by notstoring them in a traditional relational system (a row store), but in a columnstore. A column store does not store tables as collections of rows, but as collectionof columns, where each entry of a column comes with a unique ID that allowsto reconstruct the rows at query time. This has the great advantage that allentries within a column have the same type and can therefore be compressedvery efficiently. The idea of using column stores for RDF was initially proposed byAbadi et al. [2], Sidirourgos et al. [130] pointed out advantages and disadvantagesof this technique.

Regarding query processing, it is evident that triple patterns with a fixedproperty can be evaluated very efficiently, by simply scanning the table for thisproperty (or, in a column store, accessing the columns of this table). Queryoptimization can also be easier as per-table statistics can be maintained. On theother hand, triple patterns with a property wildcard are very expensive sincethey need to access all two-column tables and form the union of the results.

1.3 Property Tables

In many RDF data collections, a large number of subjects have the same or atleast are largely overlapping set of properties, and many of these properties will

be accessed together in queries (like in our example above that asked for peoplethat did their PhD at the same place where they are working now). Combin-ing all properties of a subject in the same table makes processing such queriesmuch faster since there is no need for a join to combine the different properties.Property tables do exactly this: Groups of subjects with similar properties arerepresented by a single table where each attribute corresponds to a property. Aset of facts for one of these subjects is then stored as one row in that table, whereone column represents the subjects, and the other columns store objects for theproperties that correspond to that column, or NULL if no such property exists.The most prominent example for this storage structure is Jena [23,158]. Chonget al. [27] proposed property tables as external view (which can be material-ized) to simplify access to triple stores. Levandoski et al [80] demonstrate thatproperty tables can outperform triple stores and vertically partitioned tables forRDF data collections with regular structure, such as DBLP or DBPedia.

This table layout comes with two problems to solve: First, there should notbe too many NULL values since they increase storage space, so storing the wholeset of facts in a single table is not a viable solution. Instead, the set of subjectsto store in one table can be determined by clustering subjects by the set oftheir properties, or subjects of the same type can be stored in the same table ifschema information is available. Second, multi-valued properties, i.e., propertiesthat can have more than one object for the same subject, cannot be stored inthis way without breaking the relational paradigm. In our example, people canwork for more than one institution at the same time. To solve this, one caneither create multiple attributes for the same property, but this works only ifthe maximal number of different objects for the same property and the samesubject is rather small. Alternatively, one can store facts with these propertiesin a standard triple table.

Figure 5 shows the representation of the example from Figure 1 with propertytables. We grouped information about people in the People table and informa-tion about Institutions in the Institutions table. As the works for propertycan have multiple objects per subject, we store facts with this property in theRemainder triple table.

1.4 Specialized Systems

Beyond the three classes of systems we presented so far, there are a number ofsystems that don’t fit into these categories. We will shortly sketch these systemshere without giving much detail, and refer the reader to the referenced originalpapers for more information.

Atre et al. [9] recently proposed to store RDF data in matrix form. Com-pressed bit vectors help to make query processing efficient. Zhou and Wu [160]propose to split RDF data into XML trees, rewrite SPARQL as XPath andXQuery expressions, and implement an RDF system on top of an XML database.Fletcher and Beck [42] propose to index not triples, but atoms, and introducethe Three-Way Triple Tree, a disk-based index.

People

subject teaches PhD from

Katja Databases TU Ilmenau

Martin Databases Saarland University

Ralf Information Retrieval Saarland University

Institutions

subject located in

Martin Saarland University


Remainder

subject predicate object

Katja works for MPI Informatics

Martin works for MPI Informatics

Ralf works for Saarland University

Ralf works for MPI Informatics

Fig. 5. Representation of the running example from Figure 1 with property tables

A number of proposals aims at representing and indexing RDF as graphs.Baolin and Bo [85] combine in their HPRD system a triple index with a pathindex and a content index. Liu and Hu [84] propose to use a dedicated pathindex to improve efficiency of RDF query processing. Grin [149] explicitly uses agraph index for RDF processing. Matono et al. [92] propose to use a path indexbased on suffix arrays. Brocheler et al propose to use DOGMA, a disk-basedgraph index, for RDF processing.

2 RDF in distributed setups

Managing and querying RDF data efficiently in a centralized setup is an impor-tant issue. However, with the ever-growing amount of RDF data published onthe Web, we also have to pay attention to relationships between web-accessibleknowledge bases. Such relationships arise when knowledge bases store seman-tically similar data that overlaps. For example, a source storing extracted in-formation from Wikipedia (DBpedia [10]) and a source providing geographicalinformation about places (GeoNames4) might both provide information aboutthe same city, e.g., Berlin.

Such relationships are often expressed explicitly in the form of RDFlinks, i.e., subject and object URIs refer to different namespaces and there-fore establish a semantic connection between entities contained in differ-ent knowledge bases. Consider again the above mentioned sources (DB-pedia and GeoNames) [15], both provide information about the same en-tity (e.g. Berlin) but use different identifiers. Thus, the following RDFtriple links the respective URIs and expresses that both sources refer tothe same entity: (<http://dbpedia.org/resource/Berlin>, <owl:sameAs>,http://sws.geonames.org/2950159/)

The exploitation of these relationships and links offers users the possibility toobtain a wide variety of query answers that could not be computed considering

4 http://www.geonames.org/ontology/

a single knowledge base but require the combination of knowledge provided bymultiple sources. Computing such query answers requires sophisticated reasoningand query optimization techniques that also take the distribution into account.An important issue in this context that needs to be considered is the interfacethat is available to access a knowledge base [63]; some sources provide SPARQLendpoints that can answer SPARQL queries, whereas other sources offer RDFdumps – an RDF dump corresponds to a large RDF document containing theRDF graph representing a source’s complete dataset.

The Linking Open Data initiative5 is trying to enforce the process of estab-lishing links between web-accessible RDF data sources – resulting in Linked OpenData [15], a detailed introduction to Linked Data is provided in [11]. The LinkedOpen Data cloud resembles the structure of the World Wide Web (web pagesconnected via hyperlinks) and relies on the so-called Linked Data principles [14]:

– Use URIs as names for things.– Use HTTP URIs so that people can look up those names.– When someone looks up a URI, provide useful information, using the stan-

dards (RDF, SPARQL).– Include links to other URIs, so that they can discover more things.

For a user there are several ways to access knowledge bases and exploit linksbetween them. A basic solution is browsing [38,57,148]: the user begins with onedata source and progressively traverses the Web by following RDF links to othersources. Browsers allow users to navigate through the sources and are thereforewell-suited for non-experts. However, for complex information needs, formulatingqueries in a structured query language and executing them on the RDF sourcesis much more efficient. Thus, in the remainder of this section we will discussthe main approaches for distributed query processing on RDF data sources – atopic, which has also been considered in recent surveys [51,63]. We begin with anoverview of search engines in Section 2.1 and proceed with approaches adheringto the principles of data warehouses (Section 2.2) and federated systems (Sec-tion 2.3). We proceed with approaches that discover new sources during queryprocessing (Section 2.4) and end with approaches applying the P2P principle(Section 2.5).

2.1 Search engines

Before discussing how traditional approaches known from distributed databasesystems can be applied in the RDF and Linked Data context, let us discuss analternative to find linked data on the Web: search engines.

The main idea is to crawl RDF data from the Web and create a centralizedindex based on the crawled data. The original data is not stored permanentlybut dropped once the index is created. Since the data provided by the originalsources changes, indexes need to be recreated from time to time. In order to

5 http://esw.w3.org/SweolG/TaskForces/CommunityProjects/LinkingOpenData

answer a user query, all that needs to be done is to perform a lookup operationon the index and to determine the set of relevant sources. In most cases userqueries consist of keywords that the search engine tries to find matching datafor. The so found results and relevant sources are output to the user with someadditional information about the results.

The literature proposes several central index search engines, some of them arediscussed in the following. We can distinguish local-copy approaches [26, 34, 58]that collect local copies of the crawled data and index-only approaches [37,108]that only hold local indexes for the data found on the Web. Another distinguish-ing characteristic is what search engines find: RDF documents [37,108] or RDFobjects/entities [26,34,58].

Swoogle. One of the most popular search engines for the Semantic Web wasSwoogle [37]. It was designed as a system to automatically discover Semantic Webdocuments, index their metadata, and answer queries about them. Using thisengine, users can find ontologies, documents, terms, and other data publishedon the Web.

Swoogle uses an SQL database to store metadata about the documents,i.e., information about encoding, language, statistics, ontology annotations, re-lationships among documents (e.g., one ontology imports another one), etc. Acrawler-based approach is used to discover RDF documents, metadata is ex-tracted, and relationships between documents are computed. Swoogle uses twokinds of crawlers: a Google crawler using the Google web service to find relevantSemantic Web documents and a crawler based on Jena2 [157], which identifiesSemantic Web content in a document, analyzes the content, and discovers newdocuments through semantic relations.

Discovered documents are indexed by an information retrieval system, whichcan use either character N-Gram or URIrefs as keywords to find relevant docu-ments and to compute the similarity among a set of documents. Swoogle com-putes ranks for documents in a similar way as the PageRank [109] algorithmused by Google.

Thus, a user can formulate a query using keywords and Swoogle will reportback a list of documents matching those keywords in a ranked order. Optionally,a user might also define content-based constraints to a general SQL query onthe underlying database, e.g., the type of the document, the number of definedclasses, language, encoding, etc.

Semantic Web Search Engine (SWSE). In contrast to other search engines(document-centric), which given a keyword look for relevant sources and doc-uments, SWSE [58] is entity-centric, i.e., given a keyword it looks for relevantentities.

SWSE uses a hybrid approach towards data integration: first, it usesYARS2 [60] as an internal RDF store to manage the crawled data (data ware-housing) and second, it applies virtual integration using wrappers for externalsources. In order to increase linkage between data from different sources, SWSE

applies entity consolidation. The goal is to identify URIs that actually refer tothe same real-world entity by finding matches analyzing values of inverse func-tional properties. In addition, existing RDF entities are linked to Web documents(e.g., HTML) using an inverted index over the text of documents and specificproperties such as foaf:name to identify entities.

YARS2 uses three different index types to index the data: (i) a keyword indexusing Apache Lucene6 as an inverted index – this index maps terms occurringin an RDF object of a triple to the subject, (ii) quad7 indexes in six differentorderings of triple components – distributed over a set of machines according to ahash function, and (iii) join indexes to speed up queries containing combinationsof values or paths in the graph.

In the first step of searching, the user defines a keyword query. The result ofthe search is a list of all entities matching the keyword together with a summarydescription for the entities. The results are ranked by an algorithm similar toPageRank combining ranks from the RDF graph with ranks from the data sourcegraph [65]. To refine the search and filter results, the user can specify a specifictype/class, e.g., person, document, etc. By choosing a specific result entity, theuser can obtain additional information. These additional pieces of informationmight originate from different sources. Users can then continue their explorationby following semantic links that might also lead to documents related to thequeried entity.

WATSON. WATSON [34] is a tool and an infrastructure that automaticallycollects, analyzes, and indexes ontologies and semantic data available online inorder to provide efficient access to this knowledge for Semantic Web users andapplications.

The first step, of course, is crawling. WATSON tries to discover locationsof semantic documents and collects them when found. The crawling process ex-ploits well-known search engines and libraries, such as Swoogle and Google. Theretrieved ontologies are inspected for links to other ontologies, e.g., exploitingowl:import, rdfs:seeAlso, dereferenceable URIs, etc.

Afterwards, the semantic content is validated, indexed, and metadata is gen-erated. WATSON collects metadata such as the language, information aboutcontained entities (classes, properties, literals), etc. By exploiting semantic re-lations (e.g. owl:import), implicit links between ontologies can be computed inorder to detect and remove duplicate information and so that storing redundantinformation and presenting duplicated results to the user can be avoided.

WATSON supports queries based on keywords similar to Swoogle to retrieveand access semantic content including a particular search phrase (multiple key-words are supported). Keywords are matched against the local names, labels,comments and/or literals occurring in ontologies. WATSON returns URIs ofmatching entities, which can serve as an entry point for iterative search andexploration.

6 http://lucene.apache.org/java/docs/fileformats.html7 A quad is a triple extended by a fourth value indicating the origin.

Sindice. Sindice [108] crawls RDF documents (files and SPARQL endpoints)from the Semantic Web and uses three indexes for resource URIs, Inverse Func-tional Properties (IFPs)8, and keywords. Consequently, the user interface allowsusers to search for documents based on keywords, URIs, or IFPs. To process thequery, the query only needs to be passed to the relevant index, results need tobe gathered, and the output (HTML page) needs to be created.

The indexes correspond to inverted indexes of occurrences in documents.Thus, the URI index has one entry for each URI. The entry contains a listof document URLs that mention the corresponding URI. The structure of thekeyword index is the same, the only difference is that it does not consider URIsbut tokens (extracted from literals in the documents), the IFP index uses theuniquely identifying pair (property, value).

When looking for the URI of a specific entity (e.g. Berlin), Sindice providesthe user with several documents that mention the searched URI. For each resultsome further information is given (human description, date of last update) toenable users to choose the best suitable source. The results are ranked in orderof relevance, which is determined based on the TF/IDF relevance metric [43] ininformation retrieval, i.e., sources that share rare terms (URIs, IFPs, keywords)are preferred. In addition, the ranking prefers sources whose URLs a similar toqueried URIs, i.e., containing the same hostnames.

Falcons. In contrast to other search engines, Falcons [26] is a keyword-basedsearch engine that focuses on finding linked objects instead of RDF documentsor ontologies. In this sense, it is similar to SWSE.

Just like other systems, Falcons crawls RDF documents, parses them usingJena9, and follows URIs discovered in the documents for further crawling. Tostore the RDF triples, Falcons creates quadruples and uses a quadruple store(MySQL).

To provide detailed information about objects, the system constructs avirtual document containing literals associated with the object, e.g., human-readable names and descriptions (rdfs:label, rdfs:comment).

Falcons creates an inverted index based on the terms in the virtual documentsand uses this index later on for keyword-based search. A second inverted indexis built based on the objects’ classes – it is used to perform filtering basedon the objects’ classes/types. For a query containing both, keywords and classrestrictions, Falcons computes the intersection between the result sets returnedby both indexes.

Thus, given a keyword query, Falcons uses the index to find virtual documents(and therefore objects) containing the keywords. Falcons supports class-based(typing) query refinement by employing class-inclusion reasoning, which is themain difference to SWSE, which does not allow such refinements and reasoning.

8 http://www.w3.org/TR/owl-ref/#InverseFunctionalProperty-def: If a propertyis declared to be inverse-functional, then the object of a property statement uniquelydetermines the subject (some individual).

9 http://jena.sourceforge.net

The so-found result objects are ranked by considering both their relevance tothe query (similarity between the virtual documents and the keyword query)and their popularity (a measure based on the number of documents referring tothe object).

Each presented result object is accompanied with a snippet that shows asso-ciated literals and linked objects matched with the query. – also shows detailedRDF descriptions loaded from the quadruple store.

More systems. There are many more search engines such as OntoSearch2 [111],which stores a copy of an ontology in a tractable description logic and supportsSPARQL as a query language to find, for instance, all the instances of a givenclass or the relations occurring between two instances. Several other systems aimat providing efficient access to ontologies and semantic data available online. Forexample, OntoKhoj [112] is an ontology portal that crawls, classifies, ranks, andsearches ontologies. For ranking they use an algorithm similar to PageRank.Oyster [110] is different in the sense that it focuses on ontology sharing: usersregister ontologies and their metadata, which can afterwards accessed over apeer-to-peer network of local registries. Finally let us mention OntoSelect [19] isan ontology library that focuses on providing natural language based access toontologies.

2.2 Data warehousing

So far, we have discussed several techniques to search for documents and entitiesusing search engines. However, the problem of answering queries based on dataprovided by multiple sources has been known in the database community sincethe 80s. Thus, from a database point of view the scenario we face with distributedRDF processing is similar to data integration.

Figure 6 shows a categorization of data integration systems from a classicaldatabase point of view. The same categorization can be applied to the problemof processing Linked Data because, despite some differences, similar problemshave to be solved and similar techniques have already been applied by differentsystems for Linked Data processing.

The first and most important characteristic that distinguishes data integra-tion systems is whether they copy the data into a central database, or storagesystem respectively, or leave the data at the sources and only work with sta-tistical information about the sources, e.g., indexes. Approaches falling into theformer category are generally referred to as materialized data integration systemswith data warehouses as a typical implementation. Approaches of the second cat-egory are referred to as virtual data integration systems as they only virtuallyintegrate the data without making copies.

The process of integrating data into a data warehouse is referred to as theETL process (Extract-Transform-Load). First, the data is extracted from thesources, then it is transformed, and finally loaded into the data warehouse. Thisworkflow can also be applied in the Semantic Web context.

Extract. As many knowledge bases are available as dumps for download, it ispossible to download a collection of interesting linked datasets and import theminto a data warehouse. This warehouse resembles a centralized RDF storagesystem that can be queried and optimized using the techniques discussed inSection 1. If a data source is not available for download, the data can be crawledby looking up URIs or accessing a SPARQL endpoint. As the data originatesfrom different sources, the system should keep track of provenance, e.g., by usingnamed graphs [22] or quads.

Transform. Data warehouses in the database world usually provide the data inan aggregated format, e.g., in a data warehouse storing information about saleswe do not keep detailed information about all transactions (as provided by thesources) but, for instance, the volume of sales per day (computing aggregatedvalues based on the data provided by the sources). For linked data, this roughlycorresponds to running additional analyses on the crawled data for duplicatedetection/removal or entity consolidation as applied, for instance, by WATSONand SWSE.

Load. Finally, the transformed data is loaded into the data warehouse. In depen-dence on aspects such as the update frequency of the sources and the size of theimported datasets, it might be difficult to keep the data warehouse up-to-date.In any case the data needs to be reloaded or recrawled so that the data ware-house can be updated accordingly. It is up to the user, or the user applicationrespectively, to decide if and to what extent out-of-date data is acceptable ornot.

Reconsidering the search engines we have already discussed above, we see thatsome of these approaches actually fall into this category: the search engines usinga central repository to store a local copy of the RDF data they crawled from theweb, e.g., SWSE [58], WATSON [34], and Falcons [26].

In summary, the data warehouse approach has some advantages and dis-advantes. The biggest advantage, of course, is that all information is availablelocally, which allows for efficient query processing, optimization, and thereforelow query response times. On the other hand, there is no guarantee that thedata loaded into the data warehouse is up-to-date and we have to update itfrom time to time. From the perspective of a single user or application, the datawarehouse contains a lot of data that is not queried and unnecessarily increasesstorage space consumption. So, the data warehouse solution is only suitable ifwe have a sufficiently high number of queries and diverse applications.

2.3 Federated systems

As mentioned above, database literature proposes two main classes of data in-tegration approaches (Figure 6): materialized and virtual data integration. Incontrast to materialized data integration approaches, virtual data integrationsystems do not work with copies of the original data but only virtually integrate

Fig. 6. Classification of data integration approaches

the data. If we are only interested in answering unstructured queries based onkeywords, the search engine approach, as a variant of virtual integration sys-tems, is an appropriate solution. We have already discussed several examplesof Semantic Web search engines in Section 2.1. Some of them are pure searchengines that only work with indexes and output pointers to relevant data at thesources, whereas other Semantic Web search engines rather correspond to datawarehouses.

When we are interested in answering structured queries on the most recentversion of the data, we need to consider approaches that integrate sources ina way that preserves the sources’ autonomy but at the same time allow forevaluating queries based on the data of multiple sources: we distinguish betweenmediator-based systems and federated systems.

Mediator-based systems. Classic mediator-based systems provide a service thatintegrates data from a selection of independent sources. In such a scenario,sources are often unaware that they are participating in an integration system.A mediator provides a common interface to the user that is used to formulatequeries. Based on the global catalog (statistics about the data available at thesources), the mediator takes care of rewriting and optimizing the query, i.e., themediator determines which sources are relevant with respect to the given queryand creates subqueries for the relevant sources. In case the sources manage theirlocal data using data models or query languages different from what the medi-ator uses, so-called wrappers rewrite the subqueries in a way that makes them“understandable” for the sources, e.g., SQL to XQuery. Wrappers also take careof transforming the data that the sources produce as answers to the query intothe mediator’s model. In practice, the task of the mediator/wrapper architectureis to overcome heterogeneity on several levels: data model, query language, etc.

Federated systems. The second variant of virtual integration systems from adatabase point of view is federated systems. A federation is a consolidation ofmultiple sources providing a common interface and therefore very similar to themediator-based approach. The main difference to mediator-based systems is thatsources are aware that they are part of the federation because they actively haveto support the data model and the query language that the federation agreedupon. In addition, sources might of course support other query languages.

However, from a user’s point-of-view there is no difference between both archi-tectures as both provide transparent access to the data. Likewise, in general theSemantic Web community does not distinguish between these two architectureseither so that both approaches are referred to as federated systems, virtual in-tegration, and/or federated query processing [51, 54, 63]. Thus, in the followingwe adopt the notion of a federation from the Semantic Web community anddo not distinguish between federated and mediated architectures nor betweenfederators and mediators.

There are some peculiarities unique to the web of linked data that we needto deal with in a distributed setup. Some sources, for instance, do not providea query interface that a federated architecture could be built upon, i.e., somesources only provide information to dereferenceable HTTP URIs, some sourcesare only available as dumps, and some sources provide access via SPARQL end-points.

For the first two cases, we can crawl the data and load it into a centralrepository (data warehouse) or multiple repositories that can be combined in afederation – comparable to using wrappers for different data sources. If the datais already available via SPARQL endpoints, we only need to register them inthe federation and their data can be accessed by the federation to answer futurequeries. In addition, if we adopt SPARQL as the common protocol and globalquery language, then no wrappers are necessary for SPARQL endpoints becausethey already support the global query language and data format.

Another interesting difference between classic federated database systemsand federated RDF systems is that for systems using the relational data model,answering a query involves only a few joins between different datasets defined onattributes of the relations. For RDF data, this is more complicated because thetriple format requires much more (self) joins. Moreover, some aspects of RDF,e.g., explicit links (owl:sameAs), are also aspects that need to be consideredduring query optimization and execution.

Query processing in distributed database systems roughly adheres to the stepshighlighted in Figure 7: query parsing, query transformation, data localization,global query optimization, local query optimization, local query execution, andpost-processing. In the following we discuss each of these steps in detail; wefirst describe each step with respect to classic distributed databases and thencompare it to distributed RDF systems.

Fig. 7. Steps of distributed query processing

Query parsing. After the query has been formulated by the user, it has to beparsed, i.e., the query formulated in a declarative query language is representedusing an internal format that facilitates optimization. For example, a query for-mulated in SQL is transformed into an algebra operator tree.

Likewise, SPARQL queries can be parsed into directed query graphs: SQGM(SPARQL Query Graph Model) [62]. In any case, after parsing we have a graphconsisting of operators that need to be evaluated in order to answer the query– a connection between operators indicates the data flow, i.e., the exchange ofintermediate results, between them.

Query transformation. In this step, simple and straightforward optimizationsare applied to the initial query that do not require any sophisticated consider-ations but mostly rely on heuristics and straightforward checks. One aspect isto perform a semantic analysis, e.g., to check whether relations and attributesreferred to in the query do actually exist in the schema. The schema informa-tion necessary for this step are stored in the global catalog. In addition, querypredicates are transformed into a canonical format (normalization) to facilitatefurther optimization, e.g., to identify contradicting query predicates that wouldresult in empty result sets. Moreover, simple algebraic rewriting is applied usingheuristics to eliminate bad query plans, e.g., replacing crossproducts followedby selections with join operators, redundant predicates are removed, expressionsare simplified, unnesting of subqueries and views, etc.

Similar transformations can be done for SPARQL queries. For example, wecan check for contradictions of conditions in the query that would lead to anempty result set. We can also apply some basic optimizations such as remov-ing redundant predicates and simplifying expressions. In addition, we can alsocheck if the predicates referenced in the query do actually exist, i.e., if there

is any source providing corresponding triples. However, this is only useful fordistributed RDF systems that do not traverse links to discover new sources dur-ing query processing. As discussed below, the information necessary to performthese checks are part of the global catalog.

Data localization. In the classic data localization step, the optimizer replacesreferences to global relations with references to the sources’ fragment relationsthat contribute to the global relation. The optimizer simply needs to access theglobal catalog and make use of reconstruction expressions: algebraic expressionsthat, when executed, reconstruct the global relations by combining the sources’fragments in an appropriate manner. In consideration of predicates and joinscontained in the query, a subset of fragments and therefore sources can be elim-inated from the query because the data they provide cannot contribute to thequery result.

In case of Linked Data and systems for distributed RDF processing, this stepentails the identification of sources relevant to a given query. Whereas we havereconstruction expressions for relational databases, it is more complicated forRDF data because there are no strict rules and or restrictions on which sourceuses which vocabulary. Thus, what an appropriate optimizer for distributed RDFsystems needs to do in this step is to go through the list of basic graph patternscontained in the query and identify relevant sources. The information necessaryto perform this task are contained in the global catalg, which we will discuss indetail below.

Global query optimization. In principle, the goal of this step is to find an efficientquery execution plan. In classic distributed systems, the optimizer can optimizefor total query execution time or for response time. The former is a measure forthe total research consumption in the network, i.e., it sums up the duration ofall operations necessary to answer the query over all nodes altogether. Responsetime takes parallel execution into account and measures the time until the resultsare presented to the user. Optimizing response time can exploit all differentflavors of parallelization only when the mediator has some “control” over thesources, i.e., considering direct interaction, communication, and data exchangebetween any two nodes for optimization.

In federated RDF systems, however, sources are still more autonomous sothat some of the options available to classic optimizers cannot be considered,e.g., we cannot decide on a query plan that requires one source to send anintermediate result directly to another source that uses the received data tocompute a local join. Thus, techniques commonly referred to as data, query, orhybrid shipping [76] are not applicable in the RDF context. Likewise, the optionof pipelining10 results between operators executed at different sources is hardlyapplicable.

10 Result items (tuples) are propagated through a hierarchy of operators (e.g., joins)so that when a tuple is output by an operator on a lower level, it immediately servesas input to the upper operator – before all input tuples have been processed by thelower operator.

In general, however, it is efficient in terms of load at the mediator and com-munication costs when joins can directly be executed at the sources. The problemis that with the restrictions mentioned above joins in distributed RDF systemscan only be executed in a completely decentralized fashion at a source s1 if thereis no additional data at any other source s2 that could produce additional re-sults when joined with part of the data provided by s1. Thus, even though s2can process the join locally it has to return, in addition to the join result, alldata that could be joined with other sources’ data. Hence, in most cases joinshave to be computed in a centralized fashion by the mediator [51, 139]. In thatcase, the optimizer applies techniques for local query optimization as discussedin Section 1.

Another option to process joins efficiently is to use a semijoin [76] algorithmto compute a join between an intermediate result at the mediator and a datasource [51]. The mediator needs to compute a projection on the join variablesand extract all variable bindings present in the intermediate result. For all thesevariable bindings, the mediator needs to query the remote data source with re-spect to the join definition and retrieve the result, which consists of all triplesmatching the join condition. So, the mediator retrieves all relevant data to per-form the join locally. Unfortunately, SPARQL does not support the inclusion ofvariable bindings in a query so that the only alternative is to include variablebindings as filter expressions in a query, which for a large number of bindingsmight blow up the size of the query message. The alternative of sending sepa-rate messages for each binding [115] results in a high number of messages andtherefore increases network load and query execution costs.

A heuristic to minimize query execution costs is to minimize the size of in-termediate results. Hence, optimizers for federated RDF systems also try to finda plan that minimizes the size of intermediate results. Theresfore, the optimizerconsiders additional statistics about selectivities and cardinalities provided bythe global catalog (see below).

Another standard technique for every optimizers is to apply heuristics to re-order query operators in a way that intermediate results are small, i.e., pushinghighly selective operators downwards in the query graph so that they are exe-cuted first. Considering RDF data, this means to push down value constraints– ideally into subqueries executed at the sources to reduce the amount of ex-changed data.

Another important and related issue for standard optimizers is join order op-timization. The goal is to find the order of joins that minimizes execution costs.This process heavily relies on statistics (global catalog) to estimate the size ofintermediate results. Obviously, it is beneficial to execute joins first that pro-duce small outputs so that all other joins can be computed more efficiently. Foroptimization in distributed RDF systems, join order optimization correspondsto determining the order of evaluating basic graph patterns. To estimate thesize of intermediate results, the optimizer needs detailed statistics that allow forcardinality estimation. Histograms are very common for relational distributedsystems and can also be useful in the context of federated RDF systems. A

simple heuristic that can be applied without detailed statistics is variable count-ing [136], which estimates the selectivity of basic graph patterns in dependenceon the type and number of unbound components.

In consideration of all these different possibilities for query optimization, theoptimizer has to explore and evaluate a potentially high number of alternativequery plans that all compute the same result. For exploration, a common tech-nique is the application of dynamic programming for plan enumeration, whichenables an exhaustive search of all query plans. In order to decide on the ben-efit of each plan, the optimizer has to determine its costs using an appropriatecost model [51, 105, 139]. For this purpose, the optimizer again needs statisticsabout the data, selectivity, cardinality estimations about intermediate results.In addition, it is worthwhile to consider network latencies as well.

The current standard for distributed RDF optimizers is to optimize a queryand to determine a query plan for each query individually at runtime. Thiscan be improved by adopting more techniques from classic distributed databasesystems that pre-optimize (partial) execution plans for frequently issued queriesor subqueries contained in many queries, e.g., two-step-plans [76].

As a high number of sources provide potentially relevant data for a givenquery, approximation by reducing the number of queried sources is an appro-priate technique to reduce the overall execution costs. This can be achieved byranking sources based on triple and join pattern cardinality estimations andprune all but the top-ranked sources from consideration [59].

Global catalog and indexes. As we have seen above, the global catalog playsan important role for query optimization. It might contain information aboutspecific sources such as vocabulary, supported predicates, network latencies, etc.In addition, for the distributed setup and especially for cardinality estimation,the catalog has to contain statistics about the data, e.g., in the form of indexes.In constrast to indexes that are applied in the context of centralized systems(complete indexes), these indexes cannot provide the same level of details becauseof the amount of data from all sources altogether. Therefore indexes and statisticsabout the data suitable for distributed optimization have to abstract the data ina way that allows for index size compression. The goal is to find a suitable trade-off between the level of detail and memory space consumption – the general ruleis the more detailed an index is, the more accurate are cardinality estimationsbut the higher is the memory consumption.

A widely used approach for indexing is schema-level indexing, i.e., predicateURIs and the types of instances. Whereas types and predicates that occur onlyrarely represent good discriminators to detect relevant sources for a given query,frequently used types and predicates that almost every source provides are onlyof little use for optimization, e.g., rdfs:label. The disadvantage is that thisindex cannot be used for basic graph patterns with variables in the predicateposition. An alternative to indexing predicates and types detached from thestructure of the overall RDF graph is indexing paths of predicates [139]. More-over, when considering the RDF data of a source as a graph, it might also beuseful to index frequent subgraphs [51,89,146].

Another kind of indexes are inverted URI indexes. They index the data oninstance level by indexing all URIs occuring in a data source. This kind of indexallows the query processor to identify all sources which contain a given URI andthus potentially contribute bindings for a triple pattern containing that URI.

There are also indexes indexing data on both instance-level and schema-level [59]. They make use of data structures known from classic relationaldatabase: histograms. As these data structures have originally been developedto summarize informtion about numerical data, hash functions are applied toall elements of a triple so that triples are transformed into numerical space.One-dimensional histograms index each component (subject, predicate, object)in separate. But when considering each of these three dimensions in separate,the optimizer looses a great potential for optimization because the combinationof instances is much more selective and therefore more useful for query opti-mization. Thus, multidimensional histograms are applied, i.e., three-dimensionalhistograms. An alternative to histograms, other structures that efficiently sum-marize multidimensional data can be applied, e.g., QTrees [59].

Local query optimization and execution. When the global optimizer has decidedfor a specific query execution plan, subqueries are extracted and sent to thesources for local execution. They apply the same optimization techniques as forlocal queries, so that the received query is treated like a query issued directlyat the source. Consequently, it is optimized and executed using the techniquesdiscussed in Section 1.

Post-processing In this last step the partial results received from the sources arecombined into the final result. For simple distributed queries, the post-processingmight simply consist of a union operation. For more complex queries, however,post-processing is much more complex and costly because all operations thatcould not be executed at the sources have to be processed by the mediator afterretrieving the data from the sources. As discussed above, this is particularly truefor joins for which multiple sources provide relevant data. In addition, it mightbe necessary to remove duplicates in the result set, which represents the verylast operation for post-processing.

To conclude the section about federated systems and federated RDF processing,we discuss some systems that have been proposed in this context.

SemWIQ. The Semantic Web Integrator and Query Engine [79], SemWIQin short, uses an architecture based on the mediator/wrapper approach, i.e.,wrappers are used to enable the participation of sources using other data models,e.g. relational databases. All registered data sources must either be connectedby a SPARQL-capable wrapper or support the SPARQL protocol directly.

Queries are formulated in SPARQL, the parser computes a canonical queryplan, which is optimized by the federator/mediator. The optimization processonly considers very basic statistics, such as a list of classes and the number of

instances a data source provides for each class as well as a list of predicatesand their occurrences. For query optimization, the federator analyzes the queryand scans the catalog for relevant registered data sources. The resulting planis executed by sending subqueries to the sources (via wrappers or SPARQLendpoints).

The system requires that every data item has an asserted type, i.e., for aquery it requires type information for each subject variable in order to be ableto retrieve instances. As a consequence, there are some restrictions with respectto query formulation, e.g., all subjects must be variables and for each subjectvariable its type must be explicitly or implicitly defined. The optimizer thenuses this information to look for registered data sources providing instances ofthe queried types. More sophisticated optimization techniques, such as the push-down of joins, have been proposed as future work.

DARQ. DARQ [115] (Distributed ARQ) is a federated system of SPARQL end-points that allows distributed query processing on the set of registered endpoints.It also adopts a mediator-based approach and assumes that sources that do notsupport SPARQL themselves are connected to the federation using wrappers.

Data sources are described using service descriptions (represented in RDF).These descriptions contain capabilities, i.e., constraints expressed with regularSPARQL filter expressions. These constraints can express, for instance, thata data source only stores data about specific types of resources. For sourceswith limitations of access patterns, e.g., allowing lookups on personal data onlywhen the user can specify the name of the person of interest. DARQ supportsthis by defining patterns in the service descriptions that must be included inthe query. To provide the query optimizer with statistics, service descriptionscontain the total number provided by a data source and optionally informationfor each capability, e.g., the number of triples with a specific predicate and theselectivities (bound subject/object) of a triple pattern with a specific predicate.

Queries are formulated in SPARQL, parsed, and handed to the query plan-ning and optimization components. DARQ uses a cost-based query optimizationtechnique relying on the statistical information provided as service descriptionsand capabilities.

A SPARQL query contains one or more filtered basic graph patterns (triplepatterns). DARQ performs query planning for each basic graph pattern in sep-arate. By comparing the triple patterns of the query against the capabilitiesof the service descriptions, the system can detect a set of relevant sources foreach pattern. As this matching procedure is based on predicates, DARQ onlysupports queries with bound predicates.

After having determined the relevant sources, subqueries are created, one foreach basic graph pattern and data source matching. Thus, the system mightcreate multiple subqueries that are to be sent to the same data source. In thatcase, the subqueries can be combined into one message.

Based on these subqueries the query optimizer considers limitations on accesspatterns and tries to find a feasible and efficient query execution plan. For logical

query optimization, heuristics are used that try to simplify the query and reduceintermediate result sizes, e.g., push value constraints into subqueries, which cor-responds to pushing down selections in classic query optimization. In order todecide for a specific implementation to process joins (nested loop join or bindjoin), the optimizer estimates the result size of joins based on the statistics givenin the service descriptions. The optimizer chooses the implementation with theleast estimated transfer costs (computed based on the result estimates).

In the end, subqueries are executed at the sources and remaining operationsare executed by the federator.

Hermes. Hermes [147] is a system based on a federated architecture that hasa slightly different focus than other systems discussed so far. Queries are notformulated using SPARQL. Instead, the user enters a keyword query and Hermestries to translate the keyword query into a structured query (SPARQL). Thequery is decomposed into subqueries, which are executed at the sources.

In order to achieve this, a number of indexes are created: a keyword index(indexing terms extracted from the labels of data graph elements), a structureindex (information about schema graphs, i.e., relations between classes derivedfrom the connections given in the data graph), and a mapping index (infor-mation about mappings on data- and schema-level, pairwise mappings betweenelements).

After having received the input keywords, relevant sources are determinedusing the keyword index. The retrieved keyword elements are combined withschema graphs received from the structure index to find substructures that con-nect all keyword elements. A ranking function is used to rank the computedquery graphs so that the user can choose some of them for execution.

The selected query is decomposed into subqueries, each of which is answeredby a different data source. Optimization uses the same techniques as DARQ [115].Before sending the subqueries to the sources, they are mapped to the query for-mat supported by the receiving data source. After execution, the results receivedfrom the sources are combined.

Other systems. Virtuoso [41], a native quad store, provides the option toconsider remote sources for query execution. The system dereferences URIs andholds the retrieved data in a cache for future queries.

Dartgrid [25] is a system for SPARQL queries over multiple relationaldatabases. One of the main components is the semantic registration service whichmaintains mappings from the schemas of registered data sources to the internalontology. An query interface based on forms and ontologies helps users to con-struct semantic queries formulated in SPARQL. Queries are translated using themapping information into SQL queries that can be executed on the sources.

The networked graphs approach [126] allows users to reuse and integrate RDFcontent from other sources. It allows users to define RDF graphs by extensionallylisting content and by using views on other graphs. These views can be usedto include parts of other graphs. Networked graphs are designed for distributed

settings and are exchangeable between sources. However, the paper focuses moreon semantics and reasoning than on aspects of query execution.

[139] presents an approach for querying distributed RDF data sources and in-troduces index structures, a cost model, and an algorithm for query answering.The approach supports distributed SeRQL path queries over multiple SesameRDF repositories using a special index structure to determine the relevantsources for a query.

2.4 Discovering new sources during query processing

In contrast to the classic understanding of federated databases, processing LinkedData arises additional challenges that we have not discussed in detail above, e.g.,by dereferencing URIs and considering the returned document as a new virtual“data source” not all sources are known in advance and available for indexing.Furthermore, approaches that index a static set of sources, i.e., all approacheswe have discussed so far, have to recreate their indexes from time to time inorder to reflect recent updates to the sources, which means that some of theinformation provided by the index might be out-of-date.

Pure. Some strategies for query processing over Linked Data rely on the prin-ciple of following links between sources [61]. An advantage in comparison tofederated architectures is that sources that are not accessible via SPARQL end-points can be considered. Another advantage is that users can retain completecontrol over the data they provide.

The query is executed without a previous query planning or optimizationstep. First, the system retrieves data from the sources mentioned in the query.The data is partially evaluated on the retrieved data so that relevant sourceURIs and links can be identified. The system uses these URIs to retrieve moredata. It iteratively evaluates and discovers further data until all sources foundto be relevant have been processed.

The peculiarity of this approach is the intertwining of the two phases: queryevaluation and link traversal. Previous work [17,94] kept these two phases in sep-arate by first retrieving the data and then evaluating the query on the retrieveddata.

Hybrid. It is also possible to combine federated query processing with activediscovery of unknown but potentially relevant sources [78]. The main assump-tion is that knowledge about some sources is available for query planning andoptimization. During query execution, sources are retrieved, new sources are it-eratively discovered, the query plan is reoptimized and partially executed untilall relevant sources have been processed.

2.5 Systems based on the P2P paradigm

As we have seen, distributed processing of RDF data can be realized by adoptingthe federated database system architecture and developing efficient solutions for

problems that come along with the characteristics of Linked Data. But there isanother important class of distributed systems that Linked Data processing canalso benefit from: peer-to-peer (P2P) systems.

P2P systems are networks of autonomous peers that are connected throughlogical links, which express that a pair of peers “know” each other, i.e., they canexchange messages and data and are referred to as neighbors. A pair of peerswithout a link do not know each other and can therefore only contact each otherif they find a path of links via intermediate peers that connects them.

In general, sources/peers in a P2P network have a higher degree of autonomyin comparison to sources in distributed database systems. In pure P2P networksthere is no central component, i.e., no federator or mediator, that could be usedfor query planning and optimization. Instead, the behavior of the whole systemis the consequence of local interactions between peers.

Whereas sources in the context of distributed databases are considered tobe rather stable and available, P2P systems assume a higher degree of dynamicbehavior, i.e., they assume that peers might join and leave the network at anytime. Even though a peer leaves the network, the system should still be able toanswer queries – even though the data provided by the peer that left the networkis (momentarily) unavailable. Moreover, each peer in the network might issuequeries and participate in answering queries. There are different classes of P2Psystems: centralized P2P, pure P2P, hybrid P2P, structured P2P.

Centralized P2P systems. For centralized P2P systems, like Napster11, there isa central component providing a centralized index that is used to locate relevantdata to a given query. So, a user query issued at a peer is sent to the centralcomponent, which uses the index to find peers providing the queried data. Thisinformation is sent back to the querying peer, which then directly communicateswith other peers to access the queried data.

Pure/unstructured P2P systems. As the centralized server represents a bottle-neck and a single point of failure, pure P2P system strictly avoid peers withspecial roles. Instead, all peers are considered equal and query processing is re-alized by flooding, i.e., a peer with a local user query forwards the query to allits neighbors, they proceed the same so that the query finally reaches all thepeers in the network. The answers to the query are routed back to the queryinitiator, which can then directly communicate with peers providing relevantdata. In analogy to structured P2P systems, which we will discuss below, pureP2P systems are often referred to as unstructured P2P systems.

Hybrid P2P systems. The problem with flooding is that query processing con-sumes much bandwidth and weak/slow peers represent the bottleneck. So, hybridP2P systems use super-peers to counteract this problem. Super-peers are strongpeers that form an unstructured P2P network. Weaker peers are connected to asuper-peer as leaf nodes and form a centralized P2P system.

11 http://www.napster.com

Structured P2P systems. All the P2P system discussed so far assume that thedata shared in the network remains at the peers that “own” them. However, instructured P2P systems, a global rule is used to redistribute the data amongpeers in the system. In most cases a hash function is used for this purpose sothat each peer is responsible for a specific hash range and therefore for all thedata with hash values in that range. Peers are arranged in a logical overlaystructure, e.g., a logical ring [137], that is used to organize the peer in way thatalleviates efficient lookup. For query processing peers use the globally knownrule according to which the data has been distributed in the first place, e.g.,peers compute hash values for the queried data and use the overlay network tolocate peers responsible for overlapping hash ranges.

After having introduced the basic types of P2P systems, let us now discuss someapproaches that apply these concepts to the Semantic Web and RDF data.

Edutella. An early approach that combined the two paradigms RDF and P2Pis Edutella [98]. Edutella assumes that all resources maintained in the networkcan be described with metadata in RDF format. All functionality in the Edutellanetwork is mediated through RDF statements and queries on them.

Peers participating in an Edutella network might use different schemas sothat Gnutella applies the mediator-wrapper approach based on a common datamodel and a common query exchange format to overcome heterogeneity. A peerthat wants to participate in the network, registers at a so-called mediator peer.Peers register the metadata schemas they support and in this way indicate whichqueries they can answer. Queries are sent through the Edutella network to thesubset of peers that have registered with the corresponding schema. The resultingRDF statements are sent back to the requesting peer. To broaden the searchspace, mediators provide a service that translates queries over one schema intoqueries over another schema.

Because of the mediators that mediate between clusters of peers support-ing different schemas, the overall architecture can be considered a hybrid P2Psystem.

GridVine. GridVine [5] uses P-Grid [4] to organize peers in a structured P2Poverlay network, which is used for communication and interaction between peers.Data is indexed and stored in the standard way of structured P2P systems, i.e.,each peer maintains a local database at the semantic layer to store the tripleswhose keys are contained in the key range the peer is responsible for. GridVinealso supports sharing schemas by associating schemas with unique keys andstoring them in the overlay network.

GridVine supports queries based on basic graph patterns and exploits pair-wise mappings (OWL statements relating similar classes and properties) betweendifferent schemas to overcome schema heterogeneity and evaluate queries againstschemas they were originally not formulated against. Mappings are also storedin the network.

In order to locate peers providing relevant data for a given query, or a basicgraph pattern respectively, each triple is indexed and stored three times – oncefor each component of the triple.

RDFPeers. RDFPeers [20] is similar to GridVine but uses a different semanticoverlay network for storing, indexing, and querying RDF data. To efficientlyanswer multi-attribute and range queries, RDFPeers relies on a multi-attributeaddressable network (MAAN) [21], which extends Chord [137] – structured P2P.

Each triple is stored three times applying hash functions to subject, predi-cate, and object. The system’s query processing capabilities are very similar tothe ones of GridVine. It supports triple pattern queries, disjunctive and rangequeries, and conjunctive multi-predicate queries using RDQL.

QC and SBV. Another approach for evaluating conjunctive queries of triplepatterns over RDF data using structured overlay networks is proposed in [83].It uses Chord [137] and also stores each triple three times (subject, predicate,object).

This approach proposes two algorithms for query processing: query chain(QC) and spread by value (SBV). The main idea of the query chain algorithm isthat intermediate results flow through the nodes of the chain. First, responsiblenodes are determined for each triple pattern contained in the query – exploitingconstants contained in the patterns and using the overlay structure to identifyrelevant peers. The found responsible peers form a chain and exchange messagesto answer the query.

The spread by value algorithm constructs multiple chains for each query thatcan be processed in parallel. Query processing starts at a node responsible forthe first query pattern, which uses values of matching triples to forward thequery to nodes providing data for these values.

Other approaches. YARS2 [60] uses the P2P paradigm (structured P2P) ina different way than the approaches discussed so far. As mentioned above inSection 2.1, it comes in combination with the search engine SWSE. YARS2does not distribute the data in the structured overlay network but an indexstructure. More specifically, YARS2 uses indexes in six different orderings oftriple components and distributes this index according to a hash function.

KAONp2p [55] also suggests a P2P-like architecture for query answeringover distributed ontologies. Queries are evaluated against resources, which areintegrated using a virtual ontology that logically imports all relevant ontologiesdistributed over the network.

3 Scalable Reasoning with Uncertain RDF Data

As we have seen in the previous chapter, state-of-the-art SPARQL engines forRDF data (see, e.g., [3, 99]) primarily focus on conjunctive queries on top of a

relational encoding of RDF [1] data. They often employ a so-called “triple-store”technique by indexing or slicing the data according to various permutations ofthe basic subject-predicate-object (SPO) pattern. These engines generally followa deterministic data and query processing model and do not have a notion ofuncertain reasoning or probabilistic inference.

In this chapter, we aim to devise possible directions for scalable reason-ing with uncertain RDF knowledge bases, following ideas from probabilisticdatabases, logic programming, and recent imperative programming platformsfor inference in probabilistic graphical models. We focus on RDF as our basicdata model and SPARQL as the default query language for RDF. Consequently,the forms of uncertain reasoning we consider here focus on SQL-style queries inprobabilistic databases, Datalog-style reasoning using Horn clauses as rules, aswell as some probabilistic extensions to logic programming.

By default, an RDF graph itself is schemaless. RDFS [1] thus introducesbasic facilities for imposing schema information in terms of a class hierarchy.Intuitively, entities that belong to the same instance of a class, i.e., entities ofthe same RDF type, share common properties. In RDFS, instances of classes,class memberships and class properties are transitively inherited to resourcesin subclasses via the type, subClassOf and subPropertyOf relations, respectively.More generally, rule-based reasoning using Datalog-style Horn clauses subsumestype, subclassOf and subPropertyOf inferences in RDF/S and also captures someof the constraints expressible in OWL [7] (e.g., the transitive or functional prop-erties of predicates). Transitivity, for example, can very easily be expressed viafirst-order predicate logic (specifically with rules in the form of Horn clauses),which require a from of logical reasoning in order to check consistency or to com-pute entailment. Considering relations as logical predicates and facts as literals,we thus also briefly investigate the relationship to logic programming and itsprobabilistic extensions in this chapter.

As a (more popular) counterpart to probabilistic logic programming, prob-abilistic databases [32] investigate query processing techniques for structured(SQL) queries over relational data with fixed schemata. While the general ideaof including probabilistic models into databases is not new and leads back tomore than 30 years of research (see, e.g., [24]), in particular recent works inthis context have provided us with a rich body of literature and have led to thedevelopment of a plethora of systems, which all aim at the scalable managementof uncertain relational data. Ultimately, these approaches however need to facethe very same scalability and complexity issues as any approach dealing withinference in probabilistic graphical models.

As this part of the lecture focuses on database-style infrastructures for themanagement of uncertain RDF data, we do not go into details on probabilisticextensions to OWL and its variants based on description logics (OWL-DL). Forprobabilistic description logics [87] (PDL), we refer the reader to [138], whichprovides a very good overview of PDL and related approaches, which has foundwide-spread acceptance in the semantic web community. PDL generalizes thedescription logic SHOIN (D) and can thus express, for example, functional rules.

Reasoners such as PRONTO12 can be used to decide consistency or to computeentailment with probabilistic bounds.

3.1 Probabilistic Databases

Approaches for managing uncertainty in the context of probabilistic databases [8,13,30,31,132] focus on relational data with fixed schemata, and they often employstrong independence assumptions among data objects (the “base tuples”). SQLis used as the default query language for running queries, defining views, ortriggering updates to the data.

Most probabilistic databases adopt the possible worlds [6] model as basisfor their data model and for defining the semantics of queries. Intuitively, ev-ery tuple in a probabilistic database corresponds to a binary random variablewhich may exist in the database with some probability. A probabilistic databasethus encodes a large number of possible instances of deterministic databases,where each deterministic instance contains a different combination (i.e., possi-ble world) of tuples. Each such possible world has a probability between 0 and1. The probabilities of all worlds form a distribution and sum up to 1. Themarginal probability of a tuple can be obtained by summing up the probabilitiesof all worlds which contain that tuple, which leaves confidence computations inprobabilistic databases #P-complete [30, 120] in the general case. The seman-tics of queries is then formally defined by running the query against each ofthese instances individually, and by encoding the results obtained from each ofthe individual instances back into the probabilistic database. While often theactual data computation can be carried out directly along with the SQL opera-tors, different inference techniques for confidence computations may have to beimplemented as separate function calls (e.g., as stored procedures [96]).

In addition to this basic uncertainty model, the ULDB [13] data model (for“Uncertainty and Lineage Database”) provides a lineage-based representationformalism for probabilistic data, which has been shown to be closed and completefor any combination of SQL-style relational operators and across arbitrary levelsof materialized views. Here, the lineage (aka. “history” in [132] or “repair-key”operator in [8]) of a derived tuple (or an entire view) is captured as a Booleanformula which recursively unrolls the logical dependencies from the derived tupleback to the base tuples. The lineage of a derived tuple may never be cyclic, but itmay impose a DAG structure over the derivation of a tuple. Thus, being a formof a directed (but acyclic) graphical model, probabilistic inference in ULDBsremains #P-complete for general SQL queries.

On the other hand, considering restricted classes of SQL queries and corre-sponding query plans, where exact confidence computations remain tractable,has led to a notion of “safe plans” in [30]. Intuitively, an entire query plan issafe, if all query operators take only independent subgoals as their input, whichguarantees a hierarchical derivation structure (i.e., tree-shaped lineage) of alltuples involved in a query result. Following this idea on a more fine-grained

12http://pellet.owldl.com/pronto/

(i.e., on a per-tuple rather than a per-plan) level, [129] considers a class ofso-called “read-once” functions, where the Boolean lineage formula can be fac-torized (in polynomial time) into a form, where every variable appears at mostonce. Moreover, efficient top-k query processing [116, 134] and unified rankingapproaches [81] have investigated different semantics of ranking queries overuncertain data. Recently, the modeling of correlated tuples [127] with the ex-plicit usage of probabilistic graphical models such as Bayesian Nets [18,151] andMarkov Random Fields [128] has found increasing attention also in the databasecommunity. In [70], the authors define a class of Markov networks, where queryevaluation with soft constraints can be done in polynomial time, while the casewith hard constraints is considered separately in [31]. Also in the context ofthese graphical models, lineage remains the key for a closed representation for-malism [71].

While most probabilistic database systems provide extensions to the DDL(data definition language) part of the SQL standard to define dependencies atthe schema level, only fairly few works explicitly tackle the DML (data manip-ulation language) part of SQL, including data modifications such as updatesor deletes [66, 125]. Most probabilistic database approaches focus on SELECT-PROJECT-JOIN (SPJ) query patterns, where query processing bears a numberof interesting analogies to inference in probabilistic graphical models.

MystiQ. The MystiQ system [16,30,117] developed at the University of Wash-ington provides support for a wide range of SQL queries over uncertain andinconsistent data sources. On the DML part, MystiQ introduces the notion ofan approximate match operator between a query attribute and a data attribute,which can be evaluated by data-type-specific built-in functions. On the DDLpart, MystiQ supports so-called predicate functions, which define how probabil-ities should be generated for an approximate match on such an attribute. More-over, in the spirit of functional dependencies in deterministic databases, MystiQallows for the specification of global constraints among attributes at the schemalevel. Unlike classic functional dependencies, these constraints can be specifiedto be either strict (e.g., a person may have only exactly one date-of-birth) or soft(e.g., most people have different names). A violation of a constraint mutuallyaffects the confidences of all tuples involved in the conflict. In further works, theauthors present an efficient top-k algorithm (coined “multi-simulations”) for aclass of SELECT DISTINCT queries which exhibit a DNF structure as lineage.Multi-simulation works by running multiple Monte Carlo simulations [73] in par-allel until the lower confidence bounds of the top-k answers to the query havesufficiently converged, in order to distinguish them from the upper confidencebounds of the non-top-k answers.

Trio. Based on the ULDB data model, the Trio [96] system developed at Stan-ford University provides an integrated approach for the management of data,uncertainty, and lineage. Trio is implemented on top of a conventional databasesystem (PostgreSQL) and employs an SQL-based rewriting layer for the Trio

query language (TriQL) into a series of relational queries and calls to storedprocedures. As core of its data model, Trio adopts the notion of X-tuples [123]for mutually exclusive tuple alternatives, Boolean lineage formulas, maybe an-notations (which indicate the possible absence of the tuple in the uncertaindatabase), and confidence values that may be attached to each tuple alterna-tive. In Trio (and ULDBs), lineage enables for the complete decoupling of dataand confidence computations, which may yield significant efficiency benefits forquery processing. Later extensions to Trio have investigated in more detail howto exploit lineage for probabilistic confidence computations [124] and data up-dates [125].

MayBMS. The MayBMS [8,66] system initially developed at Saarland Univer-sity and then at the Cornell database group is designed as a completely nativeextension to PostgreSQL. Based on the encoding scheme of conditional tables (C-tables) [67,123], the current MayBMS system employs a compact form of schemadecompositions (coined “U-relations”), which results in a succinct encoding ofan uncertain relation with independent attributes. Another focal point of theMayBMS project includes the investigation of foundations for query languagesin probabilistic databases in analogy to relational algebra and SQL [75]. Ongo-ing research issues in MayBMS include query optimization, an update language,concurrency control and recovery, and the design of generic APIs for uncertaindata. MayBMS is the basis for the SPROUT system discussed in Section 4.

Orion. Inspired by large-scale sensor nets, the Orion [131,132] system developedat Purdue University also investigates continuous probability density functions(PDFs) in combination with SQL-style relational query processing techniques.Originally inspired by the application of sensor networks, the current Orion2.0 prototype has built-in support for a number of continuous (e.g., Gaussian,Uniform) and discrete (e.g., Binomial, Poisson) distributions, which are treatedsymbolically at query processing time. If the resulting distribution of an SQL-style operation cannot be represented by a standard distribution anymore, Orionswitches to an approximate distribution using histograms and sampling. Furtherfeatures of Orion include the handling of correlations among attributes, whichare captured as explicit joint distributions among the correlated attributes, andthe handling of missing (i.e., incomplete) data, which is implemented by allowingfor partial PDFs whose confidence distributions may sum up to less than 1. Usinga form of query history, the Orion data model is closed under common relationaloperations and consistent with the possible-worlds semantics.

PrDB. One of the most active current probabilistic database projects is thePrDB [128] system developed at the University of Maryland. Unlike other prob-abilistic database approaches, PrDB employs undirected probabilistic graphicalmodels, specifically Markov Random Fields, as basis for handling correlated basetuples. The graphical model is stored directly in the underlying database system

in the form of factor tables, which capture correlations among tuples and serveas input for the probabilistic inference algorithms. Due to the generality of thisdata model, PrDB incorporates a large variety and optimizations for both ex-act and approximate inference, including variable elimination [36], the reuse ofshared factors, and Gibbs sampling [47] in the general case. In addition, bisim-ulation [72] is shown to significantly speed up inference for DAG-shaped queriesin this context. Also here, lineage, in the form of Boolean formulas that capturethe logical dependencies of derived data objects (tuples) back to the base tuples,is the key for a closed and complete representation model. Moreover, the efficientprocessing of lineage for probabilistic inference under this data model has beenstudied in [71].

3.2 Logic Programming & Rule-based Reasoning

The semantic web has led to the development of a plethora of rule-based anddescription-logic-based reasoning engines, including reasoners like Sesame, Jena,IRIS, Bossam, Prova, and many more13. Besides classical logic programmingframeworks based on Prolog and Datalog, these engines specifically focus ondifferent ontological reasoning concepts based on the RDF/S standards and theDL (based on description logic), RL (supporting first-order Horn rules) and EL(supporting rules with restricted existential quantifiers) fragments of OWL.

Besides the different grounding techniques and varying expressiveness of theontological concepts these reasoners support, an important semantic distinctioncan be made in the way these engines handle negation. Negation-as-failure [28] isthe most common semantics for handling negation in rule antecedents (bodies),which is also the default semantics used in most Prolog and Datalog engines.Intuitively, a negated literal in the body of the rule is grounded if no prooffor the literal can be derived from the knowledge base. As a form of closed-world assumption this semantics can lead to non-monotonic inferences when newinformation (i.e., facts or rules) are entered into the knowledge base, such thatthe negation of the literal no longer holds. To tackle this issue, the well-foundednegation and stable-model semantics have been proposed to handle negation inrule antecedents. We refer the reader to [45, 102, 143] for details. However, careis advisable when reasoning with recursive rules; already plain Datalog withoutnegation has been shown to be EXPTIME-complete for non-linear, recursiveprograms (i.e., for predicates with more than one argument and rules with morethan one recursive predicate in their antecedents) and still PSPACE-completefor linear, recursive programs, respectively [142].

To evaluate the performance of these engines, a number of benchmarks havebeen defined, out of which the most prominent ones are probably [145] forRDF/S, as well as the Lehigh University Benchmark (LUBM) [52] and the morerecent OpenRuleBench [82] initiative. In the following subsections, we providea brief overview of the top-performing engines evaluated in OpenRuleBench:OntoBroker, XSB, Yap, and DLV.

13See http://www.w3.org/2007/OWL/wiki/Implementations for a current overview.

OntoBroker14 is designed as a Java-based, object-oriented database system. Ithas been originally developed at the AIFB Karlsruhe and meanwhile turned intoa commercial product. It follows a bottom-up, deductive grounding techniqueand supports Magic-Sets-based rule rewriting [12], as well as a cost-based queryoptimizer (similar to a relational database system)—a feature that many otherrule engines lack. OntoBroker supports the well-founded negation.

XSB15 is an open-source Prolog engine implemented in native C. In additionto a top-down processing of Prolog or function-free Datalog programs, it canalso be used in a deductive (i.e., bottom-up) database fashion, using advancedgrounding techniques based on tabling (aka. “memoing”) [153]. Through tabling,XSB is able to terminate even for cases when many Prolog engines (based ontop-down SLD resolution [77]) run into cycles. Unlike most Prolog engines, XSBsupports the well-founded negation and (via a plug-in) also the stable-modelsemantics.

Yap16 is a highly optimized Prolog engine developed at the Center for Researchin Advanced Computing Systems and the University of Porto. Like XSB it sup-ports advanced grounding techniques based on tabling, but (unlike XSB) it canalso create indices for faster data access on-the-fly, when it determines that aparticular index may speed up the access to a large amount of data. It how-ever supports only negation-as-failure, but not the well-founded negation northe stable-model semantics.

DLV17 is a bottom-up rule system implemented in C++. It is unique in thatit allows for a form of disjunctive Datalog programming with disjunctive ruleconsequents (heads). Moreover, it is the only system along theses lines whichsupports the stable-model semantics. As additional feature, DLV has built-insupport for propositional reasoning over the grounded model using Max-SATsolving and similar techniques. As most of these engines, it is a pure queryengine, i.e., it does not support incremental updates to the knowledge base.

3.3 Combining First-order Logic and Probabilistic Inference

In the following, we assume a knowledge base to consist of a finite set of first-order logical formulas and a finite set of (potentially typed) entities. We focusonly on reasoning techniques which work by grounding (i.e., by instantiating) thefirst-order rules, and which again result in a finite set of propositional formulas.This class of rules conforms to a subset of first-order logic that is generallyreferred to as the Bernays-Schonfinkel-Ramsey class, which is decidable and canbe evaluated by grounding the first-order formulas. In some settings, we mightwant to distinguish between soft rules, which may be violated and thus typicallyhave a confidence weight associated with them, and hard constraints, which may

14http://www.ontoprise.de/

15http://xsb.sourceforge.net/

16http://www.dcc.fc.up.pt/~vsc/Yap/

17http://www.dbai.tuwien.ac.at/proj/dlv/

not be violated. A wide variety of grounding techniques exist, each leading to adifferent reasoning semantics and potentially different answers to queries. Whatall grounding techniques have in common is that they bind the variables in therules with the entities contained in the knowledge base in order to obtain a(grounded) set of propositional formulas. In the following, we will consider theactual grounding procedure mostly as a black-box function.

We formally call a knowledge base inconsistent if the conjunction of all propo-sitional statements that can be derived from it (e.g., via grounding the rules)evaluates to false. Moreover, we call a knowledge base unsatisfiable if there existsno truth assignment to variables such that the hard constraints are satisfied.

Propositional Reasoning. Classic (deterministic) approaches to handling in-consistencies in a set of propositional formulas are based on the Boolean satisfi-ability problem, generally known as SAT. Although the general SAT problem isNP-complete, many real-world problems have actually been shown to be “easy”to solve even for thousands of Boolean variables and many tens of thousands ofconstraints. Moreover, in recent years the field of SAT solving has made greatprogress in developing strategies, which allow for tackling also non-trivial in-stances of the SAT problem very efficiently. Introducing soft rules, on the otherhand, leads to a weighted form of the satisfiability problem, generally known asthe maximum satisfiability problem (Max-SAT). Here the goal is to find a truthassignment to variables which maximizes the aggregated weights (typically thesum) of the formulas which are satisfied by this assignment. Finding the optimumsolution in Max-SAT solving however is also NP-complete. More specifically, themaximum satisfiability problem over Horn clauses (coined Max-Horn-SAT) hasbeen studied in detail in [69]. In [48], the authors provide a 3/4 approxima-tion algorithm for the weighted Max-SAT problem over Boolean formulas inconjunctive normal form (CNF). None of these classic (deterministic) Max-SATsolvers however considers a distinction between soft and hard constraints. Thus,recently a family of stochastic Max-SAT solvers has been introduced with Walk-SAT [154] and MaxWalkSAT [74], which apply different strategies in order toexplore multiple possible worlds to more accurately approximate the optimumsolution. A counterpart to Max-SAT solving from a probabilistic perspective isMaximum-a-Posteriori estimation (or “MAP-inference”) [133], which selects themost likely mode, i.e., the most likely assignments to variables, according totheir posteriori distribution. Recently, for example in the context of informa-tion extraction, grounding a set of first-order formulas and post-processing thepropositional formulas by a Max-SAT solver has been applied successfully in theSOFIE [141] and PROSPERA [97] projects, in order to automatically populatethe YAGO [140] knowledge base18.

Markov Logic Networks. Statistical relational learning (SRL) [46] has beengaining an increasing momentum in the machine learning, database, and se-mantic web communities, with Markov Logic Networks [118] probably being the

18http://www.mpi-inf.mpg.de/yago-naga/

most generic approach for combining first-order logic and probabilistic graph-ical models into a unified representation and reasoning framework. Intuitively,Markov Logic works by grounding a set of first-order logical rules against aknowledge base, and by sampling states (“worlds”) over a Markov network thatrepresents the grounded (i.e., propositional) formulas. Inference in probabilisticgraphical models in general is #P-complete. Therefore, Markov Chain MonteCarlo (MCMC) [47, 114, 133] denotes a family of efficient sampling algorithmsfor probabilistic inference in these graphical models, with Gibbs sampling [47]being one of the most widely used sampling technique, which is also employedin Markov Logic.

Markov Logic however does not easily scale to very large knowledge bases.Grounding a first-order Markov network works by binding all entities (constants)to variables in first-order predicates that match the type signature of the pred-icate. For binary predicates, this results in grounded networks which are oftennearly quadratic in the number of entities in the knowledge base. Scaling MarkovLogic to large knowledge bases with millions of entities (and hundreds of millionsof facts) thus remains all but straightforward. Recently, the Tuffy [103] engine(see Section 4) has been addressing the issue of scaling up Markov Logic bycoupling Alchemy19 with a relational back-end, and by replacing the groundingprocedure of Alchemy with a more efficient bottom-up grounding technique.

MAP-Inference. More recently, stochastic ways of addressing inference over acombination of deterministic (hard) and probabilistic (soft) dependencies hasbeen addressed also in the context of Markov Logic. Maximum-a-Posteriori(MAP) inference [133] (based on the stochastic weighted Max-SAT solverMaxWalkSAT [74]) and MC-SAT [114] (based on slice sampling [33]) aretwo approximation algorithms for propositional and probabilistic inference inMarkov Logic, respectively. Using a log-linear model for generating the factorsof grounded formulas, MAP-inference can be shown to directly correspond to anexecution of MaxWalkSAT over a Markov Logic network [119].

MC-SAT. Hard constraints may introduce isolated regions of states which can-not easily be overcome by a Gibbs sampler (i.e, by just flipping one variable ata time). MC-SAT [113] thus introduces auxiliary variables which provide thesampler with the ability a “jump” into another (otherwise disconnected) regionwith some probability. Experimentally, MC-SAT has been shown to outperformGibbs sampling and simulated tempering by a significant margin, particularlywhen deterministic dependencies are present. However, allowing arbitrary con-straints as hard rules may lead to the formulation of unsatisfiable constraints,which either renders the knowledge base inconsistent (if there is no solution atall) or empty (if the only solution is to set all facts to false). Satisfiability checks,which includes checking whether a derived fact is false in all the possible worldsand thus has a probability of exactly 0, cannot be approximated and thus remainan NP-hard problem.

19http://alchemy.cs.washington.edu/

Constrained Conditional Models. Another framework which combines(first-order) logical constraints and probabilistic inference is given by Con-strained Conditional Models [88] (CCMs). Intuitively, constraints between inputvariables (observations) and output variables (labels) are encoded into linearweight vectors, which can be solved by Integer Linear Programming. Workingwith CCMs involves both learning weights for the model and efficient inference.CCMs allow for encoding Markov Random Fields, Hidden Markov Models andConditional Random Fields [121]. They found strong applications in naturallanguage processing, including tasks like co-reference resolution, semantic rolelabeling, and information extraction.

Probabilistic Datalog. Early probabilistic extensions to Datalog have beenstudied already in [44] and have later been refined to a number OWL con-cepts [104]. Although this approach already introduced a notion of lineage(coined “intensional query semantics”), the probabilistic computations are re-stricted to a class of rules which is guaranteed to provide independent subgoals(similar to the notion of safe plans or read-once functions in [30, 129]), whereconfidence computations can be propagated “upwards” the lineage tree usingthe inclusion-exclusion principle (aka. “sieve formula”).

3.4 Programming Platforms for Probabilistic Inference

The “declarative-imperative” [64], a term coined in the context of the Berke-ley Orders of Magnitude (BOOM) project20, brings two seemingly contractingparadigms in data management to the point: how can we combine the powerof an imperative programming language with the convenience of a declarativequery language? In the following, we briefly highlight two imperative program-ming platforms for probabilistic inference: FACTORIE and Infer.NET.

FACTORIE21 is a toolkit for deployable probabilistic modeling developed bythe machine learning group at the University of Massachusetts Amherst [93]. Itis based on the idea of using an imperative programming language (Scala) todefine templates which generate factors between random variables, an approachcoined imperatively defined factor graphs. Intuitively, when instantiated thesetemplates form a factor graph, where all factors that have been instantiatedfrom the same template also share the same parameters that were used to definethe template. For inference, FACTORIE provides a variety of techniques basedon MCMC, including Gibbs sampling. FACTORIE has been successfully appliedto various inference tasks in natural language processing and information inte-gration. Recently, FACTORIE has also been coupled with a relational back-endand thus potentially scales to probabilistic database settings with billions ofvariables [156].

20http://boom.cs.berkeley.edu/

21http://code.google.com/p/factorie/

Infer.NET22 provided by Microsoft Research in Cambridge provides a rich pro-gramming language for modeling Bayesian inference tasks in graphical modelsand comes with an out-of-the-box selection of inference algorithms. It provides abuilt-in API for defining random variables (binary/multivariate-discrete or con-tinuous), factors, message-passing operators, and other algebraic operators. Ithas been used in many machine-learning settings, with tasks involving classi-fication or clustering, and in a wide variety of domains, including informationretrieval, bio-informatics, epidemiology, vision, and many others.

3.5 Distributed Probabilistic Inference

Distribution bears the highest potential to scale-up rule-based reasoning andprobabilistic inference, but still is fairly unexplored in the context of uncertainreasoning and probabilistic data management. Although distribution of coursecannot tackle the asymptotical runtime issues inherently involved in these algo-rithms, it bears two key advantages:

• Storing a large data or knowledge base with billions of uncertain data ob-jects in a distributed environment immediately allows for an increased main-memory locality of the data, which is a key for both efficient rule-based rea-soning and probabilistic inference, with a majority of fine-grained, random-access-style data accesses.

• Running queries over a cluster of machines bears great potential for high-performance parallel computations, but clearly also poses major algorithmicchallenges in terms of synchronizing these computations and the preservationof approximation guarantees (e.g., convergence guarantees for the MCMC-based sampling techniques).

MCDB. The MCDB [68, 159] project at IBM Almaden focuses on support-ing Monte Carlo techniques for complex data analysis tasks directly within adatabase system. MCDB is one of the few database approaches to probabilisticdata management that has specifically been adapted to Hadoop23, a massivelyparallel, Map-Reduce-like [35] computing environment. MCDB focuses on ana-lytical tasks over a broad range of user-defined stochastic models, e.g., risk anal-ysis with complex analytical queries including grouping and aggregations [53].Another, recent, application domain of MCDB is declarative information extrac-tion [95].

Message Passing & Distributed Inference. Its iterative nature makesthe Map-Reduce paradigm not well suitable for inference tasks, which inher-ently involve many fine-grained updates between states of objects that maybe distributed across a compute cluster. For probabilistic inference, two main

22http://research.microsoft.com/en-us/um/cambridge/projects/infernet/

23http://hadoop.apache.org/

paradigms for distribution co-exist: the shared memory and the distributed mem-ory model. In the shared-memory model, every processor has access to all thememory in the cluster; while for the distributed memory model, every processoronly has a limited amount of local memory, and each processor can pass “mes-sages” to other processors in the cluster. An alternative to the classic MessagePassing Interface24 (MPI) is the Internet Communications Engine25 (ICE). Bothare shipped as C++ libraries.

With the ResidualSplash [49] algorithm, the authors present a parallel beliefpropagation algorithm under the shard memory model, which is shown to achieveoptimal runtime compared to a theoretical lower bound for parallel inference onchain graphical models. In their later DBRSplash [50] algorithm, the authorsdrop the shared memory model and consider parallel inference techniques ingeneric probabilistic graphical models, which are captured as distributed factorgraphs. In this setting, a factor graph is distributed into a number of (disjointor slightly overlapping) partitions, such that the number of partitions matchesthe number of processors available in the compute cluster. The objective of thepartitioning function is to minimize the communication cost among nodes whileensuring load balance. Since computing an optimal partitioning under theseconstraints is NP-hard, an efficient (linear-time) approximation algorithm is de-vised as basis for the data partitioning. As for inference, a belief propagationalgorithm is employed, with a local priority queue for incoming update messagesat each processor. DBRSplash even reports a super-linear performance scale-upcompared to a centralized setting. Moreover, GraphLab26 [86] is a frameworkfor deploying parallel (provably correct) machine learning algorithms. UnlikeMapReduce, it focuses on more asynchronous communication protocols with dif-ferent levels of sequential-consistency guarantees. In the full consistency model,during the execution of a function f(v) on a vertex v, no other function is allowedto read or write data to any node in the scope (neighborhood) S(v). In the edgeconsistency model, no other function is allowed to read or write data to an edgeassociated with v, while a function f(v) is executed. Finally, in the weakest formof consistency model, the vertex consistency model, no other function is allowedto read or write data to the vertex v itself, while a function f(v) is executed.

4 New trends: BayesStore, SPROUT, Tuffy, URDF

SQL-style query processing over uncertain relational data or, respectively,rule-based reasoning with uncertain RDF data is an emerging topic in thedatabase, knowledge management and semantic web communities. Recent trendsalong these lines include moving away from strict relational data, lifting infer-ence to higher-order constraints, and scaling-up inference for graphical modelswhich combine first-order logic and probabilistic graphical models (in partic-ular Markov Logic). Extracting structured data from the Web is an excellent

24http://www.mcs.anl.gov/research/projects/mpich2/

25http://www.zeroc.com/ice.html

26http://www.graphlab.ml.cmu.edu/

showcase—and likely one of the biggest challenges—for the scalable managementof uncertain data we have seen so far. In the following, we thus briefly highlighta few projects, each of which devises exciting directions that could help makingthe management of uncertain RDF data scalable to billions of triples.

BayesStore. Based on a native extension to the PostgreSQL database system,the BayesStore [151] project at Berkeley aims to bridge the gap between statis-tical models induced by the data and the uncertainty model supported by theprobabilistic database. BayesStore supports statistical models, evidence data,and inference algorithms directly as first-class citizens inside a database manage-ment system (DBMS). It combines a probabilistic database system (PDBMS) forrelational data with a declarative first-order extension to Bayesian Nets, whichallows for capturing complex correlation patterns in the PDBMS in a compactway. Using a combination of propositional and first-order factors, BayesStoresupports probabilistic inference for all common database operations, includingselection, projection, and joins. Recent extensions to BayesStore include theinvestigation of probabilistic-declarative information extraction techniques byusing Conditional Random Fields (CRFs) for extraction and by implementingthe Viterbi algorithm for efficient inference in the CRF directly via SQL [150].

SPROUT. Using the MayBMS probabilistic database system (see Section 3.1)as basis, the SPROUT (for “Scalable PROcessing on Uncertain Tables”) [106]project at Oxford University aims at the scalable processing of queries overuncertain data. A particular focus of the project lies on tractable classes ofqueries for which exact probabilistic inference can be done in polynomial time.Similarly to the notion of safe plans [30], a probabilistic query is tractable if ithas a hierarchical structure and (in the case of SPROUT) can be decomposedinto a binary decision diagram in polynomial time. [107], on the other hand,investigates the decision diagrams for approximate inference for queries whereexact probabilistic inference is intractable.

Tuffy. Based on the observation that Markov Logic does not easily scale toreal-world datasets with millions of data objects, the Tuffy [103] at the Univer-sity of Wisconsin-Madison investigates pushing Markov Logic Networks (MLNs)directly into a relational database system (RDBMS). In contrast to the open-world grounding strategy followed by the MLN implementation Alchemy, theauthors pursue a more efficient bottom-up, closed-world grounding strategy offirst-order rules through iterative SQL statements, which is fully supported bythe DBMS optimizer. Focusing on MAP-inference, Tuffy provides a partitioningstrategy for the search (inference) phase by splitting the grounded network intoa number of independent components, which allows for parallel inference andan exponentially faster search compared to running the search on the globalproblem. For a given classification benchmark, Tuffy (consuming only 15MB ofRAM) was reported to produce much better result quality within minutes thanAlchemy (using 2.8GB of RAM) even after days of running.

URDF. In probabilistic databases, SQL is employed as means for formulatingqueries and for defining views. While SQL queries and schema-level constraintsgenerally yield “hard” Boolean constraints among tuples, they lack the notionof “soft” dependencies among data items. The URDF [144] project developedat the Max Planck Institute for Informatics specifically focuses on weightedHorn clauses as soft rules and mutual-exclusion constraints as hard rules. UnlikeMarkov Logic, URDF follows a Datalog-style, deductive grounding strategy forsoft rules in the form of Horn clauses, which typically results in a much smallergrounded network size than for Markov Logic. While URDF originally employeda Max-SAT solver, which had been tailored to a specific class of mutual-exclusionhard rules, URDF currently also employs probabilistic models based on deductivegrounding and lineage. Moreover, URDF explores several directions for manag-ing large amounts of web-extracted RDF data in a declarative way, includingtemporal reasoning extensions [39, 152], as well as inductively learning soft in-ference rules automatically from a given knowledge base. As exact inference forthis class of queries is intractable, URDF investigates various approximationtechniques based on MCMC (Gibbs sampling) for approximate inference.

References

1. RDF Primer & RDF Schema (W3C Rec.2004-02-10). http://www.w3.org/TR/

rdf-primer/, http://www.w3.org/TR/rdf-schema/.

2. D. J. Abadi, A. Marcus, S. Madden, and K. Hollenbach. SW-Store: a verticallypartitioned DBMS for Semantic Web data management. VLDB J., 18(2):385–406,2009.

3. D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach. Scalable semanticweb data management using vertical partitioning. In C. Koch, J. Gehrke, M. N.Garofalakis, D. Srivastava, K. Aberer, A. Deshpande, D. Florescu, C. Y. Chan,V. Ganti, C.-C. Kanne, W. Klas, and E. J. Neuhold, editors, VLDB, pages 411–422. ACM, 2007.

4. K. Aberer, P. Cudre-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth,M. Punceva, and R. Schmidt. P-Grid: a self-organizing structured P2P system.SIGMOD Rec., 32:29–33, September 2003.

5. K. Aberer, P. Cudre-Mauroux, M. Hauswirth, and T. Van Pelt. GridVine: Build-ing Internet-Scale Semantic Overlay Networks. In S. A. McIlraith, D. Plexousakis,and F. van Harmelen, editors, The Semantic Web – ISWC 2004, volume 3298 ofLecture Notes in Computer Science, pages 107–121. Springer Berlin / Heidelberg,2004.

6. S. Abiteboul, P. Kanellakis, and G. Grahne. On the representation and queryingof sets of possible worlds. Theor. Comput. Sci., 78(1):159–187, 1991.

7. G. Antoniou and F. van Harmelen. A Semantic Web Primer (Cooperative Infor-mation Systems). The MIT Press, April 2004.

8. L. Antova, C. Koch, and D. Olteanu. MayBMS: Managing incomplete informationwith probabilistic world-set decompositions. In ICDE, pages 1479–1480, 2007.

9. M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler. Matrix ”bit” loaded: a scalablelightweight join query processor for RDF data. In M. Rappa, P. Jones, J. Freire,and S. Chakrabarti, editors, WWW, pages 41–50. ACM, 2010.

10. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpe-dia: a nucleus for a web of open data. In Proceedings of the 6th internationalThe semantic web and 2nd Asian conference on Asian semantic web conference,ISWC’07/ASWC’07, pages 722–735, 2007.

11. S. Auer, A.-D. N. Ngomo, and J. Lehmann. Introduction to linked data. InReasoning Web. Semantic Technologies for the Web of Data, volume 6848 ofLecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011.

12. C. Beeri and R. Ramakrishnan. On the power of magic. J. Log. Program.,10(1/2/3/4):255–299, 1991.

13. O. Benjelloun, A. D. Sarma, A. Y. Halevy, and J. Widom. ULDBs: Databaseswith uncertainty and lineage. In VLDB, pages 953–964, 2006.

14. T. Berners-Lee. Linked Data - Design Issues, 2006.http://www.w3.org/DesignIssues/LinkedData.html.

15. C. Bizer, T. Heath, and T. Berners-Lee. Linked Data – The Story So Far. Int.J. Semantic Web Inf. Syst., 5(3):1–22, 2009.

16. J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and D. Suciu. MystiQ:a system for finding more answers by using probabilities. In SIGMOD, pages891–893, 2005.

17. P. Bouquet, C. Ghidini, and L. Serafini. Querying the Web of Data: A FormalApproach. In A. Gmez-Prez, Y. Yu, and Y. Ding, editors, The Semantic Web, vol-ume 5926 of Lecture Notes in Computer Science, pages 291–305. Springer Berlin/ Heidelberg, 2009.

18. H. C. Bravo and R. Ramakrishnan. Optimizing MPF queries: decision supportand probabilistic inference. In SIGMOD, pages 701–712, 2007.

19. P. Buitelaar, T. Eigner, and T. Declerck. OntoSelect: A Dynamic Ontology Li-brary with Support for Ontology Selection. In In Proceedings of the Demo Sessionat the International Semantic Web Conference, 2004.

20. M. Cai and M. Frank. RDFPeers: a scalable distributed RDF repository basedon a structured peer-to-peer network. In Proceedings of the 13th internationalconference on World Wide Web, WWW ’04, pages 650–657, 2004.

21. M. Cai, M. Frank, J. Chen, and P. Szekely. MAAN: A Multi-Attribute Address-able Network for Grid Information Services. In Proceedings of the 4th Interna-tional Workshop on Grid Computing, GRID ’03, pages 184–, 2003.

22. J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named graphs. Journal of WebSemantics, 3:247–267, December 2005.

23. J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and K. Wilkin-son. Jena: implementing the Semantic Web recommendations. In S. I. Feldman,M. Uretsky, M. Najork, and C. E. Wills, editors, WWW (Alternate Track Papers& Posters), pages 74–83. ACM, 2004.

24. R. Cavallo and M. Pittarelli. The theory of probabilistic databases. In VLDB,pages 71–81. Morgan Kaufmann, 1987.

25. H. Chen, Y. Wang, H. Wang, Y. Mao, J. Tang, C. Zhou, A. Yin, and Z. Wu.Towards a Semantic Web of relational databases: A practical semantic toolkitand an in-use case from traditional Chinese medicine. In Fifth InternationalSemantic Web Conference (ISWC), pages 750–763. Springer, 2006.

26. G. Cheng and Y. Qu. Searching Linked Objects with Falcons: Approach, Imple-mentation and Evaluation. Int. J. Semantic Web Inf. Syst., 5(3):49–70, 2009.

27. E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An efficient SQL-based RDFquerying scheme. In K. Bohm, C. S. Jensen, L. M. Haas, M. L. Kersten, P.-A.Larson, and B. C. Ooi, editors, VLDB, pages 1216–1227. ACM, 2005.

28. K. L. Clark. Negation as failure. In Logic and Data Bases, pages 293–322. PlenumPress, 1978.

29. I. F. Cruz, V. Kashyap, S. Decker, and R. Eckstein, editors. Proceedings ofSWDB’03, The first International Workshop on Semantic Web and Databases,Co-located with VLDB 2003, Humboldt-Universitat, Berlin, Germany, September7-8, 2003, 2003.

30. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. InVLDB, pages 864–875, 2004.

31. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilisticstructures. In PODS Conference, pages 293–302, 2007.

32. N. N. Dalvi, C. Re, and D. Suciu. Probabilistic databases: diamonds in the dirt.Commun. ACM, 52(7):86–94, 2009.

33. P. Damlen, J. Wakefield, and S. Walker. Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables. Journal of theRoyal Statistical Society: Series B (Statistical Methodology), 61(2):331–344, 1999.

34. M. d’Aquin, C. Baldassarre, L. Gridinoc, S. Angeletou, M. Sabou, and E. Motta.Characterizing Knowledge on the Semantic Web with Watson. In EON, pages1–10, 2007.

35. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clus-ters. Commun. ACM, 51:107–113, January 2008.

36. R. Dechter. Bucket elimination: A unifying framework for reasoning. Artif. Intell.,113(1-2):41–85, 1999.

37. L. Ding, T. Finin, A. Joshi, R. Pan, R. S. Cost, Y. Peng, P. Reddivari, V. Doshi,and J. Sachs. Swoogle: a search and metadata engine for the semantic web.In CIKM ’04: Proceedings of the thirteenth ACM international conference onInformation and knowledge management, pages 652–659, 2004.

38. Y. Ding, Y. Sun, B. Chen, K. Borner, L. Ding, D. Wild, M. Wu, D. DiFranzo, A. G.Fuenzalida, D. Li, S. Milojevic, S. Chen, M. Sankaranarayanan, and I. Toma. Se-mantic web portal: a platform for better browsing and visualizing semantic data.In Proceedings of the 6th international conference on Active media technology,AMT’10, pages 448–460, 2010.

39. M. Dylla, M. Sozio, and M. Theobald. Resolving temporal conflicts in inconsistentrdf knowledge bases. In BTW, pages 474–493, 2011.

40. O. Erling and I. Mikhailov. Towards web-scale rdf.http://virtuoso.openlinksw.com/whitepapers/Web-Scale

41. O. Erling and I. Mikhailov. RDF Support in the Virtuoso DBMS. In T. Pelle-grini, S. Auer, K. Tochtermann, and S. Schaffert, editors, Networked Knowledge- Networked Media, volume 221 of Studies in Computational Intelligence, pages7–24. Springer Berlin / Heidelberg, 2009.

42. G. H. L. Fletcher and P. W. Beck. Scalable indexing of RDF graphs for efficientjoin processing. In D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J.Lin, editors, CIKM, pages 1513–1516. ACM, 2009.

43. W. B. Frakes and R. A. Baeza-Yates, editors. Information Retrieval: Data Struc-tures & Algorithms. Prentice-Hall, 1992.

44. N. Fuhr. Probabilistic Datalog - a logic for powerful retrieval methods. In SIGIR,pages 282–290, 1995.

45. M. Gelfond and V. Lifschitz. The stable model semantics for logic programming.In Logic Programming, pages 1070–1080. MIT Press, 1988.

46. L. Getoor and B. Taskar. An Introduction to Statistical Relational Learning. MITPress, 2007.

47. W. Gilks, S. Richardson, and D. J. S. Spiegelhalter. Markov Chain Monte Carloin Practice. Chapman and Hall, 1996.

48. M. X. Goemans and D. P. Williamson. New 3/4-approximation algorithms for themaximum satisfiability problem. SIAM J. Discrete Math., 7(4):656–666, 1994.

49. J. E. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally paralleliz-ing belief propagation. In Artificial Intelligence and Statistics (AISTATS, pages177–184, 2009.

50. J. E. Gonzalez, Y. Low, C. Guestrin, and D. O’Hallaron. Distributed parallelinference on large factor graphs. In Uncertainty in Artificial Intelligence (UAI),pages 203–212, 2009.

51. O. Gorlitz and S. Staab. Federated Data Management and Query Optimizationfor Linked Open Data, chapter 5, pages 109–137. Springer, 2011.

52. Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge basesystems. J. Web Sem., 3(2-3):158–182, 2005.

53. P. J. Haas, C. M. Jermaine, S. Arumugam, F. Xu, L. L. Perez, and R. Jampani.MCDB-R: Risk analysis in the database. PVLDB, 3(1):782–793, 2010.

54. P. Haase, T. Mathaß, and M. Ziller. An evaluation of approaches to federatedquery processing over linked data. In Proceedings of the 6th International Con-ference on Semantic Systems, I-SEMANTICS ’10, pages 5:1–5:9, 2010.

55. P. Haase and Y. Wang. A decentralized infrastructure for query answering overdistributed ontologies. In Proceedings of the 2007 ACM symposium on Appliedcomputing, SAC ’07, pages 1351–1356, 2007.

56. S. Harris and N. Gibbins. 3store: Efficient bulk RDF storage. In R. Volz, S. Decker,and I. F. Cruz, editors, PSSS, volume 89 of CEUR Workshop Proceedings. CEUR-WS.org, 2003.

57. A. Harth. VisiNav: Visual Web Data Search and Navigation. In Proceedings ofthe 20th International Conference on Database and Expert Systems Applications,DEXA ’09, pages 214–228, 2009.

58. A. Harth, A. Hogan, R. Delbru, J. Umbrich, S. O’Riain, and S. Decker. SWSE:Answers Before Links! In Semantic Web Challenge, 2007.

59. A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. DataSummaries for On-Demand Queries over Linked Data. In WWW’10, pages 411–420, 2010.

60. A. Harth, J. Umbrich, A. Hogan, and S. Decker. YARS2: a federated repositoryfor querying graph structured data from the web. In Proceedings of the 6th in-ternational The semantic web and 2nd Asian conference on Asian semantic webconference, ISWC’07/ASWC’07, pages 211–224, 2007.

61. O. Hartig, C. Bizer, and J.-C. Freytag. Executing SPARQL Queries over the Webof Linked Data. In The Semantic Web - ISWC 2009, volume 5823, pages 293–309.Springer Berlin / Heidelberg, 2009.

62. O. Hartig and R. Heese. The sparql query graph model for query optimization.In Proceedings of the 4th European conference on The Semantic Web: Researchand Applications, ESWC ’07, pages 564–578, 2007.

63. O. Hartig and A. Langegger. A Database Perspective on Consuming Linked Dataon the Web. Datenbank-Spektrum, 10(2):57–66, 2010.

64. J. M. Hellerstein. The declarative imperative: experiences and conjectures indistributed logic. SIGMOD Record, 39(1):5–19, 2010.

65. A. Hogan, A. Harth, and S. Decker. ReConRank: A Scalable Ranking Methodfor Semantic Web Data with Context. In In 2nd Workshop on Scalable SemanticWeb Knowledge Base Systems, 2006.

66. J. Huang, L. Antova, C. Koch, and D. Olteanu. MayBMS: a probabilistic databasemanagement system. In SIGMOD, pages 1071–1074, 2009.

67. T. Imielinski and W. L. Jr. Incomplete information in relational databases. J.ACM, 31(4):761–791, 1984.

68. R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas. MCDB:a Monte Carlo approach to managing uncertain data. In J. T.-L. Wang, editor,SIGMOD, pages 687–700. ACM, 2008.

69. B. Jaumard and B. Simeone. On the complexity of the maximum satisfiabilityproblem for Horn formulas. Information Processing Letters, 26(1):1 – 4, 1987.

70. A. Jha, V. Rastogi, and D. Suciu. Query evaluation with soft-key constraints. InPODS, pages 119–128, 2008.

71. B. Kanagal and A. Deshpande. Lineage processing over correlated probabilisticdatabases. In SIGMOD, pages 675–686, 2010.

72. P. C. Kanellakis and S. A. Smolka. CCS expressions finite state processes, andthree problems of equivalence. Inf. Comput., 86:43–68, May 1990.

73. R. M. Karp and M. Luby. Monte-Carlo algorithms for enumeration and reliabilityproblems. In FOCS, pages 56–64, 1983.

74. H. Kautz, B. Selman, and Y. Jiang. A general stochastic approach to solvingproblems with hard and soft constraints. In The Satisfiability Problem: Theoryand Applications, pages 573–586. American Mathematical Society, 1996.

75. C. Koch. A compositional query algebra for second-order logic and uncertaindatabases. In ICDT, pages 127–140, 2009.

76. D. Kossmann. The state of the art in distributed query processing. ACM Comput.Surv., 32:422–469, December 2000.

77. R. A. Kowalski and D. Kuehner. Linear resolution with selection function. Artif.Intell., 2(3/4):227–260, 1971.

78. G. Ladwig and T. Tran. Linked Data Query Processing Strategies. In Interna-tional Semantic Web Conference (ISWC’10), pages 453–469, 2010.

79. A. Langegger, W. Woß, and M. Blochl. A semantic web middleware for virtualdata integration on the web. In Proceedings of the 5th European semantic webconference on The semantic web: research and applications, ESWC’08, pages 493–507, 2008.

80. J. J. Levandoski and M. F. Mokbel. RDF data-centric storage. In ICWS, pages911–918. IEEE, 2009.

81. J. Li, B. Saha, and A. Deshpande. A unified approach to ranking in probabilisticdatabases. PVLDB, 2(1):502–513, 2009.

82. S. Liang, P. Fodor, H. Wan, and M. Kifer. OpenRuleBench: an analysis of theperformance of rule engines. In WWW, pages 601–610. ACM, 2009.

83. E. Liarou, S. Idreos, and M. Koubarakis. Evaluating Conjunctive Triple PatternQueries over Large Structured Overlay Networks. In In International SemanticWeb Conference (ISWC’06), pages 399–413. Springer Heidelberg, 2006.

84. B. Liu and B. Hu. Path queries based RDF index. In SKG, page 91. IEEEComputer Society, 2005.

85. B. Liu and B. Hu. HPRD: A high performance RDF database. In K. Li, C. R.Jesshope, H. Jin, and J.-L. Gaudiot, editors, NPC, volume 4672 of Lecture Notesin Computer Science, pages 364–374. Springer, 2007.

86. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein.GraphLab: A new parallel framework for machine learning. In Conference on Un-certainty in Artificial Intelligence (UAI), Catalina Island, California, July 2010.

87. T. Lukasiewicz. Probabilistic description logic programs. Int. J. Approx. Reason-ing, 45(2):288–307, 2007.

88. N. R. M. Chang, L. Ratinov and D. Roth. Learning and inference with constraints.In AAAI, 2008.

89. A. Maduko, K. Anyanwu, A. Sheth, and P. Schliekelman. Graph summaries forsubgraph frequency estimation. In Proceedings of the 5th European semantic webconference on The semantic web: research and applications, ESWC’08, pages 508–523, 2008.

90. A. Maduko, K. Anyanwu, A. P. Sheth, and P. Schliekelman. Estimating thecardinality of RDF graph patterns. In C. L. Williamson, M. E. Zurko, P. F.Patel-Schneider, and P. J. Shenoy, editors, WWW, pages 1233–1234. ACM, 2007.

91. A. Maduko, K. Anyanwu, A. P. Sheth, and P. Schliekelman. Graph summariesfor subgraph frequency estimation. In S. Bechhofer, M. Hauswirth, J. Hoffmann,and M. Koubarakis, editors, ESWC, volume 5021 of Lecture Notes in ComputerScience, pages 508–523. Springer, 2008.

92. A. Matono, T. Amagasa, M. Yoshikawa, and S. Uemura. An indexing scheme forRDF and RDF schema based on suffix arrays. In Cruz et al. [29], pages 151–168.

93. A. McCallum, K. Schultz, and S. Singh. FACTORIE: Probabilistic programmingvia imperatively defined factor graphs. In NIPS, 2009.

94. A. O. Mendelzon and T. Milo. Formal models of Web queries. In Proceedingsof the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles ofdatabase systems, PODS ’97, pages 134–143, 1997.

95. E. Michelakis, R. Krishnamurthy, P. J. Haas, and S. Vaithyanathan. Uncertaintymanagement in rule-based information extraction systems. In SIGMOD, pages101–114, 2009.

96. M. Mutsuzaki, M. Theobald, A. de Keijzer, J. Widom, P. Agrawal, O. Benjelloun,A. D. Sarma, R. Murthy, and T. Sugihara. Trio-One: Layering uncertainty andlineage on a conventional DBMS (demo). In CIDR, pages 269–274, 2007.

97. N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting withhigh precision and high recall. In WSDM, pages 227–236, 2011.

98. W. Nejdl, B. Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve, M. Nilsson, M. Palmer,and T. Risch. EDUTELLA: a P2P networking infrastructure based on RDF. InWWW ’02: Proceedings of the 11th international conference on World Wide Web,pages 604–615, New York, NY, USA, 2002. ACM.

99. T. Neumann and G. Weikum. Rdf-3x: a risc-style engine for rdf. PVLDB,1(1):647–659, 2008.

100. T. Neumann and G. Weikum. Scalable join processing on very large RDF graphs.In U. Cetintemel, S. B. Zdonik, D. Kossmann, and N. Tatbul, editors, SIGMODConference, pages 627–640. ACM, 2009.

101. T. Neumann and G. Weikum. The RDF-3X engine for scalable management ofrdf data. VLDB J., 19(1):91–113, 2010.

102. I. Niemel and P. Simons. Smodels - an implementation of the stable model andwell-founded semantics for normal logic programs. In Logic Programming andNonmonotonic Reasoning. Springer, 1997.

103. F. Niu, C. Re, A. Doan, and J. Shavlik. Tuffy: scaling up statistical inferencein Markov logic networks using an RDBMS. Technical report, University ofWisconsin-Madison, 2010.

104. H. Nottelmann and N. Fuhr. Adding probabilities and rules to OWL lite sub-sets based on probabilistic Datalog. Int. Journal of Uncertainty, Fuzziness andKnowledge-Based Systems, 14(1):17–41, 2006.

105. P. Obermeier and L. Nixon. A Cost Model for Querying Distributed RDF-Repositories with SPARQL. In Workshop on Advancing Reasoning on the Web:Scalability and Commonsense, 2008.

106. D. Olteanu, J. Huang, and C. Koch. SPROUT: Lazy vs. eager query plans fortuple-independent probabilistic databases. In ICDE, pages 640–651. IEEE, 2009.

107. D. Olteanu, J. Huang, and C. Koch. Approximate confidence computation inprobabilistic databases. In ICDE, pages 145–156, 2010.

108. E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello.Sindice.com: a document-oriented lookup index for open linked data. Int. J.Metadata Semant. Ontologies, 3:37–52, November 2008.

109. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Rank-ing: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab,November 1999.

110. R. Palma and P. Haase. Oyster - Sharing and Re-using Ontologies in a Peer-to-Peer Community. In The Semantic Web – ISWC 2005, Lecture Notes in ComputerScience, pages 1059–1062. Springer Berlin / Heidelberg, 2005.

111. J. Z. Pan, E. Thomas, and D. Sleeman. Ontosearch2: Searching and queryingweb ontologies. In In Proc. of the IADIS International Conference, pages 211–218, 2006.

112. C. Patel, K. Supekar, Y. Lee, and E. K. Park. OntoKhoj: a semantic web portalfor ontology searching, ranking and classification. In Proceedings of the 5th ACMinternational workshop on Web information and data management, WIDM ’03,pages 58–61, 2003.

113. H. Poon and P. Domingos. Sound and efficient inference with probabilistic anddeterministic dependencies. In AAAI. AAAI Press, 2006.

114. H. Poon, P. Domingos, and M. Sumner. A general method for reducing thecomplexity of relational inference and its application to MCMC. In AAAI, pages1075–1080, 2008.

115. B. Quilitz and U. Leser. Querying distributed RDF data sources with SPARQL.In Proceedings of the 5th European semantic web conference on The semantic web:research and applications, ESWC’08, pages 524–538, 2008.

116. C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probabilisticdata. In ICDE, pages 886–895, 2007.

117. C. Re and D. Suciu. Managing probabilistic data with MystiQ: The can-do, thecould-do, and the can’t-do. In SUM, pages 5–18, 2008.

118. M. Richardson and P. Domingos. Markov logic networks. Machine Learning,62(1-2), 2006.

119. S. Riedel. Cutting plane MAP inference for Markov Logic. In InternationalWorkshop on Statistical Relational Learning (SRL), 2009.

120. D. Roth. On the hardness of approximate reasoning. Artif. Intell., 82:273–302,April 1996.

121. D. Roth and W. Yih. Integer linear programming inference for conditional randomfields. In Proc. of the International Conference on Machine Learning (ICML),pages 737–744, 2005.

122. S. Sakr and G. Al-Naymat. Relational processing of rdf queries: a survey. SIG-MOD Record, 38(4):23–28, 2009.

123. A. D. Sarma, O. Benjelloun, A. Y. Halevy, and J. Widom. Working models foruncertain data. In ICDE, page 7, 2006.

124. A. D. Sarma, M. Theobald, and J. Widom. Exploiting lineage for confidencecomputation in uncertain and probabilistic databases. In ICDE, pages 1023–1032, 2008.

125. A. D. Sarma, M. Theobald, and J. Widom. LIVE: A lineage-supported versionedDBMS. In SSDBM, pages 416–433, 2010.

126. S. Schenk and S. Staab. Networked graphs: a declarative mechanism for SPARQLrules, SPARQL views and RDF data integration on the Web. In Proceeding of the17th international conference on World Wide Web, WWW ’08, pages 585–594,2008.

127. P. Sen and A. Deshpande. Representing and querying correlated tuples in prob-abilistic databases. In ICDE, pages 596–605, 2007.

128. P. Sen, A. Deshpande, and L. Getoor. PrDB: managing and exploiting rich cor-relations in probabilistic databases. VLDB J., 18(5):1065–1090, 2009.

129. P. Sen, A. Deshpande, and L. Getoor. Read-once functions and query evaluationin probabilistic databases. PVLDB, 3(1):1068–1079, 2010.

130. L. Sidirourgos, R. Goncalves, M. L. Kersten, N. Nes, and S. Manegold. Column-store support for RDF data management: not all swans are white. PVLDB,1(2):1553–1563, 2008.

131. S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. E. Hambrusch, and R. Shah.Orion 2.0: native support for uncertain data. In SIGMOD, pages 1239–1242, 2008.

132. S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. E. Hambrusch, J. Neville, andR. Cheng. Database support for probabilistic attributes and tuples. In ICDE,pages 1053–1061, 2008.

133. P. Singla and P. Domingos. Memory-efficient inference in relational domains. InAAAI, 2006.

134. M. A. Soliman, I. F. Ilyas, and K. C. Chang. URank: formulation and efficientevaluation of top-k queries in uncertain databases. In SIGMOD, pages 1082–1084,2007.

135. M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basicgraph pattern optimization using selectivity estimation. In J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, and X. Zhang, editors, WWW, pages595–604. ACM, 2008.

136. M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQLbasic graph pattern optimization using selectivity estimation. In Proceeding ofthe 17th international conference on World Wide Web, WWW ’08, pages 595–604,2008.

137. I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek,and H. Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for internetapplications. IEEE/ACM Trans. Netw., 11:17–32, February 2003.

138. U. Straccia. Reasoning Web. chapter Managing Uncertainty and Vagueness inDescription Logics, Logic Programs and Description Logic Programs, pages 54–103. Springer-Verlag, Berlin, Heidelberg, 2008.

139. H. Stuckenschmidt, R. Vdovjak, G.-J. Houben, and J. Broekstra. Index structuresand algorithms for querying distributed RDF repositories. In Proceedings of the13th international conference on World Wide Web, WWW ’04, pages 631–639,2004.

140. F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: a core of semantic knowl-edge. In WWW, pages 697–706, 2007.

141. F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: a self-organizing frameworkfor information extraction. In WWW, pages 631–640, 2009.

142. A. W. Systeme, G. Gottlob, A. Voronkov, E. Dantsin, E. Dantsin, T. Eiter, andT. Eiter. Complexity and expressive power of logic programming, 1999.

143. G. Terracina, N. Leone, V. Lio, and C. Panetta. Experimenting with recursivequeries in database and logic programming systems. Theory Pract. Log. Program.,8:129–165, March 2008.

144. M. Theobald, M. Sozio, F. Suchanek, and N. Nakashole. URDF: Efficient reason-ing in uncertain RDF knowledge bases with soft and hard rules. Technical ReportMPII20105-002, Max Planck Institute Informatics (MPI-INF), 2010.

145. Y. Theoharis, V. Christophides, and G. Karvounarakis. Benchmarking databaserepresentations of RDF/S stores. In Y. Gil, E. Motta, V. R. Benjamins, andM. A. Musen, editors, International Semantic Web Conference, volume 3729 ofLecture Notes in Computer Science, pages 685–701. Springer, 2005.

146. T. Tran, P. Haase, and R. Studer. Semantic Search – Using Graph-StructuredSemantic Models for Supporting the Search Process. In Proceedings of the 17thInternational Conference on Conceptual Structures: Conceptual Structures: Lever-aging Semantic Technologies, ICCS ’09, pages 48–65, 2009.

147. T. Tran, H. Wang, and P. Haase. Hermes: Data Web search on a pay-as-you-gointegration infrastructure. Web Semant., 7:189–203, September 2009.

148. G. Tummarello, R. Cyganiak, M. Catasta, S. Danielczyk, R. Delbru, andS. Decker. Sig.ma: live views on the web of data. In Proceedings of the 19thinternational conference on World wide web, WWW ’10, pages 1301–1304, 2010.

149. O. Udrea, A. Pugliese, and V. S. Subrahmanian. GRIN: A graph based RDFindex. In AAAI, pages 1465–1470. AAAI Press, 2007.

150. D. Z. Wang, E. Michelakis, M. J. Franklin, M. N. Garofalakis, and J. M. Heller-stein. Probabilistic declarative information extraction. In ICDE, pages 173–176,2010.

151. D. Z. Wang, E. Michelakis, M. N. Garofalakis, and J. M. Hellerstein. BayesStore:managing large, uncertain data repositories with probabilistic graphical models.PVLDB, 1(1):340–351, 2008.

152. Y. Wang, M. Yahya, and M. Theobald. Time-aware reasoning in uncertain knowl-edge bases. In Workshop on Management of Uncertain Data (MUD), 2010.

153. D. S. Warren. Memoing for logic programs. Commun. ACM, 35:93–111, 1992.154. W. Wei, J. Erenrich, and B. Selman. Towards efficient sampling: Exploiting

random walk strategies. In AAAI, pages 670–676, 2004.155. C. Weiss, P. Karras, and A. Bernstein. Hexastore: sextuple indexing for Semantic

Web data management. PVLDB, 1(1):1008–1019, 2008.156. M. L. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with

factor graphs and mcmc. PVLDB, 3(1):794–804, 2010.157. K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient RDF Storage

and Retrieval in Jena2. In First International Workshop on Semantic Web andDatabases (SWDB ’03), pages 131–150, 2003.

158. K. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds. Efficient RDF storageand retrieval in Jena2. In Cruz et al. [29], pages 131–150.

159. F. Xu, K. S. Beyer, V. Ercegovac, P. J. Haas, and E. J. Shekita. E = MC3: man-aging uncertain enterprise data in a cluster-computing environment. In SIGMOD,pages 441–454, 2009.

160. M. Zhou and Y. Wu. XML-based RDF data management for efficient queryprocessing. In X. L. Dong and F. Naumann, editors, WebDB, 2010.

Documents

Database foundations for scalable RDF processing - csd.uoc.grhy561/papers/storageaccess/optimization/Database foundations for... · Database foundations for scalable RDF processing