Upload
ilar
View
55
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools. Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department of Computer Science University of Texas at Dallas. Outline. Semantic Web Technologies & Cloud Computing Frameworks - PowerPoint PPT Presentation
Citation preview
DATA INTENSIVE QUERY PROCESSING FOR LARGE RDFGRAPHS USING CLOUD COMPUTING TOOLSMohammad Farhan HusainDr. Latifur KhanDr. Bhavani Thuraisingham
Department of Computer ScienceUniversity of Texas at Dallas
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Semantic Web Technologies
Data in machine understandable format Infer new knowledge Standards
Data representation – RDF Triples
Example:
Ontology – OWL, DAML Query language - SPARQL
Subject Predicate Object
http://test.com/s1
foaf:name “John Smith”
Cloud Computing Frameworks Proprietary
Amazon S3 Amazon EC2 Force.com
Open source tool Hadoop – Apache’s open source
implementation of Google’s proprietary GFS file system MapReduce – functional programming
paradigm using key-value pairs
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Goal
To build efficient storage using Hadoop for large amount of data (e.g. billion triples)
To build an efficient query mechanism Publish as open source project
http://code.google.com/p/hadooprdf/ Integrate with Jena as a Jena Model
Motivation
Current Semantic Web frameworks do not scale to large number of triples, e.g. Jena In-Memory, Jena RDB, Jena SDB AllegroGraph Virtuoso Universal Server BigOWLIM
There is a lack of distributed framework and persistent storage
Hadoop uses low end hardware providing a distributed framework with high fault tolerance and reliability
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Current Approaches
State-of-the-art approach Store RDF data in HDFS and query through
MapReduce programming (Our approach) Traditional approach
Store data in HDFS and process query outside of Hadoop Done in BIOMANTA1 project (details of querying
could not be found)
1. http://biomanta.org/
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage
Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
System ArchitectureLUBM Data
Generator
Preprocessor
N-Triples Converter
Predicate Based Splitter
Object Type Based Splitter
Hadoop Distributed File System / Hadoop
Cluster
MapReduce Framework
Query Rewriter
Query Plan Generator
Plan Executor
RDF/XML
Preprocessed Data
2. Jobs
3. Answer
3. Answer
1. Query
Storage Schema
Data in N-Triples Using namespaces
Example: http://utdallas.edu/res1 utd:resource1
Predicate based Splits (PS) Split data according to Predicates
Predicate Object based Splits (POS) Split further according to rdf:type of Objects
Example
D0U0:GraduateStudent20 rdf:type lehigh:GraduateStudentlehigh:University0 rdf:type lehigh:UniversityD0U0:GraduateStudent20 lehigh:memberOf lehigh:University0
P
File: rdf_typeD0U0:GraduateStudent20 lehigh:GraduateStudentlehigh:University0 lehigh:University
File: lehigh_memberOfD0U0:GraduateStudent20 lehigh:University0
PS
File: rdf_type_GraduateStudentD0U0:GraduateStudent20
File: rdf_type_UniversityD0U0:University0
File: lehigh_memberOf_UniversityD0U0:GraduateStudent20 lehigh:University0
POS
Space Gain
Example
Steps Number of Files Size (GB) Space Gain
N-Triples 20020 24 --
Predicate Split (PS) 17 7.1 70.42%
Predicate Object Split (POS)
41 6.6 72.5%
Data size at various steps for LUBM1000
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
SPARQL Query
SPARQL – SPARQL Protocol And RDF Query Language
Example
SELECT ?x ?y WHERE{
?z foaf:name ?x ?z foaf:age ?y
} Query
Data
Result
SPAQL Query by MapReduce
Example querySELECT ?p WHERE{ ?x rdf:type lehigh:Department ?p lehigh:worksFor ?x ?x subOrganizationOf http://University0.edu}
Rewritten querySELECT ?p WHERE{ ?p lehigh:worksFor_Department ?x ?x subOrganizationOf http://University0.edu}
Inside Hadoop MapReduce Job
subOrganizationOf_University
Department1 http://University0.edu
Department2 http://University1.edu
worksFor_Department
Professor1 Deaprtment1Professor2 Department2
MapMap MapMap
Reduce
Reduce
OutputWF#Professor1
Department1 SO#http://University0.edu
Dep
artm
ent1
WF#
Prof
esso
r1
Dep
artm
ent2
WF#
Prof
esso
r2
FilteringObject ==
http://University0.edu
INPUT
MAP
SHUFFLE&SORT
REDUCE
OUTPUT
Department1 SO#http://University0.edu WF#Professor1Department2 WF#Professor2
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Query Plan Generation
Challenge One Hadoop job may not be sufficient to answer
a query In a single Hadoop job, a single triple pattern cannot
take part in joins on more than one variable simultaneously
Solution Algorithm for query plan generation
Query plan is a sequence of Hadoop jobs which answers the query
Exploit the fact that in a single Hadoop job, a single triple pattern can take part in more than one join on a single variable simultaneously
Example
Example query:SELECT ?X, ?Y, ?Z WHERE { ?X pred1 obj1 subj2 ?Z obj2 subj3 ?X ?Z ?Y pred4 obj4 ?Y pred5 ?X }
Simplified view:1. X2. Z3. XZ4. Y5. XY
Join Graph &Hadoop Jobs
2
3
1
5
4
Z
X
X
X
Y
Join Graph
2
3
1
5
4
Z
X
X
X
Y
Valid Job 1
2
3
1
5
4
Z
X
X
X
Y
Valid Job 2
2
3
1
5
4
Z
X
X
X
Y
Invalid Job
Possible Query Plans
A. job1: (x, xz, xy)=yz, job2: (yz, y) = z, job3: (z, z) = done
2
3
1
5
4
Z
X
X
X
Y
Join Graph
2
3
1
5
4
Z
X
X
X
Y
Job 1
2
1,3,5
4
Z
Y
Job 2
2
Job 3
1,3,4,5
Z1,2,3,4,
5
Result
Possible Query Plans
B. job1: (y, xy)=x; (z,xz)=x, job2: (x, x, x) = done
2
3
1
5
4
Z
X
X
X
Y
Join Graph
2
3
1
5
4
Z
X
X
X
Y
Job 1
2,3
1
4,5
X
X
X
Job 2
1,2,3,4,
5
Result
Query Plan Generation
Goal: generate a minimum cost job plan Back tracking approach
Exhaustively generates all possible plans. Uses two coloring scheme on a graph to
find jobs with colors WHITE and BLACK. Two WHITE nodes cannot be adjacent
User defined cost model. Chooses best plan according to cost model.
Some Definitions
Triple Pattern,TPA triple pattern is an ordered collection of subject, predicate and object which appears in a SPARQL query WHERE clause. The subject, predicate and object can be either a variable (unbounded) or a concrete value (bounded).
Triple Pattern Join,TPJA triple pattern join is a join between two TPs on a variable
MapReduceJoin, MRJA MapReduceJoin is a join between two or more triple patterns on a variable.
Some Definitions
Job, JBA job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.
Conflicting MapReduceJoins, CMRJ A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.
NON-Conflicting MapReduceJoins, NCMRJ Non-conflicting MapReduceJoins is a pair of MRJs either not sharing any triple pattern or sharing a triple pattern and the MRJs are on same variable.
Example
LUBM Query SELECT ?X WHERE { 1 ?X rdf : type ub : Chair . 2 ?Y rdf : type ub : Department . 3 ?X ub : worksFor ?Y . 4 ?Y ub : subOrganizat ionOf <http : /
/www.U0 . edu> }
Example (contd.)
Triple Pattern Graph and Join Graph for the LUBM Query
Triple Pattern Graph (TPG)#1
Join Graph (JG)#1
Join Graph (JG)#2
Triple Pattern Graph (TPG)#2
Example(contd.)
Figure shows TPG and JG for query. On left, we have TPG where each node represents a
triple pattern in query, and they are named in the order they appear.
In the middle, we have the JG. Each node in the JG represents an edge in the TPG
For the query, an FQP can have two jobs First one dealing with NCMRJ between triple patterns 2,
3, 4 Second one NCMRJ between triple pattern 1 and the
output of the first join. IQP would be first job having CMRJs between 1, 3
and 4 and the second having MRJ between triple pattern 2 and the output of the first join.
Query Plan Generation: Backtracking
Query Plan Generation: Backtracking
Query Plan Generation: Backtracking Drawbacks of back tracking approach
Computationally intractable Search space is exponential in size
Steps a Hadoop Job Goes Through Executable file (containing MapReduce code) is
transferred from client machine to JobTracker1
JobTracker decides which TaskTrackers2 will execute the job
Executable file is distributed to TaskTrackers over network
Map processes start by reading data from HDFS Map outputs are written to discs Map outputs are read from discs, shuffled (transferred
over the network to TaskTrackers which would run Reduce processes), sorted and written to discs
Reduce processes start by reading the input from the discs
Reduce outputs are written to discs
MapReduce Data Flow
http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow
Observations & an Approximate Solution Observations
Fixed overheads of a Hadoop job Multiple read-writes to disc Data transfer over network multiple times
Even a “Hello World” MapReduce job takes a couple of seconds because of the fixed overheads
Approximate solution Minimize number of jobs This is a good approximation since the overhead of
each job (e.g. jar file distribution, multiple disc read-writes, multiple network data transfer) and job switching is huge
Greedy Algorithm: Terms
Joining variable: A variable that is common in two or more triples Ex: x, y, xy, xz, za -> x,y,z are joining, a not
Complete elimination: A join operation that eliminates a joining variable y can be completely eliminated if we join (xy,y)
Partial elimination: A join that partially eliminates a joining variable After complete elimination of y, x can be partially
eliminated by joining (xz,x)
Greedy Algorithm: Terms
E-count: Number of joining variables in the resultant
triple after a complete elimination In the example x, y, z, xy, xz E-count of x is = 2 (resultant triple: yz) E-count of y is = 1 (resultant triple: x) E-count of z is = 1 (resultant triple: x)
Greedy Algorithm: Proposition Maximum job required for any SPARQL
query K, if K<=1; min( ceil(1.71*log2K), N), if K >
1 Where K is the number of triples in the
query N is the total number of joining variables
Greedy Algorithm: Proof
If we make just one join with each joining variable, then all joins can be done in N jobs (one join per job)
Special case scenario- Suppose each joining variable is common in
exactly two triples: Example- ab, bc, cd, de, ef, …. (like a chain)
At each job, we can make K/2 joins, which reduce the number of triples to half (i.e., K/2)
So, each job halves the number of triples Therefore, total jobs required is log2K <
1.71*log2K
Greedy Algorithm: Proof (Continued) General case: Suppose we sort (decreasing order) the variables
according to the frequency in different triples Let vi has frequency fi
Therefore, fi <= fi-1<=fi-2<=…<=f1 Note that if f1=2, then it reduces to the special
case Therefore, f1>2 in the general case, also, fN>=2 Now, we keep joining on v1, v2, … ,vN as long as
there is no conflict
Greedy Algorithm: Proof (Continued) Suppose L triples could not be reduced
because each of them are left alone with one/more joining variable that are conflicting (e.g. try reducing xy, yz, zx)
Therefore, M>=L joins have been performed, producing M triples (total M+L triples remaining)
Since each join involved at least 2 triples, 2M + L <= K 2(L+e) + L <= K (letting M = L +e, e >= 0) 3L + 2e <= K 2L + (4/3)e <= K*(2/3) (multiplying by 2/3 on
both sides)
Greedy Algorithm: Proof (Continued) 2L+e <= (2/3) * K So each job reduces #of triples to 2/3 Therefore,
K * (2/3)Q >= 1>= K * (2/3)Q+1
(3/2) Q <= K <= (3/2)Q+1 , Q <= log3/2K = 1.71 * log2K <= Q+1
In most real world scenarios, we can assume that 100 triples in a query is extremely rare
So, the maximum number of jobs required in this case is 12
Greedy Algorithm
Greedy algorithm Early elimination heuristic:
Make as many complete eliminations in each job as possible
This leaves the fewest number of variables for join in the next job
Must choose the join first that has the least e-count (least number of joining variables in the resultant triple)
Greedy Algorithm
Greedy Algorithm
Step I: remove non-joining variables Step II: sort the vars according to e-
count Step III: choose a var for elimination as
long as complete or partial elimination is possible – these joins make a job
Step IV: continue to step II if more triples are available
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Experiment
Dataset and queries Cluster description Comparison with Jena In-Memory, SDB
and BigOWLIM frameworks Experiments with number of Reducers Algorithm runtimes: Greedy vs.
Exhaustive Some query results
Dataset And Queries
LUBM Dataset generator 14 benchmark queries Generates data of
some imaginary universities
Used for query execution performance comparison by many researches
Our Clusters
10 node cluster in SAIAL lab 4 GB main memory Intel Pentium IV 3.0 GHz
processor 640 GB hard drive
OpenCirrus HP labs test bed
Comparison: LUBM Query 2
Comparison: LUBM Query 9
Comparison: LUBM Query 12
Experiment with Number of Reducers
Greedy vs. Exhaustive Plan Generation
Some Query ResultsSeco
nd
s
Million Triples
Outline
Semantic Web Technologies & Cloud Computing Frameworks
Goal & Motivation Current Approaches System Architecture & Storage Schema SPARQL Query by MapReduce Query Plan Generation Experiment Future Works
Future Works
Enable plan generation algorithm to handle queries with complex structures
Ontology driven file partitioning for faster query answering
Balanced partitioning for data set with skewed distribution
Materialization with limited number of jobs for inference
Experiment with non-homogenous cluster
Publications Mohammad Husain, Latifur Khan, Murat Kantarcioglu, Bhavani M.
Thuraisingham: Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools, IEEE International Conference on Cloud Computing, 2010 (acceptance rate 20%)
Mohammad Husain, Pankil Doshi, Latifur Khan, Bhavani M. Thuraisingham: Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce, International Conference on Cloud Computing Technology and Science, Beijing, China, 2009
Mohammad Husain, Mohammad M. Masud, James McGlothlin, Latifur Khan, Bhavani Thuraisingham: Greedy Based Query Processing for Large RDF Graphs Using Cloud Computing, IEEE Transactions on Knowledge and Data Engineering Special Issue on Cloud Computing (submitted)
Mohammad Farhan Husain, Tahseen Al-Khateeb, Mohmmad Alam, Latifur Khan: Ontology based Policy Interoperability in Geo-Spatial Domain, CSI Journal (to appear)
Mohammad Farhan Husain, Mohmmad Alam, Tahseen Al-Khateeb, Latifur Khan: Ontology based policy interoperability in geo-spatial domain. ICDE Workshops 2008
Chuanjun Li, Latifur Khan, Bhavani M. Thuraisingham, M. Husain, Shaofei Chen, Fang Qiu : Geospatial Data Mining for National Security: Land Cover Classification and Semantic Grouping, Intelligence and Security Informatics, 2007
Questions/Discussion