Data model for analysis of scholarly documents in theMapReduce paradigm
Adam Kawa Lukasz Bolikowski Artur Czeczko Piotr Jan DendekDominika Tkaczyk
Centre for Open Science (CeON), ICM UW
Warsaw, July 6, 2012
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 1 / 19
Agenda
1 Problem definition
2 Requirements specification
3 Exemplary solutions based on Apache Hadoop Ecosystem tools
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 2 / 19
The data that we are in possession
Vast collections of scholarly documents to store
10 million of full texts(PDF, plain text)
17 million of document metadata records(described in XML-based BWMeta format)
4TB of data(10TB including data archives)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 3 / 19
The tasks that we are doing
Big knowledge to extract and discover
17 million of document metadata records (XML)
contain title, subtitles, abstract, keywords, references, contributors andtheir affiliations, publishing magazine, . . .
input for many state-of-the-art machine learning algorithms
relatively simple ones: searching documents with given title, findingscientific teams, . . .quite complex ones: author name disambiguation, bibliometrics,classification code assignment, . . .
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 4 / 19
The requirements that we have specified
Multiple demands regarding storage and processing of large amounts of data:
scalability and parallelism — easily handle tens of terabytes of data andparallelize the computation effectively
flexible data model — possibility to add or update data, and enhance itcontent by implicit information discovered by our algorithms
latency requirements — support batch offline processing as well asrandom, realtime read/write requests
availability of many clients — accessible by programmers and researcherswith diverse language preferences and expertise
reliablility and cost-effectiveness — ideally an open-source software whichdoes not require expensive hardware
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 5 / 19
Document-related data as linked data
Information about document-related resources can be simply described a directedlabeled graph
entities (e.g. documents, contributors, references) are nodes in the graph
relationships between entities are directed labeled edges in the graph
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 6 / 19
Linked graph as a collection of RDF triples
A directed labeled graph can be simply represented a collection of RDF triples
a triple consists of subject, predicate and object
a triple represents a statement which denotes that a resource (subject) holdsa value (object) for some attribute (predicate) of that resource
a triple can represent any statements about any resource
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 7 / 19
Hadoop as a solution for scalability/performance issues
Apache Hadoop is most commonly used open-source solution for storing andprocessing big data in reliable, high-performance and cost-effective way.
Scalable storage
Parallel processing
Subprojects and many Hadoop-related projects
HDFS — distributed file system that provides high-throughput accessto large dataMapReduce — framework for distributed processing of large data sets(Java and e.g. JavaScript, Python, Perl, Ruby via Streaming)HBase — scalable, distributed data store with flexible schema, randomread/write access and fast scansPig/Hive — higher-level abstractions on top of MapReduce (simpledata manipulation languages)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 8 / 19
Apache Hadoop Ecosystem tools as RDF triple stores
SHARD [3] — a Hadoop backed RDF triple store
stores triples in flat files in HDFSdata cannot be modified randomlyless efficient for queries that requires the inspection of only a smallnumber of triples
PigSPARQL [6] — translates SPARQL queries to Pig Latin programs andruns them on Hadoop cluster
stores RDF triples with the same predicate in separate, flat files inHDFS
H2RDF [5] — a RDF store that combines MapReduce with HBase
stores triples in HBase using three flat-wide tables
Jena-HBase [4] — a HBase backed RDF triple store
provides six different pluggable HBase storage layouts
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 9 / 19
HBase as storage layer for RDF triples
Storing RDF triples in Apache HBase has several advantages
flexible data model — columns can be dynamically added and removed;multiple versions of data in a particular cell; data serialized to a byte array
random read and write — more suitable for semi-structured RDF datathan HDFS where files cannot be modified randomly and usually whole filemust be read sequentially to find subset of records
availability of many clients
interactive clients — native Java API, REST or Apache Thriftbatch clients — MapReduce (Java), Pig (PigLatin) and Hive(HiveQL)
automatically sorted records — quick lookups and partial scans; joins asfast (linear) merge-joins
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 10 / 19
Exemplary HBase schema — Flat-wide layout
Advantages
no prior knowledge about data is requiredcolocation of all information about a resource within a single rowsupport of multi-valued propertiessupport of reified statements (statements about statements)
Disadvantages
unlimited number of columnsincrease of storage space
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 11 / 19
Exemplary HBase schema - Vertically Partitioned layout [1]
Advantages
support of multi-valued propertiessupport of reified statements (statements about statements)storage space savings when compared to the previous layoutfirst-step (predicate-bound) pairwise joins as fast merge-joins
Disadvantages
increased number of joins
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 12 / 19
Exemplary HBase schema - Hexastore layout [2]
Advantages
support of multi-valued propertiessupport of reified statements (statements about statements)first-step pairwise joins as fast merge-joins
Disadvantages
increased of number of joinsincrease of storage spacecomplication of an update operation
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 13 / 19
HBase schema - other layout
Some derivative and hybrid layouts exist to combine the advantages of originallayouts
a combination of the vertically partitioned and the hexastore layout [4]
a combination of the flat-wide and the vertically partitioned layouts [4]
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 14 / 19
Challenges
a large number of join operations
relatively expensiveand practically cannot be avoided (at least for more complex queries)but specialized join techniques can be used e.g. multi join, merge-sortjoin, replicated join, skewed join
lack of a native support for cross-row atomicity (e.g. in the form oftransactions)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 15 / 19
Possible performance optimization techniques
property tables — properties often queried together are stored in the samerecord for a quick access [8, 9]
materialized path expressions — precalculation and materialization ofthe most commonly used paths through an RDF graph in advance [1, 2]
graph-oriented partitioning scheme [7]
take advantage of the spatial locality inherent in graph patternmatchinghigher replication of data that is on the border of any particularpartition (however, problematic for a graph that is modified)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 16 / 19
The ways of processing data from HBase
Many various tools are integrated with HBase and can read data from and writedata to HBase tables
Java MapReduce
possibility to use our legacy Java code in map and reduce methodsdelivers better perfromance than Apache Pig
Apache Pig
provides common data operations (e.g. filters, unions, joins, ordering)and nested types (e.g. tuples, bags, maps)supports multiple specialized joins implementationpossibility to run MapReduce jobs directly from PigLatin scriptscan be embeded in Python code
Interactive clients (e.g. Java API, REST or Apache Thrift)
interactive access to relatively small subset of our data by sending APIcalls on demand e.g. a web-based client
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 17 / 19
Case study: author name disambiguation algorithm
The most complex algorithm that we have run over Apache HBase so far isauthor name disambiguation algorithm.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 18 / 19
Thanks!
More information about CeON:http://ceon.pl/en/research
c©2012 Adam Kawa. Ten dokument jest dost ↪epny na licencji Creative Commons Uznanie autorstwa 3.0 Polska
Tresc licencji dost ↪epna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. ScalableSemantic Web Data Management using vertical partitioning. In VLDB, pages411–422, 2007.
C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing forSemantic Web Data Management. In VLDB, pages 1008-1019, 2008.
K. Rohloff and R. Schantz. High-performance, massively scalable distributedsystems using the mapreduce software framework: The shard triple-store.International Workshop on Programming Support Innovations for EmergingDistributed Applications, 2010.
V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham.Jena-HBase: A Distributed, Scalable and Efffient RDF Triple Store. Technicalreport, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase-Ext/tech-report.pdf
N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF:Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the21th International Conference on World Wide Web (WWW demo track),Lyon, France, 2012.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
A. Schatzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: MappingSPARQL to Pig Latin. 3th International Workshop on Semantic WebInformation Management (SWIM 2011), in conjunction with the 2011 ACMInternational Conference on Management of Data (SIGMOD 2011). Athens(Greece).
J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDFGraphs. VLDB Endowment, Volume 4 (VLDB 2011).
K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient RDF Storageand Retrieval in Jena2. In SWDB, pages 131–150.
K. Wilkinson. Jena property table implementation. In SSWS, 2006.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19