Upload
tilani-gunawardena
View
448
Download
0
Embed Size (px)
DESCRIPTION
HadoopDB in Action: Building Real World Applications
Citation preview
HadoopDB in Action: Building Real World
Applications
Tilani Gunawardena
IntroductionArchitecture and DesignExample applicationDemostration Scenario
Road Map
Managing and analysing massive data◦ Provides high performance◦ Scales over clusters of thousands of
heterogeneous machines◦ Versatile-adaptability of a system to analytical
queries of varying complexity
How does one build real world applications with HadoopDB?
Introduction
Database Connector - connects Hadoop with the single-node database systems.
Data Loader - partitions data and manages parallel loading of data into the database systems.
Catalog - tracks locations of different data chunks,including those replicated across multiple nodes.
SQL-MapReduce-SQL (SMS) planner - extends Hive to provide a SQL interface to HadoopDB
Architecture And Design
Supports any JDBC-compliant database server
as an underlying DBMS layer Applications built on top of HadoopDB
generally use the 3-tier architecture◦ data tier◦ business logic tier◦ presentation tier
HadoopDB is a black box(in application perspective)
HadoopDB
A semantic web/biological data analysis application.
A business data warehousing application.
Example Application
Semantic web is an effort by the W3C to enable integration and sharing of data across dierent applications
RDF- is a directed, labeled graph data format for representing information in the Web
SPARQL –is an RDF query language
SemanticWeb-Biological Data Analysis
Find all proteins whose existence in the `Human' organism is uncertain
SPARQL query :
demonstrate◦ how the data administrator should prepare the
dataset.
Analyst- is shielded from the complexity of the actual implementation of the RDF storage layer.
Natural target application for HadoopDB. Common business data warehousing
workloads are read-mostly and involve analytical queries over a complex schema
To achieve good query performance, the dataset requires signicant preparation through data partitioning and replication to optimize for join queries
Data & Queries- TPC-H benchmark
Business Data Warehousing
Find 10 highest-revenue unshipped orders Query :
Audience is invited to query both data sets through HadoopDB
Data sets are located in a remote cluster Multiple users interaction- two client
machines that connect to the clusters.
Demonstration scenario
user selects dataset SemanticWeb—Biological Data Analysis
- An animation of the behind-the-scenes data preparation & loading is presented- Details on the tools used for data conversion from RDF to relational form.
Business Data Warehousing- the animation provides details on the partitioning scheme, the interaction between the loader and catalog components, and a summary of the configuration parameters
User select and parametrize a query to execute -User can then monitor the progress of query
execution
In addition demonstrate HadoopDB's fault-tolerance with the introduction of a node failure.
For a subset of the predened queries, as the query executes in the background, an animation of the flow of data and control through the HadoopDB system is simultaneously presented, highlighting which parts of the query execution are run in parallel.
Thank You!