Unified Framework for Big Data FDW

Shivram Mani ( Pivotal)

Unified Framework for Big Data Foreign Data Wrappers

@ FOSDEM PGDay 2016

Agenda

● Introduction to Hadoop Ecosystem

● Why Postgres SQL on Hadoop

● Current state of SQL on Hadoop (FDW/Big data wrappers)

● PXF - Design & Architecture

● Demo

● Benefits of using PXF with FDW

● Q&A

Agenda

➢ Introduction to Hadoop Ecosystem

● Demo

● Q&A

What is Hadoop/Big Data

Apache Hadoop is an open source framework for distributed processing of large data sets across clusters of computers.

● Commodity Hardware● Scale out ● Fault tolerance● Support multiple file formats

Mapreduce HBase

Hive Pig

Clustered File System

DistributedData Processing

Top levelAbstractions

ETL Tools BI Tools RDMS

Hadoop Distributed File System (HDFS)

Top levelInterfaces

Agenda

➢ Why Postgres SQL on Hadoop

● PXF

● Demo

● Q&A

Motivations: SQL on Hadoop

various formats, storages supported on HDFS

● ANSI SQL● Cost based optimizer● Transactions● Indexes

Foreign Tables!

Agenda

➢ Current state of SQL on External Hadoop - FDW/Big data wrappers

● Demo

● Q&A

Foreign Data Wrappers (FDW)

Foreign tables and foreign data wrapper is postgres way to read external data.

1. Create FDW (compiled C functions in the handler)

2. Declare the extension (FDW)

3. Create server that uses the wrapper

4. Create table that uses the server

CREATE FOREIGN DATA WRAPPER hadoop_fdw HANDLER hadoop_fdw_handler NO VALIDATOR;

CREATE EXTENSION hadoop_fdw;

CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000');

CREATE FOREIGN TABLE retail_history (name text,

price double precision )SERVER hadoop_serverOPTIONS (table 'example.retail_history');

Foreign Data Wrappers - Implementation

Creating a new foreign data wrapper simply consists of implementing the API of the FDW as c-language functions.

Scanning a foreign table requires implementation of the following:

● GetForeignRelSize - Estimate of the relation size

● GetForeignPaths - Get access paths for the foreign data

● GetForeignPlan - Plan the foreign paths of this table

● BeginForeignScan - Start scan. Open connections, etc

● IterateForeignScan - Perform scan and return tuples

● EndForeignScan - End scan. Close connection, etc

Big Data Wrappers (Multicorn, BigSQL EnterpriseDB)

Create a Hive table corresponding to HDFS file/HBase table

Create Extension, Server & Foreign Table

schema and necessary Options

Results mapped to postgres table

Query connects to HiveServer via thrift client

Hive server executesmapreduce jobs

Query Foreign Table

Big Data Wrapper - Communication

libthriftFDW

MetaStore

Agenda

● Current state of SQL on Hadoop - FDW/Big data wrappers

➢ PXF - Design & Architecture

● Demo

● Q&A

● HAWQ is an MPP SQL engine on HDFS (evolved from Greenplum Database)

● PXF is an extensible framework that allows HAWQ to query external data.

● PXF includes built-in connectors for accessing data in HDFS files, Hive & HBase tables.

● Users can create custom connectors to other parallel data stores or processing engines.

HAWQ Extension Framework - PXF

PXF - Communication

Apache Tomcat

PXF WebappREST API

Java API

libhdfs3, written in C, segments

External Tables

Native Tables

HTTP, port: 51200

Java API

Architecture - Deployment

HAWQMaster Node NN

HBase Master

HAWQseg4

HAWQseg1

HBase Region Server1

HAWQseg2

HAWQseg3

* PXF needs to be installed on all DN* PXF is recommended to be installed on NN

Design - Components(PXF)

Fragmenter Get the locations of fragments for an external tableImplicitly provides stats to query optimizer

Accessor Understand and read/write the fragment , return records

Resolver Convert records to HAWQ consumable format (Data Types)

CREATE EXTENSION hadoop_fdw;

CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw OPTIONS (address '127.0.0.1', port '10000');

CREATE FOREIGN TABLE retail_history (name text,

price double precision )SERVER hadoop_serverOPTIONS (table 'example.retail_history');

CREATE PROTOCOL PXF;

DDL Comparison

LOCATION('pxf://127.0.0.1:51200/

example.retail_history?

CREATE EXTERNAL TABLE retail_history name text, price double precision )

PROFILE = HIVE

FORMAT 'CUSTOM' (formatter='pxfwritable_import');

PXF FDW

* Items with the same color have similar action

Architecture - Data Flow: Query (HDFS)

HAWQMaster Node NN

HAWQseg1

select * from ext_table0

pxf://<namenode><port>/path/to/data

getFragments() REST

Fragments JSON2

Split mapping(fragment -> segment)

HAWQseg1

Query dispatched to Segment 1,2,3… (Interconnect)

Read() REST

6 records

query result

records (stream)

Fragmenter

Resolver

Accessor

PXF Plugins, Profiles

• Built-in with HAWQ (Profiles)

• HDFS: HDFSTextSimple(R/W), HDFSTextMulti(R), Avro(R)

• Hive(R): Hive, HiveRC, HiveText

• HBase(R): HBase

• Community (https://bintray.com/big-data/maven/pxf-plugins/view )

• JSON HAWQ-178

• Cassandra

• Accumulo

• ...

Agenda

➢ Demo

● Q&A

Demohttps://github.com/shivzone/pxf_demo

● Implement FDW callback functions that will interact with PXF.

● Use the enhanced libcurl library - libchurl

PXF as Big Data Wrapper Abstraction

Apache Tomcat

PXF WebappREST API Java API

HTTP, port: 51200

Java API

Agenda

● Demo

➢ Benefits of using PXF with FDW

● Q&A

Benefits of using PXF with FDW

● FDW isolated from underlying hadoop ecosystem APIs

● Direct access of HDFS data.

● Access Hive data without overhead of underlying execution framework

● Access HBase data without mapped Hive table

● Supports Single node & parallel execution

● Extensibility/ease of building extensions

● Support for multiple versions of underlying distributions

● Built in filter push down and support for stats

Resources

● Github

https://github.com/apache/incubator-hawq/tree/master/pxf

● Documentation

http://hawq.docs.pivotal.io/docs-hawq/topics/PivotalExtensionFrameworkPXF.html

● Wiki

https://cwiki.apache.org/confluence/display/HAWQ/PXF

Unified Framework for Big Data FDW

Data & Analytics

Pgunconf neo4j fdw

FDW vijfde leerjaar - schooljaar 2013-2014. FDW

FDD-FDW-Template[Current~Future Architecture]

Big data testing: a unified view - East Carolina Universitycore.ecu.edu/STRG/seminars/16 March 2016 Big data testing - a unified view.pdfII. DATA ACQUISITION: PRE-PROCESSING Data quality

Fdw clase1 2010-i

FDW-based Sharding Update and Future

CCC FDW FA2011

2017 fdw presentation eng

SBL FDW - Qualitative Research

Pipeline Unified Big Data Analytics - GitHub Pagesfrank19900731.github.io/downloads/file/Unified Big Data... · 2017-02-13 · Unified big data analytics pipeline for Batch / interactive

Unified Access and Optimization with F5 BIG-IP Edge Gateway

Using IBM Unified Data Model for Healthcare & Big Data to

RASSEGNA STMAP FDW 2011

The Unified Theory in the Big-bang Universe - Arc … · The Unified Theory in the Big-Bang Universe International Journal of Advanced Research in Physical Science (IJARPS) Page |

HDU ROG GRPHVWLF FDW - ata-youth.org

Apache Spark: A Unified Engine for Big Data Processing

ZZZZ WDUUHJD FDW - Tàrrega

Modulo 0 curso superior FdW prenatal

Fdw description short infoita

Automação · 2016-06-30 · 6 FDW-6S 10 FDW-10S 16 FDW-16S 3,2 20 FDW-20S 25 FDW-25S 3,6 DIII 35 FDW-35S 5,6 50 FDW-50S 6,2 63 FDW-63S 6,4 Anel de Proteção DII 2 a 25 APW25 3