Couchconf-SF-Couchbase-Hadoop-Integration

Apache Sqoop

Connecting Couchbase with Hadoop

Arvind Prabhakar, Cloudera Inc. July 29, 2011

Agenda

• Background and Motivation

• Design of Sqoop

• Couchbase Plugin

• Demo

Apache Hadoop

• A framework for Data Intensive and Distributed Applications.

• Inspired by Google’s MapReduce and Google File System Papers

Name Node

Data Node 1

Data Node 2

Data Node 3

Job Tracker

Task Tracker 1

Task Tracker 2

Task Tracker 3

Map Reduce

Data Storage

Hadoop:• Data Archival• Open Data Formats• Healthy Ecosystem

Data storage is costly.Deleting data maybe costlier!

Data Analysis

• Structured Data Stores• Semi-Structured Data

Stores

• Ad-hoc structured Data

• Unstructured Data

Introducing Sqoop

• Easily Import Data into Hadoop• Generate Datatypes for use in

MapReduce Applications• Integrate with Hive and Hbase• Easily export Data from Hadoop

Motivation

Without Sqoop

• Requires direct access to data from within Hadoop

• Loss of efficiency due to network overhead

• Impedance mismatch. Map Reduce requires fast data access.

• Can overwhelm external systems

Using Sqoop

• Data Locality

• Efficient operation for

• Integration with Hadoop based systems – Hive, HBase

• Optimized transfer speeds based on native tools

Key Features

• Command Line Interface

– Scriptable

• Integrates with Hadoop Ecosystem

– Hive, HBase, Oozie

• Automatic code generation

– Use your data in MapReduce work flows

• Connector based architecture

– Support for connector specific optimizations

Design Overview

Sqoop Datastore

Sqoop Record Map

1. Metadata Lookup

2. GenerateCode

3. SubmitMR Job

MapReduce Job

Design Overview

Map-Only Implementation

• InputFormat:– Selects Input Source

– Defines Splits

– Creates Record Readers

• OutputFormat:– Selects Destination

– Creates Record Writers

InputFormat

SplitSplitSplitRecordReaderRecordReaderRecordReader

Map Map Map…

OutputFormat

Metadata Management

• Sqoop Record

– Dynamically generated

– Independently packaged

• Maybe used without Sqoop

– Maintains type mapping

– Different Serial Formats

• Text

• Binary

• Avro Data File

Import Operation

• Generate SqoopRecord

– Or use provided SqoopRecord

• Create Input Splits

• Spin Mappers to consume splits

• Direct output to HDFS or HBase

– Control compression, File type based on user input

• Populate Hive Metastore

Export Operation

• Generate SqoopRecord

– Or use provided SqoopRecord

• Spin Mappers to consume input files

• Each Mapper writes straight to external store

– Optionally stage data before final export

Typical Workflow

• Data imported from external systems– Periodic / Incremental imports for new data

• Hadoop Analytics Processing– Hive / HBase tables– MapReduce Processing

• Processed Data exported to external systems– Periodic / Incremental exports for new data

• Workflow automation using Oozie

Connectors

• Drop-in Sqoop Extension

• Specializes in connectivity with a particular system

• Provides optimal data transfer mechanism

• Based on Connector Mechanism of Sqoop

– Varying degree of control

Couchbase Plugin

• Based on the Couchbase Tap Interface

• Allows importing and exporting of entire database or of future key mutations

Couchbase HDFS

1. Data imported viaTap mechanism

2. HadoopProcessing

3. Data exported backto Couchbase

Couchbase Import

$ sqoop import –-connect http://localhost:8091/pools --table DUMP

$ sqoop import –-connect http://localhost:8091/pools --table BACKFILL_5

$ sqoop export --connect http://localhost:8091/pools

--table DUMP –export-dir DUMP

• For Imports, table must be:– DUMP: All keys currently in Couchbase– BACKFILL_n: All key mutations for n minutes

• For Exports, table option is ignored• Specified –username maps to bucket

– By default set to “default” bucket

Thank You!

• Couchbase:– www.couchbase.com

• Hadoop:– hadoop.apache.org

• Sqoop:– incubator.apache.org/projects/sqoop.html

• Cloudera:– www.cloudera.com

Couchconf-SF-Couchbase-Hadoop-Integration

Technology

Couchbase Mobile - Meetupfiles.meetup.com/13875572/Couchbase Mobile - mobile_tea_Boston_0… · Couchbase Mobile Developing Mobile Apps & Ofﬂine Experiences with Couchbase Mobile

Couchbase @ Paypal: Couchbase Connect 2014

CouchConf Tokyo opening session

CouchConf Israel 2013_Full Text Search

Big Data with NoSQL, Hadoop, Spark, and Kafka – Couchbase Connect 2016

CouchConf Tokyo 2013_Getting Started with Couchbase App Development

CouchConf Portland: Syncpoint

Metanautix Quest and Couchbase: Scalable Analytics Across NoSQL, RDBMS, and Hadoop: Couchbase Connect 2015

Introduction to Couchbase Server 2.0 - CouchConf SF - Tour and Demo

Securing Your Couchbase Environment in Couchbase Server 4.0: Couchbase Connect 2015

Couchbase 101 - Installation of Couchbase

CouchConf Israel 2013_Couchbase in the Clouds

CouchConf London Developing with Couchbase I: Getting Started

Intro to the Hadoop ecosystem: Couchbase Connect 2015

CouchConf Israel 2.0 Tour and Demo

Couchbase at LinkedIn: Couchbase Connect 2014

Integrating with Hadoop: Couchbase Connect 2014

Couchbase 101: Couchbase Connect 2014

CouchConf Israel 2013_Couchbase Tour

CouchConf Tokyo Developing with Couchbase Part I