View
2.158
Download
0
Category
Tags:
Preview:
Citation preview
Apache Hadoop
• A framework for Data Intensive and Distributed Applications.
• Inspired by Google’s MapReduce and Google File System Papers
Name Node
Data Node 1
Data Node 2
Data Node 3
Job Tracker
Task Tracker 1
Task Tracker 2
Task Tracker 3
HDFS
Map Reduce
Data Storage
Hadoop:• Data Archival• Open Data Formats• Healthy Ecosystem
Data storage is costly.Deleting data maybe costlier!
Data Analysis
• Structured Data Stores• Semi-Structured Data
Stores
• Ad-hoc structured Data
• Unstructured Data
Introducing Sqoop
• Easily Import Data into Hadoop• Generate Datatypes for use in
MapReduce Applications• Integrate with Hive and Hbase• Easily export Data from Hadoop
Sqoop
Motivation
Without Sqoop
• Requires direct access to data from within Hadoop
• Loss of efficiency due to network overhead
• Impedance mismatch. Map Reduce requires fast data access.
• Can overwhelm external systems
Using Sqoop
• Data Locality
• Efficient operation for
• Integration with Hadoop based systems – Hive, HBase
• Optimized transfer speeds based on native tools
Key Features
• Command Line Interface
– Scriptable
• Integrates with Hadoop Ecosystem
– Hive, HBase, Oozie
• Automatic code generation
– Use your data in MapReduce work flows
• Connector based architecture
– Support for connector specific optimizations
Design Overview
Sqoop Datastore
Sqoop Record Map
HDFS
Map
HDFS
Map
HDFS
Map
HDFS
1. Metadata Lookup
2. GenerateCode
3. SubmitMR Job
MapReduce Job
Design Overview
Map-Only Implementation
• InputFormat:– Selects Input Source
– Defines Splits
– Creates Record Readers
• OutputFormat:– Selects Destination
– Creates Record Writers
InputFormat
SplitSplitSplitRecordReaderRecordReaderRecordReader
Map Map Map…
OutputFormat
Metadata Management
• Sqoop Record
– Dynamically generated
– Independently packaged
• Maybe used without Sqoop
– Maintains type mapping
– Different Serial Formats
• Text
• Binary
• Avro Data File
Import Operation
• Generate SqoopRecord
– Or use provided SqoopRecord
• Create Input Splits
• Spin Mappers to consume splits
• Direct output to HDFS or HBase
– Control compression, File type based on user input
• Populate Hive Metastore
Export Operation
• Generate SqoopRecord
– Or use provided SqoopRecord
• Spin Mappers to consume input files
• Each Mapper writes straight to external store
– Optionally stage data before final export
Typical Workflow
• Data imported from external systems– Periodic / Incremental imports for new data
• Hadoop Analytics Processing– Hive / HBase tables– MapReduce Processing
• Processed Data exported to external systems– Periodic / Incremental exports for new data
• Workflow automation using Oozie
Connectors
• Drop-in Sqoop Extension
• Specializes in connectivity with a particular system
• Provides optimal data transfer mechanism
• Based on Connector Mechanism of Sqoop
– Varying degree of control
Couchbase Plugin
• Based on the Couchbase Tap Interface
• Allows importing and exporting of entire database or of future key mutations
Couchbase HDFS
1. Data imported viaTap mechanism
2. HadoopProcessing
3. Data exported backto Couchbase
Couchbase Import
$ sqoop import –-connect http://localhost:8091/pools --table DUMP
$ sqoop import –-connect http://localhost:8091/pools --table BACKFILL_5
$ sqoop export --connect http://localhost:8091/pools
--table DUMP –export-dir DUMP
• For Imports, table must be:– DUMP: All keys currently in Couchbase– BACKFILL_n: All key mutations for n minutes
• For Exports, table option is ignored• Specified –username maps to bucket
– By default set to “default” bucket
Thank You!
• Couchbase:– www.couchbase.com
• Hadoop:– hadoop.apache.org
• Sqoop:– incubator.apache.org/projects/sqoop.html
• Cloudera:– www.cloudera.com
Recommended