Upload
jaelyn-ivery
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop
2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS
3. Understand the purpose, benefits, and limitations of the various techniques and tools
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop
2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS
3. Understand the purpose, benefits, and limitations of the various techniques and tools
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
What is the Oracle RDBMS? What is Hadoop?
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Source: Oracle Database Concepts, 12c Release 1 (12.1)
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Framework of tools Open source - Java Apache Software Foundation projects Several tools
HDFS (storage) and MapReduce (analysis) HBase, Hive, Pig, Sqoop, Flume, more
Network sockets
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Hadoop Distributed File System Text files, binary files Very large data blocks
64MB minimum 1GB or higher Typical is 128MB or 256MB
Replication – 3 copy default Namenode and Datanodes
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Analytical Engine of Hadoop JobTracker TaskTracker Interprets data at runtime instead of
using predefined schemas
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
HDFS
MapReduce
PigHive
HBase
HDFS API
Text Files
BinaryFiles
(Other)
NOTE: There are
file systems
other than HDFS.
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Oracle RDBMS
Hadoop
Integrity
High Low
Schema Structured Unstructured
Use Frequent reads and writes
Write once, read many
Style Interactive and batch
Batch
1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop
2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS
3. Understand the purpose, benefits, and limitations of the various techniques and tools
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Sample scenario: Move data from Oracle into Hadoop Perform MapReduce on datasets that include
Oracle data Move MapReduce results back into Oracle for
analysis, reporting, etc. Other uses:
Oracle queries that join with Hadoop datasets Scheduled batch MapReduce results to be
warehoused
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Oracle and Hadoop together form a comprehensive platform for managing all forms of data, both structured and unstructured.
Hadoop provides “big data” processing. Oracle provides analytic capabilities not
found in Hadoop. (NOTE: This is changing.)
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop
2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS
3. Understand the purpose, benefits, and limitations of the various techniques and tools
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
1. Understand why you may want to establish data connections between the Oracle RDBMS and Hadoop
2. Review various techniques and tools for establishing data connections between the Oracle RDBMS and Hadoop’s HDFS
3. Understand the purpose, benefits, and limitations of the various techniques and tools
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
SELECT SQL, PL/SQL,
Java / Program Units (such as Stored Procedures, etc.)
Custom Java w/JDBC, DBOutputFormat, FileSystem API, Avro
Sqoop
Custom Java w/JDBC, DBInputFormat, FileSystem API, Avro
Sqoop
Oracle Loader for Hadoop
Oracle SQL Connector for HDFS
PUSH
PUSH
PULLPULL
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
SQL’s SELECT statement Use Java to:
Spool output Control output
Use string concatenation to create delimited text HDFS files
Use Java Avro API to create serialized binary HDFS files
PUSH
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
To connect to RDBMS JDBC
Interacts with RDBMS DBInputFormat: reading from a database DBOutputFormat: dumping to a database
Generates SQL Best for smaller amounts of data org.apache.hadoop.mapreduce.lib.db
To interact with HDFS Files FileSystem API Avro API (for binary files)
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Sqoop = “SQL to Hadoop” Command line Works with any JDBC compliant RDBMS Works with any external system that supports
bulk data transfer into Hadoop (HDFS, HBase, Hive)
Strength: transfer of bulk data between Hadoop and RDBMS environments
Read / Write / Update / Insert / Delete Stored Procedures (warning: parallel
processing)
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Open Source / Java Apache Top Level Project (Graduated from
incubator level March 2012) Bundled with:
Oracle Big Data Appliance CDH (Cloudera Distribution including Apache
Hadoop) Also available at Apache Software Foundation
http://sqoop.apache.org Latest version of Sqoop2: 1.99.3 (as of 4/15/14) Wiki: https://cwiki.apache.org/confluence/display/SQOOP
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Note:Sqoop cannot currently load
SequenceFile or Avro into Hive.
Text Human-readable
Binary Precision Compression Examples
SequenceFile (Java-specific) Avro
PUSH
PULL
Interacts with structured data stores outside of HDFS
Moves data from structured data stores into Hbase
Moves analytic results out of Hadoop to a structured data store
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Interrogates the RDBMS data dictionary for the target schema
Use MapReduce to import data into Hadoop Parallel Operation - configurable Fault Tolerance – configurable
Datatype mapping: Oracle SQL data types to Java data types (VARCHAR2 = String; INTEGER = Integer, etc.)
Generates Java class of structured schema Bean: “get” methods Write methods
public void readFields(ResultSet __dbResults) throws SQLException;
public void write(PreparedStatement __dbStmt) throws SQLException;
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS job Work with saved jobs list-databases List available databases on a server list-tables List available tables in a database merge Merge results of incremental imports metastore Run a standalone Sqoop metastore version Display version information
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
$ sqoop list-databases –connect "jdbc:mysql://localhost" -username steve –password changeme
14/04/24 15:35:21 INFO manager.SqlManager: Using default fetchSize of 1000
netcents2dt_siteexam_moduleteam_wiki
$
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Generic JDBC Connector Connectors for major RDBMS: Oracle,
MySQL, SQL Server, DB2, PostgreSQL Third party: Teradata, Netezza,
Couchbase Third party connectors may support
direct import into third party data stores
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
User Guides Installation Upgrade Five Minute Demo Command Line Client
Developer Guides Building Sqoop2 Development Environment Setup Java Client API Guide Developing Connector REST API Guide
PUSH
PULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Essentially – “SQL Loader” for Hadoop Java MapReduce application Runs as a Hadoop Utility w/configuration file Extensible (Attention Java programmers) Command line or standalone process Online and offline modes Requires an existing target table (staging
table!) Loads data only, cannot edit Hadoop data Pre-partitions data if necessary Can pre-sort data by primary key or user-
specified columns before loading Leverages Hadoop’s parallel processing
PULLPULL
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
PULLPULL
HDFS
MAPPER MAPPER
PRESORTING
RDBMSJDBC
MAPPER
REDUCER
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Oracle Loader advantages Regular expressions (vs. as-is delimited file
import) Faster throughput (vs. Sqoop JDBC) Data dictionary interrogation during load Support for runtime rollback (Sqoop
generates INSERT statements with no rollback support)
Sqoop advantages One system for bi-directional transfer support
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
Essentially the “external table” feature for Hadoop
Text files only – no binary file support Treats HDFS as an external table Read only (no import / transfer) No indexing No INSERT, UPDATE, or DELETE As is data import Full table scan
PULLPULL
Copyright (c) 2013 Steve O'Hearn http://www.databasetraining.com
CREATE TABLE CUSTOMER_LOGFILES ( LOGFILE_ID INTEGER(20) , LOG_NOTE VARCHAR2(120) , LOG_DATE DATE) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY log_file_dir ACCESS PARAMETERS ( records delimited by newline badfile log_file_dir:'log_bad' logfile log_file_dir:'log_records' fields terminated by ',' missing field values are null ( LOGFILE_ID , LOG_NOTE , LOG_DATE char date_format date mask "dd-mon-yyyy“ ) ) LOCATION ( 'log_data_1.csv‘ , 'log_data_2.csv') ) PARALLEL REJECT LIMIT UNLIMITED;
Oracle to CDH via Sqoop Freeware plug-in to CDH (Cloudera
Distribution including Apache Hadoop)
Copyright (c) 2013 Steve O'Hearn http://www.databasetraining.com
PULL
Java command-line utility Saves Hive HQL output to an Oracle database
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com
There is no one best solution Apache Sqoop and Java APIs
Bi-directional Read/Write/Insert/Update/Delete Limitation: JDBC and available connectors Requires knowledge of Java
Oracle Loader Offers preprocessing and speed Unidirectional
Oracle SQL Connector Integrates with existing SQL calls Limited to HDFS text files
Third party tools (Cloudera, Hortonworks, etc.) are adding features to Hadoop that may reduce demand for moving data back to Oracle
Steve O’HearnSteve O’[email protected]@corbinian.com
DatabaseTraining.comDatabaseTraining.comandand
Corbinian.comCorbinian.com
Copyright (c) 2014 Steve O'Hearn http://www.databasetraining.com