Bd Sqltohadoop3 PDF

8/13/2019 Bd Sqltohadoop3 PDF

1/13

Copyright IBM Corporation 2013 Trademarks

SQL to Hadoop and back again, Part 3: Direct transfer and live

data exchange

Page 1 of 13

SQL to Hadoop and back again, Part 3: Direct transfer

andlive data exchange

Martin C. Brown([email protected])

Director of Documentation

22 October 2013

Big data is a term that has been used regularly now for almost a decade, and it along

with technologies like NoSQL are seen as the replacements for the long-successful

RDBMS solutions that use SQL. Today, DB2, Oracle, Microsoft SQL Server MySQL, and

PostgreSQL dominate the SQL space and still make up a considerable proportion of the overallmarket. In this final article of the series, we will look at more automated solutions for migrating

data to and from Hadoop. In the previous articles, we concentrated on methods that take

exports or otherwise formatted and extracted data from your SQL source, load that into Hadoop

in some way, then process or parse it. But if you want to analyze big data, you probably don't

want to wait while exporting the data. Here, we're going to look at some methods and tools that

enable a live transfer of data between your SQL and Hadoop environments.

View more content in this series

Using SqoopLike some solutions we've seen earlier, Sqoop is all about taking data, usually wholesale, from a

database and inserting that into Hadoop in the format required for your desired use. For example,

Sqoop can take raw tabular data either a whole database, table, view, or query and insert

it into Hadoop using a native JSON-style format, CSV format, tab-delimited format, or Sqoop can

import it to a format suitable for using the data in Hive or HBase.

The elegance of Sqoop is that it handles the entire extraction, transfer, data type translation, and

insertion process for you, in either direction. Sqoop also handles providing you have organized

your data appropriately the incremental transfer of information. This means you can perform a

load and 24 hours later, you can perform an additional load that imports only the rows that have

changed.

InfoSphere BigInsightsInfoSphere BigInsights makes integrating between Hadoop and SQL databases much

simpler as it provides the necessary tools and mechanics to export and import data between

different databases. Using InfoSphere BigInsights you can define database sources, views,

queries and other selection criteria, and then automatically convert that into a variety of

formats before importing that collection directly into Hadoop (see Resourcesfor more

information).
http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+backmailto:[email protected]://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+backmailto:[email protected]://www.ibm.com/developerworks/ibm/trademarks/http://www.ibm.com/legal/copytrade.shtml


2/13

developerWorks ibm.com/developerWorks/

SQL to Hadoop and back again, Part 3: Direct transfer and livedata exchange Page 2 of 13

For example, you can create a query that extracts the data and populates a JSON array with

the record data. Once exported, a job can be created to process and crunch the data before

either displaying it, or importing the processed data and exporting the data back into DB2.

Download BigInsights Quick Start Edition,a complimentary, downloadable version of

InfoSphere BigInsights.

Sqoop is an efficient method of swapping data, since it uses multithreaded transfers to extract,

convert, and insert the information among databases. This approach can be more efficient for

data transfer than the export/import methods previously shown. The limitation of Sqoop is that it

automates aspects of data exchange that, if made configurable, could be better tailored to your

data and expected uses.

Importing to Hadoop using Sqoop

Sqoop works very simply by taking all the data in a table (effectively SELECT * FROM tablename),

or through the query submitted, then submits this data as a MapReduce load job that writes the

content out into HDFS within Hadoop.

The basic Sqoop tool accepts a command, import, then a series of options that define the JDBC

interface, along with configuration information, such as the JDBC driver, authentication information,

and table data. For example, here's how to import the Chicago Bus data from a MySQL source:

$ sqoop import --connect jdbc:mysql://192.168.0.240/hadoop --username root --table

chicago.

Sqoop likes to use the primary keys of the table data as an identifier for the information because

each row of data will be inserted into HDFS as a CSV row in a file. The primary key is also the

better method to use for append-only data, such as logs. Using the primary key is also handy

when performing incremental imports because we can use it to identify which rows have alreadybeen imported.

The output of the command actually goes a long way to describe the underlying process (see

Listing 1).

Listing 1. Output of the sqoop importcommand$ sqoop import --connect jdbc:mysql://192.168.0.240/hadoop

--username root --table chicago

13/08/20 18:45:46 INFO manager.MySQLManager: Preparing to use a MySQL

streaming resultset.

13/08/20 18:45:46 INFO tool.CodeGenTool: Beginning code generation

13/08/20 18:45:47 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `chicago` AS t LIMIT 1

13/08/20 18:45:47 INFO manager.SqlManager: Executing SQL statement:

SELECT t.* FROM `chicago` AS t LIMIT 1

13/08/20 18:45:47 INFO orm.CompilationManager: HADOOP_MAPRED_HOME

is /usr/lib/hadoop-mapreduce

13/08/20 18:45:47 INFO orm.CompilationManager: Found hadoop core jar at:

/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-core.jar

Note: /tmp/sqoop-cloudera/compile/2a66b88e152785acb3688bb530daa957/chicago.java

uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

13/08/20 18:45:49 INFO orm.CompilationManager: Writing jar file:

/tmp/sqoop-cloudera/compile/2a66b88e152785acb3688bb530daa957/chicago.jar

13/08/20 18:45:49 WARN manager.MySQLManager: It looks like you are
http://www.ibm.com/developerworks/downloads/im/biginsightsquick/


3/13

ibm.com/developerWorks/ developerWorks


importing from mysql.

13/08/20 18:45:49 WARN manager.MySQLManager: This transfer can be faster!

Use the --direct

13/08/20 18:45:49 WARN manager.MySQLManager: option to exercise a

MySQL-specific fast path.

13/08/20 18:45:49 INFO manager.MySQLManager: Setting zero DATETIME

behavior to convertToNull (mysql)

13/08/20 18:45:49 ERROR tool.ImportTool: Error during import: No primary key

could be found for table chicago. Please specify one with --split-by or performa sequential import with '-m 1'.

[cloudera@localhost ~]$ sqoop import --connect jdbc:mysql://192.168.0.240/hadoop













Note: /tmp/sqoop-cloudera/compile/3002dc39075aa6746a99e5a4b27240ac/chicago. java uses or overrides a deprecated API.



/tmp/sqoop-cloudera/compile/3002dc39075aa6746a99e5a4b27240ac/chicago.jar

13/08/20 18:48:58 WARN manager.MySQLManager: It looks like you are importing

from mysql.

13/08/20 18:48:58 WARN manager.MySQLManager: This transfer can be faster!

Use the --direct

13/08/20 18:48:58 WARN manager.MySQLManager: option to exercise a

MySQL-specific fast path.

13/08/20 18:48:58 INFO manager.MySQLManager: Setting zero DATETIME

behavior to convertToNull (mysql)

13/08/20 18:48:58 INFO mapreduce.ImportJobBase: Beginning import of chicago

13/08/20 18:48:59 WARN mapred.JobClient: Use GenericOptionsParser for parsing

the arguments. Applications should implement Tool for the same.

13/08/20 18:49:00 INFO db.DataDrivenDBInputFormat: BoundingValsQuery:

SELECT MIN(`id`), MAX(`id`) FROM `chicago`

13/08/20 18:49:00 INFO mapred.JobClient: Running job:

job_201308151105_0012

13/08/20 18:49:01 INFO mapred.JobClient: map 0% reduce 0%

13/08/20 18:49:51 INFO mapred.JobClient: map 100% reduce 0%

13/08/20 18:49:53 INFO mapred.JobClient: Job complete:

job_201308151105_0012

13/08/20 18:49:53 INFO mapred.JobClient: Counters: 23

13/08/20 18:49:53 INFO mapred.JobClient: File System Counters

13/08/20 18:49:53 INFO mapred.JobClient: FILE: Number of bytes read=0

13/08/20 18:49:53 INFO mapred.JobClient: FILE: Number of bytes

written=695444

13/08/20 18:49:53 INFO mapred.JobClient: FILE: Number of read operations=0

13/08/20 18:49:53 INFO mapred.JobClient: FILE: Number of large read

operations=013/08/20 18:49:53 INFO mapred.JobClient: FILE: Number of write operations=0

13/08/20 18:49:53 INFO mapred.JobClient: HDFS: Number of bytes read=433

13/08/20 18:49:53 INFO mapred.JobClient: HDFS: Number of bytes written=97157691

13/08/20 18:49:53 INFO mapred.JobClient: HDFS: Number of read operations=4

13/08/20 18:49:53 INFO mapred.JobClient: HDFS: Number of large

read operations=0

13/08/20 18:49:53 INFO mapred.JobClient: HDFS: Number of write operations=4

13/08/20 18:49:53 INFO mapred.JobClient: Job Counters

13/08/20 18:49:53 INFO mapred.JobClient: Launched map tasks=4

13/08/20 18:49:53 INFO mapred.JobClient: Total time spent by all maps in

occupied slots (ms)=173233

13/08/20 18:49:53 INFO mapred.JobClient: Total time spent by all reduces in


4/13



occupied slots (ms)=0

13/08/20 18:49:53 INFO mapred.JobClient: Total time spent by all maps waiting

after reserving slots (ms)=0

13/08/20 18:49:53 INFO mapred.JobClient: Total time spent by all reduces

waiting after reserving slots (ms)=0

13/08/20 18:49:53 INFO mapred.JobClient: Map-Reduce Framework

13/08/20 18:49:53 INFO mapred.JobClient: Map input records=2168224

13/08/20 18:49:53 INFO mapred.JobClient: Map output records=2168224

13/08/20 18:49:53 INFO mapred.JobClient: Input split bytes=43313/08/20 18:49:53 INFO mapred.JobClient: Spilled Records=0

13/08/20 18:49:53 INFO mapred.JobClient: CPU time spent (ms)=24790

13/08/20 18:49:53 INFO mapred.JobClient: Physical memory (bytes)

snapshot=415637504

13/08/20 18:49:53 INFO mapred.JobClient: Virtual memory (bytes)

snapshot=2777317376

13/08/20 18:49:53 INFO mapred.JobClient: Total committed heap usage

(bytes)=251133952

13/08/20 18:49:53 INFO mapreduce.ImportJobBase: Transferred 92.6568 MB

in 54.4288 seconds (1.7023 MB/sec)

13/08/20 18:49:53 INFO mapreduce.ImportJobBase: Retrieved 2168224 records.

Once transferred, the data is stored, by default, as comma-separated values, and added to a

directory named after the table that is imported, with the data split into select sizes (see Listing 2).

Listing 2. Storing the data

$ hdfs dfs -ls chicago

Found 6 items

-rw-r--r-- 3 cloudera cloudera 0 2013-08-20 18:49 chicago/_SUCCESS

drwxr-xr-x - cloudera cloudera 0 2013-08-20 18:49 chicago/_logs

-rw-r--r-- 3 cloudera cloudera 23904178 2013-08-20 18:49 chicago/part-m-00000




To change the directory where the information is stored, use --target-dirto specify the directorylocation within HDFS.

The file format can be explicitly modified using command-line arguments, but the options are

limited. For example, you can't migrate tabular data into a JSON record with Sqoop.

A more complex alternative is to use the SequenceFile format, which translates the raw data into

a binary format that can be reconstituted within the Java environment of Hadoop as a Java

class, with each column of the table data as a property of each instantiated class record. As an

alternative, you can use Sqoop to import data directly into an HBase- or Hive-compatible table.

Importing using a query

Wholesale table transfers are useful, but one of the primary benefits of the SQL environment is the

ability to join and reformat the input into a more meaningful stream of columnar data.

By using a query, you can extract entire tables, table fragments, or complex table joins. I tend to

use queries when the source data is from multiple SQL tables and I want to crunch the data as a

single source table within Hadoop.


5/13



To use a query, the --queryargument must be specified on the command line. The query must

include a WHEREclause that includes the variable $CONDITIONS; this is automatically populated by

Sqoop to be used when splitting the source content (see Listing 3).

Listing 3. Using the --query argument

$ sqoop import --connect jdbc:mysql://192.168.0.240/hadoop --username root \

--query "SELECT log.id,log.daterec,sensor.logtype,sensor.value FROM log

JOIN sensor on (sensor.logid == log.id) WHERE $CONDITIONS"

The basic process is the same; we're just being limited about the data being exchanged. Internally,

Sqoop merely executes the query and takes the tabular output.

A good technique to use here is to remember that the size of the data being transferred is

(comparatively) meaningless. Also bear in mind that during processing within Hadoop, you will only

have access to the information in the files that are imported; you won't be able to run a join or other

lookup to find the information you need that would normally exist in a multi-table SQL environment.Therefore, duplication of information (for example, repeats of a string, ID, or date identifier column)

on a row basis that you might ordinarily dedupe, can safely be repeated.

Incremental imports

The incremental importis an attempt by Sqoop to handle the fact that source data is unlikely to be

static. The process is not automatic, and you must be prepared to keep a record of the last data

that was imported.

The incremental system operates in two ways, either using a lastmodified approach, or using an

append approach:

The lastmodifiedapproach requires changing your SQL table structure and application

as it performs a comparison on a date that is then used to determine which records have

changed since the last import was made. This is best used for data from the SQL side that is

updated, but you must adapt your application and database structure to include a column that

contains the date and time when the record was inserted or updated. Most databases include

a timestamp data type for exactly this purpose.

The appendapproach uses a simpler check field. This can be used in a number of ways,

but the most obvious is one where an auto-incremented column is used to hold data and

is, therefore, better suited to data that is permanently appended, rather than created and

updated. Another option is to use a column that is updated to a new value for each insert or

update, but this requires hoop jumping that is more complex than the auto_incrementvalue.

For either system, the fundamental approach is the same: tell Sqoop which column contains the

data to be checked and the check value, then import as normal. For example, to import all the

data since the original Chicago bus data import, we specify the auto_incrementID of the last row

imported (see Listing 4).


6/13



Listing 4. Specifying the auto_incrementID$ sqoop import --connect jdbc:mysql://192.168.0.240/hadoop --username root \

--table chicago --check-column id --incremental append --last-value=2168224

...

13/08/20 19:39:01 INFO tool.ImportTool: Incremental import complete!

To run another incremental import of all data following this import, supply the

following arguments:13/08/20 19:39:01 INFO tool.ImportTool: --incremental append

13/08/20 19:39:01 INFO tool.ImportTool: --check-column id

13/08/20 19:39:01 INFO tool.ImportTool: --last-value 4336573

13/08/20 19:39:01 INFO tool.ImportTool: (Consider saving this with

'sqoop job --create')

One useful feature of the incremental process is that the job outputs the command-line values you

would need to use for the next import, but it's easier to save this as a job (see Listing 5).

Listing 5. Saving output as a job$ sqoop job create nextimport --incremental append \

--check-column id --last-value 4336573 \ --connect jdbc:mysql://192.168.0.240/hadoop \


Now you can run the next import by running $ sqoop job --exec nextimport.

The incremental import process is great for those jobs that aggregate data from your SQL store

into Hadoop over time, while deleting the active data in the SQL store. The basic premise here

can be used to do near-live updates of information into Hadoop from an SQL source by easily

transferring the information across.

Exporting to SQL using SqoopThe export process just converts the data in Hadoop back into a table. Sqoop achieves this exportby loading the data into a staging table and importing the data into the target table. The target

table has to exist, and the structure has to match the information being exported from Hadoop (see

Listing 6).

Listing 6. Sqoop exportcommand$ sqoop export --connect jdbc:mysql://192.168.0.240/hadoop --username root

--table chicago2 --export-dir=chicago




13/08/20 20:08:46 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `chicago2` AS t LIMIT 1


SELECT t.* FROM `chicago2` AS t LIMIT 1





Note: /tmp/sqoop-cloudera/compile/5f6d818f5d78c0e4349b5fc3924f87da/

chicago2.java uses or overrides a deprecated API.



/tmp/sqoop-cloudera/compile/5f6d818f5d78c0e4349b5fc3924f87da/chicago2.jar

13/08/20 20:08:50 INFO mapreduce.ExportJobBase: Beginning export of chicago2


7/13


8/13


9/13



predefine jobs and structures and how the interface of the data exists between the two systems,

then use this definition to run explicit transfers of the data. Sqoop2 is still in its early stages (the

first stable version was made available in March 2013), and convenience features like Hive

and HBase support are not complete. The real benefit will come with a forthcoming UI, which is

expected toward the end of the 2013.

Incorporating Hadoop in your application

For performing more regular, measured transfer of data, the best approach is to make the data

exchange part of your application framework. This is the only way to be sure that the exchange of

information is wired correctly into your application workflow. One solution is to use Hive as your

best approach, since it allows SQL queries to be used directly on the two systems.

However, this kind of integration is risky is because of the potential for bad data. The main benefit

of using SQL is the transactional nature that guarantees data has been written to disk. Exporting

(Sqoop, CSV, etc.) the data is much safer and works much better with the append-only nature of

HDFS and Hadoop in general.

The exact method is dependent on your application, but writing the same data to both systems is

not without its pitfalls. In particular:

Writing data to two databases is complicated. To guarantee that no data is lost, you would

need to confirm both transactions and reject the write for the application to resubmit in the

event that either the SQL or Hadoop write had failed.

Updates are more complicated in Hadoop. HDFS is append-only by nature, so you would

have to handle the process by writing a correction record into HDFS, then using MapReduce

to compress the insert and update operations during processing. Deletes are fundamentally the same as updates: We just kill the data during MapReduce

processing.

Any changes (updates or deletes) may cause problems if you are doing streaming updates

and compression of the data in Hadoop.

If you are reading the data back into SQL in a summarized form for near-line processing,

you'll need to plan the process (as described in Part 2) very carefully.

The data life cycle between SQL and Hadoop of export, process, import is generally much easier

to handle (disk space issues aside). If you need near-live transfer, then the Sqoop incremental

transfer is much more efficient.

Conclusion

Live data transfer between SQL and Hadoop is not a sensible option, but with Sqoop, we can do

the next best thing by using incremental updates to load the most recent data into Hadoop in a

regular fashion. The alternative live updates through your application is so incredibly risky

from a quality and reliability of data point of view that it should be avoided. Regular swapping of

information between SQL and Hadoop is safer and allows for the data to be managed during the

transfer more effectively.
http://www.ibm.com/developerworks/library/bd-sqltohadoop2/index.htmlhttp://www.ibm.com/developerworks/library/bd-sqltohadoop2/index.html


10/13



Throughout this "SQL to Hadoop and back again" series, the focus has been on trying to

understand that life cycle and how the transfer and exchange of information operates. The format

of the data is relatively simple, but knowing how and when to effectively exchange the information

and process it in a way that matches your data needs is the challenge. There are lots of solutions

out there to move the data, but it is still up to your application to understand the best way to make

use of that exchange.
http://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+back


11/13


12/13



Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based

offering that extends the value of open source Hadoop with features like Big SQL, text

analytics, and BigSheets.

Follow these self-paced tutorials (PDF)to learn how to manage your big data environment,

import data for analysis, analyze data with BigSheets, develop your first big data application,

develop Big SQL queries to analyze big data, and create an extractor to derive insights fromtext documents with InfoSphere BigInsights.

Find resources to help you get started with InfoSphere Streams, IBM's high-performance

computing platform that enables user-developed applications to rapidly ingest, analyze, and

correlate information as it arrives from thousands of real-time sources.

Stay current with developerWorks technical events and webcasts.

Follow developerWorks on Twitter.

Get products and technologies

Get Hadoop 0.20.1from Apache.org.

Get Hadoop MapReduce.

Get Hadoop HDFS.

Download InfoSphere BigInsights Quick Start Edition, available as a native software

installation or as a VMware image.

Download InfoSphereStreams, available asa native software installation or as a VMware

image.

Use InfoSphere Streams on IBM SmartCloudEnterprise.

Build your next development project with IBM trial software, available for download directly

from developerWorks.

Discuss

Ask questions and get answers in the InfoSphere BigInsights forum.

Ask questions and get answers in the InfoSphere Streams forum.

Checkout the developerWorks blogsand get involved in the developerWorks community.

IBM big data and analyticson Facebook.
https://www.facebook.com/IBMbigdataanalyticshttp://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/communityhttp://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/streams/index.html/https://www.facebook.com/IBMbigdataanalyticshttp://www.ibm.com/developerworks/communityhttp://www.ibm.com/developerworks/blogs/https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000001664http://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/downloads/http://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/biginsightsquick/http://hadoop.apache.org/hdfs/releases.htmlhttp://hadoop.apache.org/mapreduce/releases.htmlhttp://hadoop.apache.org/common/releases.htmlhttp://www.twitter.com/developerworks/http://www.ibm.com/developerworks/offers/techbriefings/http://www.ibm.com/developerworks/bigdata/streams/index.htmlhttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.ibm.com/developerworks/bigdata/biginsights/index.html


13/13



About the author

Martin C. Brown

A professional writer for over 15 years, Martin (MC) Brown is the author and

contributor to more than 26 books covering an array of topics, including the recently

published Getting Started with CouchDB.His expertise spans myriad development

languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C,

C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac

OS and more. He currently works as the director of documentation for Continuent.

Copyright IBM Corporation 2013

(www.ibm.com/legal/copytrade.shtml)

Trademarks

(www.ibm.com/developerworks/ibm/trademarks/)
http://www.ibm.com/developerworks/ibm/trademarks/http://www.ibm.com/legal/copytrade.shtml

Documents

Bd Sqltohadoop3 PDF