18
Informatica PowerExchange for Hadoop (Version 9.1.0 HotFix 3) User Guide

Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

Informatica PowerExchange for Hadoop(Version 9.1.0 HotFix 3)

User Guide

Page 2: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

Informatica PowerExchange for Hadoop User Guide

Version 9.1.0 HotFix 3December 2011

Copyright (c) 2011 Informatica. All rights reserved.

This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use anddisclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form,by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S. and/or internationalPatents and other Patents Pending.

Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided inDFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013 © (1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), asapplicable.

The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us inwriting.

Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange,PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange Informatica OnDemand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging and InformaticaMaster Data Management are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other companyand product names may be trade names or trademarks of their respective owners.

Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rightsreserved. Copyright © Sun Microsystems. All rights reserved. Copyright © RSA Security Inc. All Rights Reserved. Copyright © Ordinal Technology Corp. All rightsreserved.Copyright © Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright © MetaIntegration Technology, Inc. All rights reserved. Copyright © Intalio. All rights reserved. Copyright © Oracle. All rights reserved. Copyright © Adobe Systems Incorporated. Allrights reserved. Copyright © DataArt, Inc. All rights reserved. Copyright © ComponentSource. All rights reserved. Copyright © Microsoft Corporation. All rights reserved.Copyright © Rogue Wave Software, Inc. All rights reserved. Copyright © Teradata Corporation. All rights reserved. Copyright © Yahoo! Inc. All rights reserved. Copyright ©Glyph & Cog, LLC. All rights reserved. Copyright © Thinkmap, Inc. All rights reserved. Copyright © Clearpace Software Limited. All rights reserved. Copyright © InformationBuilders, Inc. All rights reserved. Copyright © OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved. Copyright Cleo Communications, Inc. All rightsreserved. Copyright © International Organization for Standardization 1986. All rights reserved. Copyright © ej-technologies GmbH . All rights reserved. Copyright © JaspersoftCorporation. All rights reserved.

This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and other software which is licensed under the Apache License,Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing,software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See theLicense for the specific language governing permissions and limitations under the License.

This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software copyright ©1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of any kind, either express or implied, including but notlimited to the implied warranties of merchantability and fitness for a particular purpose.

The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine,and Vanderbilt University, Copyright ( © ) 1993-2006, all rights reserved.

This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) and redistribution ofthis software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html.

This product includes Curl software which is Copyright 1996-2007, Daniel Stenberg, <[email protected]>. All Rights Reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or withoutfee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

The product includes software copyright 2001-2005 ( © ) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to terms availableat http://www.dom4j.org/ license.html.

The product includes software copyright © 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http://dojotoolkit.org/license.

This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html.

This product includes software copyright © 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found at http://www.gnu.org/software/ kawa/Software-License.html.

This product includes OSSP UUID software which is Copyright © 2002 Ralf S. Engelschall, Copyright © 2002 The OSSP Project Copyright © 2002 Cable & WirelessDeutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php.

This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software are subjectto terms available at http:/ /www.boost.org/LICENSE_1_0.txt.

This product includes software copyright © 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available at http://www.pcre.org/license.txt.

This product includes software copyright © 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http:// www.eclipse.org/org/documents/epl-v10.php.

This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://www.stlport.org/doc/ license.html, http://www.asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://httpunit.sourceforge.net/doc/license.html, http://jung.sourceforge.net/license.txt, http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/license.html, http://www.libssh2.org,http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/license-agreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html; http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt. http://jotm.objectweb.org/bsd_license.html; http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://developer.apple.com/library/mac/#samplecode/HelpHook/Listings/HelpHook_java.html; http://www.jcraft.com/jsch/LICENSE.txt; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/software/tcltk/license.html, http://www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, and http://www.slf4j.org/license.html.

Page 3: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and DistributionLicense (http://www.opensource.org/licenses/cddl1.php ) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code LicenseAgreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php) the MIT License (http://www.opensource.org/licenses/mit-license.php) and the Artistic License (http://www.opensource.org/licenses/artistic-license-1.0).

This product includes software copyright © 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding this softwareare subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab. For furtherinformation please visit http://www.extreme.indiana.edu/.

This Software is protected by U.S. Patent Numbers 5,794,246; 6,014,670; 6,016,501; 6,029,178; 6,032,158; 6,035,307; 6,044,374; 6,092,086; 6,208,990; 6,339,775;6,640,226; 6,789,096; 6,820,077; 6,823,373; 6,850,947; 6,895,471; 7,117,215; 7,162,643; 7,254,590; 7,281,001; 7,421,458; 7,496,588; 7,523,121; 7,584,422, 7,720,842;7,721,270; and 7,774,791 , international Patents and other Patents Pending.

DISCLAIMER: Informatica Corporation provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the impliedwarranties of noninfringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this software or documentation is error free. Theinformation provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation issubject to change at any time without notice.

NOTICES

This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress SoftwareCorporation ("DataDirect") which are subject to the following terms and conditions:

1.THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOTLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.

2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OFTHE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACHOF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.

Part Number: PWX-HDU-91000-HF3-0001

Page 4: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

Table of Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiInformatica Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Informatica Customer Portal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Informatica Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Informatica Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Informatica How-To Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Informatica Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Informatica Multimedia Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Informatica Global Customer Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Chapter 1: Understanding PowerExchange for Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1PowerExchange for Hadoop Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Understanding Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

PowerCenter and Hadoop Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2: PowerExchange for Hadoop Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3PowerExchange for Hadoop Configuration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Registering the Plug-in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Chapter 3: PowerExchange for Hadoop Sources and Targets. . . . . . . . . . . . . . . . . . . . . . . 4PowerExchange for Hadoop Sources and Targets Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 4: PowerExchange for Hadoop Sessions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5PowerExchange for Hadoop Sessions Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

PowerExchange for Hadoop Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Sessions with Hadoop Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Staging HDFS Source Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Hadoop Source Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Session Properties for a Hadoop Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Sessions with Hadoop Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Hadoop Target Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Hadoop Target Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Session Properties for a Hadoop Target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Table of Contents i

Page 5: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

PrefaceThe PowerExchange for Hadoop User Guide provides information about how to build mappings to extract datafrom Hadoop and load data into Hadoop.

It is written for database administrators and developers. This guide assumes you have knowledge of relationaldatabase concepts and database engines, flat files, PowerCenter, Hadoop, the Hadoop Distributed File System(HDFS), and Apache Hive. You must also be familiar with the interface requirements for other supportingapplications.

Informatica Resources

Informatica Customer PortalAs an Informatica customer, you can access the Informatica Customer Portal site at http://mysupport.informatica.com. The site contains product information, user group information, newsletters,access to the Informatica customer support case management system (ATLAS), the Informatica How-To Library,the Informatica Knowledge Base, the Informatica Multimedia Knowledge Base, Informatica ProductDocumentation, and access to the Informatica user community.

Informatica DocumentationThe Informatica Documentation team takes every effort to create accurate, usable documentation. If you havequestions, comments, or ideas about this documentation, contact the Informatica Documentation team throughemail at [email protected]. We will use your feedback to improve our documentation. Let usknow if we can contact you regarding your comments.

The Documentation team updates documentation as needed. To get the latest documentation for your product,navigate to Product Documentation from http://mysupport.informatica.com.

Informatica Web SiteYou can access the Informatica corporate web site at http://www.informatica.com. The site contains informationabout Informatica, its background, upcoming events, and sales offices. You will also find product and partnerinformation. The services area of the site includes important information about technical support, training andeducation, and implementation services.

ii

Page 6: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

Informatica How-To LibraryAs an Informatica customer, you can access the Informatica How-To Library at http://mysupport.informatica.com.The How-To Library is a collection of resources to help you learn more about Informatica products and features. Itincludes articles and interactive demonstrations that provide solutions to common problems, compare features andbehaviors, and guide you through performing specific real-world tasks.

Informatica Knowledge BaseAs an Informatica customer, you can access the Informatica Knowledge Base at http://mysupport.informatica.com.Use the Knowledge Base to search for documented solutions to known technical issues about Informaticaproducts. You can also find answers to frequently asked questions, technical white papers, and technical tips. Ifyou have questions, comments, or ideas about the Knowledge Base, contact the Informatica Knowledge Baseteam through email at [email protected].

Informatica Multimedia Knowledge BaseAs an Informatica customer, you can access the Informatica Multimedia Knowledge Base at http://mysupport.informatica.com. The Multimedia Knowledge Base is a collection of instructional multimedia filesthat help you learn about common concepts and guide you through performing specific tasks. If you havequestions, comments, or ideas about the Multimedia Knowledge Base, contact the Informatica Knowledge Baseteam through email at [email protected].

Informatica Global Customer SupportYou can contact a Customer Support Center by telephone or through the Online Support. Online Support requiresa user name and password. You can request a user name and password at http://mysupport.informatica.com.

Use the following telephone numbers to contact Informatica Global Customer Support:

North America / South America Europe / Middle East / Africa Asia / Australia

Toll FreeBrazil: 0800 891 0202Mexico: 001 888 209 8853North America: +1 877 463 2435

Toll FreeFrance: 0805 804632Germany: 0800 5891281Italy: 800 915 985Netherlands: 0800 2300001Portugal: 800 208 360Spain: 900 813 166Switzerland: 0800 463 200United Kingdom: 0800 023 4632

Standard RateBelgium: +31 30 6022 797France: +33 1 4138 9226Germany: +49 1805 702 702Netherlands: +31 306 022 797United Kingdom: +44 1628 511445

Toll FreeAustralia: 1 800 151 830New Zealand: 09 9 128 901

Standard RateIndia: +91 80 4112 5738

Preface iii

Page 7: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

iv

Page 8: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

C H A P T E R 1

Understanding PowerExchange forHadoop

This chapter includes the following topics:

¨ PowerExchange for Hadoop Overview, 1

¨ Understanding Hadoop, 1

¨ PowerCenter and Hadoop Integration, 1

PowerExchange for Hadoop OverviewPowerExchange for Hadoop integrates PowerCenter with Hadoop to extract and load data.

You can connect a flat file source to Hadoop to extract data from Hadoop Distributed File System (HDFS). Youcan connect a flat file target to Hadoop to load data to HDFS. You can also load data to the Hive data warehousesystem.

Understanding HadoopHadoop provides a framework for distributed processing of large data sets across multiple computers. It dependson applications rather than hardware for high availability.

Hadoop applications use HDFS as the primary storage system. HDFS replicates data blocks and distributes themacross nodes in a cluster.

Hive is a data warehouse system for Hadoop. You can use Hive to add structure to datasets stored in file systemsthat are compatible with Hadoop.

PowerCenter and Hadoop IntegrationPowerExchange for Hadoop accesses Hadoop to extract data from HDFS or load data to HDFS or Hive.

1

Page 9: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

To extract data from HDFS, a PowerExchange for Hadoop mapping contains a flat file source. To load data toHDFS or Hive, a PowerExchange for Hadoop mapping contains a flat file target.

In the Workflow Manager, you specify the HDFS flat file reader to extract data from HDFS. You specify the HDFSflat file writer to load data to HDFS or Hive. You select a Hadoop HDFS connection object to access the HDFSdatabase tier of Hadoop.

The Integration Service communicates with Hadoop through the Java Native Interface (JNI). JNI is a programmingframework that enables Java code running in a Java Virtual Machine (JVM) to call or be called.

2 Chapter 1: Understanding PowerExchange for Hadoop

Page 10: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

C H A P T E R 2

PowerExchange for HadoopConfiguration

This chapter includes the following topics:

¨ PowerExchange for Hadoop Configuration Overview, 3

¨ Registering the Plug-in, 3

PowerExchange for Hadoop Configuration OverviewPowerExchange for Hadoop installs with PowerCenter.

1. Install or upgrade PowerCenter.

2. If you are upgrading from PowerCenter 9.1.0, manually register the PowerExchange for Hadoop plug-in.

Registering the Plug-inA plug-in is an XML file that defines the functionality of PowerExchange for Hadoop. To register the plug-in, therepository must be running in exclusive mode. Use Informatica Administrator (the Administrator tool) or thepmrep RegisterPlugin command to register the plug-in.

You only need to register the PowerExchange for Hadoop plug-in if you are upgrading from PowerCenter 9.1.0.

The plug-in file for PowerExchange for Hadoop is pmhdfs.xml. When you install the Repository component, theinstaller copies pmhdfs.xml to the following directory:

<Informatica Installation Directory>/server/bin/native

Note: If you do not have the correct privileges to register the plug-in, contact the user who manages thePowerCenter Repository Service.

3

Page 11: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

C H A P T E R 3

PowerExchange for HadoopSources and Targets

This chapter includes the following topic:

¨ PowerExchange for Hadoop Sources and Targets Overview, 4

PowerExchange for Hadoop Sources and TargetsOverview

You include a flat file source definition in a mapping to extract Hadoop data. You include a delimited flat file targetdefinition in a mapping to load data into HDFS or to a Hive table.

You can import a flat file definition or manually create one. To load data to a Hadoop target, the flat file definitionmust be delimited.

4

Page 12: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

C H A P T E R 4

PowerExchange for HadoopSessions

This chapter includes the following topics:

¨ PowerExchange for Hadoop Sessions Overview, 5

¨ PowerExchange for Hadoop Connections, 6

¨ Sessions with Hadoop Sources, 6

¨ Sessions with Hadoop Targets, 8

PowerExchange for Hadoop Sessions OverviewAfter you create a PowerExchange for Hadoop mapping in the Designer, you create a PowerExchange for Hadoopsession in the Workflow Manager to read, transform, and write Hadoop data.

Before you create a session, configure a Hadoop HDFS application connection to connect to the HDFS host.When the Integration Service extracts or loads Hadoop data, it connects to a Hadoop cluster through the HDFShost that runs the name node service for a Hadoop cluster.

If the mapping contains a flat file source, you can configure the session to extract data from HDFS. If the mappingcontains a flat file target, you can configure the session to load data to HDFS or a Hive table.

When the Integration Service loads data to a Hive table, it first loads data to HDFS. The Integration Service thengenerates an SQL statement to create the Hive table and load the data from HDFS to the table.

5

Page 13: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

PowerExchange for Hadoop ConnectionsUse a Hadoop HDFS application connection object for each Hadoop source or target that you want to access.

The following table describes the properties that you configure for a Hadoop HDFS application connection:

Property Description

Name The connection name used by the Workflow Manager. Connection name cannot contain spaces or otherspecial characters, except for the underscore character.

User Name The name of the user in the Hadoop group that is used to access the HDFS host.

Password Password to access the HDFS host. Reserved for future use.

Host Name The name of HDFS host that runs the name node service for the Hadoop cluster.

HDFS Port The port number of the name node.

Hive Driver Name The name of the Hive driver.By default, the driver name is:org.apache.hadoop.hive.jdbc.HiveDriver

Hive URL The URL to the Hive host.Specify the URL in the following format:jdbc:hive://hostname:portnumber/default

Hive User Name The Hive user name. Reserved for future use.

Hive Password The password for the Hive user. Reserved for future use.

Hadoop Distribution The name of the Hadoop distribution. You can choose one of the following options:- Apache- Cloudera- GreenplumDefault is Apache.

Sessions with Hadoop SourcesYou can configure a session to extract data from HDFS.

When you configure a session for a Hadoop source, you select the HDFS Flat File reader file type and a HadoopHDFS application connection object. You can stage the source data and configure partitioning.

Staging HDFS Source DataYou can optionally stage an HDFS source. The Integration Service stages source files on the local machine andthen loads data from the staged file or files into the target.

Stage an HDFS source when you want the Integration Service to read the source files and then close theconnection before continuing to process the data.

Configure staging for an HDFS source by setting HDFS flat file reader properties in a session.

6 Chapter 4: PowerExchange for Hadoop Sessions

Page 14: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

You can configure the following types of staging:Direct

Use direct staging when you want to read data from a source file. The Integration Service reads data from thesource file and stages the data on the local machine before passing to downstream transformations.

For example, if you stage a file named source.csv from the Hadoop source location to the following directory:

c:\staged_files\source_stage.csv

The Integration Service stages source.csv as source_stage.csv in the c:\staged_files directory. Then, theIntegration Service loads data from the source_stage.csv file into the Hadoop target file as specified by theoutput file path in the HDFS flat file writer properties.

Indirect

Use indirect staging when you want to read data from multiple source files. The Integration Service reads datafrom multiple files in the source. It creates an indirect file that contains the names of the source files. It thenstages the indirect file and the files read from the source in the local staging directory before passing todownstream transformations.

For example, you stage the files named source1.csv and source2.csv from the Hadoop source location to thefollowing directory:

c:\staged_files\source_stage_list.txt

The Integration Service creates an indirect file named source_stage_list.txt that contains the followingentries:

source1.csvsource2.csv

The Integration Service stages the indirect file and the source files. In the c:\staged_files directory, youwould see the following files:

source_stage_list.txtsource1.csvsource2.csv

Then, the Integration Service loads data from the staged source files into the Hadoop target path as specifiedby the output file path in the HDFS flat file writer properties.

To configure staging, set the following session properties:

Property Set to

Is Staged Enabled.

File Path For direct staging, enter the name and path of the source file. For indirect staging, enter the directory pathto the source files.

Staged File Name Name and path on the local machine used to stage the source data.

Hadoop Source ConnectionsTo extract data from a Hadoop source, you configure the session properties to select the HDFS Flat File Readerfile type and a Hadoop HDFS application connection object.

In the Sources Node section of the Mapping tab in the session properties, select HDFS Flat File as the readertype for the source. Then, select a Hadoop HDFS application connection object for the source.

Sessions with Hadoop Sources 7

Page 15: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

Session Properties for a Hadoop SourceYou can configure a session for a Hadoop source to set staging and partitioning properties.

The following table describes the properties that you can configure for a Hadoop source:

SessionProperty

Description

Is Staged Before reading the source file, the Integration Service stages the remote file or files locally.

Staged FileName

The local file name directory where the Integration Service stages files. If you use direct staging, theIntegration Service stages the source file using this file name. If you use indirect staging, the IntegrationService uses the file name to create the indirect file.

Concurrentread partitioning

Order in which multiple partitions read input rows from a source file. You can choose one of the followingoptions:- Optimize throughput. The Integration Service does not preserve row order when multiple partitions read

from a single file source. Use this option if the order in which multiple partitions read from a file source isnot important.

- Keep relative input row order. The Integration Service preserves the input row order for the rows readby each partition. Use this option to preserve the sort order of the input rows read by each partition.

- Keep absolute input row order. The Integration Service preserves the input row order for all rows readby all partitions. Use this option to preserve the sort order of the input rows each time the session runs. Ina pass-through mapping with passive transformations, the order of the rows written to the target will be inthe same order as the input rows.

File Path The Hadoop directory path of the flat file source. The path can be relative or absolute. If relative, it is relativeto the home directory of the Hadoop user.For example, you might specify the following value in the File Path property:\home\foo\

Sessions with Hadoop TargetsYou can configure a session to load data to HDFS or a Hive table.

When you configure a session for a Hadoop target, you select the HDFS Flat File writer file type and a HadoopHDFS application connection object.

When you configure the session to load data to HDFS, you can configure partitioning, file header, and outputoptions. When you configure the session to load data to a Hive table, you can configure partitioning and outputoptions.

When the Integration Service loads data to a Hive table, it generates a relational table in the Hive database. Youcan overwite the Hive table data when you run the session again.

Hadoop Target ConnectionsTo load data to a Hadoop target, you configure the session properties to select the HDFS Flat File writer file typeand a Hadoop HDFS application connection object.

In the Targets Node section of the Mapping tab in the session properties, select HDFS Flat File as the writer typefor the target. Then, select a Hadoop HDFS application connection object for the target.

8 Chapter 4: PowerExchange for Hadoop Sessions

Page 16: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

Hadoop Target PartitioningWhen you configure a session to load data to a Hadoop target, you can write the target output to a separate file foreach partition or to a merge file that contains the target output for all partitions.

You can select the following merge types for Hadoop target partitioning:No Merge

The Integration Service generates one target file for each partition. If you stage the files, the IntegrationService transfers the target files to the remote location at the end of the session. If you do not stage the files,the Integration Service generates the target files at the remote location.

Sequential Merge

The Integration Service creates one output file for each partition. At the end of the session, the IntegrationService merges the individual output files into a merge file, deletes the individual output files, and transfersthe merge file to the remote location.

If you set the merge type to sequential, you need to define the merge file path and the output file path in thesession properties. The merge file path determines the final Hadoop target location where the IntegrationService creates the merge file. The Integration Service creates the merge file from the intermediate merge fileoutput in the location defined for the output file path.

Session Properties for a Hadoop TargetYou can configure a session for a Hadoop target to load data to HDFS or a Hive table. When you load data to aHadoop target, you can set partitioning properties and file paths. When you load data to HDFS, you can also setheader options.

The following table describes the properties that you can configure for a Hadoop target:

Session Property Description

Merge Type Type of merge that the Integration Service performs on the data for partitioned targets.You can choose one of the following merge types:- No Merge- Sequential Merge

Append if Exists Appends data to a file.If the merge file path refers to a directory, this option applies to the files in the directory.

Header Options Creates a header row in the flat file when loading data to HDFS.

Auto generate partitionfile names

Generates partition file names.

Merge File Path If you choose a sequential merge type, defines the final Hadoop target location where the IntegrationService creates the merge file. The Integration Service creates the merge file from the intermediatemerge file output in the location defined in the output file path.

Generate And LoadHive Table

Generates a relational table in the Hive database. The Integration Service loads data into the Hivetable from the HDFS flat file target.

Overwrite Hive Table Overwrites the data in the Hive table.

Hive Table Name Hive table name. Default is the target instance name.

Sessions with Hadoop Targets 9

Page 17: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

Session Property Description

Externally ManagedHive Table

Loads Hive table data to the location defined in the output file path.

Output File Path Defines the absolute or relative directory path or file path on the HDFS host where the IntegrationService writes the HDFS data. A relative path is relative to the home directory of the Hadoop user.If you choose a sequential merge type, defines where the Integration Service writes intermediateoutput before it writes to the final Hadoop target location as defined by the merge file path.If you choose to generate partition file names, this path can be a directory path.

Reject File Path The path to the reject file. By default, the Integration Service writes all reject files to service processvariable directory, $PMBadFileDir.

10 Chapter 4: PowerExchange for Hadoop Sessions

Page 18: Informatica PowerExchange for Hadoop 9.1.0 HotFix3 User Guide … Documentation... · 2016-07-24 · Preface The PowerExchange for Hadoop User Guide provides information about how

I N D E X

HHDFS sources

staging 6HDFS targets

pipeline partitioning 9

Ppipeline partitioning

description for Hadoop 9plug-ins

registering for PowerExchange for Hadoop 3PowerExchange for Hadoop

configuring 3overview 1PowerCenter integration 2

registering the plug-in 3

Ssessions

configuring Hadoop sources 8configuring Hadoop targets 9PowerExchange for Hadoop 5

sourcesPowerExchange for Hadoop 4

stagingHDFS source data 6

Ttargets

PowerExchange for Hadoop 4

11