Upload
gabriele-baldassarre
View
1.843
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Introduction to Talend Open Studio for Data Integration, focusing on job architecture, metadata, workspaces, connection types and common use components. Rick Tips & Tricks sections
Citation preview
Talend Open StudioFundamentals
gabrielebaldassarre.com
What is Talend for Data Integration?
❏ Eclipse-based visual programming IDE for ETL
applications
❏ Java code generator
❏ 600+ connectors for open and proprietary data systems
❏ Easily embeddable in custom applications
❏ Cross-platform
❏ Central metadata repository
❏ Available in both open source and premium flavours
What does ETL stand for?
It summarizes every operation that loads, retrieves,
digests, consumes, transforms and shapes data:
❏ Extract - get the data from different sources.
From flat files, RDBMS, Big Data systems, web services, business...
❏ Transform - convert it in a form suitable for the destination
data system.
Aggregate, transform, combine, reshape, clean, filter, improve quality...
❏ Load - move to target destination in a suitable way.
Write the data in the target format.
Talend Open Studio
❏ It’s the open source, free to use, community-supported
version of Talend for Data Integration;
❏ Often abbreviated in “TOS”, to differ from the premium
version (“TIS”);
❏ Features-lite, but still completely usable:
❏ Same set of connectors and components of the premium
version;
❏ It misses team working and Enterprise capabilities like
SVN, scheduling, process orchestrations and monitoring
console.
Hands on!
❏ Download Talend Open Studio for Data Integration
❏ https://www.talend.com/download/data-integration
❏ Download the user manual as well
❏ Install it!
❏ Optional:
❏ Prepare a quick MySQL stack for a ready-to-start
database and other commodities
❏ https://github.com/r8/vagrant-lamp it’s worth the try
Say hello to TOS!
TOS Interface: Designer
The Designer is the “canvas” where you’re going to “draw” your ETL job, graphically connecting components each others using different kinds of connectors.
TOS Interface: Components Palette
The Palette on the right hosts the complete set of 600+ available components, both custom and built.
Use the search field to quickly filter the palette views and find the component you need in a glance.
TOS Interface: Opened Jobs
Currently Opened jobs are tabbed on top...
TOS Interface: Repository Pane
The Repository pane hosts all the metadata, like DB connections credentials, external delimited file schemas, parameters and the whole set of ETL jobs themselves.
TOS Interface: Parameters Pane
The Parameters pane hosts all the select-component settings, job settings and parameters, debug status and the diagnostic tab.
TOS Interface: Perspectives
...and different Perspectives are available on top-notch corner.
Both TOS and standard Eclipse perspectives are available here.
Workspaces
A Workspace is a container of Projects which shares the
same TOS version and the same components palette.
Like Eclipse, you can choose which one to use when the
program starts.
❏ In TOS, it’s a folder in the local drive.
Projects
❏ A Project is a set of jobs and involved metadata;
❏ It’s defined under a subfolder into the Workspace;
❏ Both TOS and Eclipse Preferences are Project-based
❏ In other words, different projects in the same Workspace
have different settings;
❏ Internally, it’s a mix of XML, .items and .properties files
in a classical Eclipse flavour.
Metadata: General Principles
❏ TOS requires preliminary definition and
description of jobs using metadata.
The Repository holds this information.
❏ There are 8 types of metadata,
although custom components can
define their owns. We’ll look the most
important ones in details:
❏ Business Models, Job Projects, Contexts,
Code, Metadata.
Metadata: Business Models
❏ It stores diagrams used to
conveniently describe business models
and to embed them with ETL;
❏ It offers a small set of drawing
capabilities in UML-fashion;
❏ It’s not widely used, but it’s proven to
be useful to quickly sketch-up
transformation goals and for auto-
documenting ETL.
Metadata: Jobs
❏ It’s the warm heart of TOS Repository:
the jobs themselves;
❏ Here you’ll store all the metadata you
need for graphically describing the jobs
❏ Components used, connectors, signals,
parameters, colors and presentation
stuff are hosted here.
❏ You can (you should!) organize them in
a tree manner for better clarity.
Metadata: Contexts
❏ It stores context groups which are
parameters sets that can be used by
any job in current Project.
❏ A group is a set of initialized java
variables of one of the allowed types in
the global scope.
❏ Groups are for presentation only: you’
ve no limitations on how many or how
to use context variables in jobs.
Metadata: Code
❏ It stores routines written in Java;
❏ These routines are typically a set of
static methods inside a class.
❏ If your routine is going to be too much
complex, consider writing a custom
component instead.
❏ Consider using maven and git while
creating a routine for better reliability.❏ https://github.com/theclue/talend-routine-collection
Metadata: ...Metadata?
❏ It stores a heterogeneous set of
reusable, atomic elements for jobs.
❏ They include database parameters and
credentials, external files schema, web
service interfaces, business
applications connections and so on.
❏ User components often add their
metadata types to the list, but this
often breaks compatibility
Anatomy of a Job
❏ A Job is a visual set of components graphically
connected using different connections;
❏ From the visual canvas and the connection topology,
TOS in turn generates Java code;
❏ This code is procedural by design and not really object
oriented:
❏ It’s fast…
❏ ...but the debug is a pain in the neck for the experienced
programmer.
Anatomy of a job
❏ Drag and Drop components from the Palette to the canvas,
then visually connect them each other.
❏ You cannot make closed paths in your jobs!
❏ It’ll become clear later why.
Anatomy of a job: Subjobs
❏ A set of connected components is part of a subjob if they are
all enclosed by a light-blue background;
❏ You can have as many subjobs you need in a given job.
Anatomy of a job: Starting Point
❏ The starting point component of a subjob is the one with a
green background;
❏ Parallel execution is made using unconnected subjobs, but
you won’t be able to predict the execution order!
Anatomy of a job: Main Connections
❏ The Main connections are those that dictate the data flow;
❏ They carry on vectors of data (one vector per row/tuple);
Anatomy of a job: Main Connections
❏ The Main connections are those that dictate the data flow;
❏ They carry on vectors of data (one vector per row/tuple);
❏ When you have a split, the order dictates who’s come first.
You may change it from the contextual menu.
Anatomy of a job: Lookup Connections
❏ Lookup connections, as the name suggests, make data
available for fast-lookup (ie join or match operations).
❏ Typically, lookup data vectors are stored in-memory during
job processing. So watch out for memory shortage!
Anatomy of a job: Endpoints
❏ Endpoints are components that have not outgoing
connection.
❏ A given subjob can have as many endpoints as needed (think
about of what’s going on after a split operation like the above).
Signals and Data Connections
❏ There are three types of connections in standard TOS:
❏ Row
❏ Trigger
❏ Iterator
❏ You may select which connection to use from the
contextual menu of any component instance.
Row
❏ Rows are connections that carry on data, one tuple at
once;
❏ Their content is defined by a Schema;
❏ They are used to connect components;
❏ Components connected this way will end up in the same
subjob;
❏ Main, Lookup, Filter, Merge are all data connections;
❏ Custom components can define their own Data
Connection.
Schema
❏ Schema is an important inner concept in TOS design;
❏ Each Row connection must have non-null schema
declaration which defines the dimensionality of the
vector of data ingoing and outgoing to/from a given
component;
❏ Several primitive java types are supported.
Triggers
❏ Triggers, as the name suggest, won’t carry on data,
but are actually signals.
❏ They are usually used to connect subjobs.
❏ They comes in two main flavours, depending on their
scope: Sub Job Triggers and Component Triggers.
❏ They’re typically Go/No-Go events to trig the execution
of one or more subjobs;
Sub Job Triggers
❏ Sub Job Triggers are the most
widely used in practice;
❏ They are used to connect the
starting points of subjobs;
❏ When connected this way,
subjobs will execute sequentially,
forcing an execution order;
❏ You’ll end up having only one
starting point for the whole chain.
Run If Triggers
❏ Run If Trigger is a special type of trigger that is fired
only if the embedded expression is evaluated to true.
❏ The expression must be written in Java and have a
boolean outcome.
Iterators
❏ Iterators stands in the middle between Data
Connections and Triggers;
❏ They won’t carry on data like Rows…
❏ ...but they’re not fired only once like Triggers.
❏ Think of them like Triggers which will be fired once for
each incoming row.
❏ They are connected to starting points, like SubJob
Triggers, but originates from standard components like
Row Connections.
Component Parameters
❏ When you select a component instance, the parameter
pane will show the relevant fields to you to fill up;
❏ Several types of parameters are allowed: dropdown,
radio buttons, schemas, text fields...
❏ Text fields will often end up writing their value into the
generated java code as-is, so be sure to write them
properly:
❏ Enclose strings in double quotes;
❏ Be sure to match the expected type, or cast
otherwise
Components and Repository
❏ Very often, Components allows you to select a relevant
metadata from the Repository;
❏ Doing so, you will be able to keep parameters between
jobs and component instances “in sync”;
❏ However, this is not mandatory and at any time you
can detach the component from the Repository.
❏ This brings the component in “built in” state, which
means that its parameters are locally defined and won’t
be updated anymore if the Repository is.
The Context
❏ The Context holds parameters defined at compile time
❏ Those parameters are grouped in Context Groups and
defined into the Repository as primitive java types.
❏ Then, they will end up as public attributes of the
context object inside the code.
❏ For example, a parameter named “foo” will be referenced
using the syntax context.foo in code and paramters
fields.
❏ Just like parameters, “built in” Context can be defined,
too, to scope it in local job only.
The Global Map
❏ The Global Map holds parameters defined at runtime
❏ Those parameters live in a pure Java space.
❏ It’s a Key-Value Map used to store generic Objects:
❏ globalMap.put(“key”, Object) to store an object
❏ globalMap.get(“key”) to get an Object
❏ Since it’s a <Object> Java Map, you must explicitly
cast to proper type when getting back the object.
❏ It’s proven very handy when used in conjunction with
Iterators, as they cannot carry data alone.
Talend Open StudioCommon-use Components
gabrielebaldassarre.com
Which component to use…?
❏ TOS comes with more than 600 general-use items;
❏ This because it must assure connectivity with tons of
different data sources (ie RDBMS, appliances…);
❏ Cleaning up those garbage, you’ll end up with a very
small subset of life-saving components. We can group
the most important ones in families and look in details:
❏ Database, File, Custom Code, Processing, Orchestration
File Components
❏ These components are used for input and
output from/to local files;
❏ Notable features includes the archiving
capabilities and a complete set of file
system management stuff, like copy, delete
or directory listing;
❏ Under Linux, you can use named pipe for
streaming data into TOS directly from a
caller shell.
Database Components
❏ These components are used for performing
operations on RDBMS;
❏ Notable features includes the components
for SCD and cloud support (ie AWS
Redshift);
❏ Unfortunately, for licensing issues, you often
have to download the jdbc wrapper from
the RDBMS vendor by yourself in order to
use it in TOS; quite annoying!
Custom Code Components
❏ These components allow you to directly
write java code into your Job;
❏ Although quite hard to manage, these are
real life-saver in lot of different situations;
❏ Typical use case is when you want to import
and use an external java library or method.
❏ Several components are available for
different scopes, ie generate data flows,
processing rows, etc...
Processing Components
❏ These are probably the most important
components at all;
❏ They include sort, filter, aggregation, join,
sampling, XML traversing;
❏ But the most important component ever is
the tMap;
❏ It’s a general purpose multi-input, multi-
output mapper component.
❏ We’ll look on it in details...
tMap in a typical Job
❏ Basically speaking,
think about a set of
joins, a set of splits
and transformations
set in the middle.
❏ That’s why it has a
special user interface.
Say hello to tMap
Say hello to tMap
Here come the Input Data Connections with their own Schemas. Only one is the Main connection, the others are all Lookup connections. Here’d you’ll set the join conditions. Clicking the wrench reveal more options, like the join type and how to load the lookup tables.
Say hello to tMap
While on the right pane we’ve the Output Data Connections, each of them with its Schema, too. Again, the wrench reveal more options, for example if the connection must catch rows where the join has failed and more...
Say hello to tMap
Each output field is a java expression. This mean you can call methods on it, user routines, combine expression and more. Click on it to open the powerful Expression Wizard.
Say hello to tMap
As a commodity, you have the Var pane for adding temp variables. Use it if your inner transformations cannot be easily handled in a single-line java expression.
Say hello to tMap
The Schema Editor is for both input and output connections. Check and set here the data types, the length, the nullable flag for each field.
Orchestration Components
❏ These components, as the name states, are
used to “make order” inside and outside the
jobs;
❏ They allows you to call a TOS jobs from
another, to put a job in wait state and more.
❏ Here’re you will find two components to switch
between Row and Iterator Connections;
❏ Typical use case is when you want to trig an
event for each row in the incoming connection.
Other useful components
❏ tPreJob and tPostJob are two special starting
points that are respectively triggered before
and after all other subjobs in the current job;
❏ tLogRow is to log the content of a given Row
connection into the console;
❏ tHashInput and tHashOutput are useful to
define reusable buffers of data inside a job;
❏ tLibraryLoad is to import external jars into
the classpath of the current job.
Talend Open StudioTips and Tricks
gabrielebaldassarre.com
Tips and Tricks
❏ Use Repository metadata when possible:
it’ll make your design more robust.
❏ Generic Schema metadata, as the name
suggests, are useful to define schema that you
don’t want to be format and platform
dependant, like file schema or database table
schemas.
❏ Always documentate your jobs: this can be
exported to a ready-to-use document then!
Tips and Tricks
❏ Clicking “Sync Schema” will propagate
current schema forward changing any
schema to “built in” in the way.
❏ Built in Schemas won’t get updated when
Repository changes!
❏ If you have large lookups, sort, aggregate
operations, you may need to rise the amount
of ram devoted to jvm in Job Parameters.
❏ You may get a java heap error otherwise.
Tips and Tricks
❏ Every transformation is a java expression
in Talend!
❏ Handle the null value properly to avoid Java
NullPointerExceptions;
❏ Use primitive wrapper when possible (ie.
‘Integer’ instead of ‘int’;
❏ Use methods, not operators (ie .equals() and .
concat()).
❏ Perform filtering as soon as possible to
reduce the memory consumption.
Getting Help
❏ Talend Forge: forum, custom components, tutorials,
bug trackers, example jobs
❏ http://stackoverflow.com/questions/tagged/talend
❏ Stack Overflow
❏ http://stackoverflow.com/questions/tagged/talend
❏ Books from Packt Publishing
❏ “Getting started with Talend Open Studio for Data
Integration” by Jonathan Bowen;
❏ “Talend Open Studio Cookboo” by Rick D. Barton.
Contacts
❏ Tutorials
❏ Custom components
❏ Ready-made jobs
❏ Use Cases
http://gabrielebaldassarre.com
Need help? Questions? Consulting needs?
http://gabrielebaldassarre.com/contacts/
@cerealping