Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Talend Open StudioFundamentals

gabrielebaldassarre.com

What is Talend for Data Integration?

❏ Eclipse-based visual programming IDE for ETL

applications

❏ Java code generator

❏ 600+ connectors for open and proprietary data systems

❏ Easily embeddable in custom applications

❏ Cross-platform

❏ Central metadata repository

❏ Available in both open source and premium flavours

What does ETL stand for?

It summarizes every operation that loads, retrieves,

digests, consumes, transforms and shapes data:

❏ Extract - get the data from different sources.

From flat files, RDBMS, Big Data systems, web services, business...

❏ Transform - convert it in a form suitable for the destination

data system.

Aggregate, transform, combine, reshape, clean, filter, improve quality...

❏ Load - move to target destination in a suitable way.

Write the data in the target format.

Talend Open Studio

❏ It’s the open source, free to use, community-supported

version of Talend for Data Integration;

❏ Often abbreviated in “TOS”, to differ from the premium

version (“TIS”);

❏ Features-lite, but still completely usable:

❏ Same set of connectors and components of the premium

version;

❏ It misses team working and Enterprise capabilities like

SVN, scheduling, process orchestrations and monitoring

console.

Hands on!

❏ Download Talend Open Studio for Data Integration

❏ https://www.talend.com/download/data-integration

❏ Download the user manual as well

❏ Install it!

❏ Optional:

❏ Prepare a quick MySQL stack for a ready-to-start

database and other commodities

❏ https://github.com/r8/vagrant-lamp it’s worth the try

https://www.talend.com/download/data-integration

https://www.talend.com/download/data-integration

https://github.com/r8/vagrant-lamp

https://github.com/r8/vagrant-lamp

Say hello to TOS!

TOS Interface: Designer

The Designer is the “canvas” where you’re going to “draw” your ETL job, graphically connecting components each others using different kinds of connectors.

TOS Interface: Components Palette

The Palette on the right hosts the complete set of 600+ available components, both custom and built.

Use the search field to quickly filter the palette views and find the component you need in a glance.

TOS Interface: Opened Jobs

Currently Opened jobs are tabbed on top...

TOS Interface: Repository Pane

The Repository pane hosts all the metadata, like DB connections credentials, external delimited file schemas, parameters and the whole set of ETL jobs themselves.

TOS Interface: Parameters Pane

The Parameters pane hosts all the select-component settings, job settings and parameters, debug status and the diagnostic tab.

TOS Interface: Perspectives

...and different Perspectives are available on top-notch corner.

Both TOS and standard Eclipse perspectives are available here.

Workspaces

A Workspace is a container of Projects which shares the

same TOS version and the same components palette.

Like Eclipse, you can choose which one to use when the

program starts.

❏ In TOS, it’s a folder in the local drive.

Projects

❏ A Project is a set of jobs and involved metadata;

❏ It’s defined under a subfolder into the Workspace;

❏ Both TOS and Eclipse Preferences are Project-based

❏ In other words, different projects in the same Workspace

have different settings;

❏ Internally, it’s a mix of XML, .items and .properties files

in a classical Eclipse flavour.

Metadata: General Principles

❏ TOS requires preliminary definition and

description of jobs using metadata.

The Repository holds this information.

❏ There are 8 types of metadata,

although custom components can

define their owns. We’ll look the most

important ones in details:

❏ Business Models, Job Projects, Contexts,

Code, Metadata.

Metadata: Business Models

❏ It stores diagrams used to

conveniently describe business models

and to embed them with ETL;

❏ It offers a small set of drawing

capabilities in UML-fashion;

❏ It’s not widely used, but it’s proven to

be useful to quickly sketch-up

transformation goals and for auto-

documenting ETL.

Metadata: Jobs

❏ It’s the warm heart of TOS Repository:

the jobs themselves;

❏ Here you’ll store all the metadata you

need for graphically describing the jobs

❏ Components used, connectors, signals,

parameters, colors and presentation

stuff are hosted here.

❏ You can (you should!) organize them in

a tree manner for better clarity.

Metadata: Contexts

❏ It stores context groups which are

parameters sets that can be used by

any job in current Project.

❏ A group is a set of initialized java

variables of one of the allowed types in

the global scope.

❏ Groups are for presentation only: you’

ve no limitations on how many or how

to use context variables in jobs.

Metadata: Code

❏ It stores routines written in Java;

❏ These routines are typically a set of

static methods inside a class.

❏ If your routine is going to be too much

complex, consider writing a custom

component instead.

❏ Consider using maven and git while

creating a routine for better reliability.❏ https://github.com/theclue/talend-routine-collection

https://github.com/theclue/talend-routine-collection

https://github.com/theclue/talend-routine-collection

Metadata: ...Metadata?

❏ It stores a heterogeneous set of

reusable, atomic elements for jobs.

❏ They include database parameters and

credentials, external files schema, web

service interfaces, business

applications connections and so on.

❏ User components often add their

metadata types to the list, but this

often breaks compatibility

Anatomy of a Job

❏ A Job is a visual set of components graphically

connected using different connections;

❏ From the visual canvas and the connection topology,

TOS in turn generates Java code;

❏ This code is procedural by design and not really object

oriented:

❏ It’s fast…

❏ ...but the debug is a pain in the neck for the experienced

programmer.

Anatomy of a job

❏ Drag and Drop components from the Palette to the canvas,

then visually connect them each other.

❏ You cannot make closed paths in your jobs!

❏ It’ll become clear later why.

Anatomy of a job: Subjobs

❏ A set of connected components is part of a subjob if they are

all enclosed by a light-blue background;

❏ You can have as many subjobs you need in a given job.

Anatomy of a job: Starting Point

❏ The starting point component of a subjob is the one with a

green background;

❏ Parallel execution is made using unconnected subjobs, but

you won’t be able to predict the execution order!

Anatomy of a job: Main Connections

❏ The Main connections are those that dictate the data flow;

❏ They carry on vectors of data (one vector per row/tuple);

Anatomy of a job: Main Connections

❏ The Main connections are those that dictate the data flow;

❏ They carry on vectors of data (one vector per row/tuple);

❏ When you have a split, the order dictates who’s come first.

You may change it from the contextual menu.

Anatomy of a job: Lookup Connections

❏ Lookup connections, as the name suggests, make data

available for fast-lookup (ie join or match operations).

❏ Typically, lookup data vectors are stored in-memory during

job processing. So watch out for memory shortage!

Anatomy of a job: Endpoints

❏ Endpoints are components that have not outgoing

connection.

❏ A given subjob can have as many endpoints as needed (think

about of what’s going on after a split operation like the above).

Signals and Data Connections

❏ There are three types of connections in standard TOS:

❏ Row

❏ Trigger

❏ Iterator

❏ You may select which connection to use from the

contextual menu of any component instance.

Row

❏ Rows are connections that carry on data, one tuple at

once;

❏ Their content is defined by a Schema;

❏ They are used to connect components;

❏ Components connected this way will end up in the same

subjob;

❏ Main, Lookup, Filter, Merge are all data connections;

❏ Custom components can define their own Data

Connection.

Schema

❏ Schema is an important inner concept in TOS design;

❏ Each Row connection must have non-null schema

declaration which defines the dimensionality of the

vector of data ingoing and outgoing to/from a given

component;

❏ Several primitive java types are supported.

Triggers

❏ Triggers, as the name suggest, won’t carry on data,

but are actually signals.

❏ They are usually used to connect subjobs.

❏ They comes in two main flavours, depending on their

scope: Sub Job Triggers and Component Triggers.

❏ They’re typically Go/No-Go events to trig the execution

of one or more subjobs;

Sub Job Triggers

❏ Sub Job Triggers are the most

widely used in practice;

❏ They are used to connect the

starting points of subjobs;

❏ When connected this way,

subjobs will execute sequentially,

forcing an execution order;

❏ You’ll end up having only one

starting point for the whole chain.

Run If Triggers

❏ Run If Trigger is a special type of trigger that is fired

only if the embedded expression is evaluated to true.

❏ The expression must be written in Java and have a

boolean outcome.

Iterators

❏ Iterators stands in the middle between Data

Connections and Triggers;

❏ They won’t carry on data like Rows…

❏ ...but they’re not fired only once like Triggers.

❏ Think of them like Triggers which will be fired once for

each incoming row.

❏ They are connected to starting points, like SubJob

Triggers, but originates from standard components like

Row Connections.

Component Parameters

❏ When you select a component instance, the parameter

pane will show the relevant fields to you to fill up;

❏ Several types of parameters are allowed: dropdown,

radio buttons, schemas, text fields...

❏ Text fields will often end up writing their value into the

generated java code as-is, so be sure to write them

properly:

❏ Enclose strings in double quotes;

❏ Be sure to match the expected type, or cast

otherwise

Components and Repository

❏ Very often, Components allows you to select a relevant

metadata from the Repository;

❏ Doing so, you will be able to keep parameters between

jobs and component instances “in sync”;

❏ However, this is not mandatory and at any time you

can detach the component from the Repository.

❏ This brings the component in “built in” state, which

means that its parameters are locally defined and won’t

be updated anymore if the Repository is.

The Context

❏ The Context holds parameters defined at compile time

❏ Those parameters are grouped in Context Groups and

defined into the Repository as primitive java types.

❏ Then, they will end up as public attributes of the

context object inside the code.

❏ For example, a parameter named “foo” will be referenced

using the syntax context.foo in code and paramters

fields.

❏ Just like parameters, “built in” Context can be defined,

too, to scope it in local job only.

The Global Map

❏ The Global Map holds parameters defined at runtime

❏ Those parameters live in a pure Java space.

❏ It’s a Key-Value Map used to store generic Objects:

❏ globalMap.put(“key”, Object) to store an object

❏ globalMap.get(“key”) to get an Object

❏ Since it’s a <Object> Java Map, you must explicitly

cast to proper type when getting back the object.

❏ It’s proven very handy when used in conjunction with

Iterators, as they cannot carry data alone.

Talend Open StudioCommon-use Components


Which component to use…?

❏ TOS comes with more than 600 general-use items;

❏ This because it must assure connectivity with tons of

different data sources (ie RDBMS, appliances…);

❏ Cleaning up those garbage, you’ll end up with a very

small subset of life-saving components. We can group

the most important ones in families and look in details:

❏ Database, File, Custom Code, Processing, Orchestration

File Components

❏ These components are used for input and

output from/to local files;

❏ Notable features includes the archiving

capabilities and a complete set of file

system management stuff, like copy, delete

or directory listing;

❏ Under Linux, you can use named pipe for

streaming data into TOS directly from a

caller shell.

Database Components

❏ These components are used for performing

operations on RDBMS;

❏ Notable features includes the components

for SCD and cloud support (ie AWS

Redshift);

❏ Unfortunately, for licensing issues, you often

have to download the jdbc wrapper from

the RDBMS vendor by yourself in order to

use it in TOS; quite annoying!

Custom Code Components

❏ These components allow you to directly

write java code into your Job;

❏ Although quite hard to manage, these are

real life-saver in lot of different situations;

❏ Typical use case is when you want to import

and use an external java library or method.

❏ Several components are available for

different scopes, ie generate data flows,

processing rows, etc...

Processing Components

❏ These are probably the most important

components at all;

❏ They include sort, filter, aggregation, join,

sampling, XML traversing;

❏ But the most important component ever is

the tMap;

❏ It’s a general purpose multi-input, multi-

output mapper component.

❏ We’ll look on it in details...

tMap in a typical Job

❏ Basically speaking,

think about a set of

joins, a set of splits

and transformations

set in the middle.

❏ That’s why it has a

special user interface.

Say hello to tMap

Say hello to tMap

Here come the Input Data Connections with their own Schemas. Only one is the Main connection, the others are all Lookup connections. Here’d you’ll set the join conditions. Clicking the wrench reveal more options, like the join type and how to load the lookup tables.

Say hello to tMap

While on the right pane we’ve the Output Data Connections, each of them with its Schema, too. Again, the wrench reveal more options, for example if the connection must catch rows where the join has failed and more...

Say hello to tMap

Each output field is a java expression. This mean you can call methods on it, user routines, combine expression and more. Click on it to open the powerful Expression Wizard.

Say hello to tMap

As a commodity, you have the Var pane for adding temp variables. Use it if your inner transformations cannot be easily handled in a single-line java expression.

Say hello to tMap

The Schema Editor is for both input and output connections. Check and set here the data types, the length, the nullable flag for each field.

Orchestration Components

❏ These components, as the name states, are

used to “make order” inside and outside the

jobs;

❏ They allows you to call a TOS jobs from

another, to put a job in wait state and more.

❏ Here’re you will find two components to switch

between Row and Iterator Connections;

❏ Typical use case is when you want to trig an

event for each row in the incoming connection.

Other useful components

❏ tPreJob and tPostJob are two special starting

points that are respectively triggered before

and after all other subjobs in the current job;

❏ tLogRow is to log the content of a given Row

connection into the console;

❏ tHashInput and tHashOutput are useful to

define reusable buffers of data inside a job;

❏ tLibraryLoad is to import external jars into

the classpath of the current job.

Talend Open StudioTips and Tricks


Tips and Tricks

❏ Use Repository metadata when possible:

it’ll make your design more robust.

❏ Generic Schema metadata, as the name

suggests, are useful to define schema that you

don’t want to be format and platform

dependant, like file schema or database table

schemas.

❏ Always documentate your jobs: this can be

exported to a ready-to-use document then!

Tips and Tricks

❏ Clicking “Sync Schema” will propagate

current schema forward changing any

schema to “built in” in the way.

❏ Built in Schemas won’t get updated when

Repository changes!

❏ If you have large lookups, sort, aggregate

operations, you may need to rise the amount

of ram devoted to jvm in Job Parameters.

❏ You may get a java heap error otherwise.

Tips and Tricks

❏ Every transformation is a java expression

in Talend!

❏ Handle the null value properly to avoid Java

NullPointerExceptions;

❏ Use primitive wrapper when possible (ie.

‘Integer’ instead of ‘int’;

❏ Use methods, not operators (ie .equals() and .

concat()).

❏ Perform filtering as soon as possible to

reduce the memory consumption.

Getting Help

❏ Talend Forge: forum, custom components, tutorials,

bug trackers, example jobs

❏ http://stackoverflow.com/questions/tagged/talend

❏ Stack Overflow

❏ http://stackoverflow.com/questions/tagged/talend

❏ Books from Packt Publishing

❏ “Getting started with Talend Open Studio for Data

Integration” by Jonathan Bowen;

❏ “Talend Open Studio Cookboo” by Rick D. Barton.

http://stackoverflow.com/questions/tagged/talend




Contacts

❏ Tutorials

❏ Custom components

❏ Ready-made jobs

❏ Use Cases

http://gabrielebaldassarre.com

Need help? Questions? Consulting needs?

http://gabrielebaldassarre.com/contacts/

@cerealping





Software

Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks