60
Talend Open Studio Fundamentals gabrielebaldassarre.com

Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Embed Size (px)

DESCRIPTION

Introduction to Talend Open Studio for Data Integration, focusing on job architecture, metadata, workspaces, connection types and common use components. Rick Tips & Tricks sections

Citation preview

Page 1: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Talend Open StudioFundamentals

gabrielebaldassarre.com

Page 2: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

What is Talend for Data Integration?

❏ Eclipse-based visual programming IDE for ETL

applications

❏ Java code generator

❏ 600+ connectors for open and proprietary data systems

❏ Easily embeddable in custom applications

❏ Cross-platform

❏ Central metadata repository

❏ Available in both open source and premium flavours

Page 3: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

What does ETL stand for?

It summarizes every operation that loads, retrieves,

digests, consumes, transforms and shapes data:

❏ Extract - get the data from different sources.

From flat files, RDBMS, Big Data systems, web services, business...

❏ Transform - convert it in a form suitable for the destination

data system.

Aggregate, transform, combine, reshape, clean, filter, improve quality...

❏ Load - move to target destination in a suitable way.

Write the data in the target format.

Page 4: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Talend Open Studio

❏ It’s the open source, free to use, community-supported

version of Talend for Data Integration;

❏ Often abbreviated in “TOS”, to differ from the premium

version (“TIS”);

❏ Features-lite, but still completely usable:

❏ Same set of connectors and components of the premium

version;

❏ It misses team working and Enterprise capabilities like

SVN, scheduling, process orchestrations and monitoring

console.

Page 5: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Hands on!

❏ Download Talend Open Studio for Data Integration

❏ https://www.talend.com/download/data-integration

❏ Download the user manual as well

❏ Install it!

❏ Optional:

❏ Prepare a quick MySQL stack for a ready-to-start

database and other commodities

❏ https://github.com/r8/vagrant-lamp it’s worth the try

Page 6: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Say hello to TOS!

Page 7: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

TOS Interface: Designer

The Designer is the “canvas” where you’re going to “draw” your ETL job, graphically connecting components each others using different kinds of connectors.

Page 8: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

TOS Interface: Components Palette

The Palette on the right hosts the complete set of 600+ available components, both custom and built.

Use the search field to quickly filter the palette views and find the component you need in a glance.

Page 9: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

TOS Interface: Opened Jobs

Currently Opened jobs are tabbed on top...

Page 10: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

TOS Interface: Repository Pane

The Repository pane hosts all the metadata, like DB connections credentials, external delimited file schemas, parameters and the whole set of ETL jobs themselves.

Page 11: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

TOS Interface: Parameters Pane

The Parameters pane hosts all the select-component settings, job settings and parameters, debug status and the diagnostic tab.

Page 12: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

TOS Interface: Perspectives

...and different Perspectives are available on top-notch corner.

Both TOS and standard Eclipse perspectives are available here.

Page 13: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Workspaces

A Workspace is a container of Projects which shares the

same TOS version and the same components palette.

Like Eclipse, you can choose which one to use when the

program starts.

❏ In TOS, it’s a folder in the local drive.

Page 14: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Projects

❏ A Project is a set of jobs and involved metadata;

❏ It’s defined under a subfolder into the Workspace;

❏ Both TOS and Eclipse Preferences are Project-based

❏ In other words, different projects in the same Workspace

have different settings;

❏ Internally, it’s a mix of XML, .items and .properties files

in a classical Eclipse flavour.

Page 15: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Metadata: General Principles

❏ TOS requires preliminary definition and

description of jobs using metadata.

The Repository holds this information.

❏ There are 8 types of metadata,

although custom components can

define their owns. We’ll look the most

important ones in details:

❏ Business Models, Job Projects, Contexts,

Code, Metadata.

Page 16: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Metadata: Business Models

❏ It stores diagrams used to

conveniently describe business models

and to embed them with ETL;

❏ It offers a small set of drawing

capabilities in UML-fashion;

❏ It’s not widely used, but it’s proven to

be useful to quickly sketch-up

transformation goals and for auto-

documenting ETL.

Page 17: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Metadata: Jobs

❏ It’s the warm heart of TOS Repository:

the jobs themselves;

❏ Here you’ll store all the metadata you

need for graphically describing the jobs

❏ Components used, connectors, signals,

parameters, colors and presentation

stuff are hosted here.

❏ You can (you should!) organize them in

a tree manner for better clarity.

Page 18: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Metadata: Contexts

❏ It stores context groups which are

parameters sets that can be used by

any job in current Project.

❏ A group is a set of initialized java

variables of one of the allowed types in

the global scope.

❏ Groups are for presentation only: you’

ve no limitations on how many or how

to use context variables in jobs.

Page 19: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Metadata: Code

❏ It stores routines written in Java;

❏ These routines are typically a set of

static methods inside a class.

❏ If your routine is going to be too much

complex, consider writing a custom

component instead.

❏ Consider using maven and git while

creating a routine for better reliability.❏ https://github.com/theclue/talend-routine-collection

Page 20: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Metadata: ...Metadata?

❏ It stores a heterogeneous set of

reusable, atomic elements for jobs.

❏ They include database parameters and

credentials, external files schema, web

service interfaces, business

applications connections and so on.

❏ User components often add their

metadata types to the list, but this

often breaks compatibility

Page 21: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Anatomy of a Job

❏ A Job is a visual set of components graphically

connected using different connections;

❏ From the visual canvas and the connection topology,

TOS in turn generates Java code;

❏ This code is procedural by design and not really object

oriented:

❏ It’s fast…

❏ ...but the debug is a pain in the neck for the experienced

programmer.

Page 22: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Anatomy of a job

❏ Drag and Drop components from the Palette to the canvas,

then visually connect them each other.

❏ You cannot make closed paths in your jobs!

❏ It’ll become clear later why.

Page 23: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Anatomy of a job: Subjobs

❏ A set of connected components is part of a subjob if they are

all enclosed by a light-blue background;

❏ You can have as many subjobs you need in a given job.

Page 24: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Anatomy of a job: Starting Point

❏ The starting point component of a subjob is the one with a

green background;

❏ Parallel execution is made using unconnected subjobs, but

you won’t be able to predict the execution order!

Page 25: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Anatomy of a job: Main Connections

❏ The Main connections are those that dictate the data flow;

❏ They carry on vectors of data (one vector per row/tuple);

Page 26: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Anatomy of a job: Main Connections

❏ The Main connections are those that dictate the data flow;

❏ They carry on vectors of data (one vector per row/tuple);

❏ When you have a split, the order dictates who’s come first.

You may change it from the contextual menu.

Page 27: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Anatomy of a job: Lookup Connections

❏ Lookup connections, as the name suggests, make data

available for fast-lookup (ie join or match operations).

❏ Typically, lookup data vectors are stored in-memory during

job processing. So watch out for memory shortage!

Page 28: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Anatomy of a job: Endpoints

❏ Endpoints are components that have not outgoing

connection.

❏ A given subjob can have as many endpoints as needed (think

about of what’s going on after a split operation like the above).

Page 29: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Signals and Data Connections

❏ There are three types of connections in standard TOS:

❏ Row

❏ Trigger

❏ Iterator

❏ You may select which connection to use from the

contextual menu of any component instance.

Page 30: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Row

❏ Rows are connections that carry on data, one tuple at

once;

❏ Their content is defined by a Schema;

❏ They are used to connect components;

❏ Components connected this way will end up in the same

subjob;

❏ Main, Lookup, Filter, Merge are all data connections;

❏ Custom components can define their own Data

Connection.

Page 31: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Schema

❏ Schema is an important inner concept in TOS design;

❏ Each Row connection must have non-null schema

declaration which defines the dimensionality of the

vector of data ingoing and outgoing to/from a given

component;

❏ Several primitive java types are supported.

Page 32: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Triggers

❏ Triggers, as the name suggest, won’t carry on data,

but are actually signals.

❏ They are usually used to connect subjobs.

❏ They comes in two main flavours, depending on their

scope: Sub Job Triggers and Component Triggers.

❏ They’re typically Go/No-Go events to trig the execution

of one or more subjobs;

Page 33: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Sub Job Triggers

❏ Sub Job Triggers are the most

widely used in practice;

❏ They are used to connect the

starting points of subjobs;

❏ When connected this way,

subjobs will execute sequentially,

forcing an execution order;

❏ You’ll end up having only one

starting point for the whole chain.

Page 34: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Run If Triggers

❏ Run If Trigger is a special type of trigger that is fired

only if the embedded expression is evaluated to true.

❏ The expression must be written in Java and have a

boolean outcome.

Page 35: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Iterators

❏ Iterators stands in the middle between Data

Connections and Triggers;

❏ They won’t carry on data like Rows…

❏ ...but they’re not fired only once like Triggers.

❏ Think of them like Triggers which will be fired once for

each incoming row.

❏ They are connected to starting points, like SubJob

Triggers, but originates from standard components like

Row Connections.

Page 36: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Component Parameters

❏ When you select a component instance, the parameter

pane will show the relevant fields to you to fill up;

❏ Several types of parameters are allowed: dropdown,

radio buttons, schemas, text fields...

❏ Text fields will often end up writing their value into the

generated java code as-is, so be sure to write them

properly:

❏ Enclose strings in double quotes;

❏ Be sure to match the expected type, or cast

otherwise

Page 37: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Components and Repository

❏ Very often, Components allows you to select a relevant

metadata from the Repository;

❏ Doing so, you will be able to keep parameters between

jobs and component instances “in sync”;

❏ However, this is not mandatory and at any time you

can detach the component from the Repository.

❏ This brings the component in “built in” state, which

means that its parameters are locally defined and won’t

be updated anymore if the Repository is.

Page 38: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

The Context

❏ The Context holds parameters defined at compile time

❏ Those parameters are grouped in Context Groups and

defined into the Repository as primitive java types.

❏ Then, they will end up as public attributes of the

context object inside the code.

❏ For example, a parameter named “foo” will be referenced

using the syntax context.foo in code and paramters

fields.

❏ Just like parameters, “built in” Context can be defined,

too, to scope it in local job only.

Page 39: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

The Global Map

❏ The Global Map holds parameters defined at runtime

❏ Those parameters live in a pure Java space.

❏ It’s a Key-Value Map used to store generic Objects:

❏ globalMap.put(“key”, Object) to store an object

❏ globalMap.get(“key”) to get an Object

❏ Since it’s a <Object> Java Map, you must explicitly

cast to proper type when getting back the object.

❏ It’s proven very handy when used in conjunction with

Iterators, as they cannot carry data alone.

Page 40: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Talend Open StudioCommon-use Components

gabrielebaldassarre.com

Page 41: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Which component to use…?

❏ TOS comes with more than 600 general-use items;

❏ This because it must assure connectivity with tons of

different data sources (ie RDBMS, appliances…);

❏ Cleaning up those garbage, you’ll end up with a very

small subset of life-saving components. We can group

the most important ones in families and look in details:

❏ Database, File, Custom Code, Processing, Orchestration

Page 42: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

File Components

❏ These components are used for input and

output from/to local files;

❏ Notable features includes the archiving

capabilities and a complete set of file

system management stuff, like copy, delete

or directory listing;

❏ Under Linux, you can use named pipe for

streaming data into TOS directly from a

caller shell.

Page 43: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Database Components

❏ These components are used for performing

operations on RDBMS;

❏ Notable features includes the components

for SCD and cloud support (ie AWS

Redshift);

❏ Unfortunately, for licensing issues, you often

have to download the jdbc wrapper from

the RDBMS vendor by yourself in order to

use it in TOS; quite annoying!

Page 44: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Custom Code Components

❏ These components allow you to directly

write java code into your Job;

❏ Although quite hard to manage, these are

real life-saver in lot of different situations;

❏ Typical use case is when you want to import

and use an external java library or method.

❏ Several components are available for

different scopes, ie generate data flows,

processing rows, etc...

Page 45: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Processing Components

❏ These are probably the most important

components at all;

❏ They include sort, filter, aggregation, join,

sampling, XML traversing;

❏ But the most important component ever is

the tMap;

❏ It’s a general purpose multi-input, multi-

output mapper component.

❏ We’ll look on it in details...

Page 46: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

tMap in a typical Job

❏ Basically speaking,

think about a set of

joins, a set of splits

and transformations

set in the middle.

❏ That’s why it has a

special user interface.

Page 47: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Say hello to tMap

Page 48: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Say hello to tMap

Here come the Input Data Connections with their own Schemas. Only one is the Main connection, the others are all Lookup connections. Here’d you’ll set the join conditions. Clicking the wrench reveal more options, like the join type and how to load the lookup tables.

Page 49: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Say hello to tMap

While on the right pane we’ve the Output Data Connections, each of them with its Schema, too. Again, the wrench reveal more options, for example if the connection must catch rows where the join has failed and more...

Page 50: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Say hello to tMap

Each output field is a java expression. This mean you can call methods on it, user routines, combine expression and more. Click on it to open the powerful Expression Wizard.

Page 51: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Say hello to tMap

As a commodity, you have the Var pane for adding temp variables. Use it if your inner transformations cannot be easily handled in a single-line java expression.

Page 52: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Say hello to tMap

The Schema Editor is for both input and output connections. Check and set here the data types, the length, the nullable flag for each field.

Page 53: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Orchestration Components

❏ These components, as the name states, are

used to “make order” inside and outside the

jobs;

❏ They allows you to call a TOS jobs from

another, to put a job in wait state and more.

❏ Here’re you will find two components to switch

between Row and Iterator Connections;

❏ Typical use case is when you want to trig an

event for each row in the incoming connection.

Page 54: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Other useful components

❏ tPreJob and tPostJob are two special starting

points that are respectively triggered before

and after all other subjobs in the current job;

❏ tLogRow is to log the content of a given Row

connection into the console;

❏ tHashInput and tHashOutput are useful to

define reusable buffers of data inside a job;

❏ tLibraryLoad is to import external jars into

the classpath of the current job.

Page 55: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Talend Open StudioTips and Tricks

gabrielebaldassarre.com

Page 56: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Tips and Tricks

❏ Use Repository metadata when possible:

it’ll make your design more robust.

❏ Generic Schema metadata, as the name

suggests, are useful to define schema that you

don’t want to be format and platform

dependant, like file schema or database table

schemas.

❏ Always documentate your jobs: this can be

exported to a ready-to-use document then!

Page 57: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Tips and Tricks

❏ Clicking “Sync Schema” will propagate

current schema forward changing any

schema to “built in” in the way.

❏ Built in Schemas won’t get updated when

Repository changes!

❏ If you have large lookups, sort, aggregate

operations, you may need to rise the amount

of ram devoted to jvm in Job Parameters.

❏ You may get a java heap error otherwise.

Page 58: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Tips and Tricks

❏ Every transformation is a java expression

in Talend!

❏ Handle the null value properly to avoid Java

NullPointerExceptions;

❏ Use primitive wrapper when possible (ie.

‘Integer’ instead of ‘int’;

❏ Use methods, not operators (ie .equals() and .

concat()).

❏ Perform filtering as soon as possible to

reduce the memory consumption.

Page 59: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Getting Help

❏ Talend Forge: forum, custom components, tutorials,

bug trackers, example jobs

❏ http://stackoverflow.com/questions/tagged/talend

❏ Stack Overflow

❏ http://stackoverflow.com/questions/tagged/talend

❏ Books from Packt Publishing

❏ “Getting started with Talend Open Studio for Data

Integration” by Jonathan Bowen;

❏ “Talend Open Studio Cookboo” by Rick D. Barton.

Page 60: Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tricks

Contacts

❏ Tutorials

❏ Custom components

❏ Ready-made jobs

❏ Use Cases

http://gabrielebaldassarre.com

Need help? Questions? Consulting needs?

http://gabrielebaldassarre.com/contacts/

@cerealping