32
http://blog.iadvise.eu/tag/etl/ Talend: Schema compatibility check Posted on October 8, 2014 by Jessica Smets Most of the time when talking about Talend jobs, people think of standard ETL (Extract, Transform, Load). But in some cases there’s the need to check the incoming data before loading them into the target rather than just transforming it. We refer to this process as E-DQ-L (Extract, Data Quality, Load). One of the things that you might want to check before loading is schema compatibility. For example: you expect to get a String that’s 5 long. If you, for any reason, receive a String that is larger than 5, it will generate an error. Or perhaps you expect a percent (in format BigDecimal like 0.19), but you receive it as a string (“19%”). This example will result into a failing job with an error saying “Type mismatch: cannot convert from dataType to otherDataType”. Before I continue this blog I would like to emphasize that all the solutions below are possible with the Data Integration version of Talend, except for the last one. The last option requires a Talend Data Quality license. Let’s create an example case: We want to extract data on a regular basis from a third-party source which we cannot fully trust in terms of schema-settings. We know how many columns we can expect and we have a rough idea of what it contains, but we do not fully trust the source to not give incompatible data. We want to load the records that are valid and we want to separately

Schema Testing

Embed Size (px)

DESCRIPTION

Schema Testing for DWH

Citation preview

Page 1: Schema Testing

http://blog.iadvise.eu/tag/etl/

Talend: Schema compatibility   check Posted on October 8, 2014 by Jessica Smets

Most of the time when talking about Talend jobs, people think of standard

ETL (Extract, Transform, Load). But in some cases there’s the need to check

the incoming data before loading them into the target rather than just

transforming it. We refer to this process as E-DQ-L

(Extract, Data Quality, Load).

One of the things that you might want to check before loading is schema

compatibility. For example: you expect to get a String that’s 5 long. If you,

for any reason, receive a String that is larger than 5, it will generate an error.

Or perhaps you expect a percent (in format BigDecimal like 0.19), but you

receive it as a string (“19%”). This example will result into a failing job with

an error saying “Type mismatch: cannot convert

from dataType to otherDataType”.

Before I continue this blog I would like to emphasize that all the solutions

below are possible with the Data Integration version of Talend, except for the

last one. The last option requires a Talend Data Quality license.

Let’s create an example case: We want to extract data on a regular basis

from a third-party source which we cannot fully trust in terms of schema-

settings. We know how many columns we can expect and we have a rough

idea of what it contains, but we do not fully trust the source to not give

incompatible data. We want to load the records that are valid and we want to

separately store the ‘corrupt’ data for logging purposes. I’ve gathered

several solutions for this problem:

1. Use rejected flow on an input-component

One thing you can do is reject the records as soon as you import them.

Disable “die on error” on the basic settings tab of you input-component and

Page 2: Schema Testing

then right-click it and select “Reject”. The rows will be rejected based on the

schema of the file. In the example below we put phone number as an integer

and as you can see 1 records is begin rejected. This is because the phone

number contains characters and therefore cannot be read as an integer. If

you did not disable the “die on error”-option then this component would

make the job fail.

2. In case of the target being a database: use rejected links

You can also choose to directly input the data into your database, but to

reject any rows that would create an error. You can then create a separate

flow to determine what to do with these rejected records.

In your database output component (for example tOracleOutput) change the

following:

Basic settings: Uncheck “Die on error”

Advanced settings: Uncheck “Use batch size”

Now, right-click on your component and select “Row-Reject” and connect it

to an output-component. The output you’ll receive will be the rejected rows

and what error would have been generated if you tried inserting them, as

you can see in the picture below.

Page 3: Schema Testing

3. Use a tFilter-component

You can make the data go through a filter-component before inserting it into

your target. You can (manually) decide what’s allowed to go through. This

can be useful when your destination is not a database, in which case option

1 is most likely not available.

A tFilterRow-component also has the possibility to output the rejected rows,

including the reason why they got rejected. You can enable this by right-

Page 4: Schema Testing

clicking on your filter and selecting “Row-Reject”. An example of rejected

rows by the filter:

Note – You can also use self-defined routines in the tFilterRow-component by

checking “Use advanced mode”. This can be useful when you want to check

whether or not converting is possible. For example: you could define a

routine called “isInterger” that returns true if the conversion is valid and

false if it’s impossible.

4. Use a tSchemaComplianceCheck-component

Another way of making sure that your schema is compatible is by using the

tSchemaComplianceCheck-component. Unfortunately, this component is only

integrated in the Data Quality version of Talend.

It’s a very easy component to use. The only thing you have to do is connect

the incoming data to the tSchemaComplianceCheck-component and then

continue its flow to the destination source. You can get the rejected rows the

same way as previously (by right clicking on it and then selecting “Row-

>Reject”).

Page 5: Schema Testing

The rejected rows and their error message look like this:

That’s it for now. There’s probably a lot of other ways of checking schema

compatibility. Feel free to comment if you know any. Thank you for reading!

Posted in Talend | Tagged ETL, Talend | Leave a comment

Talend: tips and tricks part   2 Posted on August 26, 2014 by Jessica Smets

In the first part of these entries we discussed how to test your expressions,

the importance of optimizing the appearance of a tLogRow component and

how to handle windows and views within Talend. This time around, we will be

talking about the different ways to get components into your job, how to

trace your dataflow and how to easily sync columns. As last time, this post

will be useful for both starting and experienced users.

4. Getting components into your job

There are many ways to get components into your job. Most people search

the palette (by either the search-function or by manually exploring the

folders) and drag/drop the components into their job. You can achieve the

same thing by simply clicking on a random place in your job and then type

the name of the component. Obviously this is only recommended once

you’re familiar with the different components and their names.

Page 6: Schema Testing

When working with metadata, you can use certain shortcuts to save a bit of

time. Usually people just click on the metadata and then drop it onto their

job. This will pop up a window allowing you to choose which type of

component you want to use. Holding the Control-key while dragging the

component will directly create an Output-component. Holding Control+Shift

will result into an Input-component.

5. Syncing columns

Occasionally, you may have to change the schema of a certain component in

the middle of development. This might affect other components in your job.

In some cases, Talend asks if you want to propagate the changes you’ve

made (to the other components).

You may accidently close this window, click “No” or not get this message at

all, resulting in the following error: “The schema from the input

link “youroutputlink”is different from the schema defined in the component”.

Page 7: Schema Testing

When this happens, you can go to the basic settings of the component that

has the error and click on “Sync columns”. The error should now be gone.

6. Tracing your dataflow (Debug Run)

Lastly, I would like to say a few words about the debug run. In some cases

we want to closely watch our dataflow in order to get a better understanding

of what’s exactly happening. You can achieve this by running your job in

debug mode. This can be done by clicking on the Run-window, then click on

the “Debug Run” tab on the left side of the window and start it by clicking on

“Traces Debug”.

The moment you open the “Debug run” tab, you’ll immediately see extra

icons in your job. These magnifying glass icons indicate that details will be

shown when you debug-run your job. The result should look something like

this:

Page 8: Schema Testing

You can Pause and Resume the run at any time. You can also add

breakpoints if you like. Do this by right-clicking on a dataflow and then

selecting “Show Breakpoint Setup”.

This brings you to the “Breakpoint” tab of the data flow you clicked on. You

can also go there by clicking on the specific flow and manually selecting

“Breakpoint”. Let’s add a breakpoint to pause our run whenever we come

Page 9: Schema Testing

across a record with “Bloom” as last name. Firstly, make sure to check the

“Activate conditional breakpoint” option. After that, click on the plus-icon

underneath the conditions. Then select the InputColumn we want to put our

condition on, in our case this is “Last_name”, and add a value (“Bloom” in

this example). The default Operation is “Equals”, which is the one we want.

You can also specify an Operation if you need to, but this is unnecessary for

this case.

You can add multiple breakpoints if you like. Whenever you debug run your

job now, it will stop at a record where the Last_name is “Bloom” (if any

exist).

That’s it for now. Thank you for reading!

Posted in data integration, ETL, Talend, Tips and Tricks | Tagged ETL, Talend | Leave a comment

Talend: tips and tricks part   1 Posted on August 4, 2014 by Jessica Smets

This blog contains some convenient tips and tricks that will make working

with the open source tool Talend for data integration a lot more efficient.

This blogpost will be especially useful for people who are just discovering this

amazing tool, yet I am sure that people who have been using it for a while

will also find it very helpful. These series of tips will be spread over multiple

blog entries so make sure to check back often for future tips!

1. Testing expressions in the tMap component

Using the tMap component, you have the possibility to test your expressions.

This way you can easily see whether or not the result is what you expected it

Page 10: Schema Testing

to be. You can also use this to determine whether or not your expression will

error. Let’s create an example.

We’ve got details of employees as input for our tMap. We would like the first

name to be shown in uppercase. First of all, go into the expression builder by

clicking the ellipsis next to your expression.

To convert the first name to uppercase, we have to use the StringHandling

function “UPCASE”. This will result in the following

expression:StringHandling.UPCASE(employee.First_name)

After you’re done filling in test values, click on the “Test!” button and wait

for the result. If everything goes as expected, you should see your first name

in uppercase on the right side of the window.

2. Optimizing the appearance of the tLogRow component output

Page 11: Schema Testing

tLogRow is one of the most frequently used components. It is recommended

that you learn how to optimize its use. Firstly, make sure that you always

have the right appearance selected for your output. You can find this

property in the basic settings of your tLogRow-component.

There are three types of Modes that you can choose between:

Basic

Basic will generate a new line for each record, separated by the “Field

Separator” you’ve chosen (see image above). When using basic mode, I

highly recommend to check the “Print header” option when working with

multiple column records or multiple outputs, purely for visibility reasons.

Table (print values in cells of a table)

Page 12: Schema Testing

The table mode shows the records and their headers in a table-format,

including the name of the component that generated this output (in our

case: “tLogRow_1”). This emphasizes the importance of properly naming

everything, especially when you have multiple components that generate

output. In this case, it would have been better to rename our component to

“EMPLOYEES”. Personally, I prefer this mode.

Vertical (each row is a key value/list)

Vertical mode will show a table for each one of your records.

Page 13: Schema Testing

The output mode you decide to use depends on what you’re trying to

visualize. For example, when your goal is to show a single string, I would

recommend using the basic mode. But when you have multiple table outputs

(for example: departments, customers and employees in a single output),

I’m certain the table mode would be the best option.

Sometimes your data is spread over multiple lines, resulting in an unclear

output, like shown in the image below.

To force the output to put all the data on one single line, you can uncheck

the “Wrap” option. This option is located underneath your output and will

enable a horizontal scrollbar.

Page 14: Schema Testing

Do you also want to be able to get data regarding tweets using Talend, as

shown in the image above? Read my previous blogpost and find out how!

3. Resetting windows and maximizing/minimizing them

Sometimes you accidently close a window and have a hard time finding a

way to get it back. You can very easily reset your environment by clicking on

“Window” – “Reset Perspective”.

You can see all of the views by clicking on “Windows” – “Show View” –

“Talend”. Some of the views are not shown by default, such as “Modules”.

Modules can be used to import .jar-files without having to restart your studio,

which will most likely save you some time.

Lastly, because Talend is Eclipse-based, you have the possibility to maximize

and minimize windows. I personally use this function when examining the

output of a tLogRow-component including a lot of data. You can achieve this

by either double-clicking on the window or by right-clicking on it and

selecting “Minimize”/”Maximize”.

That’s it for now. I hope you enjoyed reading this blog and make sure to

return soon for future blogs!

Posted in data integration, ETL, Talend, Tips and Tricks | Tagged ETL, Talend | Leave a comment

Use of contexts within   Talend Posted on May 27, 2014 by Dieter Van Ransbeek

When developing jobs in Talend, it’s sometimes necessary to run them on

different environments. For other business cases, you need to pass values

Page 15: Schema Testing

between multiple sub-jobs in a project. To solve this kind of issues, Talend

introduced the notion of “contexts”.

In this blogpost we elaborate on the usage of contexts for easily switching

between a development and a production environment by storing the

connection data in context variables. This allows you to determine on which

environment the job should run, at runtime, without having to recompile or

modify your project.

To start using contexts in Talend you have two possible scenario’s:

1) you can create a new context group and its corresponding context

variables manually, or

2) you can export an existing connection as a context.

In this example we’ll go over exporting an existing Oracle connection as a

context.

Double click an existing database connection to edit it and click Next.

ClickExport as context

Page 16: Schema Testing

NOTE There are some connections that don’t allow you to export them as a

context. In that case you’ll have to create the context group and its variables

manually, add the group/variables to your job, and use the variables in the

properties of the components of your job.

Page 17: Schema Testing

After you’ve clicked the Export as context button you’ll see the Create/Edit

context group screen. Enter a name, purpose and description and click Next.

Now you’ll see all the context variables that belong to this context group.

Notice that Talend has already created all the context variables that are

needed for the HR connection. If you want to change their names you can

simply click them and they become editable.

Click the Values as table tab.

Page 18: Schema Testing

In the Values as table tab you can edit the values of the context variables by

simply clicking the value and changing it. To add a new context, click the

context symbol in the upper right corner.

Page 19: Schema Testing

The window that pops up is used to manage contexts. To create a new

context, click New, enter the name of the context, in our

example Production, and clickOk. To rename the Default context, select it,

click Edit, enter Development and click Ok. When you’re done editing,

click Ok.

Page 20: Schema Testing

After the window closes, you’ll see that an extra column appeared. Enter the

connection data of the production environment in the Production column and

click Finish.

Page 21: Schema Testing

In the connection window it’s possible to check the connection again, but this

time you’ll be prompted which connection you want to check.

Verify that both the connections work and click Finish.

Now that we’ve exported the connection as a context, it’s possible to use it

in a job. Create a new job, use the connection that has been exported as a

context and connect it to a tLogRow component. Your job should look

something like this

Page 22: Schema Testing

When using a connection that has been exported as a context in a job, you

have to include the context variables in order for your job to be able to run.

Go to the context tab and click the context button in the bottom left.

NOTE When using one of the newer versions, Talend proposes to add

missing context variables whenever you try to run a job, because of this you

don’t need to add them manually as described in this example.

Page 23: Schema Testing

Select the context group that contains the context variables, in our case the

HR context group.

Select the contexts you want to include and click OK

Page 24: Schema Testing

NOTE A context group can also be added to a job by simply selecting the

context from the repository, dragging it towards the context tab of the job,

and dropping it there.

Once you’ve added the context group to the job, it’s possible to run the job

for both the development and production environment by selecting the

context in the dropdown menu of the Run tab.

Posted in data integration, ETL, Talend | Tagged Contexts, ETL, Talend | 1 Comment

Page 25: Schema Testing

Connecting to Salesforce and Mailchimp using   Talend Posted on November 25, 2013 by liesbethvanraemdonck

A lot of companies use Salesforce to manage their customers and contacts.

In addition Mailchimp can be used for sending out mailings to these

connections. Mailchimp also captures information about what people did with

these mails. This can be useful information for your CRM. A while ago, I was

asked to make a list of everyone that have opened their mails in

Mailchimp. Let me show you how easy it is, to do something like that with

Talend.

In Talend:

we can get a list of email addresses from Mailchimp of receivers that

opened a mail

and we can ask Salesforce for the email addresses and names of all

our connections

and we can also use a mapping component to join these lists.

Talend has a standard interface with Salesforce. And Mailchimp offers lots of

RESTful web services, which we can make use of in our Talend job.

1. Connecting to Salesforce  

Right click “Salesforce” under the Metadata and choose “Create Salesforce

Connection”.

Page 26: Schema Testing

After choosing a name for our connection, all we need to fill in, is the

username and password for our Salesforce-connection.  The rest is already

filled in for us.

To enable the “Finish” button, we need to check our properties first, using

the button “Check login”.

Under Metadata, we can now browse through all our Salesforce-data.

Page 27: Schema Testing

Now you’re probably wondering, how to use this data in your ETL-flow. Well..

that’s even easier!

Simply drag one of the tables (with the blue icons) into your job and choose

for the “tSalesforceInput” component from it’s 3 suggestions.

After specifying the necessary mappings you should get something like this:

We’ve used Contact and Account data of Salesforce for this.

In the next part, let’s check out how we generated the list of email

addresses.

2.       Connecting to Mailchimp

Page 28: Schema Testing

Accessing your Mailchimp-data, is a bit harder. We need two components

from the Talend-palette:

The ‘tRest’ component,  because we need to use a RESTful webservice for

requesting our data from Mailchimp. And the ‘tExtractJSONFields’ component

for interpreting the data we receive back.

After dragging the tRest component to your job, choose ‘POST’ as the

‘method’ and fill in the URL, corresponding to the report you wish to receive.

If you want to receive your report in XML-format instead of JSON, just add

“.xml” at the end of the URL.

Here we needed the Mailchimp report, that gives us information on opened

emails.

If you are interested in other kinds of reports, you can find the list here:

http://apidocs.mailchimp.com/api/2.0/#lists-methods

Every request, needs certain parameters. We can specify them in the HTTP

body field, like this:

“{\”apikey\”: \”your api key will be here\”,\”cid\”: \”put a campaign id

here\”}”

The API-key will always be needed as the first parameter. You can find it in

Mailchimp under your ‘Account Settings’  – ‘Extras’ .

Page 29: Schema Testing

The second component we need, is called ‘ExtractJSONFields’. After dragging

it to our job, we link our first component to it.

We can use ‘Edit schema’, to define the data we want to extract.

Finally all we need to do, is specify the location of this data we are interested

in, for example the ’email’-field inside the ‘member’-field.

Now that we’re able to access our data from Mailchimp, let’s take a look at

how we used it for generating the list of e-mailaddresses.

Page 30: Schema Testing

First we asked Mailchimp for all our Campaigns, then we used the

‘flowToIterate’-component so we could ask Mailchimp for the email

addresses, once for every campaign in the list:

Finally all we had to do, is put these two jobs together and press ‘run’.

So.. I hope you’ll enjoy it, as much as I did!