You are on page 1of 92

Talend

Basics Training Course

Author: Anjali Garg (912328)


Praveen Tyagi (491801)
Course Content

1. Talend Architecture
2. Talend DI Open Studio Introduction
3. Creating Repository, Project and a Job
4. Talend DI Basic Components
5. Parallelism
6. Error and Logs Handling
7. Scheduling and Execution
8. Version management
9. Case Studies
10. Queries

2
Course Objective

To provide overview of Talend Open studio DI, ETL product


components, Features and Capabilities.

To provide steps to create a Job

How to run and schedule a Job

Version Management in Talend

Hands On Experience

3
Talend Overview

Free Open Source Integration


Founded in 2005 by Bertrand Diard
and Fabrice Bonan
First commercial open source vendor of data integration software
The company's first product, Talend Open Studio for Data Integration
was launched in October 2006
It uses Open Architecture to solve any Data and Application
Integration challenges
The Vendors customers base for this product set is estimated
more than 3,000 companies
Java is the development language of Talends products and services.

4
Talend Products
Talend Open Studio

Free, open source integration software for Big Data, Data


Integration, Data Quality, MDM etc..

Talend Enterprise Edition

The Enterprise version additionally has a Subversion plug-in


built in, as well as support for joblets.

6
Talend Architecture & Supporting Platforms
Talend Architecture Components

8
Talend Architecture Components

Three different type of functional blocks :


At least One Studio where it can carry out data integration
processes

The Administration Center enables the management and


administration of all projects.

One or more execution servers. Talend Jobs are deployed to


the job servers through the Administration Center's Job
Conductor to be executed on scheduled time, date, or event.

9
Talend Supporting Platforms
Talend Data Integration supports the following third party components, products
and operating systems. Support varies across products.

Supported Database and Storage Connectivity


Amazon RDS, Amazon Redshift, Amazon S3, AS400, DB2, Derby DB, Exasol,
eXist-db, Firebird, Google Storage, Greenplum, H2, HSQLDB, Informix, Ingres,
InterBase, JavaDB, JDBC, MaxDB, Microsoft OLE-DB, Microsoft SQL Server,
MySQL, Netezza, Oracle, ParAccel, PostgresSQL, PostgresPlus, SAS, SQLite,
Sybase, Teradata, VectorWise, Vertica

Supported SaaS and 3rd Party Applications


Alfresco, Centric CRM, Marketo, Microsoft CRM and AX, NetSuite, Open Bravo,
SAGE X3,Salesforce.com,SAP,SugarCRM,Vtiger CRM

Supported Operating Systems


CentOS Linux, OS X, Redhat Enterprise Linux, Solaris, SUSE Linux, Ubuntu Linux,
Microsoft Windows
10
Talend DI Open Studio Introduction
Intro - Talend DI Open Studio

Powerful and versatile open source solution for data integration


Generates Java code that accesses the database through
ODBC/JDBC drivers
Its GUI gives access to a metadata repository and to a graphical
designer
Product is based on Eclipse RCP.
Users design individual jobs using graphical components from a
set of over 400 for transformation, connectivity, or other
operations.
Jobs created can be executed from within the studio or as
standalone scripts.

12
Intro - Talend DI Open Studio

Talend Open Studio Data Integration used for :

Real-time or batch exchanges of data

ETL (Extract/Transform/Load)

Data migration

13
Important Concepts

Repository: The storage location for DI to gather data related to


all of the technical items that use to design Jobs.

Project: Projects are structured collections of technical items


and their associated metadata.

Workspace: The directory where all project folders are stored.

Job: A graphical design of one or more components connected


together, that allows to setup and run dataflow management
processes.

Component: A preconfigured connector used to perform a


specific data integration operation.

14
GUI

15
GUI Contd

16
Business Model

It allows to design systems, connections, processes and


requirements using standardized workflow notation through an
intuitive graphical library of shapes and links.

Allows data integration project stakeholders to graphically


represent their needs.

Helps the IT operation staff understand these expressed


needs and translate them into technical processes (Jobs).

Designing Business Models is part of the enterprises' best


practices.

17
Business Model- An example

18
Job Design

Job Design is the runnable layer of a business model.

It is a graphical design of one or more components connected


together.

Job Design implements data flow consists of several


components that forms the building blocks of an application.
Change the default setting of components or create new
components or family of components to match exact needs.
Set connections and relationships between components in order
to define the sequence and the nature of actions.
Create and add items to the repository for reuse and sharing
purposes.

19
Context Management

A context contains several types (dev/prod)


The prompt functionality refers to different types of variables
(pathDir, pathFile)

20
Selecting the execution context
When Talend Open Studio starts:

During deployment:

A retenir :
Le context est inclus dans le code gnr et ne peut pas tre modifi une fois le job dploy.
Lors de lexport du job, on peut choisir le context dexcution du job et des sous-jobs lancs par un
tRunJob 21
Basic Components

Big Data Components


Database Components

Processing Components

File Components

Misc Components

Log and Error Components

22
Big Data Components

23
Database Components

24
Oracle Components

25
Processing Components

26
Misc Components

27
File Components

28
Log & Errors

29
Job Creation Steps
1. Setup/Create Project
Locate the Talend Open Studio and Start the Talend Open
Studio for Data
Integration (TOS_DI-win-x86_64.exe) and double click it to
execute and open the Talend studio.

31
2. Open Project

32
3. Create Job

33
4. Provide Job Name

34
5. Add Component

35
6. Setting Properties
Basic Settings tab

Double-click the component to open the Component view beneath


the design workspace

36
7. Set Schema

Click the Edit Schema button to create built-in schema by adding


columns and describing their content, according to the input file
definition

37
8. Advanced Setting Tab

Some components, especially File and Databases components,


provides numerous advanced use possibilities

38
8. Advanced Setting Tab

Some components, especially File and Databases components,


provides numerous advanced use possibilities

39
9. Add Transform Component

tMap Component allows the following types of operations:


Data transformation on any type of fields,
Fields concatenation and interchange,
Field filtering using constraints,
Data rejecting.
tMap uses incoming connections to pre-fill input schema with data
in the Map Editor
Note that there can be only one Main incoming rows. All other
incoming rows are of Lookup type.

40
10. Map Editor
The Map Editor is made of several panels:

Input panel is the top left panel on the editor. It offers a graphical
representation of all (main and lookup) incoming data flows. The data are gathered
in various columns of input tables.

Variable panel is the central panel in the Map Editor. It allows the centralization
of redundant information through the mapping to variable and allows to carry out
transformations.

Output panel is the top right panel on the editor. It allows mapping data and
fields from Input tables and Variables to the appropriate Output rows.

Schema editor is at the bottom panels and offers a schema view of all columns
of input and output tables.

Expression editor is the edition tool for all expression keys of Input/Output
data, variable expressions or filtering conditions.
41
10. Map Editor Contd

42
10. Map Editor Contd

Expression editor

Filter input flow

43
11. Connection Types
A Job or a sub job is composed of a group of components logically
linked to one another via connections

Main

Passes on data flows from one component to the other, iterating


on each row and reading input data according to the component
properties setting (schema).

Rejects

Gathers the data that does NOT match the filter or are not valid
for the expected output

Lookup

Connects a sub-flow component to a main flow component.


44
11. Connection Types Contd
OnComponentOK
Trigger the target component once the execution of the source
component complete without an error.

OnComponentError

Trigger the sub-job or component as soon as an error is


encountered in the primary Job.

OnSubjobOK

Tigger the next subjob on the condition that the main subjob
completed without error.

OnSubjobError

Trigger the next subjob in case the first (main) subjob do not
complete correctly.
45
11. Connection Types-Example

46
12. Add Output Component
tFileOutputDelimited component writes rows to a file in
delimited format

47
13. Run a Job

48
14. Build Job

The Build Job feature allows to deploy and execute a Job on


any server, independent of Talend Studio.

Adds all of the files required to execute the Job to an archive,


including the .bat and .sh along with any context-parameter
files or other related files.

By default, when a Job is built, all the required jars are


included in the .bat or .sh command.

49
14. Build Job Contd

50
14. Build Job Contd

51
15. Build Job-Jar/sh/batch file

52
Job Designer: tMap and Lookup

CUSTOMER
WITH STATES

53
Database Connection-Step 1

Select Create connection option

54
Database Connection-Step 2

Provide connection name

55
Database Connection-Step 3

Test Connection

56
Database Connection Contd

57
Database Connection Contd

58
Joins and Transformation
Example
Example

Person File (PersonFile.txt)


Id;Title;FirstName;LastName;AddressId
1;Mr;Austin;Patel;4
2;Ms;Sophia;Watson;2
3;Mr;Ewan;Parker;3
4;Ms;Evie;Cunningham;2
5;Mr;Dexter;Booth;

Address File (AddressFile.txt)


Id;Street;Town;County;Postcode
1;19 West Close;Shefford;West Sussex;WE24 8ST
2;140 Great Square;Slough;Cornwall;CO43 3RN
3;178 North Close;Biggleswade;Hereford and Worcester;HE91
7RE
4;89 Windsor Street;Warrington;Gloucestershire;GL38 5OU
5;153 Dee Avenue;Shefford;County Down;CO63 5UN
59
Joins and Transformation
Example Contd

60
Joins and Transformation
Example Contd
Id;Title;FirstName;LastName;AddressId;Street;Town;County;Postcode

1;Mr;Austin;Patel;4;89 Windsor Street;Warrington;Gloucestershire;GL38 5OU


2;Ms;Sophia;Watson;2;140 Great Square;Slough;Cornwall;CO43 3RN
3;Mr;Ewan;Parker;3;178 North Close;Biggleswade;Hereford and Worcester;HE91
7RE
4;Ms;Evie;Cunningham;2;140 Great Square;Slough;Cornwall;CO43 3RN
5;Mr;Dexter;Booth;;;;;

61
Routines
Routines
They are fairly complex Java functions, generally used to factorize
code. They therefore optimize data processing and improve Job
capacities.
Types of Routines:
System Routines : Classed according to the type of data which
theyprocess like numerical, string, date, etc
User Routines : These are routines which user creates or adapt
from existing routines.

63
System Routines
Numeric Routines :
The Numeric category contains several routines, notably
sequence, random and decimal (convertImpliedDecimalFormat)

Relational Routines :
Allows to check affirmations based on Boolean

64
System Routines Contd
TalendString Routines

65
User Routines

User routines: these are routines which created by user or


adapted from existing routines

66
Parallelism
Job Parallelization
Talend allows to run SubJobs in parallel, also known as Multi-
threading in two ways.

Setting a Job's Multi-threading property to TRUE.

Use a tParallelize component.

The tParallelize component is only available in the Enterprise


Edition of Talend.

68
Job Parallelization Contd

69
Error and Log Handling
Error and Log Components
The Logs & Errors family groups together the components which
are dedicated to log information catching and Job error
handling.
Example: tAssert, tAssertCather, tlogRow etc

tAssert: works alongside tAssertCatcher to evaluate the


status of a JobThe
execution. It family
Logs & Errors concludes
groups with the boolean result
together
the components which are dedicated to log
based on an assertive statement related to the execution and
information catching and Job error
feed the result to tAssertCatcher for proper Job status
handling.
presentation.

tAssertCather: Based on its pre-defined schema, fetches the


execution status information from repository, Job execution and
tAssert.

tLogRow: Displays data or results in the Run console.

71
Error and Log handling

72
Management of Logs/Preferences

73
Management of
Logs/Preferences Contd
In the Properties and Job Designs view, preferences are entered
automatically:

This is in Built-in mode, which is not great for maintaining


preferences!
It is better to create metadata in the Repository and specify it
in each job.
74
Error handling

Each component has its own error handling routine


(OnComponentError)

75
Customizing error logs

76
tLogCatcher schema
Default schema

77
tStatCatcher

Monitoring the performance of each component

78
Scheduling and Execution
Schedule job using Crontab

To execute the job on the command line simply run below


command:

sh ./JoinExample_run.sh

To schedule a job on Linux use Crontab to schedule a job:

crontab -e

And then set it up similar to the one shown below:

80
Scheduling
(Within Talend DI Open Studio)
The Scheduler view in DI helps to schedule a task that will launch
periodically a Job via a task scheduling (crontab) program.

It can generate a crontab file that holds cron-compatible entries


(the data required to launch the Job). These entries will allow to
launch periodically a Job via the crontab program.

This Job launching feature is based on the crontab command,


found in Unix and Unix-like operating systems. It can be also
installed on any Windows system.

81
Scheduling Contd
(Within Talend DI Open Studio)

82
Job Execution & Scheduling
(Outside of Talend DI Open Studio)
Talend Open Studio (TOS) allows export jobs in a number of
export types.

Talend export types are: -

Autonomous Job
Axis WebService (WAR)
AxisWebService (ZIP)
JBoss ESB
Petals ESB
OSGI Bundle For ESB

Talend can generate both a Unix Shell Script or a Windows Batch


File for launching Job. The Unix Shell Script is of course suitable
for Linux and OSX too.

83
Version management
Version Management
When a Job is created in Talend Studio, by default its version
is 0.1, where 0 stands for the major version and 1 for the
minor version.

Creating new version

Close a Job if it is open on the design workspace.

In the Repository tree view, right-click your Job and


select Edit properties from the drop-down list to open
the [Edit properties] dialog box.

Next to the Version field, click the M button to increment


the major version and the m button to increment the minor
version.

Click Finish to validate the modification.


85
Version Management Contd

86
Version Management Contd

87
Version Management Contd

88
Version Management Contd

89
Case Studies
Case Study

Demo of how to create a simple DI job

Please find attached Case Studies :

91
Queries???

92
Thank You!!!

You might also like