You are on page 1of 9

Talend Open Studio and R integration

21-June-2015

Target Corp LIM Upgrade


TARGET Relationship

Akshayaa.V
Talend Open Studio and R

akshayaa.v@tcs.com
Confidentiality and Non-Disclosure Notice

Confidentiality Statement
The information contained in this document is confidential and proprietary to TCS.
This information may not be disclosed, duplicated or used for any other purposes.
The information contained in this document may not be released in whole or in part
outside TCS for any purpose without the express written permission of TATA
Consultancy Services.

Code of Conduct

Tata Code of Conduct


We, in our dealings, are self-regulated by a Code of Conduct as enshrined in the
Tata Code of Conduct. We request your support in helping us adhere to the Code in
letter and spirit.
We request that any violation or potential violation of the Code by any person be
promptly brought to the notice of the Local Ethics Counsellor or the Principal Ethics
Counsellor or the CEO of TCS. All communication received in this regard will be
treated and kept as confidential.

Page | 2
Table of Content

Confidentiality and Non-Disclosure Notice ....................................................................................................................... 2


Code of Conduct................................................................................................................................................................ 2
Table of Content ............................................................................................................................................................. 3
Overview ........................................................................................................................................................................... 4
Objective ........................................................................................................................................................................... 4
1. Why to integrate R and Talend? ............................................................................................................................... 5
2. R and Talend Interface .............................................................................................................................................. 5
2.1. Talend component tExecuteRScript.................................................................................................................. 5
2.2. How to Use the Component ............................................................................................................................. 6
3. Sample Scenario (Classification of Dataset) .............................................................................................................. 7
3.1. Using tExecuteRScript Component ................................................................................................................... 7
3.2. How to get back outcomes on the Talend ........................................................................................................ 8
Overview
This document covers method to integrate Talend Open Studio and R language and how to use R to build a simple
predictive model with data coming from Talend and how to get results back to Talend.

Objective
1. To illustrate the purpose of integrating R and Talend
2. tExecuteRScript component overview
3. Example to explain R and Talend interfacing

Page | 4
1. Why to integrate R and Talend?

R is an absolute standard for statisticians, with a huge amount of external packages for practically any possible
kind of analysis one could imagine, but even simple data operations must be hand-coded. R language is a very
expressive and extensible data language, but one perhaps would prefer to spend time reasoning on the predictive
model, rather than writing code to get the data out from the database.

This is particularly true in data exploitation scenarios, but also in rapid prototyping and, generally speaking, in the
whole business world.

If its not enough, R is basically a data language plus a command line executor. This is historically common for
statistical software (just think to SAS) so its not a flaw on its own. But in real life Business Intelligence life-cycle, you
probably have a corporate standard, a service bus, a protocol for data transfer and so on. A better interface with R
is really advisable. This is possible using a custom optional component made by me for Talend.

There are plenty of scenarios when one would benefit to do a cross-over between Talend Open Studio and R. The
first is perfect for even complex ETL tasks, which by their very basic nature involves massive data I/O,
manipulation, federation and governance, but it completely lacks any kind of serious statistical tool.

2. R and Talend Interface

2.1. Talend component tExecuteRScript

This Talend Open Studio component provides a complete environment for executing code written for the popular
statistical platform R and retrieve results back. Its built around the JRI interface of the rJava package. Although this
package offers a 100% compatibility in execution of code, its quite rudimental on I/O features and its limited to
retrieve only String, Int and Double arrays. This limitation is imposed by the very inner architecture of R and it
doesnt depends on the component itself.

Some features of the component:


i. Low level connection to an existing R installation
ii. Support for external .R file load and inline R code
iii. Two logging possibilities (Verbose/Silent)
iv. Results mapping to convert R symbols to standard row Talend connection
v. Log redirection to tLogCatcher elements, if available; Autocast of output, if possible.
vi. Written in true OOP using the robust Talend Bridge framework
To establish the interface between R and Talend this component needs to be installed. This is a low-level interface
to an existing R installation and should be able to execute almost any arbitrary R code, but the I/O interface is quite
rude: there is just a uni-directional data link, at the moment, and it goes from R to Talend and supports only a
subset of R types.

2.2. How to Use the Component

To use it, just write down in the box the code you want to be executed by R. Please note this must be correctly
quoted and escaped. Alternatively, you can source a code from an external .R file. This is by far the fastest route,
especially if you have lot and lot of code, since it doesnt need to be quoted/escaped. Of course, inline code could
be manipulated (parametrized) at runtime directly on the Talend side, while external .R scripts could not. So the
final choice depends on your need. Anyway, its always possible to pass some command-line parameters to R
using the proper String (which must be quoted/escaped too). These parameters applies on both executing
scenarios just exposed.

At the very end of the tab, youll have a parameter that let you choose if log messages from the component must
be notified to tLogCatcher instances. They will be printed out to stdout/stderr only, if disabled.

On the advanced parameters, you can choose between two output redirection strategies. On Standard clients,
output coming from R is redirected to Talend logging facilities, while on Silent clients, R console is actually muted.
Irrespective of the choice, R autoprinting is disabled, so you must explicitly put a print statement in your R
code to output something to the console.

Although sometimes you just need to execute R code, more often you probably need to get the computation
results back on the Talend side. Under the hood of the limitations imposed by JRI, this is done from the advanced
parameters tab.

For each column of an output schema, you can map a R expression returning an array of a imposed kind (only
String, Double and Integers are supported). If the expression doesnt return an array, or the array cannot be cast to
String, Double, Integers on the R-side, youll probably going to get a NullPointerException somewhere. As this is
from R, not Talend, youll probably need to look at your R code and expressions to fix it (for example, Factors must
be converted to Characters/String, Boolean to Integers and so on).
Anyway, Talend schema columns conversion is automatic, if allowed. For example, a Double array coming from R
can be stored in a Float column of the output schema. This is perfectly allowed and doesnt cause any error.

Page | 6
3. Sample Scenario (Classification of Dataset)

3.1. Using tExecuteRScript Component

A classification model fit from a dataset (iris dataset) is made. A Random forest model is made without a test
phase, when the species if iris as the outcome (predicted) variable.

Find below the job layout,

In the above scenario, there are two sub jobs. In the first one the iris dataset from the external file is rapidly loaded.
Then, that dataset is split into two subsamples. The first (120 rows) will be our training set, while the latter (30
rows) our prediction set. To simulate a prediction, were going to drop the outcome variable from the prediction
dataset is dropped. Finally, the subsamples in two CSV files with headers are saved. In the second sub job, R is
called. The component basically sources on the.R file.
Find below the sample R script sent to the component,

1 setwd("C:/Users/Akshayaa/Desktop")
2 library(randomForest)
3 predict train r outcome

After the working dir is set and the needed library is loaded, we simply go to import the freshly created CSV files.
Then, we fit a trivial model and we feed an outcome array with the predicted species.

3.2. How to get back outcomes on the Talend

Find below the advanced paramaters pane for tExecuteRCode,

This allows to define an output schema and to write a set of R expressions on how to feed that schema. For the
specific example, what were actually say is: (R) convert the outcome array elements to characters and then
(Talend) put as Strings on the column species. Finally, its just time to take these results into consideration in our
ETL job, using a standard outgoing Talend data connection.

Page | 8
Thank You
Contact

For more information, contact gsl.cdsfiodg@tcs.com (Email Id of ISU)

About Tata Consultancy Services (TCS)

Tata Consultancy Services is an IT services, consulting and business solutions organization that delivers real results to
global business, ensuring a level of certainty no other firm can match. TCS offers a consulting-led, integrated portfolio
of IT and IT-enabled infrastructure, engineering and assurance services. This is delivered through its unique Global
Network Delivery ModelTM, recognized as the benchmark of excellence in software development. A part of the Tata
Group, Indias largest industrial conglomerate, TCS has a global footprint and is listed on the National Stock Exchange
and Bombay Stock Exchange in India.

For more information, visit us at www.tcs.com.

IT Services/Business Solutions/Consulting

All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content
/ information contained here is correct at the time of publishing. No material from here may be copied, modified,
reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission
from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and other
applicable laws, and could result in criminal or civil penalties. Copyright 2011 Tata Consultancy Services Limited

You might also like