You are on page 1of 9

1.

APT_CONFIG_FILE is the file using which DataStage determines the configuration


file (one can have many configuration files for a project) to be used. In fact, this is what
is generally used in production. However, if this environment variable is not defined then
how DataStage determines which file to use?
1. If the APT_CONFIG_FILE environment variable is not defined then DataStage
look for default configuration file (config.apt) in following path:
1. Current working directory.
2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the
top level directory of DataStage installation.
2. What are the different options a logical node can have in the configuration file?
1. fastname The fastname is the physical node name that stages use to open
connections for high volume data transfers. The attribute of this option is often the
network name. Typically, you can get this name by using Unix command uname
-n.
2. pools Name of the pools to which the node is assigned to. Based on the
characteristics of the processing nodes you can group nodes into set of pools.
1. A pool can be associated with many nodes and a node can be part of many
pools.
2. A node belongs to the default pool unless you explicitly specify a pool list
for it, and omit the default pool name () from the list.
3. A parallel job or specific stage in the parallel job can be constrained to run
on a pool (set of processing nodes).
1. In case job as well as stage within the job are constrained to run on
specific processing nodes then stage will run on the node which is
common to stage as well as job.
3. resource resource resource_type location [{pools disk_pool_name}] |
resource resource_type value . resource_type can be canonicalhostname
(Which takes quoted ethernet name of a node in cluster that is unconnected to
Conductor node by the hight speed network.) or disk (To read/write persistent
data to this directory.) or scratchdisk (Quoted absolute path name of a directory
on a file system where intermediate data will be temporarily stored. It is local to
the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX,
ORACLE, etc.)

3. How datastage decides on which processing node a stage should be run?


1. If a job or stage is not constrained to run on specific nodes then parallel engine
executes a parallel stage on all nodes defined in the default node pool. (Default
Behavior)
2. If the node is constrained then the constrained processing nodes are choosen while
executing the parallel stage. (Refer to 2.2.3 for more detail).
4. When configuring an MPP, you specify the physical nodes in your system on which the
parallel engine will run your parallel jobs. This is called Conductor Node. For other
nodes, you do not need to specify the physical node. Also, You need to copy the (.apt)
configuration file only to the nodes from which you start parallel engine applications. It
is possible that conductor node is not connected with the high-speed network switches.
However, the other nodes are connected to each other using a very high-speed network
switches. How do you configure your system so that you will be able to achieve
optimized parallelism?
1. Make sure that none of the stages are specified to be run on the conductor node.
2. Use conductor node just to start the execution of parallel job.
3. Make sure that conductor node is not the part of the default pool.
5. Although, parallelization increases the throughput and speed of the process, why
maximum parallelization is not necessarily the optimal parallelization?
1. Datastage creates one process for every stage for each processing node. Hence, if
the hardware resource is not available to support the maximum parallelization, the
performance of overall system goes down. For example, suppose we have a SMP
system with three CPU and a Parallel job with 4 stage. We have 3 logical node
(one corresponding to each physical node (say CPU)). Now DataStage will start
3*4 = 12 processes, which has to be managed by a single operating system.
Significant time will be spent in switching context and scheduling the process.
6. Since we can have different logical processing nodes, it is possible that some node will be
more suitable for some stage while other nodes will be more suitable for other stages. So,
when to decide which node will be suitable for which stage?
1. If a stage is performing a memory intensive task then it should be run on a node
which has more disk space available for it. E.g. sorting a data is memory
intensive task and it should be run on such nodes.
2. If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS
related stages, etc.) then you need to associate those stages with the processing
node, which is physically mapped to the machine on which the licensed software

is installed. (Assumption: The machine on which licensed software is installed is


connected through other machines using high speed network.)
3. If a job contains stages, which exchange large amounts of data then they should
be assigned to nodes where stages communicate by either shared memory (SMP)
or high-speed link (MPP) in most optimized manner.
7. Basically nodes are nothing but set of machines (specially in MPP systems). You start the
execution of parallel jobs from the conductor node. Conductor nodes creates a shell of
remote machines (depending on the processing nodes) and copies the same environment
on them. However, it is possible to create a startup script which will selectively change
the environment on a specific node. This script has a default name of startup.apt.
However, like main configuration file, we can also have many startup configuration files.
The appropriate configuration file can be picked up using the environment variable
APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment
variable?
1. Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct
Parallel engine not to run the startup script on the remote shell.
8. What are the generic things one must follow while creating a configuration file so that
optimal parallelization can be achieved?
1. Consider avoiding the disk/disks that your input files reside on.
2. Ensure that the different file systems mentioned as the disk and scratchdisk
resources hit disjoint sets of spindles even if theyre located on a RAID
(Redundant Array of Inexpensive Disks) system.
3. Know what is real and what is NFS:
1. Real disks are directly attached, or are reachable over a SAN (storage-area
network -dedicated, just for storage, low-level protocols).
2. Never use NFS file systems for scratchdisk resources, remember
scratchdisk are also used for temporary storage of file/data during
processing.
3. If you use NFS file system space for disk resources, then you need to
know what you are doing. For example, your final result files may need to
be written out onto the NFS disk area, but that doesnt mean the
intermediate data sets created and used temporarily in a multi-job
sequence should use this NFS disk area. Better to setup a final disk pool,
and constrain the result sequential file or data set to reside there, but let
intermediate storage go to local or SAN resources, not NFS.

4. Know what data points are striped (RAID) and which are not. Where possible,
avoid striping across data points that are already striped at the spindle level.

DataStage EE environment variables


The default environment variables settings are provided during the Datastage installation
(common for all users).
Users have a few options to override the default settings with Datastage client applications:
With Datastage Administrator - project-wide defaults for general environment variables, set
per project in the Projects tab under Properties -> General Tab -> Environment...
With Datastage Designer - settings at the job level in Job Properties
With Datastage Director - settings per run, overrides all other settings and is very useful for
testing and debuging.

The Datastage environment variables are grouped and each variable falls into one of categories.
Basically the default values set up during an installation are resonable and in most cases there is
no need to modify them.
Setting environment variables for parallel execution in Datastage Administrator

Environment variables overview


Listed below are only environment variables that are candidates to adjustment in real-life project
deployments. Please refer to the datastage help for details on variables not listed here.
General variables
LD_LIBRARY_PATH - specifies the location of dynamic libraries on Unix
PATH - Unix shell search path
TMPDIR - temporary directory
Parallel properties
APT_CONFIG_FILE - the parallel job configuration file. It points to the active
configuration file on the server. Please refer to Datastage EE configuration guide for
more details on creating a config file.
APT_DISABLE_COMBINATION - prevents operators (stages) from being
combined into one process. Used mainly for benchmarks.

APT_ORCHHOME - home path for parallel content.


APT_STRING_PADCHAR - defines a pad character which is used when a varchar
is converted to a fixed length string
Operator specific

The operator specific variables under parallel properties are stage specific settings and usually
set during an installation. The settings apply to the supported parallel database engines (DB2,
Oracle, Sas and Teradata).
APT_DBNAME - default DB2 database name to use
APT_RDBMS_COMMIT_ROWS - RDBMS commit interval
Reporting

The reporting variables control logging options and take True/False values only.
APT_DUMP_SCORE - shows operators, datasets, nodes, partitions,
combinations and processes used in a job.
APT_RECORD_COUNTS - helps detect and analyze load imbalance. It prints the
number of records consumed by getRecord() and produced by putRecord()
OSH_PRINT_SCHEMAS - shows unformatted metadata for all stages (interface
schema) and datasets (record schema). OSH_PRINT_SCHEMAS environment variable
should be set to verify that runtime schemas match the job design column
definitions (especially from Oracle).
OSH_DUMP - shows an OSH script and produces a verbose description of a step
before executing it
APT_NO_JOBMON - disables performance statistics and process metadata
reporting in Designer.
Compiler
APT_COMPILER - path to the C++ compiler needed to compile transformer
stages

Datastage EE configuration file

The Datastage EE configuration file is a master control file (a textfile which sits on the server
side) for Enterprise Edition jobs which describes the parallel system resources and
architecture. The configuration file provides hardware configuration for supporting such
architectures as SMP (Single machine with multiple CPU , shared memory and disk), Grid ,
Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per node).

The configuration file defines all processing and storage resources and can be edited with any
text editor or within Datastage Manager.
The main outcome from having the configuration file is to separate software and hardware
configuration from job design. It allows changing hardware and software resources without
changing a job design. Datastage EE jobs can point to different configuration files by using job
parameters, which means that a job can utilize different hardware architectures without
being recompiled.
The Datastage EE configuration file is specified at runtime by a $APT_CONFIG_FILE
variable.
Configuration file structure

Datastage EE configuration file defines number of nodes, assigns resources to each node and
provides advanced resource optimizations and configuration.
The configuration file structure and key instructions:

node - a node is a logical processing unit. Each node in a configuration file is


distinguished by a virtual name and defines a number and speed of CPUs,
memory availability, page and swap space, network connectivity details, etc.

fastname defines node's hostname or IP address

pool - defines resource allocation. Pools can overlap accross nodes or can be
independent.

resource (resources) names of disk directories accessible to each node.


The resource keyword is followed by the type of resource that a given
resource is restricted to, for instance resource disk, resource scratchdisk,
resource sort, resource bigdata

Sample configuration files


Configuration file for a simple SMP

A basic configuration file for a single machine, two node server (2-CPU) is shown below. The
file defines 2 nodes (dev1 and dev2) on a single etltools-dev server (IP address might be
provided as well instead of a hostname) with 3 disk resources (d1 , d2 for the data and temp as
scratch space).
The configuration file is shown below:
{
node "dev1"

fastname
pool ""
resource
resource
resource

"etltools-dev"
disk "/data/etltools-tutorial/d1" { }
disk "/data/etltools-tutorial/d2" { }
scratchdisk "/data/etltools-tutorial/temp" { }

node "dev2"
{
fastname "etltools-dev"
pool ""
resource disk "/data/etltools-tutorial/d1" { }
resource scratchdisk "/data/etltools-tutorial/temp" { }
}

Configuration file for a cluster / MPP / grid

The sample configuration file for a cluster or a grid computing on 4 machines is shown below.
The configuration defines 4 nodes (etltools-prod[1-4]), node pools (n[1-4]) and s[1-4), resource
pools bigdata and sort and a temporary space.
{
node "prod1"
{
fastname "etltools-prod1"
pool "" "n1" "s1""tutorial2" "sort"
resource disk "/data/prod1/d1" {}
resource disk "/data/prod1/d2" {"bigdata"}
resource scratchdisk "/etltools-tutorial/temp" {"sort"}
}
node "prod2"
{
fastname "etltools-prod2"
pool "" "n2" "s2""tutorial1"
resource disk "/data/prod2/d1" {}
resource disk "/data/prod2/d2" {"bigdata"}
resource scratchdisk "/etltools-tutorial/temp" {}
}
node "prod3"
{
fastname "etltools-prod3"
pool "" "n3" "s3""tutorial1"
resource disk "/data/prod3/d1" {}
resource scratchdisk "/etltools-tutorial/temp" {}
}
node "prod4"
{
fastname "etltools-prod4"
pool "n4" "s4""tutorial1"
resource disk "/data/prod4/d1" {}
resource scratchdisk "/etltools-tutorial/temp" {}

Validate configuration file

The easiest way to validate the configuration file is to export APT_CONFIG_FILE variable
pointing to the newly created configuration file and then issue the following command:
orchadmin check

You might also like