You are on page 1of 61

1) What do you mean by specifying layouts in Ab Initio?

Ans: The Layout is something which determines whether a component runs in a serial or
parallel mode. If you specify the path as serial directory the component runs as single stream
and if you specify the path as a multi file directory the component runs in parallel mode. Also
the path which you specify there is serves as the working directory of the graph where all
intermediate files are stored
Layout can be specified as
1) propagate from neighbors
2) URL
3) custom
4) host
Before you can run an Ab Initio graph, you must specify layouts to describe the following to the
Co>Operating System:

The location of files


The number and locations of the partitions of multifiles
The number of, and the locations in which, the partitions of program components
execute

Layout is one of the following:

A URL that specifies the location of a serial file


A URL that specifies the location of the control partition of a multifile
A list of URLs that specifies the locations of:
o The partitions of an ad hoc multifile
o The working directories of a program component

2) What is skew?
Ans: Skew tells about the unbalanced behavior of data partitioning. You can do performance
tuning by controlling Skew, Max-Core etc and there are so many ways.
3) How to read only 10 records from i/p file?
Ans: 1) There is a component called Sample in your sort component folder. If you use this
after the input file you can specify how many records you would like to pass through
2) In the Input Table component, in the parameters tab, you can specify how many records to
read.
3) Also there is a Leading Records component (well in 2.11 anyway) that allows you to specify
the number of records to read from a serial or mfs file.
4) One way to do this is with the Read-Raw component available in 2.11 or higher, although
pragmatically you will have to describe and process the record structure as it works with raw
data.
4) How do you make 4 way to 8 way in a graph
Ans: Put a partition and a gather component... Partition component should be 4 way MFS and
the gather should be 8way.

5) In Ab initio what is the upstream and downstream?


Ans: upstream and downstream are used in conjunction with EME for dependency and impact
analysis for the graphs we have developed and saved in to the repository. Basically it helps
tracking the changes between the versions and changes in the individual components and
variables in the components.
6) To extract files from both Oracle DB and from Mainframes (DB2). Is it possible to
extract the data directly from the DB's or do i need to convert them into Flat files &
load?
Ans: You can extract directly from the database on each of these. You just have to make sure
that you have a config file set up for DB2 and Oracle. You also want to make sure you have
your entire login variables set in your run settings You can load the data directly from
DB2,Oracle or Informix using Unload DB Table component to my knowledge
7) Does anybody know how many number of columns in the lookup file? What is the
maximum data we can have in the lookup file? I am doing code review for my
application, i see their 8 to 10 columns in each lookup file with large amount of
data
Ans: 1) There is no set column number that a lookup file can contain. There is, however, a
limit on the size of the data file. If you believe that 8-10 columns are too large, you might
be correct. If the size of the lookup contains anything over 750,000-1M records, I would
highly recommend using a join on this. The lookup file will die, if the size gets too large,
and you will have to code for a join.
2) Lookups get cached into memory during graph execution; it is always a good idea to
keep the data in lookup to bare minimum based on requirement.
Don't keep any columns you don't need or you don't access from the lookup in Lookup file.
If the graph is partitioned then try to use lookup_local wherever possible. For this your
partition key and lookup key must match or lookup key should be leading subset of
partition key.
Rule of thumb: Trim any fields from the data, which you don't use in the downstream
processing.
3) The limit for a lookup file is 2GB. Whether or not it is sensible to use a lookup of that
sort of size depends on what it's being used for.
9) How can I stop an executing graph in the middle for some conditions then how to
restart it?
Ans: 1) Doing a kill -9 PID1 PID2 will only kill the Ab Initio processes running on the host node.
We may still have Ab Initio processes running on different agent nodes. During runtime Ab Inito
creates one recovery file by the name <graph_name.rec> in the host directory specified in Run
-> Setting parameter in GDE. If the host directory is not specified, then the file gets created in
the default $HOME of the user specified in the Run ->Setting of GDE. This recovery file contains
pointers to different temporary files created dynamically during runtime. In order to kill an Ab
Inito Job and all its associated processes running across all the nodes, you have to execute the
following two commands in order as they appear 1. m_kill -9 <recovery_file>
2. m_rollback -d <recovery_file>

If the graph execution has to be stopped, depending on certain conditions, then use
force_error() function.
10) What are the functions used for system date?
a. today() :: Returns the internal representation of the current date on each call
b. today1() :: Returns the internal representation of the current date on the first
call.
Note [DML represents dates internally as integer values specifying days relative to
January 1, 1900]
iii)
now()
:: Returns the current local date and time
iv)
now1() :: The first time a component calls now1, the function returns
the value returned from the system function localtime. The second
and subsequent times a component calls now1, it returns the same value it
returned on the first call
11) How to convert a string into date format?
Ans: The string needs to be first casted in the date format. So if you have an input field of
string 20031130 and your output field is a date (YYYY-MM-DD),then use this
out.fieldname (date(YYYY-MM-DD))in.fieldname;
Note: [However if any of the i/p field has NULL data, it fails, so use a is_valid() ,is_defined()
functions to check the validity of the i/p data ]
12) What is the relation between EME, GDE and Co-operating system?
Ans. EME is said as enterprise metadata env, GDE as graphical development env and Cooperating system can be said as ab initio server relation b/w this CO-OP, EME AND GDE is as
fallows Co operating system is the Abinitio Server. This co-op is installed on particular
O.S platform that is called NATIVE O.S .coming to the EME, its just as repository in
informatica, its hold the metadata, transformations, db config files source and targets
informations. coming to GDE its is end user environment where we can develop the
graphs(mapping just like in informatica) designer uses the GDE and designs the graphs and save
to the EME or Sand box it is at user side where EME is as server side.
13) What is the use of aggregation when we have rollup as we know rollup component
in abinitio is used to summarize group of data record, then where we will use
aggregation?
Ans: Aggregation and Rollup both can summarize the data but rollup is much more convenient
to use. In order to understand how a particular summarization being rollup is much more
explanatory compared to aggregate. Rollup can do some other functionality like input and
output filtering of records.
Aggregate and rollup perform same action, rollup display intermediate result in main memory;
Aggregate does not support intermediate result
14) What are kinds of layouts does ab initio supports?

Ans: Basically there are serial and parallel layouts supported by AbInitio. A graph can have both
at the same time. The parallel one depends on the degree of data parallelism. If the multi-file
system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is
defined such as it's same as the degree of parallelism.
15) How can you run a graph infinitely?
Ans: To run a graph infinitely, the end script in the graph should call the .ksh file of the graph.
Thus if the name of the graph is abc.mp then in the end script of the graph there should be a
call to abc.ksh. Like this the graph will run infinitely.
16) Do you know what a local lookup is?
Ans: If your lookup file is a multifile and partioned/sorted on a particular key then local lookup
function can be used ahead of lookup function call. This is local to a particular partition
depending on the key.
Lookup File consists of data records which can be held in main memory. This makes the
transform function to retrieve the records much faster than retrieving from disk. It allows the
transform component to process the data records of multiple files fastly.
17) What is the difference between look-up file and look-up, with a relevant example?
Ans: Generally Lookup file represents one or more serial files (Flat files). The amount of data is
small enough to be held in the memory. This allows transform functions to retrieve records
much more quickly than it could retrieve from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it by using a
key parameter.
A lookup file is the physical file where the data for the lookup is stored.
18) How many components in your most complicated graph?
It depends the type of components you us. Usually avoid using much complicated transform
function in a graph.
19) Explain what is lookup?
Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per
the data present in a particular file (serial/multi file). The dataset can be static as well
dynamic (in case the lookup file is being generated in previous phase and used as lookup file in
current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of
the inputs to the join contains less number of records with slim record length.
AbInitio has built-in functions to retrieve values using the key for the lookup
20) What is a ramp limit?
The limit parameter contains an integer that represents a number of reject events. The ramp
parameter contains a real number that represents a rate of reject events in the number of
records processed.
no of bad records allowed = limit + no of records*ramp.
ramp is basically the percentage value (from 0 to 1)
This two together provides the threshold value of bad records.
21) Have you worked with packages?

A multistage transform component by default uses packages. However user can create his own
set of functions in a transfer function and can include this in other transfer functions.
22) Have you used rollup component? Describe how.
If the user wants to group the records on particular field values then rollup is best way to do
that. Rollup is a multi-stage transform function and it contains the following mandatory
functions.
1. initialise
2. Rollup
3. finalise
Also need to declare one temporary variable if you want to get counts of a particular group.
For each of the group, first it does call the initialize function once, followed by rollup function
calls for each of the records in the group and finally calls the finalize function once at the end
of last rollup call.
23) How do you add default rules in transformer?
Add Default Rules Opens the Add Default Rules dialog. Select one of the following: Match
Names Match names: generates a set of rules that copies input fields to output fields with the
same name. Use Wildcard (.*) Rule Generates one rule that copies input fields to output
fields with the same name.
1) If it is not already displayed, display the Transform Editor Grid.
2) Click the Business Rules tab if it is not already displayed.
3) Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source fields then
no need to write anything in the reformat xfr unless you dont want to use any real transform
other than reducing the set of fields or split the flow into a number of flows to achieve the
functionality.
24) What is the difference between partitioning with key and round robin?
Partition by Key or hash partition -> this is a partitioning technique which is used to partition
data when the keys are diverse. If the key is present in large volume then there can large data
skew? But this method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the data on each
of the destination data partitions. The skew is zero in this case when no of records is divisible
by number of partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin manner.
25) How do you improve the performance of a graph?
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving
port

8) For large dataset don't use broadcast as partitioner


9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be partitioned and if
possible output file should also be partitioned.
26) How do you truncate a table?
From Abinitio run sql component using the DDL "truncate table
By using the Truncate table component in Ab Initio
27) Have you ever encountered an error called "depth not equal"?
When two components are linked together if their layout doesnt match then this problem can
occur during the compilation of the graph. A solution to this problem would be to use a
partitioning component in between if there was change in layout.
28) What is the function you would use to transfer a string into a decimal?
In this case no specific function is required if the size of the string and decimal is same. Just
use decimal cast with the size in the transform function and will suffice. For example, if the
source field is defined as string (8) and the destination as decimal (8) then (say the field name
is field1).
out.field :: (decimal(8)) in.field
If the destination field size is lesser than the input then use of string_substring function can be
used like the following.
say destination field is decimal(5).
out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim
leading and trailing spaces */
29) What is an outer join?
An outer join is used when one wants to select all the records from a port - whether it has
satisfied the join criteria or not.
30) What are Cartesian joins?
Joins two tables without a join key. Key should be {}.
31) What is the difference between a DB config and a CFG file?
A .dbc file has the information required for Ab Initio to connect to the database to extract or
load tables or views. While .CFG file is the table configuration file created by db_config while
using components like Load DB Table.
32) What is the relation between EME, GDE and Co-operating system?
ans. EME is said as enterprise metadata env, GDE as graphical development env and Cooperating system can be said as abinitio server relation b/w this CO-OP, EME AND GDE is as
fallows
Co operating system is the Abinitio Server. this co-op is installed on particular O.S platform
that is called NATIVE O.S .coming to the EME, its i just as repository in informatica , its hold
the metadata, transformations, db config files source and targets informations. coming to GDE

its is end user environment where we can develop the graphs(mapping just like in informatica)
designer uses the GDE and designs the graphs and save to the EME or Sand box it is at user side
where EME is as server side.
33) Explain the difference between the truncate and "delete" commands.
The difference between the TRUNCATE and DELETE statement is Truncate belongs to DDL
command whereas DELETE belongs to DML command. Rollback cannot be performed in case of
truncate statement whereas Rollback can be performed in Delete statement. "WHERE" clause
cannot be used in Truncate where as "WHERE" clause can be used in DELETE statement.
34) How we can create job sequencer in abinitio i.e running number of graphs at a
time?
As such there is no job sequencer supported by Ab initio Until the versions:GDE:1.13.3 and
Co>Op:2.12.1 But we can sequence a the jobs by creating Wrapper Scripts in UNIX i.e. a korn
shell script which calls the graphs in sequence.
In Abinito it is not possible to create the job sequence. But scheduling of the jobs can be done
with the help of scheduling tool called "CONTROL M".In this tool graph corresponding scripts
and wrapper scripts are placed as per the sequence of exec and we can monitor the execution
of the graphs. There is no sequencer concept in abinitio. suppose you have graphs A,B,C
A o/p is I/p to B and B o/p is Input to C
Then you will write a wrapper script that will call this jobs, script will be like this
a.ksh
b.ksh
c.ksh
you can use next_in_sequence function which returns sequence of integers
35) How to take the input data from an excel sheet?
There is a Read Excel component that reads excel either from host or from local drive. The dml
will be a default one.
make it csv formatted , deliminated file and read it thru input table comp.
36) What is the function you would use to transfer a string into a decimal?
use ""reinterpret_as" function to convert string to decimal,or decimal to string.
syntax: To convert decimal into string
reinterpret_as(ebcdic string(13),(ebcdic decimal(13))(in.cust_amount))
37) How to run the graph without GDE?

In the run directory a graph can be deployed as a .ksh file. Now, this .ksh file can be run at the
command prompt as:
ksh <script_name> <parameters if any>
38) How to work with parameterized graphs?
One of the main purpose of the parameterized graphs is that if we need to run the same graph
for n number of times for different files, we set up the graph parameters like $INPUT_FILE,
$OUTPUT_FILE etc and we supply the values for these in the Edit>parameters. These
parameters are substituted during the run time. We can set different types of parameters like
positional, keyword, local etc.
The idea here is, instead of maintaining different versions of the same graph, we can maintain
one version for different files.
Have you worked with packages?
Packages are nothing but the reusable blocks of objects like transforms, user defined functions,
dmls etc. These packages are to be included in the transform where you use them. For
example, consider a user defined function like
/*string_trim.xfr*/
out::trim(input_string)=
begin
let string(35) trimmed_string = string_lrtrim(input_string);
out::trimmed_string;
end
Now, the above xfr can be included in the transform where you call the above function as
include ''~/xfr/string_trim.xfr'';
But this should be included ABOVE your transform function.
For more details see the help file in "packages".
What is an outer join?
If you want to see all the records of one input file independent of whether there is a matching
record in the other file or not. Then its an outer join.
What is driving port?
In a join, it is sometimes advantageous to have the Sorted-Input parameter set to "Input need
not be sorted". This helps, when we are sure that one of the input ports has far less records
than the other port, and the data from that port can be held in memory. In this case, we can
set the other port as the driving port.

Say, e.g. Port in0 has 1000 rec and in1 has 1 million records, in this case we set the port in1 as
driving port, for which the value would be 1. By default, the driving port value is 0(for in0).
Depending on the requirement, sometimes it more advisable to create a lookup instead. But
that depends on the requirement and design
What is writing of wrapper can any explain elaborately?
Writing a wrapper script helps u 2 to run the graph in sequence as u want.
Example:
when u need to run 3 graphs but the condition is after the first graph ran successfully u need to
take the feed generated by it and use it in next graph and so on... graph after it finished u
have to check the graph ran successfully then run the second KSh so on.....
What is Conditional DML? Can anyone please explain with example
Then u have to right a Unix script in which run the ksh of the first The DML that is used as a
condition is known as conditional DML..
Suppose we have data that includes the Header, Main data and Trailer as given below:
10 This data contains employee info.
20 emp_id,emp_name, salary
30 count
So, the DML for the above structure would be:
Record
decimal (",") id;
if (id==10)
begin
string (",") info;
end
else if (id==20)
begin
string (",") emp_id;
string (",") name;

string (",") salary;


end
else if (id==30)
begin
decimal (",") count;
end
end;
This is
Could anybody provide me the major UNIX commands for abinitio multi file system?
m_mkfs - For creating a multifile
m_ls - to list all the multifiles
m_rm - To remove the multifile
m_cp - To cpy a multifile
What is meant by vector field? Explain with an exam...
A vector is a sequence of the same type of elements. The element type may be any type
including a vector or record type.
It is a field which tell us how many times a particular field is repeated .for example
Take this input
Cust_id purchase_amount purchase date
101
1000
29.08.06
101
500
30.08.06
102
1050
31.08.06
103
1140
1.0906
103
1000
02.0906
103
500
30.09.06
Cust_id total_purchase_amount
101
102
103

1500
1050
2640

no_purchase purchase_date(1) purchase_date(2


2
1
3

29.08.06
31.08.06
1.09.06

30.08.06
02.09.06 so on

Here no_purchase is the vector field which rep the no of times a cust hase done purchases
What does dependency analysis mean in Ab Initio?
Dependency analysis will answer the questions regarding data linage that is where does the
data come from, what applications produce and depend on this data etc.

For data parallelism, we can use partition components. For component parallelism, we can
use replicate component. Like this which component(s) can we use for pipeline parallelism?
When connected sequence of components of the same branch of graph executes concurrently is
called pipeline parallelism.
Components like reformat where we distribute input flow to multiple o/p flow using output
index depending on some selection criteria and process those o/p flows simultaneously creates
pipeline parallelism.
But components like sort where entire i/p must be read before a single record is written to o/p
cannot achieve pipeline parallelism
flow:
input file ------>reformat----->rollup------>filter by expression----->o/p file
50th record

25 records

10 records

clearly speaking when ever u run any graph we observe the number of records processed on
flows ,this is best example for pipeline parallism.
What is .abinitiorc and what it contains?
.abinitiorc is the config file for ab initio. It is found in user's home directory. Generally it is
used to contain abinitio home path, different log in information like id encrypted password
login method for hosts where the graph connects in time of execution.
.abinitiorc file contains all configuration variables such as AB_WORK_DIR, AB_DATA_DIR etcthis
file can be find in "$AB_HOME/Config".
What do you mean by .profile in Abinitio and what...?
.profile is a file which gets executed automatically when that particular user logging in.
You can change your .profile file to include any commands that you want to execute whenever
u logging in. you can even put commands in your .profile file that overrides settings made in
/etc/profile(this file is set up by the system administrator).
You can set the following in your .profile......
- Environment settings
- aliases
- path variables
- name and size of your history file
- primary and secondary command prompts.....and many more.

What is semi-join?
in abinitio, there are 3 types of join...
1. inner join.

2. outer join

and 3.semi join.

For inner join 'record_required' parameter is true for all in ports.


For outer join it is false for all the in ports.
if u want the semi join u put 'record_requiredn' as true for the required component and false
for other components.
What is data mapping and data modeling?
Data mapping deals with the transformation of the extracted data at FIELD level i.e.
the transformation of the source field to target field is specified by the mapping defined on the
target field. The data mapping is specified during the cleansing of the data to be loaded.
For Example:
source;
string(35) name = "Siva Krishna

";

target;
string("01") nm=NULL("");/*(maximum length is string(35))*/
Then we can have a mapping like:
Straight move.Trim the leading or trailing spaces.
What is driving port? When do you use it?
Driving port in join supplies the data that drives join . That means, for every record from the
driving port, it will be compared against the data from non driving port.
We have to set the driving port to the larger dataset so that non driving data which is smaller
can be kept in main memory for speeding up the operation.
What is $mpjret? Where it is used in ab-initio?
$mpjret is return value of shell command "mp run" execution of Ab-Initio graph.
What is data cleaning? How is it done?
I can simply say it as Purifying the data.
Data Cleansing: the act of detecting and removing and/or correcting a databases dirty data
(i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly)
1. What is the Difference between GDE and Co>Operating system?

GDE(Graphical development environment) is look like a GUI to develop the graphs in a simple
manner.
Co> Operating system is nothing but distributed operating system, which can run as a backend
server
Current Version of GDE is 1.15 and Co>Operating system is 2.14
2. Which process you fallowed to develop a graph?

Getting the requirements


Preparing the mapping documents(Mapping document is nothing but mapping
between Input field and output field using some functional logic)
Then using the design documents I will implement the graph with proper components.

3. Which components you have worked?

Reformat
Rollup
Join
Sort
Replicate
Partition by expression and key
Redefine
Multi update
Lookup
Intermediate

4. Explain About Reformat component?


Reformat can change the record formats by dropping fields or adding or combining
Ports:

Input
Output
Reject
Log
Error

Specific Parameters:

Select
Output Index

5. What is the difference between output Index and Select parameters in reformat?
Select and output index both are used to filter the data, but using select parameter we cant
get the deselected record. But using output index parameter U can filter the data as well as u
can connect the deselected record to another output port.
6. What is the difference between Reformat and Redefined component?:

Reformat can change the record format by dropping, adding, modifying fields.
Using Redefined format copies the records from input to output without changing the record
values.
7. Explain about Join component?

Reads the data from two or more inputs and combines the records with
matching keys and send to output ports

Specific parameters:

Dedup: Set true to remove duplicates before joining

Driving port: Driving port is the largest input and remain inputs will directly
reads into memory.(Available only when Inmemory: Input need not to sort
parameter set to true)

Join type:

1. Inner join
2. Full outer join
3. Explicit join

Record required parameter: This will be available when join type is set to Explicit. If
you want left outer join set true to input 0 and false to input 1.If you want right outer
join set false to Input 0 and set true to Input 1.

Key: Matching keys

Overridden key : Set the alternative names to the particular key fields

Max memory: Maximum usage of bytes before joining to write the temporary files to
the disk(Available only when(sorted Input In memory: Input need to be sort is set to
true), default is 8MB

Select: To filter the data

Max-core: Maximum usage of bytes before joining to write the temporary files to the
disk (Available only when (sorted Input In memory: Input need not to be sort is set to
true) .The default is 64MB

Sorted Input:

When set to in memory Input need to be sort, it accept only sorted input and if it is In
memory Input need not sort, accepts unsorted data
Specific ports:
Unusedn: We can retrieve the unmatched data using unused ports
9. Can we make a explicit join for more than two inputs?
Yes, we can make join for more than two inputs
Ex:

For three inputs, if you want left outer join set the record required parameter true to
input 0 and false to input 1 and input2
For three inputs, if you want right outer join set the record required parameter false to
input 0 input 1 and set true toinput2
10. What is the difference between merge and join?
Both components used to join the data based on keys, with join we can combine to input
flows, but using merge we can combine the partitioned data.
11. Explain about sort component?
Sort component sorts and merge the data
Parameters:
Key
Maxcore (Default is 100MB)
12. How to determine the Maximum usage of memory of a component?
The maximum available value of max core is 231 1
13. Explain about Portion by key and Expression
Portion by Key: Distributes the records to output flow portions according to its key value
Partition by Expression: Distributes the record to output flows partitions by expression.
14. What r the different types of partition components?

Partition by key
Partition by Expression
Partition by round robin
Partition by range
Broadcast

15. Difference between broadcast and replicate?


Broadcast: combines the records it receives into single flow and writes a copy of that flow to
each output flow partitions. Broadcast supports data parallelism.
Replicate: combines the records it receives into single flow and write a copy of that flow to
each output flows. Replicate supports component parallelism.
16. What is difference between Concatenate and merge?
Concatenate: Appends the multiple flow partitions one after another
Merge: Combine the multiple flow partitions that have been sorted by key
16. What are the different de partition components?

Merge
Interleave (Combines in round robin fashion)
Concatenate

Gather (Combines the data arbitrarily)

16. what is the difference between reformat and Filter by Expression?


In both components we can filter the data based on select expression, but in reformat we
cant get the de selected records in a separate port. In filter by expression we have a separate
deselect port.
18. Explain the difference between Aggregator and rollup?
Both components used for summarization, but in aggregator dont have the built-in functions.
In rollup we have the built-in functions like SUM (), AVG (), COUNT (), MIN (), MAX (), FIRST (),
LAST (), PRODUCT ().
19. Explain the difference between rollup and scan?
Rollup component can produce the total control on summarization. Scan component produce
only Intermediate summary or cumulative summary records.

20. What are the aggregator functions in rollup?

Temporary_type (declaring the temporary variable)


Initialize (Initializing needed value)

Rollup (Doing summarization)

Finalize (Assigning the final value)

21. what are the different types of sort components?

Sort
Sort with groups

Checkpointed sort

Partition by key and sort

23. What is a multifile and how we can create through command line?
AbInitio multifiles are nothing but a partition of a large serial file into tree structure and runs
parallel way.
We can create the multifile in command line using the command M_MFKS fallowed by URL. Of
that particular file.
24. What is the difference between phase and check point?

Phases are used to break up a graph into blocks for performance tuning.
Check point is used for recovery

25. Explain about different types of parallelisms supported by Ab Initio?


Ab Initio supports three types of parallelisms:

Component parallelism
Pipeline parallelism

Data parallelism

Component parallelism:
Component parallelism occurs when program components execute simultaneously on different
branches of a graph.

Pipeline parallelism:
Pipeline parallelism occurs when a connected sequence of program components on the same
branch of a graph execute simultaneously.

Data parallelism:
Data parallelism occurs when you separate data into multiple divisions, allowing multiple
copies of program components to operate on the data in all the divisions simultaneously.

26. Explain about flow partitions in AbIntio?

Straight flow
Fan-in flow(

Fan-out flow(

All to All flow(

)
)
)

Straight flow: This flow connects the two components with the same depth of parallelism

Fan-in flow: A fan-in flow connects a component with a greater depth of parallelism to one
with a lesser depth in other words; it follows a many-to-one pattern.

Fan-out flow:
A fan-out flow connects a component with a lesser number of partitions to one with a greater
number of partitions in other words, it follows a one-to-many pattern.

All to All flow:

An all-to-all flow is used:

To connect components with different numbers of partitions, when the result of


dividing the greater number of partitions by the lesser number is not an integer
For repartitioning of data using components with the same or different numbers of
partitions (see Repartitioning)

28. Do u have worked on conditional components?


You can make any component or sub graph conditional by specifying a conditional expression
that the GDE evaluates at runtime to determine whether or not the component runs.
If the conditional expression evaluates to true, the GDE runs the subgraph or component. If the
conditional expression evaluates to false, the GDE either disables the component and any flows
connected to its ports, or replaces it with a flow, depending on your choice on the Properties
dialog: Condition tab.
29. What is a subgraph?
A subgraph is a graph fragment. Just like graphs, subgraphs contain components and flows. A
subgraph groups together components that perform a subtask in a graph. The subgraph creates
a reusable component that performs the subtask.
30. What sort of functions have u worked?

Enquiry and error functions


String functions

Lookup functions

Date functions

31. which Enquiry and error functions have u used?

Is_defined (Test whether the expression is not null)

Syntax: Is_defined (expr)

Is_Null(Test whether the expression is null)

Syntax: Is_defined (expr)

Is_error (Tests whether the error will occur while the time of evaluating the
expression)

Syntax: Is_error (expr)


Is_valid (Tests whether the expression is valid or not)
Syntax: Is_valid (expr)

Force error(Causes an error and returns a message)

Syntax: force_error (string msgr)


32. What sorts of String functions u have been worked?

Decimal_lpad:
Decimal_lrpad

String_compare

String_substring

String_concat

String_Index

String_length

String_lpad

String_lrpad

(Note: please go through the help document for the description)


33. How can we generate sequence of numbers in Ab Intio?
We have separate function called Next_in_sequence is there to generate 1 to n numbers.

Syntax: int next_in_sequence( )


34. How can we get the log information in AbInitio?
Using write_to_log function we can write to log port of a component
Syntax: write_to_log(string event_type, string event_text)
35. What is the use of Lookup file component?
Lookup File represents one or more serial files or a multifile. The amount of data is small
enough to be held in main memory. This allows a transform function to retrieve records much
more quickly than it could retrieve them if they were stored on disk.
Lookup File associates key values with corresponding data values to index records and retrieve
them.
Parameters for Lookup:

Key
Record format

How to Use Lookup File


Unlike other dataset components, Lookup File is not connected to other components in graphs.
In other words, it has no ports. However, its contents are accessible from other components in
the same or later phases.
You use the Lookup File in other components by calling one of the following DML functions in
any transform function or expression parameter: lookup, lookup_count, or lookup_next.
The first argument to these lookup functions is the name of the Lookup File. The remaining
arguments are values to be matched against the fields named by the key parameter. The
lookup functions return a record that matches the key values and has the format given by the
RecordFormat parameter. For details, see the Data Manipulation Language Reference.
A file you want to use as a Lookup File must fit into memory. If a file is too large to fit into
memory, use Input File followed by Match Sorted or Join instead.
Information about Lookup Files is stored in a catalog, which allows you to share them with
other graphs.
36.Have u worked on Lookup functions?
I worked on the fallowing functions:
Lookup
Lookup_count
Lookup_Local
Lookup_next
(Note please go through help document for the description)

37. How to convert a output file or Intermediate file to a lookup file?

By clicking the Add to catalog check box.


39. Explain the performance tuning in your current project?
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimize the number of sort components
4) Minimize sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving
port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily
40.What is DB CONFIG file and how to create it?
Db config file has the information required for the AbInitio to connect the database
Creation: In Input or output table components select dbconfig file/new/then u should give the
Db name, Db node,database version and user_id and password and click create.
41. How do u migrate ur project from one env to another env?
We have two options like
Check-In
Check-Out
(Note :Please go through the help document for more Information)

42. How can do Version control in Ab Initio?


Once Check- In has done Graph automatically updated to new version
Whenever u checkout the graph u need to give Tag information in the Tag tab(It represents the
version)
If u want to view total versions, you need to give the fallowing command in the command line:
AIR_OBJECT_VERSION_VERBOSE.
43 .How can we debug AbInitio graph?
Ans: Using file watchers we can debug the graph, watcher will add an Intermediate file on the
flow. So you can view the data that passes through the flow when you run a graph.
Two types of watchers are there:

Non-phased
Phased

Non-Phased: with out phase break

Phased: with phase break.


44. How do we add a watcher to the flow?
Add watchers on flows by doing the following:
1. Turn on debugging mode if it is not on.
2. Select the flows on which you want to place watchers.
3. Do one of the following:
o On the menu bar of the GDE, choose Debugger > Add Watcher to Flow.
o
o

On the GDE Debugging toolbar, click the Add Watcher to Flow button
Right-click the flow and choose Add Watcher from the shortcut menu.

Watchers appear on the selected flows.


The actions in step 3 will remove watchers if there are watchers on all selected flows.
When you run the graph the watchers turn blue, and you can view the data that has passed
through the flows.
45.How to run a graph through command line?
Ans: We can deploy the graph as a .ksh file and using that file can run the graph through
command line.
46.what is a sandbox?
A sandbox is a collection of graphs and related files that are stored in a single directory tree,
and treated as a group for purposes of version control, navigation, and migration.
Sandbox contains fallowing sub directories:

DML (Holds the Record format Information)


XFR (Holds the Transformation logic files)

DB (Holds the database connection information)

MP (Holds the graphs)

RUN (Holds the ksh files)

47.What will happen we you create a sandbox in Ab Initio?


When you create the sandbox, Automatically the tree structure (DML.XFR,DB,MP,RUN
Folders) ,parameters and environment variables will create. Along with these the
ABPROJECTSETUP.KSH file will create in the sandbox.
48..What sort of error messages you have got In your project?

Bad value found error


Null value assignment

Depth is not equal

Too many files open or max core error

49.When we can get the depth is not equal message?


When the depth of parallelism (partitions of a layout) mismatched between up stream and
down stream components.
50.When we get the too many files opened error?
When the max core value is too low while executing a component this error will occur, so we
need to set the appropriate max core value for that component.
51.How does the job recovery works in Ab Initio?
Job recovery can done in the fallowing ways:

If you set the checkpoint phase .rec file will create automatically. Once failure occur
for graph, while the time of rerunning of that graph, It will automatically recover the
data till last check point
If you want to run the from the beginning, you need perform the manual rollback from
the command line
The command is m_rollback

53.What is local variable?


A local variable is a variable declared within a transform function. You can use local variables
to simplify the structure of rules or to hold values used by multiple rules.
Declaration:
Here is the syntax of a local variable declaration:
let type_specifier variable_name [not NULL] [ = expression ] ;
NOTE: The declaration of a local variable must occur before the statements and rules in a
transform function.
Let

Keyword for declaring a variable.

type_specifier

The type of the variable.

variable_name

The name of the variable.

[not NULL]

Optional. Keywords indicating that the variable cannot take on the value of
null. These must appear after the variable name and before the initial value.
NOTE: If you create a local variable without the not NULL keywords, and do
not assign an initial value, the local variable initially takes on the value of
null.

[=

Optional. An expression that provides an initial value for the variable.

expression ]
;

A semicolon must end a variable declaration.

For example, the following local variable definitions define two variables, x and y. The value
for x depends on the value of the amount field of the variable in, and the value of y depends
on the value of x:
let int x = in.amount + 5;
let double y = 100.0 / x;
54.What is Global variable?
With in a package you can create and use the global variable to all the transformation
functions, which are present in the package, but u should declare the global variable outside
the transformation function.
Declaration:
let type_specifier variable_name [not NULL] [ = expression ] ;
Let

Keyword for declaring a variable.

type_specifier

The type of the variable.

variable_name

The name of the variable.

[not NULL]

Optional. Keywords indicating that the variable cannot take on the value of
null. These must appear after the variable name and before the initial value.
NOTE: If you create a global variable without the not NULL keywords, and do
not assign an initial value, the global variable initially takes on the value of
null.

[=
expression ]

Optional. An expression that provides an initial value for the variable.

A semicolon must end a variable declaration.

55. Have you ever used any m commands?


Yes, I used the commands like M_rollback, M_cleanup,m_dump
56. What is the difference between m_rollback and m_cleanup?
m_rollback rolls back a partially completed graph to its beginning state. m_cleanup cleans up
files left over from unsuccessfully executed graphs and manually recovered graphs.
57. How to use m_cleanup?
To find temporary files and directories before cleaning them up, you use the m_cleanup
command. You can run this utility with or without arguments:

m_cleanup prints usage for the command.


m_cleanup -help prints usage for the command.
m_cleanup -j job_log_file [job_log_file... ] lists the temporary files and directories
listed in the log file specified by job_log_file. To specify multiple files, separate each
filename with a space.

Log files have either a .hlg or .nlg suffix. A log file ending in .hlg is on the control, or host,
machine of a graph. A log file ending in .nlg is on a processing machine of a graph.
The job_log_file can be an absolute or relative pathname. Paths have the following syntax:
o On the control machine AB_WORK_DIR/host/job_id/job_id.hlg
o On a processing machine AB_WORK_DIR/vnode/job_id-XXX/job_id.nlg, where
the XXX on a processing machine path is an internal ID assigned to each
machine by the Co>Operating System.
58.How can I generate DML for a database table from command line?
Using the m_db command line utility we can generate the dml.
Syntax is
m_db gendml dbc_file [options] -table tablename
59.Can we do check-In and Check-Out through Command line?
Yes, we can do check-in and check-out using the air commands like AIR_OBJECT_IMPORT and
AIR_OBJECT_EXPORT.
60.What sort of issues you solved in the production support?

Data quality issues


Max core issues.

1) What is EME & EME DataStore?


Ans) EME is short for Enterprise Meta>Environment. The EME is a high performance objectoriented storage system that manages Ab-Initio applications (including data formats and
business rules) and related information. It provides an integrated and consolidated view of your
business. It is used for the purpose of VERSION CONTROLLING, NAVIGATION & MIGRATION.
An EME datastore is a specific instance of the EME: the term denotes the specific EME storage
that you are currently connected to through the GDE, there can be many such datastore
instances resident in an environment in which the EME has been installed. But you can only be
connected to one datastore at a time: this is determined by your GDE's current EME datastore
settings.
2) What is Sandbox?
Ans) A sandbox is a collection of graphs and related files that are stored in a single directory
tree, and treated as a group for purposes of version control, navigation, and migration. A
sandbox can be a file system copy of a datastore project.
3) What is Co-Operating System?

Ans) The Co>Operating System is core software that unites a network of computing
resourcesCPUs, storage disks, programs, datasetsinto a production-quality data processing
system.
The Co>Operating System is layered on top of the native operating systems of a collection of
computers. It provides a distributed model for process execution, file management, process
monitoring, checkpointing, and debugging.
The Graphical Development Environment (GDE) provides a graphical user interface into the
services of the Co>Operating System.
4) What are the differences between the various GDE connection methods?
Ans) There are a number of communication methods used to communicate between the GDE
and the Co>Operating System, including:

Ab Initio Server/REXEC:
Ab Initio Server/TELNET:
DCOM:
REXEC:
RSH:
TELNET:
SSH(/Ab Initio)

When using the GDE to connect to the Co>Operating system, the normal process for a
connection differs depending upon which communication method is selected. In broad
terms, two things tend to happen: files are transferred from the GDE to the target host (or
from the host to the GDE), and processes are started/executed on the host.
When using telnet, rexec and rsh, the basic steps are as follows.
A. The GDE transfers the execution script to the server via FTP.
B. The GDE connects to the server by means of the selected method.
C. The GDE executes that script on the server by means of the connection set up in
step B.
The process is differerent for connection methods that use the Ab Initio Server, however. These
methods include Ab Initio Server/Telnet and Ab Initio Server/Rexec, as well as SSH and DCOM.
The use of the Ab Initio Control Server replaces the need for FTP and adds enhanced serverside services. When the Ab Initio Control Server is involved, the basic steps are as follows:

The GDE connects to the server by means of the selected method.


This connection initiates startup of the Ab Initio Control Server.

The GDE initiates a connection to the Control Server.

All file transfer occurs across the same Control Server connection.

Script execution is accomplished through a new connection using the selected


connection method.

5) What is Meta data?


Ans) Meta data is Data about the Data, It will give the description about the data. Metadata
associated with graphs. This includes the information needed to build a graph, such as record
formats, key specifiers, and transform functions.
6) What is the configuration file in Ab-initio?
Ans) The Co>Operating System accepts either of two names for the per-user Ab Initio
configuration file. In addition to .abinitiorc, the Co>Operating System now also accepts
abinitio.abrc in order to conform to Windows file name conventions. Other supported
platforms also recognize the new name. Only one configuration file is permitted, however.
Using both .abinitiorc and abinitio.abrc results in an error.
7) What are different file extensions in Ab-initio?
Ans)
.cfg

Database table configuration files for use with 2.1 Database Components

.dat

Data files (either serial files or multifiles)

.dbc

Database configuration files

.dml

Data Manipulation Language files or record format definitions.

.mdc Dataset or custom dataset components


.mp

Stored Ab Initio graphs or graph components

.mpc Program components or custom components


.xfr

Transform function definitions or packages

.aih

Host Settings

.aip

Project Settings

8) What does GDE do automatically?


Ans) The GDE provides default settings and behaviors for several features.
Buffering and Deadlock
Record Format Propagation

Flow
Layout and

9) What kind of flat file formats supports by Ab Initio Graphical Design Interface (GDE)?
Ans) The Ab Initio Graphical Design Interface (GDE) supports these flat file formats: All file
types use the .dat extension.

Serial Files
Multifiles

Ad-hoc Multifile

Serial Files
A serial file is a flat, non-parallel file also known as one-way parallel. You create serial files
using a Universal Resource Locator (URL) on the component's Description tab. The URL starts
with file
Multifiles:
A multifile is a parallel file consisting of individual files called partitions and often stored on
different disks or computers. A multifile has a control file that contains URLs pointing to one or
more data files. You can divide data across partition files using these methods: random or
roundrobin partitioning, partitioning based on ranges or functions, and replication or
broadcast, in which each partition is an identical copy of the serial data. You create multifiles
using a URL on the components Description tab.
Ad-hoc Multifile :An ad-hoc multifile is a also a parallel file. Unlike a multifile, however, the
content of an ad-hoc multifile is not stored in multiple directories. In a custom layout, the
partitions are serial files. You create an ad-hoc multifile using partitions on the component's
Description tab.
10) What is dbc file contains?
Ans) File with a .dbc extension which provides the GDE with the information it needs to
connect to a database. A configuration file contains the following information:

The name and version number of the database to which you want to connect
The name of the computer on which the database instance or server to which you want
to connect runs, or on which the database remote access software is installed

The name of the database instance, server, or provider to which you want to connect

11) What are the default parameters in sandbox?


Ans) The default sandbox parameters in a GDE-created sandbox are these six:

PROJECT_DIR absolute path to the sandbox directory


DML relative sandbox path to the dml subdirectory
XFR relative sandbox path to the xfr subdirectory
RUN relative sandbox path to the run subdirectory
DB relative sandbox path to the db subdirectory
MP relative sandbox path to the mp subdirectory

These six parameters are automatically created (and assigned their correct value) whenever
you create a sandbox.
12) What is the difference b/w sandbox parameters & graph parameters?
Ans) The difference between sandbox parameters and graph parameters is:

Graph parameters are visible only to the particular graph to which they belong
Sandbox parameters are visible to all the graphs stored in a particular sandbox

13) What is standalone Sandbox?


Ans) A sandbox that is not associated with a project is simply a special directory.
14) What is the difference b/w EME & Sandbox?
Ans) The big difference between the contents of a sandbox and its corresponding project in the
EME is that the project contains, for each file, each and every version that has ever been
checked in by anybody. The sandbox, on the other hand, contains only the latest version of
each file checked out into that sandbox.
A sandbox can be associated with only one project. However, there is no limit (other than the
physical one of disk space) to the number of sandboxes that a user can have. Although a given
sandbox can be associated with only one project, a given project can have any number of
sandboxes.
15) What are formal graph parameters?
Ans) A formal graph parameter is a parameter you substitute for a path and/or filename when
you create a graph. This allows you to specify the value of that parameter at runtime.
16) What is the order of evolution of parameters?
Ans) When you run a graph, parameters are evaluated in the following order:

The host setup script is run.


Common (that is, included) sandbox parameters are evaluated.

Sandbox parameters are evaluated.

The project-start.ksh script is run.

Formal parameters are evaluated.

Graph parameters are evaluated.

The graph Start Script is run.

17) What is Transform function?


Ans) A transform function (or transform) is the logic that drives data transformation most
commonly, transform functions express record reformatting logic. In general, however, you can
use transform functions in data cleansing, record merging, and record aggregation.
To be more specific, a transform function is a collection of business rules, local variables, and
statements. The transform expresses the connections between the rules, variables, and
statements, as well as the connections between these elements and the input and output
fields.

Transform functions are always associated with transform components; these are components
that have a transform parameter: Aggregate, Denormalize Sorted, Fuse, Join, Match Sorted,
MultiReformat, Normalize, Reformat, Rollup, and Scan components.
18) What is Prioritizing rule?
Ans) The order of evaluation of rules in a transform function by assigning priority numbers to
the rules. The rules are attempted in order of priority, starting with the assignment of lowestnumbered priority and proceeding to assignments of higher-numbered priorities, then finally to
an assignment for which no priority has been given.
19) What are local variables?
Ans) A local variable is a named storage location in an expression or transform function. You
declare a local variable within the transform function in which you want to use it. The local
variable is reinitialized each time the transform function is called, and it persists for one single
evaluation of the transform function.
20) What Is a Package?
Ans) A package is a named collection of related DML objects. A package can hold types,
transform functions, and variables, as well as other packages. Packages provide a means of
locating in one place DML objects that are needed more than once in a given graph, or needed
by multiple developers. Packages allow developers to avoid redundant code; this makes
maintenance of DML objects more efficient.
Packages are very useful in these types of situations:

The record formats of multiple ports use common record formats and/or type specifiers
Multiple components use common transforms

21) Explain Multi-Stage transform Components?


Ans) The multi-stage transform components require packages because, unlike other transform
components, they are driven by more than single transform functions. These components each
take a package as a parameter and, in order to process data, look for particular variables,
functions, and types in that package. For example, a multi-stage component might look for a
type named temporary_type, a transform function named finalize, or a variable named
count_items.
22) What is a Phase?
Ans) A phase is a stage of a graph that runs to completion before the start of the next stage. By
dividing a graph into phases, you can save resources, avoid deadlock, and safeguard against
failures. To protect a graph, all phases are checkpoints by default.
23) What is a Checkpoint?
Ans) A checkpoint is a phase that acts as an intermediate stopping point in a graph and saves
status information to allow you to recover from failures. By assigning phases with checkpoints
to a graph, you can recover completed stages of the graph if failure occurs.

24) How will use the subgraph of graph A in the Graph B?


Ans) When you build a subgraph, it becomes a part of the graph in which you build it. If you
want to use it in other graphs, or in other places in the original graph, save it in the
Component Organizer of the GDE.
25) Is there a way to make my graph conditional, so that certain components may not run?
Ans) You can enter a Condition statement on the Condition tab of graph components. This is an
expression that evaluates to the string value for true or false (see details below). The GDE then
evaluates the expression at runtime. If the expression evaluates to true, the component or
subgraph is executed. If it is false, then the component or subgraph is not executed, and is
either removed completely or replaced with a flow between two user-designated ports.
The correct syntax for if statements in the Korn shell is as follows:
$( if [[ condition ]]; then_statement; else_statement; fi)
26) How to improve GDE is performence, when it's running slow?
Ans) If the GDE is performing slowly, you can improve performance with one or more of these
methods:

Turn off Undo by choosing File > Autosave/Undo on the GDE menu bar and clearing the
selection of Undo/Redo Enabled.
Turn off Propagation by choosing Edit > Propagation on the GDE menu bar and clearing
the selection of Record Format and Layout.
Increase the Tracking Interval by choosing Run > Default Settings on the GDE menu bar,
clicking the Code Generation tab, and increasing the Tracking Interval to 60 seconds.

27) What is lookup file?


Ans) Lookup File represents one or more serial files or a multifile. The amount of data is small
enough to be held in main memory. This allows a transform function to retrieve records much
more quickly than it could retrieve them if they were stored on disk. Lookup File associates key
values with corresponding data values to index records and retrieve them.
28) What is Two-stage routing?
Ans) When an all-to-all flow connects components with layouts containing a large numbers of
partitions, the Co>Operating System uses many networking resources. If the number of
partitions in the source and destination components is N , an all-to-all flow uses resources
proportional to N*N(N square) .
To save network resources, you can mark an all-to-all flow as using two-stage routing. With
two-stage routing, the all-to-all flow uses only resources 2*N*N (2*N*root N).
For example, an all-to-all flow with 25 partitions uses 25*25 = 625 resources, but with twostage routing uses only 2*25*5 = 250 resources.
29) What kind parallelism supports by Ab-initio?

Ans) There are three types of parallelism employed by the Co>Operating System:

Component Parallelism
Pipeline Parallelism

Data Parallelism

30) What is Component Parallelism?


Ans) Component parallelism occurs when program components execute simultaneously on
different branches of a graph.
Component parallelism scales to the number of branches of a graph the more branches a
graph has, the greater the component parallelism. If a graph has only one branch, component
parallelism cannot occur.
31) What is Pipeline Parallelism?
Ans) Pipeline parallelism occurs when a connected sequence of program components on the
same branch of a graph execute simultaneously.
32) What is Data Parallelism?
Ans) Data parallelism occurs when you separate data into multiple divisions, allowing multiple
copies of program components to operate on the data in all the divisions simultaneously.
33) What are Multifiles and Multifile Systems & Multi directories?
Ans) Ab Initio multifiles are parallel files composed of individual files, typically located on
different disks and usually, but not necessarily, on different systems. These individual files are
the partitions of the multifile.
Ab Initio multifiles reside in parallel directories called multidirectories, which are organized
into multifile systems. An Ab Initio multifile system consists of multiple replications of a
directory tree structure containing multidirectories and multifiles. Each replication constitutes
a partition of the multifile system.
Each partition holds a subset of the data contained in the multifile system, and the system has
one additional partition that contains control information. The partitions containing data are
the data partitions of the system, and the additional partition is the control partition. The
control partition contains no user data, only the information the Co>Operating System needs to
manage the multifile system.
34) How to create multifile system?
Ans) To create a multifile system, issue the m_mkfs command, using as arguments the URLs of
the partitions of the multifile system you want to create. The first URL creates the control
partition, and each subsequent URL creates the next partition of the multifile system.
Similarly use m_mkdir for multi directories.
35) What is Layout?

Ans) A layout is one of the following:

A URL that specifies the location of a serial file


A URL that specifies the location of the control partition of a multifile

A list of URLs that specifies the locations of:

The partitions of an ad hoc multifile

The working directories of a program component

36) What Is Dependency Analysis?


Ans) Using the EME, you can conduct project analyses of the dependencies within and between
graphs. The EME examines the project and develops an analytical survey of it in its entirety,
tracing how data is transformed and transferred, field by field, from component to component.
37) What are different kinds of Analysis?
Ans):
Choice

Checkin Wizard Action

None

Turns off all translation and dependency analysis during checkin.

Translation Only

Translates graphs from GDE format to datastore format, but does not
do error checking and does not store results in the datastore.
Tip We recommend that at minimum you do translation only, since it
is required for analysis, which you can run anytime.

Translation with
Checking

Translates graphs from GDE to datastore format and checks for errors
that will interfere with dependency analysis. See Checked-for Errors.

Full Dependency
Analysis (Default)

Performs full dependency analysis on the graph and saves the results
in the datastore.
Tip We recommend that you do not do analysis now, as it can greatly
prolong checkin.

What to Analyze
The What to Analyze group of checkboxes allow you to specify which files will be subjected to
the level of analysis you specified in Analysis Level. The following table explains the four
choices:
Choice

Checkin Wizard Analyzes ...

All Files

All files in the project.

All Unanalyzed
Files

All files in the project that have changed or those that are dependent on
or are required by files that have changed since the last time they were
analyzed regardless of whether or not the files were checked in by
you.

Only My Checked
In Files

Only the files checked in by you. This group can include files you
checked in earlier which are still on the analysis queue and have not yet
been analyzed.

Only the File


Specified
(Default)

Only the specified file(s).

Analysis Scope
The Analysis Scope group of checkboxes allow you to specify how far the specified level of
analysis will be extended to files dependent on those being analyzed, both in the current
project and in other projects. The following table describes the three choices.
Choice

Checkin Wizard Analyzes...

Dependent Files from All


Projects (Default)

Files in other projects common to (included in) the one you are
checking, if they are dependent on the files being analyzed.

Dependent Files from


Specified Project (Default)

Only the dependent files that are in the same project as the
file(s) being analyzed.

No Dependent Files

No dependent files.

38) What is switch parameter?


Ans) A switch parameter has a fixed set of string values which you define when you create the
parameter. The purpose of a switch parameter is to allow you to change your sandbox's
context: its value determines the values of various other parameters that you make dependent
on that switch. For each switch value, each of the dependent parameters has a dependent
value. Changing the switch's value thus changes the values of all its dependent parameters.
39) What are the types of project parameters?
There are four types of project parameters:

Standard Parameters
Switch Parameters

Dependent Parameters

Common Project Parameters

40) What is max-core parameter?

Ans) The value for the max-core parameter determines the maximum amount of memory, in
bytes, that the component can use. If the component is running in parallel, the value of maxcore represents the maximum memory usage per partition, not the sum for all partitions.
If you set the max-core value too low, the component runs more slowly than expected. If you
set the max-core value too high, the component might use too many machine resources, slow
the process drastically, and cause hard-to-diagnose failures.
41) What is ordered flow?
Ans) The Ordered attribute is a port attribute. It determines whether the order in which you
attach flows to a port, from top to bottom, is significant to the definition and purpose of the
component. If a port is ordered, the order in which flows are attached determines the result of
the processing the component does: if you change the order in which you attach the flows, you
create a different result.
Note: GDE indicates the difference between a port that is ordered and one that is not by
drawing them differently. If you inspect the ordered port of Concatenate in the graph, you see
a line dividing the port between the two flows; that line is not present in the port of Gather,
which is not ordered.
42) What will be the record order in the flows?
Ans) Components maintain the ordering of the input data records unless their explicit purpose
is to reorder records. For most components, if record x appears before record y in an input
flow partition, and if record x and record y are both in the same output flow partition, then
record x appears before record y in that output flow partition.
For example, if you supply sorted input to a Partition component, it produces sorted output
partitions.
Exceptions are:

The components that explicitly reorder records, such as Sort, Sort within Groups, and
Partition by Key and Sort.
The components that have fan-in flows, such as the Departition components. They each
define their own record order.

43) What is loging parameter?


Ans) The transform components and some other components have a logging parameter. This
parameter specifies whether or not you want the component to generate log records for
certain events. The value of the logging parameter is True or False. The default is False.
If you set the logging parameter to True, you must also connect the component's log port to a
component that collects the log records.
44) Explain about multistage transform components?
Ans) A multistage transform is a Transform Component that modifies records in up to five
stages: input selection, temporary initialization, processing, finalization, and output selection.

Each stage is written as a DML transform function. The multistage transform components are:
Denormalize, Normalize, Rollup & Scan
45) Explain about compress components?
Ans) There are a number of components that compress and uncompress data.

Deflate (compress) and Inflate (Uncompress) work on all platforms.


Compress and Uncompress work on only UNIX platforms.
GZip(compress) is deprecated. It will be removed in a future release. GUnzip
(Uncompress) uncompresses data and works on all platforms.

Components:
46) Difference b/w Replicate & Broadcast?
Ans) Broadcast arbitrarily combines all the data records it receives into a single flow and writes
a copy of that flow to each of its output flow partitions.
Replicate arbitrarily combines all the data records it receives into a single flow and writes a
copy of that flow to each of its output flows.
Use Replicate to support component parallelism for example, when you want to perform
more than one operation on a flow of data records coming from an active component.
Use Broadcast to increase data parallelism when you have connected a single fan-out flow to
the out port or to increase component parallelism when you have connected multiple straight
flows to the out port.
47) Explain about FUSE?
Ans) Fuse applies a transform function to corresponding records of each input flow. The first
time the transform function executes, it uses the first record of each flow. The second time the
transform function executes, it uses the second record of each flow, and so on. Fuse sends the
result of the transform function to the out port.
The component works as follows. The component tries to read from each of its input flows.

If all of its input flows are finished, Fuse exits.


Otherwise, Fuse reads one record from each still-unfinished input port and a NULL from
each finished input port.

48) Explain about JOIN?


Ans) Join reads data from two or more input ports, combines records with matching keys
according to the transform you specify, and sends the transformed records to the output port.
Additional ports allow you to collect rejected and unused records. There can be as many as 20
input ports.
Types of join:

Inner join sets the record-required parameters for all ports to True. Inner join is the
default. The GDE does not display the record-required parameters because they all
have the same value.
Outer join sets the record-required parameters for all ports to False. The GDE does
not display the record-required parameters because they all have the same value.
Explicit allows you to set the record-required parameter for each port individually.

49) What is the use of override key parameter & where it is used?
Ans) Override key parameter is used in the Join component. To specify the alternative name(s)
for the key field(s) for a particular in port.
50) What are the different options available in reject thresh hold?
Ans) There are 3 options available, they are

Never abort
Abort on first reject
Use limit/ramp

51) Explain about limit & ramp?


Ans) Limit is a number representing the acceptable total of reject events. Default is 0.

Ramp is a decimal representing the Rate of reject events in the number of records
processed.
The component stops the execution of the graph when the number of reject events
exceeds the result of the following formula:
limit + (ramp * number_of_records_processed_so_far)

52) Explain Join with DB component?


Ans) Join with DB joins records from the flow or flows connected to its input port with records
read directly from a database(SQL statement), and outputs new records containing data based
on, or calculated from, the joined records.
execute_on_miss parameter:
A statement executed when no rows are returned by the select_sql statement. The statement
should be an INSERT (or possibly an UPDATE); after it is executed, the select_sql is executed a
second time. If no results are generated on the second attempt, the input record is rejected. A
database commit is by default performed after each execution of execute_on_miss, but this
can be altered by setting the commit number parameter.
53) What is the difference b/w JOIN & JOIN with DB?
Ans) The main difference is, in join with DB component we are joining the incoming feed file
with the TABLE by writing the SQL statement, where as in normal join we dont have SQL
statements.

Instead of using a statement in SQL, you can now extract the to-be-joined data from the
database by calling a stored procedure, specified in the sql_select parameter. The syntax for
calling a stored procedure using Oracle or DB2 is as follows:

{call | exec | execute} [:a = ] [schema.][package.]stored_procedure(:b, :c, ...)


where: :a, :b, and :c are input/output arguments

54) What is the use of META-PIVOT component?


Ans) The Meta Pivot component allows you to split records by data fields (columns). The
component converts each input record into a series of separate output records. There is one
separate output record for each field of data in the original input record. Each output record
contains the name and value of a single data field from the original input record.
55) Explain REFORMAT?
Ans) Reformat changes the record format of data records by dropping fields, or by using DML
expressions to add fields, combine fields, or transform the data in the records.
output-index:

Either the name of a file containing a transform function, or a transform string. The
Reformat component calls the specified transform function for each input record. The
transform function uses the value of the input record to direct that input record to a
particular output port. The expected output of the transform function is the index of
an output port (zero-based). The Reformat component directs the input record to the
identified output port and executes the transform function, if any, associated with that
port.
When you specify a value for output-index, each input record goes to exactly one
transform/output port pair. For example, suppose there are 100 input records and two
output ports. Each output port receives between 0 and 100 records. According to the
transform function you specify for output-index, the split can be 50/50, 60/40, 0/100,
99/1, or any other combination that adds up to 100.
If you do not specify a value for output-index, Reformat sends every input record to
every transform/output port pair. For example, if Reformat has two output ports and
there are no rejects, 100 input records results in 100 output records on each port for a
total of 200 output records.

56) What is the difference between the Reformat and Redefine Format components?
Ans) The difference between Reformat and Redefine Format is that Reformat can actually
change the bytes in the data while Redefine Format simply changes the record format on the
data as it flows through, leaving the data unchanged.
The Reformat component can change the record format of data records by dropping fields, or
by using DML expressions to add fields, combine fields, or transform the data in the records.
The Redefine Format component copies data records from its input to its output without
changing the values in the data records. You use Redefine Format to change or rename fields in
a record format without changing the values in the records. In this way, it is similar to the DML

built-in function, reinterpret_as. Typically this component has different DML on its input and
output ports, and allows the unmodified data to be interpreted in a different form.
57) Explain Multi Reformat?
Ans) Multi Reformat changes the record format of data records flowing between from one to 20
pairs of in and out ports by dropping fields, or by using DML expressions to add fields, combine
fields, or transform the data in the records. A typical use for Multi Reformat is to put it
immediately before a custom component that takes multiple inputs.
The component operates separately on the data flowing between each pair of its inn-outn
ports. The count parameter specifies the total number of port pairs. Each inn-outn port pair
has its own associated transformn to reformat the data flowing between those ports.
58) What is ABLOCAL() and how can I use it to resolve failures when unloading in parallel?
Ans) Some complex SQL statements contain grammar that is not recognized by the Ab Initio
parser when unloading in parallel. You can use the ABLOCAL() construct in this case to prevent
the Input Table component from parsing the SQL (it will get passed through to the database). It
also specifies which table to use for the parallel clause.
59) What is the difference b/w Update table & Multi update table?
Ans) The main difference is commit number & commit table are mandatory parameters in multi
update table, where as in update table they are optional.
Update table modifys only single table in the database, where as multi update table can
modify more than one table, so we require commit table & commit number in multi update
table.
API Mode Execution (same in both componets):
The statements are applied to the incoming records as follows. For each record:

The statement referenced by updateSqlFile is attempted first. If the statement can be


successfully applied to the current record, it is executed, and the statement
referenced by insertSqlFile is skipped.
If the updateSqlFile statement cannot be applied to the current record, the statement
referenced by insertSqlFile is attempted.

60) Difference b/w NORMALIZE & DE NORMALIZE?


Ans) Normalize generates multiple output data records from each of its input records. You can
directly specify the number of output records for each input record, or the number of output
records can depend on some calculation.

Denormalize consolidates groups of related data records into a single output record
with a vector field for each group, and optionally computes summary fields in the
output record for each group.

Both these components are Multi stage Transform components, The multi-stage transform
components require packages because, unlike other transform components, they are driven by

more than single transform functions. These components each take a package as a parameter
and, in order to process data, look for particular variables, functions, and types in that
package. For example, a multi-stage component might look for a type named temporary_type,
a transform function named finalize, or a variable named count_items.
61) How can I generate DML for a database table from the command line?
Ans) The Ab Initio command-line utility, m_db, with the gendml argument, generates
appropriate metadata for a database table or expression. The syntax for the utility is

m_db gendml dbc_file [options] -table tablename


m_db gendml dbc_file [options] -select 'sql-select-statement'

62) What are Departitioning components?


Ans) Departition components combine multiple flow partitions of data records into a single flow
as follows:

Concatenate appends multiple flow partitions of data records one after another.
Gather combines data records from multiple flow partitions arbitrarily.

Interleave combines blocks of data records from multiple flow partitions in round-robin
fashion.

Merge combines data records from multiple flow partitions that have been sorted
according to the same key specifier and maintains the sort order.

64) What are Partitioning components?


Ans) Partition components distribute data records to multiple flow partitions to support data
parallelism, as follows:

Broadcast arbitrarily combines all the data records it receives into a single flow and
writes a copy of that flow to each of its output flow partitions.
Partition by Expression distributes data records to its output flow partitions according
to a specified DML expression.

Partition by Key distributes data records to its output flow partitions according to key
values.

Partition by Percentage distributes a specified percent of the total number of input


data records to each output flow.

Partition by Range distributes data records to its output flow partitions according to
the ranges of key values specified for each partition.

Partition by Round-robin distributes data records evenly to each output flow.

Partition with Load Balance distributes data records to its output flow partitions,
writing more records to the flow partitions that consume records faster.

65) What are FTP components?


Ans) FTP (file transfer protocol) components transfer data records as follows:

FTP From transfers files of data records from a computer that is not running the
Co>Operating System to a computer that is running the Co>Operating System.
FTP To transfers files of data records to a computer that is not running the
Co>Operating System from a computer that is running the Co>Operating System

66) How can I terminate a graph based on a condition?


Ans) You can use a Reformat component with a force_error() function to test for a condition
and terminate the graph if that condition is met.
67) Explain about ROLLUP?
Ans) Rollup evaluates a group of input records that have the same key and then generates data
records that either summarize each group or select certain information from each group. There
are two ways to use a Rollup component:
Template Mode :
This mode uses transform that uses aggregation functions like SUM,MAX,MIN,COUNT,AVG.
Expanded Mode :
This mode uses package transformation
68) Explain SCAN component?
Ans) For every input record, Scan generates an output record that includes a running,
cumulative summary for the data records group that that input record belongs to. For example,
the output records might include successive year-to-date totals for groups of data records.
69) Explain SCAN with ROLLUP?
Ans) For every input record, Scan with Rollup sends an output record to its out port that
includes a running, cumulative summary for the input group that input record belongs to. In
addition, after reading all input records for a particular input group, Scan with Rollup sends a
summary record to its rollup port for that input group. For example, suppose transaction
records are keyed on the stores in which they occur. Each record sent to the out port might
include the year-to-date transaction total for the store in which the transaction occurred.
Each record sent to the rollup port would include the year's total for transactions at one
store, and there would be one record for each store.
70) Explain SORT & SORT within GROUPS?

Ans) Sort sorts and merges data records.


Sort within Groups assumes input records are sorted according to the major-key
parameter.

Sort within Groups reads data records from all the flows connected to the in port until
it either reaches the end of a group or reaches the number of bytes specified in the
max-core parameter.

When Sort within Groups reaches the end of a group, it does the following:
a. Sorts the records in the group according to the minor-key parameter
b. Writes the results to the out port
c. Repeats this procedure with the next group
NOTE: When connecting a fan-in or all-to-all flow to the in port of a Sort, you do not need to
use a Gather because Sort can gather internally on its in port.
71) How you validate the records?
Ans) Validate Records uses the is_valid function to check each field of the input data records to
determine if the value in the field is:

Consistent with the data type specified for the field in the input record format
Meaningful in the context of the kind of information it represents

72) What is the difference between m_rollback and m_cleanup? When would you use them?
Ans) m_rollback rolls back a partially completed graph to its beginning state. m_cleanup
cleans up files left over from unsuccessfully executed graphs and manually recovered graphs.
The Co>Operating System automatically creates a recovery (.rec) file and other temporary files
and directories in the course of executing a graph. When a graph terminates abnormally, it
leaves the temporary files and directories on disk. At this point there are several alternatives
possible:
Roll back to the last checkpoint.
The Co>Operating System rolls back the graph automatically, if possible. You can roll back the
graph manually by explicitly using the m_rollback command without the -d option. After a
rollback, some temporary files and directories remain on disk. To remove them, follow one of
the other three alternatives.
Rerun the graph.
If the graph is not already rolled back, rerunning the graph first rolls back the graph to the last
checkpoint. The graph then starts re-executing. If the re-execution is successful, it removes all
temporary files and directories.

Roll back and clean up using m_rollback -d.


Clean up using the m_cleanup utilities.

So, given this new feature, for old job files, you can use the m_cleanup utility to list the
temporary files and directories, and m_cleanup_rm to delete them. You can also use
m_cleanup_du to display the amount of space these files use. Because recovery files and

temporary files are automatically created in the course of a run, remember not to delete these
files for jobs that are still running.
73) What does the error message "straight flows may only connect ports having equal
depths" mean?
Ans) The "straight flows may only connect ports having equal depths" error message appears
when you connect two components running at different levels of parallelism (or depth) with a
straight flow (one that does not have an arrow symbol on it). For example, you get this error
message if you connect a Join running 10 ways parallel to a serial output file, or if you connect
a serial Join to a 4-way multifile.
74) What is AB_WORK_DIR and what do you need to know about it?
Ans) AB_WORK_DIR is a configuration variable whose value is a working space for graph
execution.You can view the value of this by using m_env describe.
75) What does the error message "too many open files" mean, and how do you fix it?
Ans) The "too many open files" error messsage occurs most commonly because the value of the
max-core parameter of the Sort component is set too low. In these cases, increasing the value
of the max-core parameter solves the problem.
76) What does the error message "Failed to allocate <n> bytes" mean and how do you fix it?
Ans) The "failed to allocate" type of error message is generated when an Ab Initio process has
exceeded its limit for some type of memory allocation.

Reduce the value of max-core in order to reduce the amount of memory allocated to a
component before temporary files are used. When the amount of memory specified by
max-core is used up by a component, the component starts writing temporary files to
hold the data being processed.
Be aware that while reducing the value of max-core may solve the problem of running
out of swap space, it may have an adverse effect on the graph's performance and will
increase the number of temporary files.
Increase available swap space, for example, by waiting until other memory intensive
jobs have completed.

77) What do you need to do to configure to run my graph across two or more machines?
Ans) In order to execute a graph across multiple machines, you need to carry out the following
steps:

Make sure that all the machines involved have compatible Co>Operating Systems
installed.
Set up the configuration files (.abinitiorc files) so that the different Co>Operating
Systems can communicate with each other.
Set up the environment variables and make sure that they are propagated properly
from one machine to another, when appropriate.

Set up the graph so that it can run across the machines as desired.

78) What communication ports does the GDE use when communicating with the
Co>Operating System?
Ans) The communication ports used depend upon the communication protocol selected. In
short, the GDE uses:

DCOM: 135 & **


REXEC: 512 & 21/20

RSH: 514 & 21/20

TELNET: 23 & 21/20

SSH(/AI): 22

AI/REXEC: 512

AI/TELNET: 23 & **

The ** above refer to the dynamically determined port that the control server sets up for
the file transfer.
79) If you use the layout Database: default in your database component, which working
directory does the Co>Operating System use?
Ans) The $AB_WORK_DIR directory is the working directory for database layouts.
$AB_DATA_DIR provides disk storage for the temporary files.
80) What are vectors? Why would you use them?
Ans) Vectors are arrays of elements. An element can be a single field or an entire record. They
are often used to provide a logical grouping of information. Many programming languages use
the concept of an array. In broad terms, an array is a collection of elements that are logically
grouped for ease of access.
81) How can you quickly test my DML expressions?
Ans) You can use the m_eval utility to quickly test the expressions that you intend to use in
your graphs.
82) What is the layout for watcher files?
Ans) The debugger places watcher files in the layout of the component downstream of the
watcher.
83) How do you remove watcher files?

Ans) To delete all watcher datasets in the host directory (for all graphs), you can either use the
GDE menu option, Debugger > Clean-out Watcher Datasets or invoke the following command:
m_rm -f -rmdata GDE-WATCHER-xxx
84) How can I determine which version of the GDE and Co>Operating System I am using?
Ans) To determine your GDE version, on the GDE menu bar choose Help > About Ab Initio.
For the Co>Operating System, use either of the following commands:

m_env -version
m_env -v

85) Should you use a Reformat component with a lookup file or a Join component in graph?
Ans) First of all, there are situations in which you cannot use a Reformat with Lookup instead of
a Join. For example, you cannot do a Full Outer Join using a Reformat and Lookup. The answer
below assumes that in your particular case either Reformat with Lookup or Join can be used in
principle, and that the question is about performance benefits of one over the other. When the
lookup file (in case of lookup) or the nondriving input (in case of a Join) fits into the available
memory, the Join and the lookup offer very similar performance.
86) How can you increase the time-out value for starting an Ab Initio process?
Ans) You can increase time-out values with the Ab Initio environment variables
AB_STARTUP_TIMEOUT and AB_RTEL_TIMEOUT_SECONDS.
87) Give the file management commands?
Ans) To create Multi file system: m_mkfs [ options ] control_url partition_url [ partition_url .. ]

To delete Multifile system: m_rmfs path


To create Multidirectory:
m_mkdir [ -m[ode] mode ] [ -mvfile ] [ -max-segment-size
bytes ] path

To delete Multidirectory:

To copy:

m_cp

To move:

m_mv

To list files: m_ls

Disk usage: m_du

Disk free:

m_df

Count:

m_wc

m_rmdir url [url ...]

88) What are data-sized vectors? How do you work with them?
Ans) Data-sized vectors are vectors that have no set length of elements but, rather, are
variably sized based upon the number of elements in each data record. For example, if an
input dataset has three records, each with a vector, the first record's vector might have 5
elements, the second 1 element, and the third record, 7.
89) What is the difference b/w today (now) and today1 (now1)?
Ans) The today (now) function calls the operating system for the current date on each call.

In contrast, the function today1 (now1) calls the operating system for the current date
only on the first call in a job, returning the same value on subsequent calls. The
difference between the two functions is particularly noticeable on jobs that start
before and end after midnight.

90) Explain is_valid function?


Ans) is_valid Tests whether a value is valid.
The is_valid function returns:

The value 1 if expr is a valid data item.


The value 0 if the expression does not evaluate to NULL.

91) Explain is_defined function?


Ans) is_defined Tests whether an expression is NOT NULL.
The is_defined function returns:

The value 1 if expr evaluates to a non-NULL value.


The value 0 otherwise.

The inverse of is_defined function is is_null function (is_failure).


92) Explain read raw component?
Ans) The Read Raw component reads a flow of data whose structure requires it to be parsed
programmatically rather than with declarative DML type declarations. Typically, the data
written to the output port can be readily described with DML types.
93) How you will get the sequence of numbers in abinitio?
Ans) by using next_in_sequence ( ) function. Returns a sequence of integers on successive
calls, starting with 1.
94) How will you get the degree of parallelism?

Ans) By using number_of_partitions ( ) function,it returns the number of partitions. The


number of partitions is also known as the degree of parallelism. The value -1 if not called from
within a component.
95) Explain first_defined function?
Ans) first_defined returns the first defined (not NULL) argument of two arguments. Note that
the Oracle NVL function is very similar to this function.
Syntax: first_defined (a, b)

1) what is difference between file and table in abinitio


ans)Table means it maintaince relational data i.e it is a relational structure.
File means non relation structure.
it maintaince data.
2) How do you connect EME to Abinitio Server?
ans)There are serveral ways of connecting to EME
1.Set AB_AIR_ROOT
2.GDE you can connect to EME datastore
3.login to eme web interface
http://serverhost[:serverport]/abinitio
4 using the air command, i don't know much about this
3)What is the function you would use to transfer a string into a decimal?
For converting a string to a decimal we need to typecast it using the following syntax,
out.decimal_field :: ( decimal( size_of_decimal ) ) string_field;
The above statement converts the string to decimal and populates it to the decimal field in
output.
6)How do we handle if DML changing dynamicaly
There are lot many ways to handle the DMLs which changes dynamically with in a single file.
Some of the suitable methods are to use a conditional DML or to call the
vector functionality while calling the DMLs.
8)What is .abinitiorc and What it contain?
ans)
.abinitiorc is the config file for ab initio. It is found in user's home directory.
Generally it is used to contain abinitio home path, different log in information like
id encrypted password login method for hosts where the graph connects in time of execution.
What is air_project_parameters and air_sandbox_overrides? what is the relation between them?
Answer
# 1 .air-project-parameters
Contains the parameter definitions of all the parameters
within a sandbox. This file is maintained by the GDE and
the Ab Initio environment scripts.
.air-sandbox-overrides

This file exists only if you are using version 1.11 or a


later version of the GDE. It contains the user's private
values for any parameters in .air-project-parameters that
have the Private Value flag set. It has the same format as
the .air-project-parameters file.
When you edit a value (in GDE) for a parameter that has the
Private Value flag checked, the value is stored in the .airsandbox-overrides file rather than the .air-projectparameters file.
What is AB_LOCAL expression where do you use it in ab- initio?
Answer
# 3 we use AB_LOCAL(expression) to increase the SQL query
performance by supplying the name of large table in
expression. This way we make it as a driving table.
name the air commands in ab initio?
Answer
# 1 Here are the few of the commands we use
1) air object ls <EME Path for the object /Projects/edf/.. > --- This is used to see the listing of
objects in a directory inside the project.
2) air object rm <EME Path for the object /Projects/edf/.. > -- This is used to remove an object
from the repository. Please be careful with this.
3) air object cat <EME Path for the object /Projects/edf/.. > --- This is used to see the object
which is present in the EME.
4) air object versions -verbose <EME Path for the object /Projects/edf/.. > --- Gives the Version History of the
object.
5) air project show <EME Path for the project /Projects/edf/.. > --- Gives the whole info about the
project. What all types of files can be checked-in etc.
6) air project modify <EME Path for the project /Projects/edf/.. > -extension <something like *.dat within
single quotes> <content-type> --- This is to modify the
project settings. Ex: If you need to checkin *.java files
into the EME, you may need to add the extension first.
7) air lock show -project <EME Path for the project /Projects/edf/.. > --- shows all the files that are locked
in the given project
8) air lock show -user <UNIX User ID> -- shows all the
files locked by a user in various projects.
9) air sandbox status <file name with the relative path> ---

shows the status of file in the sandbox with respect to


the EME (Current, Stale, Modified are few statuses)
what is deeup in unique only?
Answer
# 1 keep Parameter of Dedup Sorted Component
(choice, required)
Specifies which records the component keeps to write to the
out port. Choose one of the following options:
first Keeps the first record of a group
last Keeps the last record of a group
unique-only Keeps only records with unique key values
The component writes the remaining records of each group to
the dup port.
Default is first.
14) What is Ad hoc multifile? How is it used?
ANSWER:
Here is a description of Ad hoc multifile:
Ad hoc Multifiles treat several serial files having the same record format as a single graph
component.
Frequently, the input of a graph consists of a set of serial files, all of which have to be
processed as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial
files, without needing to define a multifile system to contain it. This enables you to represent
the needed set of serial files with a single input file component in the graph. Moreover, the set
of files used by the component can be determined at runtime. This lets the user customize
which set of files the graph uses as input without having to change the graph itself, even after
it goes into production.
Ad hoc Multifiles can be used as output, intermediate, and lookup files as well as input files.
The simplest way to define an Ad hoc multifile is to list the files explicitly as follows:
1. Insert an input file component in your graph.
2. Open the properties dialog. Select Description tab.
3. Select Partitions in the Data Location of the Description tab
4. Click Edit to open the Define multifile Partitions dialog box.
5. Click New and enter the first file name. Click New again and enter the second file name and
so on.
6. Click OK.
If you have added 'n' files, then the input file now acts something like a file in a n-way multifile
system, whose data partitions are the n files you listed. It is possible for components to run in
the layout of the input file component. However, there is no way to run commands such as
m_ls or m_dump on the files, because they do not comprise a real multifile system.
There are other ways than listing the input files explicitly in an Ad hoc multifile.

1. Listing files using wildcards - If the input file names have a common pattern then you can
use a wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are
found at the runtime matching the wild card pattern will be taken for the Ad hoc multifile.
2. Listing files in a variable. You can create a runtime parameter for the graph and inside the
parameter you can list all the files separated by spaces.
3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces
the list of files to be used for the ad hoc multifile. This method gives maximum flexibility in
choosing the input files, since you can use complex commands also that involves owner of file
or date time stamp.
15) How can I tune a graph so it does not excessively consume my CPU?
How to Tune a Graph against Excessive CPU consumption?
ANSWER:
Options:
1. Reduce the DOP ( degree of paralleism ) for components.
Example:
1. Change from a 4-way parallel to a 2-way parallel.
2. Examine each transformation for inefficiencies.
Example:
1. If transformation uses many local variables, make these variables global.
2. If same function call is performed more than once; call it once and store its value in a global
variable.
3. When reading data, reduce the amount of data that needs to be carried forward to the next
component.
16) I'm having trouble finding information about the AB_JOB variable. Where and how can I
set this variable?
ANSWER:
You can change the value of the AB_JOB variable in the start script of a given graph. This will
enable you to run the same graph multiple times at the same time (thus parallel). However,
make sure you append some unique identifier such as timestamp or sequential number to the
end of each AB_JOB variable name you assign. You will also need to vary the file names of any
outputs to keep the graphs from stepping on each others outputs. I have used this technique
to create a "utility" graph as a container for a start script that runs another graph multiple
times depending on the local variable input to the "utility" graph. Be careful you don't max out
the capacity of the server you are running on.
17) I have a job that will do the following: ftps files from remote server; reformat data in
those files and updates the database; deletes the temporary files.
How do we trap errors generated by Ab Initio when an ftp fails? If I have to re-run / re-start
a graph again, what are the points to be considered? Does *.rec file have anything to do
with it?
ANSWER:
AbInitio has very good restartability and recovery features built into it. In Your situation you
can do the tasks you mentioned in one graph with phase breaks.

FTP in phase 1 and your transformation in next phase and then DB update in another phase
(This is just an example this may not best of doing it as best design depends on various other
factors)
If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your
graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would
see a message saying recovery file exists, do you want to start your graph from last successful
check point or restart from beginning. Same thing if it fails in Phase 2.
Phases are expensive from Disk I/O perspective, so have to be careful in doing too much
phasing.
Coming back to error trapping each component has reject, error, log ports, reject captures
rejected records, error captures corresponding error and log captures the execution statistics
of the component. You can control reject status of each component by setting reject threshold
to either "Never Abort", "Abort on first reject" or setting "ramp/limit"
Recovery files keep tack of crucial information for recovering the graph from failed status,
which node the component is executing on etc. It is a bad idea to just remove the *.rec files,
you always want to rollback the recovery fils cleanly so that temporary files created during
graph execution won't hang around and occupy disk space and create issues.
Always use m_rollback -d
18) What is parallelism in Ab Initio?
ANSWER:
1) Component parallelism:
A graph with multiple processes running simultaneously on separate data uses component
parallelism.
2) Data parallelism:
A graph that deals with data divided into segments and operates on each segment
simultaneously uses data parallelism. Nearly all commercial data processing tasks can use data
parallelism. To support this form of parallelism, Ab Initio software provides Partition
Components to segment data, and Departition Components to merge segmented data back
together.
3) Pipeline parallelism:
A graph with multiple components running simultaneously on the same data uses pipeline
parallelism.
Each component in the pipeline continuously reads from upstream components, processes data,
and writes to downstream components. Since a downstream component can process records
previously written by an upstream component, both components can operate in parallel.
NOTE: To limit the number of components running simultaneously, set phases in the graph.
20) What is a sandbox?
ANSWER:
Sandbox is a directory structure of which each directory level is assigned a variable name, is
used to manage check-in and checkout of repository based objects such as graphs.
fin -------> top level directory ( $AI_PROJECT )

|
|---- dml -------> second level directory ( $AI_DML )
|
|----- xfr -------> second level directory ( $AI_XfR )
|
|----- run --------> second level directory ( $AI_RUN )
|
You'll require a sandbox when you use EME (repository s/w) to maintain release control.
Within EME for the same project an identical structure will exist.
The above-mentioned structure will exist under the os (eg unix), for instance for the project
called fin, and is usually name of the top-level directory.
In EME, a similar structure will exist for the project: fin.
When you checkout or check-in a whole project or an object belonging to a project, the
information is exchanged between these two structures.
For instance, if you checkout a dml called fin.dml for the project called fin, you need a
sandbox with the same structure as the EME project called fin. Once you've created that, as
shown above, fin.dml or a copy of it will come out from EME and be placed in the dml directory
of your sandbox.
21) How can I read data which contains variable length records with different record
structures and no delimiters?
ANSWER:
a)Try using the Read Raw component, it should do exactly what you are looking for.
b)Use the dml format:
record
string(integer(4)) my_field_name;
end
22) How do I create subgraphs in Ab Initio?
ANSWER:
First, highlight all of the components you would like to have in the sub graph, click on edit,
then click on sub graph, and finally click on create.
23) suppose that u are changing fin.dml u said checkout
Exactly how does u do it? And one more thing like can quote an example where do u use
sandbox
Parameters and how exactly u create those do u keep those sand box parameters also 2
copies as we keep our graphs and other files.
ANSWER:
Checkin and checkout from EME
Checkin (sandbox) ---------------> EME
Checkout (sandbox) <------------- EME
1. AbInitio gives command line interfaces via air command to perform
Checkin and checkout

2. Checkin and checkout must be performed via sandbox


3. GDE gives option to perform checkin and checkout via Project ---->
Checkin option ----> checkout "
4. You create a sandbox from GDE via Project -----> Create sandbox
option
5. When creating a sandbox you specify a directory name ( try it
out; don't be afraid)
6. EME contains one or many Projects ( a project is a collection of
graphs and related files and a parameter called Project parameter file )
7. The project parameter file, when resides within EME, is called
Project Parameter
8. The project parameter file, resides within one's sandbox, is called sandbox parameter.
Therefore, sandbox parameter is a copy of project parameter and
is local to sandbox owner.
9. When project parameters change, it'll be reflected in your
sandbox parameters , if you have checked out a graph and therefore, a copy of latest
project parameter, after that change had taken place.
10. You edit sandbox parameter via Project ----->edit sandbox option
11. You edit project parameters via Project -----> Administrative -------> edit project
12. When checking out an object, use Project--------> checkout option.
Navigate down to the Project of your choice Navigate down to required directory
( eg.mp, dml or xfr etc ) Select the object required Then specify a sandbox name ( ie.
the top level directory of the directory structure called sandbox) You will be prompted to
confirm the checkout
13. Sometimes, when you checkout an object, you get a number of other objects checked
out for you automatically and this happens due to dependency.
Example
Checkout a graph ( .mp file )
In addition, you might get a .dml .xfr file
You will also certainly ger a .ksh file for the graph
26) How do I create subgraphs in Ab Initio?
ANSWER:
First, highlight all of the components you would like to have in the subgraph, click on edit,
then click on subgraph, and finally click on create.
27)I was trying to use a User Defined Function (int_to_date) inside a Rollup, to type cast
date and time values originally stored as integers back to date forms and then concatenate
the same.
The code I wrote is as below.
record
datetime("YYYY-MM-DD HH24:MI:SS")("\001") output_date_format;
end out::int_to_date(record
big endian integer(4) input_date_part;
end in0, record
big endian integer(4) input_time_part;
end in1) begin
let datetime("YYYY-MM-DD HH24:MI:SS")("\001") v_output_format =(datetime("YYYY-MM-DD
HH24:MI:SS"))string_concat((string("|"))(date("YYYY-MM-DD"))in0.input_date_part,(string("|"))
(datetime("HH24:MI:SS"))decimal_lpad(((string("|"))(decimal("|")) in1.input_time_part),6));
out.output_date_format :: v_output_format;
end;

out::rollup(in) begin
let datetime("YYYY-MM-DD HH24:MI:SS")("\001") rfmt_dt;
rfmt_dt=int_dat(in.reg_date, in.reg_time);
out.datetime_output :: rfmt_dt;
out.* :: in.*;
end;
However I got an error during run time.
The Error Message looked like:
While compiling finalize:
While compiling the statement:
rfmt_dt = int_to_date(in.reg_date, in.reg_time);
Error: While compiling transform int_to_date:
Output object "out.output_date_format" unknown.?
28) I have small problem understanding the problem with reformat.
i could not figure out why this reformat component runs forever. i believe it is in endless
loop somehow
Reformat component has following input and output DML:
record
begin
string(",") code, code2;
intger(2) count ;
end("\n")
Note : here variable "code" is never null nor blank.
sample data is
string_1,name,location,firstname,lastname,middlename,0
string_2,job,location,firstjob,lastjob,0
string_3,design,color,paint,architect,0
out::reformat(in) =
begin
let string(integer(2)) temp_code2 = in.code2;
let string(integer(2)) temp_code22 = " ";
let integer(2) i=0;
while (string_index(temp_code2, ",") !=0 || temp_code2 "")
begin
temp_code22 = string_concat(in.code,",", string_substring(temp_code2,
1,string_index(temp_code2,",")));
temp_code2 = string_substring(temp_code2, string_index(temp_code2, ","),
string_length(temp_code2));
i=i+1;
end
out.code :: in.code;
out.code2 :: string_lrtrim(temp_code22);
out.count:: i;
end;

my expected output is
string_1,string_1,name,string_1,location,string_1,firstname,string_1,lastname,string_1,middlen
ame,5
string_2,string_2,job,string_2,location,string_2,firstjob,string_2,lastjob,4
string_3,string_3,design,string_3,color,string_3,paint,string_3,architect,4
ANSWER:
record
begin
string(",") code, code2;
integer(2) count ;
end("\n")
In my abinitio it is not validated ..................
29) In my graph I am creating a file with account data. For a given account there can be
multiple rows of data. I have to split this file into 4 (specifically) files which are nearly
equal in size. The trick is to keep the accounts confined to one file. In other words
account data should not span across these files. How do I do it?
Also if the records are less than 4 (different accounts) i should be able to create empty
files. But I need atleast 4 files.
FYI: The requirement to have 4 files is because I need to start 4 parallel processes for load
balancing the subsequent processes.
ANSWER:
a)
I could not get ur requirement very clearly as you want to split the files in 4 equal parts as well
as keep the same account numbers in same file. Can you explain what will you do in case of
5account numbers having 20 records each?..........As far as splitting is concerned a very very
crude soln would be as follows
In the end script do the following:
1.Find the size of the file and store it in variable (say v_size)
2.v_qtr_size=`expr $v_size / 4`
3.split -b $v_qtr_size <filename>
4.Rename the splitted files as per ur requirement. Note the splitted
files have a specific pattern in their name
b)
Your requirement is such that it essentially depends on the skewness of your data across
accounts. If you want to keep same accounts in same partition, then partition the data by key
(account) with the out port connected to 4 way parallel layout. But this does not guarantee
equal load in all partitions unless the data has little skewness.
But I can suggest you an alternative approach, though cumbersome, still might give you a
result, close to your requirement.
You replicate your original dataset into two, and take one of them and rollup on account no to
find the record count per account_no. Now sort this result on record count so that you have the
account_no with min count at top and the one with max count at bottom. Now apply a
partition by round robin and separate out the four partitions (partition 0, 1, 2 & 3).

Now take the first partition and join with your main dataset ( that you have replicated earlier)
on account_no and write the matching records (out port) into the first file. Take the unused
records of your main flow of the previous join and now join it with the second partition
(partition1) on account_no and write the matching records (out port) to the second file.
Similarly again take the unused records of the previous join and join it with the third partition
(partition 2) on acount_no. Write the matching record (out port) to the third file and the
unused records of the main flow in the fourth file.
This way you can get four files, nearly equal in size, and same account not spread across files.
30) I have a graph parameter state_cd having values based on a If statement. This variable I
would like to use in SQL statement in AI_SQL directory. I have 20 SQL statements for 20
table codes. I will be using corresponding SQL statement based on table code passed as
parameter to a graph.
eg: SQLs in AI_SQL directory.
---------------------------1. Select a,b from abc where abc.state_cd in ${STATE_CD}
2. Select x,y from xyz where xyz.state_cd in ${STATE_CD}
${STATE_CD} is a graph parameter
In value - "(IL,CO,MI)"

Problem is ${STATE_CD} is not getting interpreted when I echo the 'Select Statement',
hence giving problem.

ans: Anand use eval or export of I table components ..................


Or define ${STATE_CD} in ur start script ...................... thats better...............
30. Explain kill() and its possible return values.
There are four possible results from this call:
kill() returns 0. This implies that a process exists with the given PID, and the system would
allow you to send signals to it. It is system-dependent whether the process could be a zombie.
kill() returns -1, errno == ESRCH either no process exists with the given PID, or security
enhancements are causing the system to deny its existence. (On some systems, the process
could be a zombie.)
kill() returns -1, errno == EPERM the system would not allow you to kill the specified
process. This means that either the process exists (again, it could be a zombie) or draconian
security enhancements are present (e.g. your process is not allowed to send signals to
*anybody*).
kill() returns -1, with some other value of errno you are in trouble! The most-used
technique is to assume that success or failure with EPERM implies that the process exists,
and any other error implies that it doesn't.
An alternative exists, if you are writing specifically for a system (or all those systems) that
provide a /proc filesystem: checking for the existence of /proc/PID may work.
What is the difference between m_rollback and m_cleanup and when would I use them?
Answer
m_rollback has the same effect as an automatic rollback using the jobname.rec file, it rolls
back a job to the last completed checkpoint, or to the beginning if the job has not completed
any checkpoints. The m_cleanup commands are used when the jobname.rec file doesn't exist
and you want to remove temporary files and directories left by failed jobs.

Details
In the course of running a job, the Co>Operating System creates a jobname.rec file in the
working directory on the run host.

NOTE: The script takes jobname from the value of the AB_JOB environment variable. If
you have not specified a value for AB_JOB, the GDE supplies the filename of the graph as the
default value for AB_JOB when it generates the script.
The jobname.rec file contains a set of pointers to the internal job-specific files written by the
launcher, some of which the Co>Operating System uses to recover a job after a failure. The
Co>Operating System also creates temporary files and directories in various locations. When a
job fails, it typically leaves the jobname.rec file, the temporary files and directories, and
many of the internal job-specific files on disk. (When a jobs succeeds, these files are
automatically removed, so you don't have to worry about them.)
If your job fails, determine the cause and fix the problem. Then:

If desired, restart the job.

If the job succeeds, the jobname.rec file and all the temporary files and directories are
cleaned up. Alternatively, run m_rollback -d to clean up the files left behind by the failed job.
What value should I set for the max-core parameter?
Short answer
The max-core parameter is found in the SORT, JOIN, and ROLLUP components, among others.
There is no single, optimal value for the max-core parameter, because a "good" value depends
on your particular graph and the environment where it runs.
Details
The SORT component works in memory, and the ROLLUP and JOIN components have the option
to do so. These components have a parameter called max-core, which determines the
maximum amount of memory they will consume per partition before they spill to disk. When
the value of max-core is exceeded in any of the in-memory components, all inputs are dropped
to disk. This can have a dramatic impact on performance; but this does not mean that it is
always better to increase the value of max-core.
The higher you set the value of max-core, the more memory the component can use. Using
more memory generally improves performance up to a point. Beyond this point, performance
will not improve and might even decrease. If the value of max-core is set too high, operating
system swapping can occur and the graph might fail if memory on the machine is exhausted.
When setting the value for max-core, you can use the suffixes k, m, and g (upper-case is also
supported) to indicate powers of 1024. For max-core, the suffix k (kilobytes) means precisely
1024 bytes, not 1000. Similarly, the suffix m (megabytes) means precisely 1048576 (10242), and
g (gigabytes) means precisely 1024 3. Note that the maximum value for max-core is 2g-1.
SORT component

For the SORT component, 100 MB is the default value for max-core. This default is used to
cover a wide variety of situations and might not be ideal for your particular circumstances.
Increasing the value of max-core will not increase performance unless the full dataset can be
held in memory, or the data volume is so large that a reduction in the number of temporary
files improves performance. You can estimate the number of temporary files by multiplying the
data volume being sorted by three and dividing by the value of max-core (because data is
written to disk in blocks that are one third the size of the max-core setting). This number
should be less than 1000. For example, suppose you are sorting 1 GB of data with the default
max-core setting of 100 MB and the process is running in serial. The number of temporary files
that will be created is:
3 1000MB / 100 MB = 30 files
You should decrease the value of a SORT component's max-core if an in-memory ROLLUP or
JOIN component in the same phase would benefit from additional memory. The net
performance gain will be greater.
If you get a "Too many open files" error message, your SORT component's max-core might be
set too low. If this is the case, SORT can also fill AB_WORK_DIR (usually set to /var/abinitio at
installation), which will cause all graphs to fail with a message about semaphores. This
directory is where recovery information and inode information for named pipes are stored and
is typically mounted on a small filesystem.

NOTE: We recommend setting the value max-core as a $ reference to a parameter (for


example, $AI_SORT_MAX_CORE) so you can easily adjust the value at runtime if required.
In-memory ROLLUP or JOIN
It is difficult to be precise about the amount of memory an in-memory ROLLUP or JOIN will
consume. An in-memory JOIN tries to hold all its nondriving inputs in memory, so make the
largest input by volume the driving one. A ROLLUP component must hold the size of the keys,
plus the size of the temporaries, plus the size of any input fields required in finalize to
produce the output. In practice, in most ROLLUP components, this is just the size of the
output. In addition, some space is needed for the hash table.
You should always set the max-core parameter in in-memory ROLLUP and JOIN components
with a parameter, like AI_GRAPH_MAX_CORE. The default can be set to the appropriate value
and changed at runtime if required. You can create additional parameters such as
AI_GRAPH_MAX_CORE_HALF and AI_GRAPH_MAX_CORE_QUARTER to divide up the available
max-core among different in-memory components in a phase. If two in-memory components
each need most or all of AI_GRAPH_MAX_CORE, you should put them in separate phases,
provided you have the disk space to hold the data at the phase break.
A second use of phasing is to control the allocation of memory among in-memory components.
Because there is a limited amount of memory available, you can use phasing to make sure each
in-memory component gets a sufficient amount. Typically, only one to four in-memory
components should occupy the same phase, depending on memory availability and demands.
To compute a value for AI_GRAPH_MAX_CORE, take the total memory on the machine and
subtract memory used by lookups and competing processes, including other graphs, running at
the same time on the machine. This is the available memory. Divide this by twice the number

of partitions to get AI_GRAPH_MAX_CORE max-core is measured per partition, and the factor
of two gives a contingency safety factor. So:
AI_GRAPH_MAX_CORE = (total memory - memory used elsewhere)/(2 * number of partitions

You might also like