Professional Documents
Culture Documents
Objectives
After completing this module, you should be able to:
Indexes in Teradata
Indexes are used to access rows from a table without having to search the whole
table. In the Teradata RDBMS, an index is made up of one or more columns in a
table. Once Teradata indexes are selected, they are maintained by the system.
While other vendors may require data partitioning or index maintenance, these
tasks are unnecessary with Teradata.
In the Teradata RDBMS, there are two types of indexes:
You specify which column(s) are used as the Primary Index when you create a
table. Secondary Index column(s) can be specified when you create a table or at
any time during the life of the table.
Data Distribution
When the Primary Index for a table is well chosen, the table rows are evenly
distributed across the AMPs for the best performance. The way to guarantee
Unevenly distributed data, also called "skewed data," causes slower response
time as the system waits for the AMP(s) with the most data to finish their
processing. The slowest AMP becomes a bottleneck.
The system automatically distributes the data across the AMPs based on
Unique Primary Index (UPI) - For a given row, the combination of the
data values in the columns of a Unique Primary Index are not duplicated
in other rows within the table. This uniqueness guarantees uniform data
distribution and direct access. For example, in the case where old
employee numbers are sometimes recycled, the combination of the Last
Name and Employee Number columns would be a UPI.
duplicated in other rows within the table. A NUPI can cause skewed
data, but in specific instances can still be a good Primary Index choice.
For example, either the Department Number column or the Hire Date
column might be a good choice for a NUPI if you will be accessing the
table most often via these columns.
If the Primary Index is unique, you could have one row with a null value. If you
have multiple rows with a null value, the Primary Index must be Non-Unique.
Data distribution
Data access
Loading data into a table (one or more rows, using a data loading utility)
Inserting or updating rows (one or more rows, using SQL)
Changing the system configuration (redistribution of data, caused by
reconfigurations to add or delete AMPs)
When loading data or inserting rows, the data being affected by the load or insert is not
available to other users until the transaction is complete. During a reconfiguration, no
data is accessible to users until the system is operational in its new configuration.
Row Distribution Process
The process the system uses for inserting a row on an AMP is described below:
1. The system uses the Primary Index value in each row as input to the hashing
algorithm.
2. The output of the hashing algorithm is the row hash value (in this example, 646).
3. The system looks at the hash map, which identifies the specific AMP where the
row should be stored (in this example, AMP 3).
4. The row is stored on the target AMP.
o UPI: The system automatically checks for duplicate UPI values when rows
are loaded or inserted. If a row already exists with the UPI value, the new
row is not added.
o NUPI: The system does not check for duplicate NUPI values. If a row
already exists with the NUPI value, the new row is added to the same
AMP.
Hash Map
A hash map is an array that associates hash bucket numbers with specific AMPs. While it
has a limited number of hash buckets, there are enough hash buckets to minimize the
number of hash collisions (when the hashing algorithm calculates the same row hash
value for two different rows).
The hash map is a GDO (globally distributed object), which is a file that is copied and
distributed to every node in the system. If an AMP is executing a request that requires
information in a GDO, it can access the copy of the GDO on its node.
Teradata Indexes - Workshop
To differentiate each row in a table, every row is assigned a unique Row ID. The Row ID
is the combination of the row hash value and a uniqueness value.
When each row is inserted, the AMP adds the row ID, stored as a prefix of the row. The
first row inserted with a particular row hash value is assigned a uniqueness value of 1.
The uniqueness value is incremented by 1 for any additional rows inserted with the same
row hash value.
Duplicate Rows
A duplicate row is a row in a table whose column values are identical to another
row in the same table. In other words, the entire row is the same, not just an
index. Although duplicate rows are not allowed in the relational model (because
every Primary Key must be unique), Teradata does allow duplicate rows
because the capability is a part of the ANSI standard.
Because duplicate rows are allowed in Teradata, how does it affect the UPI,
which, by definition, is unique? When you create a table, the following
MULTISET tables: May contain duplicate rows. Teradata will not check
for duplicate rows.
SET tables: The default. Teradata checks for and does not permit
duplicate rows. If a SET table is created with a Unique Primary Index,
the check for duplicate rows is replaced by a check for duplicate index
values.
Hashing Process
1.
2.
3.
4.
5.
6. The row data is sent over the BYNET to the PE, and the PE sends the
answer set on to the client application.
10
Use in value access: Retrievals, updates, and deletes that specify the
Primary Index are much faster than those that do not. Because a Primary
Index is a known access path to the data, it is best to choose column(s)
that will be frequently used for access. For example, the following SQL
statement would directly access a row based on the equality WHERE
clause:
SELECT * FROM employee WHERE employee_ID = ABC456789
A NUPI may be a better choice if the access is based on another, mostly unique
column. For example, the table may be used by the Mail Room to track package
delivery. In that case, a column containing room numbers or mail stops may not
be unique if employees share offices, but a better choice for access.
Use in join access: SQL requests that use a JOIN statement perform the
best when the join is done on a Primary Index. Consider Primary Key
and Foreign Key columns as potential candidates for Primary Indexes.
For example, if the Employee table and the Payroll table are related by
the Employee ID column, then the Employee ID column could be a good
Primary Index choice for one or both of the tables.
Non-volatile values: Look for columns where the values do not change
frequently. For example, in an Invoicing table, the outstanding balance
column for all customers probably has few duplicates, but probably
changes too frequently to make a good Primary Index. A customer ID,
statement number, or other more stable columns may be better choices.
When choosing a Primary Index, try to find the column(s) that best fit these
criteria and the business need.
Questions
What do you think are key considerations in choosing a Primary Index? (Choose three.)
A. Column(s) containing unique (or nearly unique) values for uniform distribution.
B. Column(s) with values in sequential order for best load and access performance.
C. Column(s) frequently used in queries to access data or to join tables.
D. Column(s) with values that are stable (do not change frequently), to minimize
redistribution of table rows.
11
Hash
Value
With PPI, the ORDER in which the rows are stored on the AMP is affected.
Using the traditional method, No Partitioned Primary Index (NPPI), the rows
are stored in row hash order.
4 AMPs with Orders Table Defined with NPPI
Using PPI, the rows are stored first by partition and then by row hash. In our
example, there are four partitions. Within the partitions, the rows are stored in
row hash order.
4 AMPs with Orders Table Defined with PPI on O_Date
12
13
14
15
Unlike Primary Indexes, Secondary Indexes are stored in separate subtables that
require extra overhead in terms of disk space, and maintenance which is handled
automatically by the system. So, Secondary Indexes do require some system
resources.
Question
In what instances would it be a good idea to define a secondary index for a table? (This
information will be covered in this module, but here is a preview.)
1. The Primary Index exists for even data distribution and data access, but a
Secondary Index is defined to efficiently generate monthly reports based on a
different set of columns.
2. The Product table is accessed by the retailer (who accesses data based on the
retailer's product code column), and by a vendor (who access the same data based
on the vendor's product code column).
3. The table already has a Unique Primary Index, but a second column must also
have unique values. The column is specified as a Unique Secondary Index (USI)
to enforce uniqueness on the second column.
4. All of the above.
Rule 1: Optional SI
While a Primary Index is required, a Secondary Index is optional. If one path to
the data is sufficient, no Secondary Index need be defined.
You can define 0 to 32 Secondary Indexes on a table for multiple data access
paths. Different groups of users may want to access the data in various ways.
You can define a Secondary Index for each heavily used access path.
16
As with the Primary Index, the Secondary Index column may contain NULL
values.
17
18
19
When a user submits an SQL request using the table name and a Unique
Secondary Index, the request becomes a one- or two-AMP operation, as explained
below.
USI Access
1. The SQL is submitted, specifying a USI (in this case, a customer number
of 56).
2. The hashing algorithm calculates a row hash value (in this case, 602).
3. The hash map points to the AMP containing the subtable row
corresponding to the row hash value (in this case, AMP 2).
4. The subtable indicates where the base row resides (in this case, row 778 on
AMP 4).
5. The message goes back over the BYNET to the AMP with the row and the
AMP accesses the data row (in this case, AMP 4).
6. The row is sent over the BYNET to the PE, and the PE sends the answer
set on to the client application.
As shown in the example above, accessing data with a USI is typically a twoAMP operation. However, it is possible that the subtable row and base table row
could end up being stored on the same AMP, because both are hashed separately.
If both were on the same AMP, the USI request would be a one-AMP operation.
20
When a user submits an SQL request using the table name and a Non-Unique
Secondary Index, the request becomes an all-AMP operation, as explained
below.
NUSI Access
1. The SQL is submitted, specifying a NUSI (in this case, a last name of
"Adams").
2. The hashing algorithm calculates a row hash value for the NUSI (in this
case, 567).
3. All AMPs are activated to find the hash value of the NUSI in their index
subtables. The AMPs whose subtables contain that value become the
participating AMPs in this request (in this case, AMP1 and AMP2). The
other AMPs discard the message.
4. Each participating AMP locates the row IDs (row hash value plus
uniqueness value) of the base rows corresponding to the hash value (in
this case, the base rows corresponding to hash value 567 are 640, 222,
and 115).
5. The participating AMPs access the base table rows, which are located on
the same AMP as the NUSI subtable (in this case, one row from AMP 1
and two rows from AMP 2).
6. The qualifying rows are sent over the BYNET to the PE, and the PE
sends the answer set on to the client application (in this case, three
21
For all requests, you must specify a value for each column in the index or
Teradata will do a full table scan. A full table scan is an all-AMP operation, and
each data row is accessed only once. As long as the choice of Primary Index has
caused the table rows to distribute evenly across all of the AMPs, the parallel
processing of the AMPs working simultaneously can accomplish the full table
scan quickly.
While full table scans are impractical and even disallowed on some commercial
database systems, Teradata routinely permits ad hoc queries with full table
scans.
22
Keys
Indexes
While most commercial database systems use the Primary Key as a way to
retrieve data, a Teradata system does not. In a Teradata system, you use the
Primary Key only when designing a database, as a mechanism for maintaining
referential integrity according to relational theory. The Teradata RDBMS itself
does not require keys in order to manage the data, and can function fully with no
awareness of Primary Keys.
The Teradata parallel architecture uses Primary Indexes to distribute and access
the data rows. A Primary Index is always required when creating a Teradata
table.
A Primary Index may include the same columns as the Primary Key, but does
not have to. In some cases, you may want the Primary Key and Primary Index to
be different. For example, a credit card account number may be a good Primary
Key, but customers may prefer to use a different kind of identification to access
their accounts.
Primary Key
Foreign Key
Primary Index
Secondary
Index
One PK
Multiple FKs
One PI
0 to 32 SIs
Unique values
Unique or nonunique
Unique or nonunique
Unique or nonunique
23
No NULLs
NULLs allowed
NULLs allowed
NULLs allowed
Values may be
changed
(redistributes row)
Values may be
changed
Column should
not change
Column may
change
Column cannot be
changed (drop and
recreate table)
Index may be
changed (drop
and recreate
index)
No column limit
No column limit
16-column limit
16-column limit
n/a
FK must exist as
PK in the related
table
n/a
n/a
Unique Primary Index (If the DBA did not specify the Primary Index
in the CREATE TABLE satement.)
Unique Secondary Index (If columns other than the Primary Index are
chosen)
24
25