Professional Documents
Culture Documents
Presented by : S ROY
In this case study we take up integer stream processing on the GPU. The new GeForce 8 Series GPU, several new extensions and functions have been introduced to GPU programming. New integer processing features include not only the arithmetic operations but also the bitwise logical operations (such as AND and OR) and the right/left shift operations. Array parameters and the new texture-buffer object provide a flexible way of referring to integer-indexed tables.
Contd.
With the new "transform feedback mode," it is now possible to store our results without the need to render to textures or pixel buffers. Several block-cipher modes of operation are also considered here.
The GL's target parameter is changed to GL_TRANSFORM_FEEDBACK_BUFFER_N V Need to specify the output attributes and whether each of them is output into a separate buffer object or they are all output interleaved into a single buffer object The output buffer must be bound through special new API calls Rasterization can also be optionally disabled
Two features are used: When declaring a register, we can either specify its type, such as FLOAT or INT, or just leave it typeless we can refer to tables using an integer index, array parameters, or one of the newly introduced texturebuffer objects
Contd.
The AES algorithm is currently the standard block-cipher algorithm that has replaced the Data Encryption Standard (DES) A rough summary of the requirements made by NIST for the new AES were the following:
Symmetric-key cipher Block cipher Support for 128-bit block sizes Support for 128-, 192-, and 256-bit key lengths
AES cipher operation algorithm is as:
Contd.
The encryption step uses a key that converts the data into an unreadable ciphertext, and then the decryption step uses the same key to convert the ciphertext back into the original data. This type of key is a symmetric key; other algorithms require a different key for encryption and decryption
Contd.
The precise steps involved in the algorithm In cryptography, algorithms such as AES are called product ciphers For this class of ciphers, encryption is done in rounds, where each round's processing is accomplished using the same logic.
Contd.
these product ciphers, including AES, change the cipher key at each round round keys is determined by a key schedule, which is generated from the cipher key given by the user
The code given throughout this chapter uses C-style macros and comments to improve readability Head of the AES Cipher Vertex Program
Contd.
In this application we expand the cipher key using the CPU and store the key schedule in the GPU program-local parameters.
AES encryption operates over a two-dimensional array of bytes, called the state.
During the input step, we slice our data into sequential blocks of 16 bytes and unpack it into 4x4 arrays that we push onto the GPU's registers.
Finally, during the output step, we pack these 4x4 arrays back into sequential blocks of 16 bytes and stream the results back to the transform feedback buffer
Contd.
Initialization
During the initialization stage, we do an AddRoundKey operation, which is an XOR operation on the state by the round key, as determined by the key schedule
Rounds A round for the AES algorithm consists of four operations: the SubBytes operation, the ShiftRows operation, the MixColumns operation, and the previously mentioned AddRoundKey operation
Contd.
Contd.
The MixColumns Operation The next step is the MixColumns operation, which has the purpose of scrambling the data of each column
Contd.
The AddRoundKey Operation This operation determines the current round key from the key schedule As an optimization we can also combine the MixColumns and AddRoundKey operations into a single subroutine
Performance
Tests were performed on a test machine with the following specifications CPU: Pentium 4, 3 GHz, 2 MB Level 2 cache Memory: 1 GB Video: GeForce 8800 GTS 640 MB System: Linux 2.6, Driver 97.46
Results were obtained by processing a plaintext of 128 MB filled with random numbers and averaging measurements from ten runs The throughput for the vertex program is 53 MB/sec, whereas for the fragment program, the throughput is 95 MB/sec with a batch size of 1 MB