You are on page 1of 10

AES Implementation in Verilog

ABHISHEK KONERU Siddharth Chatterjee Sushobhan Nayak

February 18, 2011

Original Encryption Algorithm and its Modication

In our implimentation of AES cipher, we have essentially followed the approach of [2]. In the write up below, we will rst state the algo described in [1] and then present the equivalent algo by [2]. Then we will describe the ecient pipelined use of the modied algo for maximum throughput with minimum use of space. The following gure presents the psedocodes of both these algorithms. In what follows, we will describe the Verilog module used to implement each function in the modied algo.

Figure 1: Comparison of both algos

1.1

State representation

The data to be encrypted and the key to be used are both 128-bit bit streams. [1] represents them in a 4-by-4 matrix as shown below. In [1], an input stream is arranged column wise (refer Fig). However, our implimentation arranges the state representation row-wise. So a row2column() module is employed to achieve this conversion before the key and the data are used for processing.

Figure 2: Comparison of representation

1.2

AddRoundKey()

This function is a simple bit-wise XOR of the key and the generated state. So it has been directly implimented in the main AES module.

1.3

SubByte() & ShiftRow()

Both the functions have been implimented in a single module. As described in Section 5.1.1 of [1], SubByte() can essentially be implimented through a look-up table. So, in our design, module sbox() achieves this through a combinatorial circuit. Now ShiftRows() is essentially cyclic shifts of the the rows of the state matrix, and as such, both these functions have been implimented in a single module subRow(), following gure3.

Figure 3: Row operation

1.4

MixColumns()

This function has been implimented with the help of module mixColumns(), which encompasses modules mult() and xtime(). Please refer to gure 3, 4 and 5 of [2] for the design diagram. xtime() essentially impliments the xtime() function dened in [1], Section 4.2.1. As the transformation matrix for this column operation essentially multiplies 01, 02 or 03 in GF (28 ) to each element of the state matrix, following [2], we implimented a combinatorial circuit mult, which takes an input and based on a select signal, outputs the multiplied result. It is easy to notice that gure3 in [2] corresponds to the following set of equations from [1]. As dened in section 4.2.1 of [1], every number can be represented through operation of xtime() and XOR. So the invMux() module used in decryption, which uses numbers 0e, 0d, 09 and 0b can also be easily generated.

Figure 4: Column operation

1.5

Inverse Cypher

The equivalent inverse cipher described in section 5.3.5 of [1] follows exactly the same procedure for decryption, albeit with each function turned to its inverse one. So inverse of each function was implemented, and invMixColumn() was applied to the key schedule as described in [1].

Design Decisions and Implementation Issues

For details of the pipelined implimentation, please refer to Section III of [2]. Instead of describing them again to avoid repetition, we will only focus on design issues faced:

2.1

Pipelining

There were two possible ways of pipelining we decided upon. In one case, there is a possibility of using one stage of pipelining for each of the 11 stages of the encryption peocess, thereby increasing the throughput. The other is the one described in [2]. A 11 stage pipeline would obviously have a 3 to 4 times increase in throughput over a 3-stage one: but it requires 11 128-bit registors, thereby taking up too much space and a possibility of overow of space in the FPGA. So we stuck with the implementation of the paper. The following are the design issues we faced: As described in the following subsection, we run the key scheduler rst to generate the whole 1408 bits needed for the entire process. Once the schedule is ready, it sets signal keyGen low, sinalling the control unit that it can go ahead with the encryption process. The control signal has three output bits: control[0] to decide whether new data should be let into the pipeline or the next stage of the old data is to be selected; control[1] is set high when the data is ready at the output; control[2] chooses between the output of the M ixColumn() function and the data not processed through the same, which is crucial in the last stage of encryption of a 128-bit data stream. So, when keyGen is set low, control[0] is set high and is maintained for next three clock cycles to allow for three new 128-bit data streams to enter the pipeline. Once, say, data1 is in, it takes 3 clock cycles to return to Register0. 4

So, after 30th clock cycle, the output for data1 is ready. On the 30th, control[1] is set high, so that we can get the output in the 31st clock cycle. It is maintained for the next two cycles to allow for output of next two data streams. So, in 33 cycles, we get 3 outputs, leading to a throughput of 3 128 Tp = 33 clockCycle which in our case, considering that the minimum clockperiod is 4.97ns, amounts to 2.34 Gbps. At each stage, the key schedule to be used has to be decided. Since there are 11 stages, we used a 4-bit stateInf o to keep track of the stage of encryption for a single 128-bit data. When a new data enters the pipeline, the last four bits of Register0, which we have implemented as a 132 bit register, are set to zero, to indicate that we are in the rst stage. This is incremented each time the contents of AddRound() function are put in Registor1, so that when the same data returns to Registor0 after processing, we can choose the next key schedule just by looking at those 4-bits.

2.2

keyExpansion() Routine

Based on speed and space considerations, there are three ways to go about performing key expansion. The encryption process takes 11 key matrices, amounting to a generation of a total of 1408 bits. Only a combinatorial implimentation will use too many XOR gates (around 1200). We considered three sequential key generation techniques : Sequential/Dynamic : As only 128-bits are used at each stage in the encryption process, the key schedule for each stage can be generated dynamically. The implemetation is shown in Figure 5. With each clock cycle, the previous value is loaded on the register. Another 128bit register, R0, has to be maintained which holds the original key so that when the present cycle of encryption ends, the original key can be loaded for the next run. As is evident from our pipelined design, the key for stage i has to be held for three continuous clock cycles (as there are three stages in the pipelinethe same key acts on three dierent inputs in three clock cycles), and we employed a counter to achieve 5

the same. Now, when one encryption cycle ends, new data is loaded conditioned on the control[0] signal the same signal is used to load the key with contents of R0, enabling us to synchronize the process of data and key selection.

Figure 5: Row operation Sequential/Static While the above method is very ecient in terms of space and speed, it has the obvious disadvantage that it cannt be used for the decryption process. Also, assuming only one key schedule is to be used for a single encryption process on any le,if processor/memory space is not of much consideration, it is particularly energy consuming to generate the key schedule through the entire process when it can easily be generated in one run and then saved for use in subsecquent stages. So, we also implemented this approach, as we have implemented the decryption process too. The process is started with a reset signal, whereupon the key scheduler runs and generates a 1408 bit key schedule to be used in 11 stages of one encryption of 128-bit data input. Once the key is generated, the module outputs a keyGen signal, giving the control module a thumbs up to go ahead with the encryption process, by setting control[0] to one to accept data. Sequential/RAM If space is very much an issue and we need to implement the decryption process nonetheless, and if we are willing to com6

promise on speed and throughput, we can run the Sequential/Dynamic scheduler and then use the block RAM in the FPGA (which has the capability to read/write 128-bits in a single clockcycle) to store the schedule for further utilization.

2.3

I/O module

The I/O module we used is shown in gure 8. In a generic device, which contains a memory that can read or write 32 bits in a single clock cycle, we employ the I/O module. 32 bits are read and stored in the rst register in 4 clock cycles, which is controlled by InC ounter. Three registers are used such that when the pipeline needs three consecutive inputs, they can be provided. It takes 4 3 = 12 clock cycles to ll up the three registers. Given that the pipeline requires data every 33 clock cycles, there is more than enough time to keep the data ready when the need arises. Same is the case with the output.

2.4

Sbox

[2] uses the ROM in the FPGA for Sbox look up table implimentation. While it reduces the number of FPGA slices used, it also brings down the speed which would otherwise have been achieved due to a combinatorial implimentation. We decided to go for a higher speed and hence implemented a combinatorial module of 16 Sboxes.

Decryption routine

It is easy to notice that the equivalent inverse cipher described in Section 5.3.5 is equivalent in its operations to the encryption process we have implemented. So, it can be used in a very similar way, albeit we replace each of the functions with their inverse, and introduce a minor change in key lookup. The issues are: AddRoundKey () is its own inverse, as it simply is an XOR operation. InvSubByte() and InvSubRow() are easily reproducible, by dening an inverse Sbox, which is a look-up table and rotating the bytes in Registor0 appropriately.

Figure 6: Row operation InvM ixColumn() can be easily implemented if we recognize the fact that any multiplication of numbers in GF (28 ) can be implemented through xtime() and XOR operations. The inverse matrix and the generated numbers are shown in g 6. We use the following equations, where s is a byte long and belongs to GF (28 ) {02} s = xtime(s) {04} s = xtime({02} s) {08} s = xtime({04} s)

{0e} s = ({02} {04} {08}) s = ({02} s) ({04} s) ({08} s) {0d} s = ({01} {04} {08}) s = (s) ({04} s) ({08} s) {0b} s = ({01} {02} {08}) s = (s) ({02} s) ({08} s) {09} s = ({01} {08}) s = (s) ({08} s) So, following the encryption process, the invM ult() unit is shown in g 7. This module replaces the mult() module in M ixColumns() to produce InvM ixColumn(), with proper care taken for the enable signals.

Figure 7: Row operation The key schedules used are taken from the generated key array. We look at the stateInf o, select the appropriate key, pass it through InvM ixColumn() and then conditioned on the stateInf o, pass the original or the output of InvM ixColumn()(rst and last stage use unprocessed key).

Specications on Synthesis
1. Platform used: Xilinx Virtex 5, XUPV5-LX110T. 2. Maximum clock speed achieved was GHz. 340.022MHz, corresponding to a min time period of 2.941ns. 3. Maximum throughput achieved was GBPS. 3.8 GBPS. As was expected, due to combinatorial implementation of Sbox, the throughput became almost twice. 4. An estimated 530 registers and ip ops were used, with some 450 XOR gates.

Figure 8: Row operation

References
1. National Institute of Standards and Technology, Advanced Encrytion Standard, Federal Information Processing Standards 197, November 2001 2. Nadia Nedjah, Luiza de Macedo Mourelle, Marco Paulo Cardoso, A Compact Piplined Hardware Implementation of the AES-128 Cipher, itng, pp.216-221, Third International Conference on Information Technology: New Generations (ITNG06), 2006

10

You might also like