High throughput FPGA implementation of Advanced Encryption Standard algorithm

(1)

DOI:_{10.12928/TELKOMNIKA.v15i1.4713} ₄₉₄

High throughput FPGA implementation of Advanced

Encryption Standard algorithm

Soufiane Oukili1, Seddik Bri2

Materials and Instrumentation (MIN), High School of Technology, Moulay Ismail University, Meknes, Morocco.

Corresponding author, e-mail: [email protected], [email protected]

Abstract

Owing to the worldwide increase of electronic communications and transactions, security and encryption speed have become essential elements of all systems and applications. Advanced Encryption Standard (AES) is one of the most widely secure and used symmetric encryption algorithms today. In this paper, we present high throughput full-pipelining implementation of AES algorithm, using the least amount of possible hardware. Substitution box (S-box) is the only non-linear step in this algorithm. It is the most complicated and costly part of the system. Consideration on high delay and hardware cost required for this transformation, we proposed pipelined S-box implementation by using composite field, to deal with the critical path and the occupied memory. In addition, efficient key expansion architecture suitable for our proposed AES design is also presented. The implementation has been successfully done by Virtex-5 XC5VLX85 and Virtex-6 XC6VLX75T Field Programmable Gate Array (FPGA) devices using Xilinx ISE 14.7. Our high throughput AES design achieves a data encryption rate of 108.69 Gbps and uses only 6361 slices. Compared to the best previous works, this implementation improves data throughput by 5.6% and reduces the occupied memory by 77.69%.

Keywords: security, Advanced Encryption Standard, high throughput, full-pipelining, S-box, FPGA

1. Introduction

Cryptography is the main aspect of secure data transmission over unreliable network. It is a challenging issue of data communications today that touches many areas including secure communication channel, strong data encryption technique and trusted third party to maintain the database. Cryptography encompasses many problems: encryption, authentication and key distribution to name a few. An encryption algorithm, or cipher, is a means of transforming plaintext into ciphertext under the control of a secret or public key. Secret key algorithm (symmetric cryptography) uses the same cryptographic keys for both encryption of plaintext and decryption of ciphertext. Public key algorithm (asymmetric cryptography) uses pairs of keys: public key for encryption with private key for decryption [1,2]. The Data Encryption Standard (DES) was the first modern symmetric key algorithm used for encryption and decryption of digital data. It has been developed in the 1970s at IBM and adopted as a Federal Information Processing Standard (FIPS) by the National Institute of Standards and Technology (NIST) in 1977 [3]. In 1998, the NIST announced a competition for a new encryption algorithm that would be used for protecting sensitive, non-classified, U.S. government information. This algorithm would replace DES, which was not resistant to known attacks because of the short key length. After all reviews, NIST chose an algorithm known as Rijndael. It was developed by two Belgian cryptographers: Dr. Joan Daemen and Dr. Vincent Rijmen. In November 2001, Advanced Encryption Standard (standardized version of Rijndael) became a FIPS standard (FIPS-197) [4, 5].

There are software and hardware approaches to implement cryptographic AES algorithm. As compared to software implementation, hardware implementation provides greater physical security and higher speed [6]. Low power, high throughput and compactness have always been topic of interest for hardware design and implementation. Because of the increasing requirements for high-speed, high-volume secure communications combined with physical security, the main goal of this paper is to implement high throughput AES design using as less hardware as possible.

(2)

The S-box substitution is at the core of any AES implementation. It is the only non-linear and complex step in each round of encryption algorithm. The most traditionally implementation used is where all predefined 256 8-bit values of S-box table are stored in various kinds of memories such as ROMs, BRAMs, and LUTs [7,8]. This method can reduce the area cost. However it suffers from unbreakable delay of memories that leads to a reduction in throughput. Another method is based on the calculation of the S-box functions using composite field arithmetic operations. Where it is possible to use pipelining technique to decrease the critical delay and increase the throughput [9,10].

In this paper, we present efficient high throughput hardware architecture design and implementation of 128-bit key AES, using full-pipelining technique. Consequently, each round unit is divided into more sub-stages. By incrementing the number of these sub-stages, the critical path and clock pulse width of system can be decreased and as a result the throughput is increased. In addition, we have proposed pipeline S-box based on composite field, in which the field operations are implemented in lower order fields and by lower cost subfield operations. It is used to avoid the unbreakable delay of memories and to achieve any further increase in processing speed. Moreover, efficient key expansion architecture suitable with the full-pipelined AES round units is presented. Our proposed design is implemented on Xilinx Virtex-5 and Virtex-6 FPGA technology. The FPGAs offer the advantage of hardware speed and software flexibility and programmability.

This paper is structured as follows. Section 2 presents a brief background of the AES algorithm. Section 3 gives the relevant works of various authors reported in the literature. Our proposed AES architecture is presented in Section 4. Section 5 provides results and comparison between our implementations and different reported implementations. Finally, conclusion and references are given respectively.

2. Background of AES Algorithm

AES algorithm is a symmetric block cipher, in which both the sender and the receiver use a same key for encryption and decryption. The data block length is fixed to be 128 bits and the key length can be 128, 192, or 256 bits. The AES is an iterative algorithm. Each iteration can be called a round, and the total number of rounds is dependent on the key length. The key length is represented by Nk = 4, 6, or 8, which reflects the number of 32-bit words (number of columns) in the cipher key. The number of rounds is represented by Nr, where Nr = 10 when Nk =4, Nr = 12 when Nk = 6, and Nr = 14 when Nk = 8. The output of each round serves as input of next stage. For each round, 128 bit input data and 128/192/256 bit key is required. For this paper, 128 bit key is chosen, which requires 10 rounds of encryption.

The 128-bit data block is arranged in a 4 × 4 array of bytes called the State, with four rows and four columns consisting of 16 bytes in total. Each round is composed of four different byte-oriented transformations: SubByte, ShiftRow, MixColumn and AddRoundKey except for the last round in which MixColumn transformation is not performed. Apart from this, there is an initial round at the start that consists of only AddRoundKey transformation [2]. Figure 1 shows the 128 bit-key AES algorithm.

a. SubByte: operates in each byte of the State independently. Each byte is substituted by the corresponding byte in the S-box. S-box is one of the basic components of any symmetric key algorithm, which exhibits the property of confusion. This property is provided to increase the difficulty in finding the key from the known cipher text. S-box takes M inputs and transforms them to deliver N bits at the output. Fixed S-boxes are used in AES algorithm, which are designed using multiplicative inverse over GF (28) and combining the inverse function with an invertible affine transformation. These properties make it efficient over cryptanalysis by providing non-linear properties.

b. Shift row: takes the data in the State matrix and circularly shifts each data block left by its row index.

c. MixColumn: in this operation, each column of the State is considered as polynomials over GF (28). Then, this vector is multiplied by a fixed polynomial.

d. AddRoundKey: takes in a unique round key form the key expansion component and simply performs a bit by bit XOR with each of the bits in the State matrix.

The decryption structure of AES algorithm can be derived by inverting the encryption one directly and their rounds require four inverse operations: InvSubByte, InvShiftRow,

(3)

InvMixColumn, and AddRoundKey.

AES algorithm takes the original main key, and performs a key expansion routine to generate the round keys. In AES-128 bits key, it generates a total of 11 KeyRound of 16 bytes in order to be employed respectively in rounds of AES, taking into account that the first KeyRound is the initial key. Key expansion is also an iterative algorithm with same round number as the AES. The output of each one is the input of the next one. In each round, the first four bytes of the input KeyRound constitute the word w0, the next four bytes the word w1, and so on. The bytes of the final word are left rotated by one position, and then each byte passes thought substitution S-box. The result is XORed with a round constant RCon(i). Finally, the columns are added together to generate a new 128 bit round key. Figure 2 shows one round of key expansion module. The key expansion is designed to be resistant to known cryptanalytic attacks. The inclusion of a round-dependent round constant eliminates the symmetry, or similarity, between the ways in which round keys are generated in different rounds.

Figure 1. 128-bit key AES Algorithm

Figure 2. Round i of Key Expansion Module

3. Previous Works

The AES algorithm is implemented using different methods and contributions which can be categorized as follows. In the first one, loop unrolling and iterative techniques are used to increase throughput, increase throughput to area ratio, and decrease area cost, respectively. These methods are studied in [11,13, 18, 24, 27, 29, 30, 33]. The second category includes the designs which use pipelining and full-pipelining techniques to increase operational frequency and throughput [8,17, 22, 23, 25-29, 31-33, 35-37, 39-42].

S-box is the main block in AES. Several different methods have been presented in the literature to implement it. The first one is considered in [13-16, 18, 20, 22, 23, 25, 26, 29, 30, 33, 35, 38] where various kind of memories such as ROMs, BRAMs, and LUTs are used to store all predefined 256 8-bit values. This method can reduce the area cost; however it suffers from unbreakable delay of memories that leads to a reduction in throughput. In the next method, S-box is implemented based on composite field where it is possible to use pipelining and sub-pipelining techniques to decrease the critical delay path and increase the throughput. However,

(4)

hardware complexity is its disadvantage [13, 17, 24, 28, 31, 32, 36, 37, 41-43].

Proposed AES implementation based on the combination of MixColumn and SubByte transformations into a single table consisting of 256 1-bytes columns [12]. Several area optimization options for the AES-128 implementation are presented [13]. Architecture in Refs. [14] and [23], uses memory modules (i.e., Dual-Port RAMs) of FPGAs for storing all the results of the fixed operations (i.e., Look-Up Table) and Digital Clock Manager (DCM) to optimize the execution time, reduce design area and facilitates implementation in FPGA. Refs. 32-bit AES implementation [15-16]. Cell array reconfigurable architecture is proposed in [17] which provides a high-efficiency, low-hardware overhead and high reliability in AES by combining the use of three hardware languages (Handel-C, VHDL, and JBits). An efficient optimized area and speed FPGA implementation for the AES is proposed [18]. The iterative looping method is adopted with multistage sub-pipelining architecture. Describe FPGA implementation of AES based on a novel design of S-box built using reduced residue of prime numbers [20]. The novel S-box look up table (LUT) entries forms a set of reduced residue of prime number, which forms a mathematical field. Presents an efficient hardware realization of AES algorithm using state-of-the-art reconfigurable hardware [21]. An FPGA memory based pipelined implementation of AES- XTS is presented [22]. The design is being operated by the use of two cascaded digital clock managers and an enhanced T- Table approach. A new methodology is presented [26] where authors employed dynamic and partial reconfiguration with parallelism and pipelining to implement AES. In the other words, they used VHDL for implementation of AES elements and Handel-C for implementation of the communication system. Authors propose a high throughput AES architecture, which is based on the CAM-based SubBytes scheme [27]. By properly pipelining, the proposed CAM-based SubBytes needs a smaller gate delay than the ROM-based design. Two implementations of the AES algorithm are introduced, the first design is based on the basic architecture of the AES and the second one is based on the sub-pipelined architecture of the AES [28]. A new efficient architecture for high-speed AES encryptor is presented [31]. This technique is implemented using composite field arithmetic byte substitution, where higher efficiency is achieved by merging and location rearrangement of different operations required in the steps of encryption. In addition, multistage sub-pipelined architecture is used to allow having higher efficiency in terms of throughput and area. Composite field arithmetic in normal bases is employed to propose a low-cost S-box for the AES [32]. Moreover, authors present efficient key expansion architecture convenient for 6 sub-pipelined round units. Authors present design, implementation and comparison of highly efficient architectures for AES on FPGAS: Iterative architecture and pipelined architecture [33]. The first design is optimized for area and the second one is optimized for speed. An extension of a general-purpose processor with a crypto coprocessor for encryption and decryption of information is described [35]. High throughput digital design of the 128-bit AES algorithm based on the 2-slow retiming technique on FPGA is presented [36]. Equivalent pipelined AES architecture working on CTR mode is presented to provide high throughput through inserting some registers in appropriate points making the delay shortest, when implementing the byte transformation in one clock period [37]. AES implementation [38] is mainly targeted for low-cost embedded applications. It introduces parallel operation in the folded architecture to obtain better throughput. Authors map operations of AES from Galois Field (28) to GF (24) to obtain an area-efficient masked AES implementation [39]. Fully pipelined crypto processor is presented [40], where AES is integrated with a 32-bit general purpose 5-stage pipelined MIPS processor. Three high-throughput AES implementations in ECB mode and one ultra-high throughput AES implementation in CTR mode [42]. They performed area-delay efficient multiplier and multiplicative inverter over GF(28). Moreover, loop-unrolling, fully pipelining, and fully sub-pipelining techniques are also used and the registers are placed in optimal placement. Compact and high-speed hardware architectures and logic optimization methods for the AES algorithm [43]. It proposes a methodology to optimize the S-Box by introducing a new composite field.

4. Proposed High Throughput AES Algorithm 4.1. Proposed AES Architecture

In this paper, the proposed AES design aimed to increase the throughput and use hardware resource as less as possible. Thus, full-pipelining technique is adopted. The pipelining strategy consists in parallelizing the data inputs and outputs with the processing. Consequently,

(5)

the design is divided into stages. By incrementing the number of these stages, the critical path and clock pulse width of system can be decreased and as a result the speed is increased. The optimum number of pipeline stages and the best placement strategy for pipelining registers are two main factors to achieve an area-throughput efficient design. Our full-pipelining architecture is like further fine grain pipelining where registers are introduced between rounds and between the operations of rounds.

stored and the input byte would be wired to the ROM’s address bus. Using of LUTs in high speed applications has two main problems. The first one is the high volume of required gates for implementation. The second one is inherent and unbreakable existent delay in LUT structure. This delay is longer than the total delay of the rest transformations in each round unit and prohibits them from being divided into more than two sub-stages to achieve any further speedup. Therefore, S-box using composite field is adopted according to [43], which the field operation is implemented in lower order fields and by lower cost subfield operations. This S-box structure has the advantage of having small area occupancy, in addition to be capable of being pipelined. As mentioned before, pipeline technique modifies the critical path by increasing possible frequency of clock cycle. It consists in parallelizing the data inputs and outputs with the processing by inserting registers. To achieve optimum number and the best placement strategy for pipelining registers in our proposed S-box design, we first placed pipeline registers in all possible placements in the architecture. Then, the pipeline stage which contains the lowest reduction of frequency is removed until the optimum number and placements of pipeline registers are achieved. Proposed full-pipelining general block and round of AES are illustrated in Figure 3(a) and Figure 3(b) respectively. Proposed 7-stage pipelining S-box implementation using composite field is shown in Figure 4. The proposed AES round is divided into 10 stages. Therefore, the proposed AES algorithm is divided into 100 stages.

(a) (b)

(6)

Figure 4. Proposed 7-Stage S-box Architecture Using Composite Field

4.2. Proposed Key Expansion Module

Key expansion, also as one of the crucial steps in the realization of AES cipher core, decides the throughput of the cipher process. The round keys that are used in each encryption round unit, can be either generated beforehand and stored in memory or generated on the fly. If we want to pipeline round keys generation to preserve the advantage of high throughput of our cipher process, it must be considered that the key expansion architecture is divided the same as the number of existent stages in encryption round unit or less to provide KeyRound to the corresponding encryption round.

Our proposed key expansion module is divided into 9 sub-stages. Knowing that the encryption round unit contains 10. The S-box design used is the one previously proposed, 7-stage based on composite field. Figure 5 presents proposed key expansion round.

(7)

5. Results and Comparisons

Our proposed AES implementation was accomplished on Virtex-5 XC5VLX85 and Virtex-6 XC6VLX200 FPGAs, which are widely used in recent related works. For synthesis and simulation, we employed Xilinx ISE Design Suite 14.7. The design was coded using VHDL language. It takes 100 clock pulse cycles to cipher the first data block, then the cipher blocks are recovered at each clock cycle. The implementation on virtex-5 occupied 7385 (56%) slices and achieves a frequency and throughput of 638.162 MHz, 81.68 Gbps. On virtex-6, it occupied 6361 (16%) slices and achieves a frequency and throughput of 849.185 MHz, 108.69 Gbps.

There are several hardware implementations for the AES algorithm that aim to achieve the most efficient architecture, by improving high throughput and area-efficient. Table 1 shows the performance figures for some recent FPGA-based designs up to our best knowledge. It provides values of hardware utilization, throughput, maximum frequency and throughput to area for several related works. In all reported results, we employ well- known equations (1) and (2), to calculate the throughput and the efficiency, respectively.

(1) (2) From Table 1, the highest throughput and the highest frequency reported in the literature to our knowledge by virtex5 FPGA are 73.737 Gbps and 576.07 MHz respectively, using 22994 slices [37]. By comparing these results with our implementation on virtex5, we see that ours gives 1.107 times more throughput and only 0.321 times slices used. Therefore, it improves data throughput by 10.78%. Furthermore, it is 3.44 times more efficient.

Table 1.MaximumFrequency, Throughput and Hardware UtilizationResults

In Table 2, we give the results implemented by Virtex-6 FPGA. These works have the highest frequency and throughput among the others. For a fair comparison, we implemented our

Design Device Slices Critical

delay (ns)

Max-freq. (MHz)

Through put (Gbps)

Efficiency (Mbps/slice) Kamal and Youssef [13] Virtex-2

XC2V1000BG575

2732 - 98.95 0.029 -

Adib and Raissouni [14] Spartan3 XC3S500E-4 326+3 BRAMs - 168.765 0.27 0.83 Chi-Wu et al. [15] Spartan-3 XC3S200 148+11 BRAMs - 287 0.647 4.37 Chi-Jeng et al. [16] Virtex-2 Pro XC2VP2 156+3 BRAMs - 306 0.876 5.61 Li et al. [17] Virtex-5 XC5VFX70T 2030+28 BRAMs - 91.57 0.97 0.481 Mazen et al. [18] Virtex-5 XC5VLX50 303+10 BRAMs 2.35 425 1.33 4.38 Leelavathi et al. [19] XC3S500E-4FG320 2439 - 222.41 2.846 1.166 Rais and Qasim [20] Virtex-5 XC5VLX50 1745 - 242.15 3.09 - Rais and Qasim [21] Virtex-5 XC5VLX50 399 - 339.09 4.34 - Shakil et al. [22] Virtex-5 XC5VLX50 573+8 BRAMs - - 5.25 9.16 Sireesha and Madhava

[23]

Spartan-3 XC3S50 0E 326+3 BRAMs - 50 6.4 - Jyrwa and Paily [24] Virtex-2 Pro XC2VP30 6211+1 BRAM - 142.5 18.2 0.2347

Gielata et al. [25] Virtex-4 XC4VLX200 1209 - 165 21.2 - José et al. [26] Virtex-2 XC2V6000 3576+80 BRAMs 5.1 194.7 24.92 6.97 Fan and Hwang [27] Virtex-2 XC2V3000 139357 4.5 222.2 28.40 0.20 Rizk and Morsy [28] Virtex-4 4VLX60FF668 18855+200 BRAMs - - 28.510 1.512

Banu et al. [29] Virtex-5 XC5VLX110T - 4 250 31.25 -

Kaur et al. [30] Virtex-2 XC2VP30 1127 4 247.365 31.66 -

Issam et al. [31] XC2V6000 10662 - 305.1 39.05 3.6

Samiee et al. [32] Virtex-2 Pro XC2VP20 7865 2.9 341.53 43.71 5.55 Nalini et al. [33] Virtex-2 Pro XC2VP30 12556+100 BRAMs 2.6 373 47.74 3.8 Naiem et al. [34] Virtex-2 Pro XC2VP30 1835+40 BRAMs 2.4 405.227 51.87 28.27 Mostafa and Ghada [35] Virtex-5 XC5VLX50T 1656 1.7 557 70 -

Reza et al. [36] Virtex-4 XC4VLX200 3425 1.73 576.037 73.73 21.53 Shanxin et al. [37] Virtex-5 XC5VLX85 22994 1.7 576.07 73.73 3.21

(8)

design on Virtex-6 FPGA too. Our implementation achieves 1.056 times more throughput, 4.74 times more efficient and only 0.223 times slices used than the implementation in [42], which has the highest throughput reported. Hence, our implementation improve data throughput by 5.6%.

By considering all the reported results in this section, we can see that our proposed AES implementations provide better performance in terms of throughput, occupied memory and efficiency compared to all the reported ones, known to date, which is mostly due to the following reasons:

a. Using full-pipelining technique for high throughput implementations b. Pipelined S-box implementation using composite field.

c. Optimum number of pipeline stages and the best placements for its registers

Table 2. AES Implementations on Virtex-6

6. Conclusion

This paper presents high throughput efficient AES implementation. The design uses 128-bit key for encryption. Full-pipelining technique is performed to obtain high throughput than basic structure. Also, we have employed pipelining S-box implementation using composite field that decreases the path delay. Therefore, the throughput is increased. Furthermore, efficient key expansion architecture suitable with the full-pipelined round units is introduced. Our implementation takes 100 clock cycles to load the first cipher block. Then, it will appear on consecutive clock cycles. The implementation is done by virtex-5 and virtex-6 FPGAs. It can encrypt data blocks at a rate of 81.68 Gbps and 108.69 Gbps respectively. Our proposed design improves in throughput all the reported architectures of AES algorithm, with consuming low area.

References

[1] Stalling W. Editor. Cryptography and Network Security Principles and Practices. Prentice Hall, 4th edition. 2005.

[2] Kahate A. Editor. Cryptography and Network Security. Tata McGraw Hill, 2nd edition. 2007.

[3] National Institute of Standards and Technology. Federal Information Processing Standards Publication 46-3: Data Encryption Standard. 1999.

[4] National Institute of Standards and Technology. Federal Information Processing Standards Publication 197: Advanced Encryption Standard; 2001.

[5] Joan D, Vincent R. AES Proposal: Rijndael. National Institute of Standards and Technology. 1999. http://csrc.nist.gov/archive/aes/rijndael/Rijndael-ammended.pdf

[6] Yoo SM, Kotturi D, Pan DW, Blizzard J. An AES crypto chip using a high-speed parallel pipelined architecture. Microprocessor and Microsystem. 2005; 29(7): 317-326.

http://dx.doi.org/10.1016/j.micpro.2004.12.001

[7] Burns F, Murphy J, Koelmans A. Efficient advanced encryption standard implementation using lookup and normal basis. IET Computers & Digital Techniques. 2009; 3(3): 270-280. http://dx.doi.org/10.1049/iet-cdt.2008.0049

Design Device Slices Critical delay (ns) Max-freq. (MHz) Through put (Gbps) Efficiency (Mbps/slic e) Rahimunnisa et al.

[38]

Virtex-6 XC6VLX75T

2056+48 BRAMs

1.9 505.5 37.1 15.56 Wang and Ha [39] Virtex-6

XC6VLX240T

9071+400 BRAMs

3.1 319.29 40.86 4.51 Anwar et al. [40] Virtex-6 ML605 2547+204

BRAMs

1.8 553 58 -

Rahimunnisa et al. [41]

Virtex-6 XC6VLX75T

2597 2.2 450.04 5

59.59 22.94 Abolfazl and Saeed

[42]

Virtex-6 XC6VLX240T

28520 1.2 803.98 8

102.91 3.6

Our design Virtex-6

XC6VLX240T

6361 1.17 849.18

(9)

[8] Sumathi M, Nirmala D, Rajkumar RI. Study of Data Security Algorithms using Verilog HDL. International Journal of Electrical and Computer Engineering (IJECE). 2015; 5(5): 1092-1101.

[9] Wong MM, Wong MLD. A High Throughput Low Power Compact AES S-box Implementation using Composite Field Arithmetic and Algebraic Normal Form Representation. 2nd Asia Symposium on Quality Electronic Design. Penang. 2010: 318-323.http://dx.doi.org/10.1109/ASQED.2010.5548317 [10] Rachh RR, AnandaMohan PV, Anami BS. High Speed S-box architecture for Advanced Encryption

Standard. IEEE 5th International Conference on Internet Multimedia Systems Architecture and Application. Bangalore. 2011: 1-6. http://dx.doi.org/10.1109/IMSAA.2011.6156342

[11] El Adib S, Raissouni N, Chahboun A, Azyat A, Lahraoua M, Ben Achhab N, Abbous A, Benarchid O. A Reconfigurable Cryptography Coprocessor RCC for Advanced Encryption Standard AES/Rijndael. International Journal of Information & Network Security (IJINS). 2012; 1(2): 119-126.

[12] El Adib S, Raissouni N. AES Encryption Algorithm Hardware Implementation: Throughput and Area Comparison of 128, 192 and 256-bits Key. International Journal of Reconfigurable and Embedded Systems (IJRES). 2012; 1(2): 67-74.

[13] Kamal AA, Youssef AM. An area optimized implementation of the advanced encryption standard. International Conference on Microelectronics. Sharjah. 2008: 159-162.

[14] El Adib S, Raissouni N. AES encryption algorithm hardware implementation architecture: resource and execution time optimization. International Journal of Information and Network Security (IJINS). 2012; 1(2): 110-118.

[15] Chi-Wu H, Chi-Jeng C, Mao-Yuan L, Hung-Yun T. Compact FPGA implementation of 32-bits AES algorithm using block RAM. IEEE Region 10 Conference. Taipei. 2007: 1-4. http://dx.doi.org/10.1109/TENCON.2007.4429126

[16] Chi-Jeng C, Chi-Wu H, Kuo-Huang C, Yi-Cheng C, Chung-Cheng H. High throughput 32-bit AES implementation in FPGA. IEEE Asia Pacific Conference on Circuits and Systems. Macao. 2008: 1806-1809. http://dx.doi.org/10.1109/APCCAS.2008.4746393

[17] Li H, Ding J, Pan Y. Cell array reconfigurable architecture for high-efficiency AES system.

Microelectronics Reliability. 2012; 52(11): 2829-2836.

http://dx.doi.org/doi:10.1016/j.microrel.2012.04.020

[18] Mazen EM, Salma H, Mohamed AEG. Real-time efficient FPGA implementation of AES algorithm. IEEE 26th International SOC Conference. Erlangen. 2013: 203-208. http://dx.doi.org/10.1109/SOCC.2013.6749688

[19] Leelavathi G, Prakasha S, Shaila K, Venugopal KR, Patnaik LM. Design and implementation of advanced encryption algorithm with FPGA and ASIC. International Journal of Research in Engineering & Advanced Technology. 2013; 1(3): 1-8.

[20] Rais MH, Qasim SM. Fpga implementation of Rijndael algorithm using reduced residue of prime numbers. 4th International Design and Test Workshop. Riyadh. 2009: 1-4. http://dx.doi.org/10.1109/IDT.2009.5404130

[21] Rais MH, Qasim SM. Efficient hardware realization of advanced encryption standard algorithm using virtex-5 FPGA. International Journal of Computer Science and Network Security. 2009; 9(9): 59-63. [22] Shakil A, Khairulmizam S, Abdul RR, Fakhrul ZR. An effective storage encryption solution. Indian

Journal of Science and Technology. 2013; 6(4): 4384-4389.

[23] Sireesha K, Madhava SR. A novel approach of area optimized and pipelined FPGA implementation of AES encryption and decryption. International Journal Scientific and Research Publications. 2013; 3(9): 1-5.

[24] Jyrwa B, Paily R. An Area-Throughput Efficient FPGA implementation of Block Cipher AES algorithm, International Conference on Advances in Computing, Control, and Telecommunication Technologies. Kerala. 2009: 328-332. http://dx.doi.org/10.1109/ACT.2009.88

[25] Gielata A, Russek P, Wiatr K. AES hardware implementation in FPGA for algorithm acceleration purpose. International Conference on Signals and Electronic Systems. Krakow. 2008: 137-140. http://dx.doi.org/10.1109/ICSES.2008.4673377

[26] José MG, Miguel AV, Juan MS, Juan AG. A new methodology to implement the AES algorithm using partial and dynamic reconfiguration. Integration, the VLSI Journal. 2010; 43(1): 72-80. http://dx.doi.org/doi:10.1016/j.vlsi.2009.05.003

[27] Fan CP, Hwang JK. FPGA implementations of high throughput sequential and fully pipelined AES algorithm. International Journal of Electrical Engineering. 2008; 15(6): 447-455.

[28] Rizk MRM, Morsy M. Optimized area and optimized speed hardware implementations of AES on FPGA. 2nd International Design and Test Workshop. Cairo. 2007: 207-217. http://dx.doi.org/10.1109/IDT.2007.4437462

[29] Banu JS, Vanitha M, Vaideeswaran J, Subha S. Loop parallelization and pipelining implementation of AES algorithm using OpenMP and FPGA. International Conference on Emerging Trends in Computing, Communication and Nanotechnology. Tamilnadu. 2013: 481-485. http://dx.doi.org/10.1109/ICE-CCN.2013.6528547

(10)

[30] Kaur A, Bhardwaj P, Kumar N. Fpga implementation of efficient hardware for the Advanced Encryption Standard. International Journal of Innovative Technology and Exploring Engineering. 2013; 2(3):186-189.

[31] Issam H, Kamal E, Ezz E. High-speed aes encryptor with efficient merging techniques. IEEE Embedded Systems Letters. 2010; 2(3): 67-71. http://dx.doi.org/10.1109/LES.2010.2052401

[32] Samiee H, Atani RE, Amindavar H. A novel area-throughput optimized architecture for the AES algorithm. International Conference on Electronic Devices, Systems and Applications. Kuala Lumpur. 2011: 29-32. http://dx.doi.org/10.1109/ICEDSA.2011.5959055

[33] Nalini I, Anandmohan PV, Poornaiah DV, Kulkarni VD. Efficient Hardware Architectures for AES on FPGA. In Das VV, Thankachan N. Editors. Computational Intelligence and Information Technology. New York: Springer-Verlag; 2011: 249-257.

[34] Naiem GF, Elramly S, Hasan BE, Shehata K. An efficient implementation of CBC mode Rijndeal AES on an FPGA. National Radio Science Conference. Tanta. 2008: 1-8. http://dx.doi.org/10.1109/NRSC.2008.4542373

[35] Mostafa IS, Ghada YA. Fpga implementation and performance evaluation of a high throughput crypto coprocessor. Journal of Parallel and Distributed Computing. 2011; 71(8): 1075-1084. http://dx.doi.org/doi:10.1016/j.jpdc.2011.04.006

[36] Reza RF, Bahram R, Sayed MS. FPGA based fast and high-throughput 2-slow retiming 128-bit AES encryption algorithm. Microelectronics Journal. 2014; 45(8): 1014-1025. http://dx.doi.org/doi:10.1016/j.mejo.2014.05.004

[37] Shanxin Q, Guochu S, Yihong H, Zhigang G, Zongjue Q. High Throughput, Pipelined Implementation of AES on FPGA. International Symposium on Information Engineering and Electronic Commerce. Ternopil. 2009: 542-545. http://dx.doi.org/10.1109/IEEC.2009.120

[38] Rahimunnisa K, Karthigaikumar P, Rasheed S, Jayakumar J, SureshKumar S. FPGA implementation of AES algorithm for high throughput using folded parallel architecture. Security and Communication Networks. 2014; 7(11): 2225-2236. http://dx.doi.org/10.1002/sec.651

[39] Wang Y, Ha Y. FPGA-based 40.9-gbits/s masked AES with area optimization for storage area network. IEEE Transactions on Circuits and Systems- II: Express Briefs. 2013; 60(1): 36-40. http://dx.doi.org/10.1109/TCSII.2012.2234891

[40] Anwar H, Daneshtalab M, Ebrahimi M, Plosila J, Tenhunen H. FPGA implementation of AES-based crypto processor. IEEE 20th International Conference on Electronics, Circuits, and Systems. Abu Dhabi. 2013: 369-372. http://dx.doi.org/10.1109/ICECS.2013.6815431

[41] Rahimunnisa K, Karthigaikumar P, Christy NA, Kumar SS, Jayakumar J. Psp: parallel sub-pipelined architecture for high throughput AES on FPGA and ASIC. Central European Journal of Computer Science. 2013; 3(4): 173-186. http://dx.doi.org/10.2478/s13537-013-0112-2

[42] Abolfazl S, Saeed S. An ultra-high throughput and fully pipelined implementation of AES algorithm on FPGA. Microprocessors and Microsystems. 2015; 39(7): 480-493. http://dx.doi.org/doi:10.1016/j.micpro.2015.07.005

[43] Satoh A, Morioka S, Takano K, Munetoh S. A Compact Rijndael Hardware Architecture with S-Box Optimization, In: Boyd C. Editor. Advances in Cryptology. New York: Springer-Verlag; 2001: 239-254. http://dx.doi.org/10.1007/3-540-45682-1_15

(1)

Designing a high speed and low area S-box substitution is one of the most critical problems in the research process, because it is the only non-linear transformation in the four transforms of AES arithmetic and is the key point to improve the throughput and decrease the resource used. The most of traditionally hardware implementation of the S-box is to have the pre-computed values stored in a ROM based LUT. In this implementation, all 256 values are stored and the input byte would be wired to the ROM’s address bus. Using of LUTs in high speed applications has two main problems. The first one is the high volume of required gates for implementation. The second one is inherent and unbreakable existent delay in LUT structure. This delay is longer than the total delay of the rest transformations in each round unit and prohibits them from being divided into more than two sub-stages to achieve any further speedup. Therefore, S-box using composite field is adopted according to [43], which the field operation is implemented in lower order fields and by lower cost subfield operations. This S-box structure has the advantage of having small area occupancy, in addition to be capable of being pipelined. As mentioned before, pipeline technique modifies the critical path by increasing possible frequency of clock cycle. It consists in parallelizing the data inputs and outputs with the processing by inserting registers. To achieve optimum number and the best placement strategy for pipelining registers in our proposed S-box design, we first placed pipeline registers in all possible placements in the architecture. Then, the pipeline stage which contains the lowest reduction of frequency is removed until the optimum number and placements of pipeline registers are achieved. Proposed full-pipelining general block and round of AES are illustrated in Figure 3(a) and Figure 3(b) respectively. Proposed 7-stage pipelining S-box implementation using composite field is shown in Figure 4. The proposed AES round is divided into 10 stages. Therefore, the proposed AES algorithm is divided into 100 stages.

(a) (b)

(2)

Figure 4. Proposed 7-Stage S-box Architecture Using Composite Field

4.2. Proposed Key Expansion Module

(3)

5. Results and Comparisons

Table 1.MaximumFrequency, Throughput and Hardware UtilizationResults

In Table 2, we give the results implemented by Virtex-6 FPGA. These works have the highest frequency and throughput among the others. For a fair comparison, we implemented our

Design Device Slices Critical

delay (ns)

Max-freq. (MHz)

Through put (Gbps)

Efficiency (Mbps/slice) Kamal and Youssef [13] Virtex-2

XC2V1000BG575

2732 - 98.95 0.029 -

Adib and Raissouni [14] Spartan3 XC3S500E-4 326+3 BRAMs - 168.765 0.27 0.83

Chi-Wu et al. [15] Spartan-3 XC3S200 148+11 BRAMs - 287 0.647 4.37

Chi-Jeng et al. [16] Virtex-2 Pro XC2VP2 156+3 BRAMs - 306 0.876 5.61

Li et al. [17] Virtex-5 XC5VFX70T 2030+28 BRAMs - 91.57 0.97 0.481

Mazen et al. [18] Virtex-5 XC5VLX50 303+10 BRAMs 2.35 425 1.33 4.38

Leelavathi et al. [19] XC3S500E-4FG320 2439 - 222.41 2.846 1.166

Rais and Qasim [20] Virtex-5 XC5VLX50 1745 - 242.15 3.09 -

Rais and Qasim [21] Virtex-5 XC5VLX50 399 - 339.09 4.34 -

Shakil et al. [22] Virtex-5 XC5VLX50 573+8 BRAMs - - 5.25 9.16

Sireesha and Madhava [23]

Spartan-3 XC3S50 0E 326+3 BRAMs - 50 6.4 -

Jyrwa and Paily [24] Virtex-2 Pro XC2VP30 6211+1 BRAM - 142.5 18.2 0.2347

Gielata et al. [25] Virtex-4 XC4VLX200 1209 - 165 21.2 -

José et al. [26] Virtex-2 XC2V6000 3576+80 BRAMs 5.1 194.7 24.92 6.97

Fan and Hwang [27] Virtex-2 XC2V3000 139357 4.5 222.2 28.40 0.20

Rizk and Morsy [28] Virtex-4 4VLX60FF668 18855+200 BRAMs - - 28.510 1.512

Banu et al. [29] Virtex-5 XC5VLX110T - 4 250 31.25 -

Kaur et al. [30] Virtex-2 XC2VP30 1127 4 247.365 31.66 -

Issam et al. [31] XC2V6000 10662 - 305.1 39.05 3.6

Samiee et al. [32] Virtex-2 Pro XC2VP20 7865 2.9 341.53 43.71 5.55

Nalini et al. [33] Virtex-2 Pro XC2VP30 12556+100 BRAMs 2.6 373 47.74 3.8

Naiem et al. [34] Virtex-2 Pro XC2VP30 1835+40 BRAMs 2.4 405.227 51.87 28.27

Mostafa and Ghada [35] Virtex-5 XC5VLX50T 1656 1.7 557 70 -

Reza et al. [36] Virtex-4 XC4VLX200 3425 1.73 576.037 73.73 21.53

Shanxin et al. [37] Virtex-5 XC5VLX85 22994 1.7 576.07 73.73 3.21

(4)

a. Using full-pipelining technique for high throughput implementations b. Pipelined S-box implementation using composite field.

c. Optimum number of pipeline stages and the best placements for its registers

Table 2. AES Implementations on Virtex-6

6. Conclusion

References

[1] Stalling W. Editor. Cryptography and Network Security Principles and Practices. Prentice Hall, 4th edition. 2005.

[2] Kahate A. Editor. Cryptography and Network Security. Tata McGraw Hill, 2nd edition. 2007.

[3] National Institute of Standards and Technology. Federal Information Processing Standards Publication 46-3: Data Encryption Standard. 1999.

[4] National Institute of Standards and Technology. Federal Information Processing Standards Publication 197: Advanced Encryption Standard; 2001.

[5] Joan D, Vincent R. AES Proposal: Rijndael. National Institute of Standards and Technology. 1999. http://csrc.nist.gov/archive/aes/rijndael/Rijndael-ammended.pdf

[6] Yoo SM, Kotturi D, Pan DW, Blizzard J. An AES crypto chip using a high-speed parallel pipelined architecture. Microprocessor and Microsystem. 2005; 29(7): 317-326.

http://dx.doi.org/10.1016/j.micpro.2004.12.001

Design Device Slices Critical

delay (ns) Max-freq. (MHz) Through put (Gbps) Efficiency (Mbps/slic e) Rahimunnisa et al.

[38]

Virtex-6 XC6VLX75T

2056+48 BRAMs

1.9 505.5 37.1 15.56

Wang and Ha [39] Virtex-6 XC6VLX240T

9071+400 BRAMs

3.1 319.29 40.86 4.51

Anwar et al. [40] Virtex-6 ML605 2547+204 BRAMs

1.8 553 58 -

Rahimunnisa et al. [41]

Virtex-6 XC6VLX75T

2597 2.2 450.04

59.59 22.94 Abolfazl and Saeed

[42]

Virtex-6 XC6VLX240T

28520 1.2 803.98

102.91 3.6

Our design Virtex-6

XC6VLX240T

6361 1.17 849.18

(5)

[8] Sumathi M, Nirmala D, Rajkumar RI. Study of Data Security Algorithms using Verilog HDL.

International Journal of Electrical and Computer Engineering (IJECE). 2015; 5(5): 1092-1101.

Standard. IEEE 5th International Conference on Internet Multimedia Systems Architecture and Application. Bangalore. 2011: 1-6. http://dx.doi.org/10.1109/IMSAA.2011.6156342

[11] El Adib S, Raissouni N, Chahboun A, Azyat A, Lahraoua M, Ben Achhab N, Abbous A, Benarchid O. A Reconfigurable Cryptography Coprocessor RCC for Advanced Encryption Standard AES/Rijndael.

International Journal of Information & Network Security (IJINS). 2012; 1(2): 119-126.

[13] Kamal AA, Youssef AM. An area optimized implementation of the advanced encryption standard. International Conference on Microelectronics. Sharjah. 2008: 159-162.

2012; 1(2): 110-118.

[15] Chi-Wu H, Chi-Jeng C, Mao-Yuan L, Hung-Yun T. Compact FPGA implementation of 32-bits AES

algorithm using block RAM. IEEE Region 10 Conference. Taipei. 2007: 1-4.

http://dx.doi.org/10.1109/TENCON.2007.4429126

[16] Chi-Jeng C, Chi-Wu H, Kuo-Huang C, Yi-Cheng C, Chung-Cheng H. High throughput 32-bit AES implementation in FPGA. IEEE Asia Pacific Conference on Circuits and Systems. Macao. 2008: 1806-1809. http://dx.doi.org/10.1109/APCCAS.2008.4746393

[17] Li H, Ding J, Pan Y. Cell array reconfigurable architecture for high-efficiency AES system.

Microelectronics Reliability. 2012; 52(11): 2829-2836.

http://dx.doi.org/doi:10.1016/j.microrel.2012.04.020

[18] Mazen EM, Salma H, Mohamed AEG. Real-time efficient FPGA implementation of AES algorithm. IEEE 26th International SOC Conference. Erlangen. 2013: 203-208. http://dx.doi.org/10.1109/SOCC.2013.6749688

[19] Leelavathi G, Prakasha S, Shaila K, Venugopal KR, Patnaik LM. Design and implementation of advanced encryption algorithm with FPGA and ASIC. International Journal of Research in Engineering & Advanced Technology. 2013; 1(3): 1-8.

[20] Rais MH, Qasim SM. Fpga implementation of Rijndael algorithm using reduced residue of prime

numbers. 4th International Design and Test Workshop. Riyadh. 2009: 1-4.

http://dx.doi.org/10.1109/IDT.2009.5404130

[21] Rais MH, Qasim SM. Efficient hardware realization of advanced encryption standard algorithm using virtex-5 FPGA. International Journal of Computer Science and Network Security. 2009; 9(9): 59-63. [22] Shakil A, Khairulmizam S, Abdul RR, Fakhrul ZR. An effective storage encryption solution. Indian

Journal of Science and Technology. 2013; 6(4): 4384-4389.

[23] Sireesha K, Madhava SR. A novel approach of area optimized and pipelined FPGA implementation of AES encryption and decryption. International Journal Scientific and Research Publications. 2013; 3(9): 1-5.

[24] Jyrwa B, Paily R. An Area-Throughput Efficient FPGA implementation of Block Cipher AES algorithm, International Conference on Advances in Computing, Control, and Telecommunication Technologies. Kerala. 2009: 328-332. http://dx.doi.org/10.1109/ACT.2009.88

[25] Gielata A, Russek P, Wiatr K. AES hardware implementation in FPGA for algorithm acceleration purpose. International Conference on Signals and Electronic Systems. Krakow. 2008: 137-140. http://dx.doi.org/10.1109/ICSES.2008.4673377

[27] Fan CP, Hwang JK. FPGA implementations of high throughput sequential and fully pipelined AES algorithm. International Journal of Electrical Engineering. 2008; 15(6): 447-455.

[28] Rizk MRM, Morsy M. Optimized area and optimized speed hardware implementations of AES on FPGA. 2nd International Design and Test Workshop. Cairo. 2007: 207-217. http://dx.doi.org/10.1109/IDT.2007.4437462

[29] Banu JS, Vanitha M, Vaideeswaran J, Subha S. Loop parallelization and pipelining implementation of AES algorithm using OpenMP and FPGA. International Conference on Emerging Trends in Computing, Communication and Nanotechnology. Tamilnadu. 2013: 481-485. http://dx.doi.org/10.1109/ICE-CCN.2013.6528547

(6)

[31] Issam H, Kamal E, Ezz E. High-speed aes encryptor with efficient merging techniques. IEEE Embedded Systems Letters. 2010; 2(3): 67-71. http://dx.doi.org/10.1109/LES.2010.2052401

[32] Samiee H, Atani RE, Amindavar H. A novel area-throughput optimized architecture for the AES algorithm. International Conference on Electronic Devices, Systems and Applications. Kuala Lumpur. 2011: 29-32. http://dx.doi.org/10.1109/ICEDSA.2011.5959055

[33] Nalini I, Anandmohan PV, Poornaiah DV, Kulkarni VD. Efficient Hardware Architectures for AES on FPGA. In Das VV, Thankachan N. Editors. Computational Intelligence and Information Technology. New York: Springer-Verlag; 2011: 249-257.

[34] Naiem GF, Elramly S, Hasan BE, Shehata K. An efficient implementation of CBC mode Rijndeal AES on an FPGA. National Radio Science Conference. Tanta. 2008: 1-8. http://dx.doi.org/10.1109/NRSC.2008.4542373

[35] Mostafa IS, Ghada YA. Fpga implementation and performance evaluation of a high throughput crypto coprocessor. Journal of Parallel and Distributed Computing. 2011; 71(8): 1075-1084. http://dx.doi.org/doi:10.1016/j.jpdc.2011.04.006

[36] Reza RF, Bahram R, Sayed MS. FPGA based fast and high-throughput 2-slow retiming 128-bit AES encryption algorithm. Microelectronics Journal. 2014; 45(8): 1014-1025. http://dx.doi.org/doi:10.1016/j.mejo.2014.05.004

[37] Shanxin Q, Guochu S, Yihong H, Zhigang G, Zongjue Q. High Throughput, Pipelined Implementation of AES on FPGA. International Symposium on Information Engineering and Electronic Commerce. Ternopil. 2009: 542-545. http://dx.doi.org/10.1109/IEEC.2009.120

[38] Rahimunnisa K, Karthigaikumar P, Rasheed S, Jayakumar J, SureshKumar S. FPGA implementation of AES algorithm for high throughput using folded parallel architecture. Security and Communication Networks. 2014; 7(11): 2225-2236. http://dx.doi.org/10.1002/sec.651

[39] Wang Y, Ha Y. FPGA-based 40.9-gbits/s masked AES with area optimization for storage area network. IEEE Transactions on Circuits and Systems- II: Express Briefs. 2013; 60(1): 36-40. http://dx.doi.org/10.1109/TCSII.2012.2234891

[40] Anwar H, Daneshtalab M, Ebrahimi M, Plosila J, Tenhunen H. FPGA implementation of AES-based crypto processor. IEEE 20th International Conference on Electronics, Circuits, and Systems. Abu Dhabi. 2013: 369-372. http://dx.doi.org/10.1109/ICECS.2013.6815431

[41] Rahimunnisa K, Karthigaikumar P, Christy NA, Kumar SS, Jayakumar J. Psp: parallel sub-pipelined architecture for high throughput AES on FPGA and ASIC. Central European Journal of Computer Science. 2013; 3(4): 173-186. http://dx.doi.org/10.2478/s13537-013-0112-2

[42] Abolfazl S, Saeed S. An ultra-high throughput and fully pipelined implementation of AES algorithm on FPGA. Microprocessors and Microsystems. 2015; 39(7): 480-493. http://dx.doi.org/doi:10.1016/j.micpro.2015.07.005

[43] Satoh A, Morioka S, Takano K, Munetoh S. A Compact Rijndael Hardware Architecture with S-Box Optimization, In: Boyd C. Editor. Advances in Cryptology. New York: Springer-Verlag; 2001: 239-254. http://dx.doi.org/10.1007/3-540-45682-1_15

High throughput FPGA implementation of Advanced Encryption Standard algorithm