Professional Documents
Culture Documents
IOB
CLB CLB CLB CLB These combine the four-input LUT outputs. These gates can
Input/Output Switch
be cascaded in a chain to provide wide AND functionality
SM SM SM
Block Matrix across slices. The output from the cascaded AND gates can
IOB
IOB
CLB CLB CLB CLB then be combined with the dedicated ORCY to produce a
Wire Sum of Products (SOP) function.
SM SM SM Segments
IOB
IOB
CLB CLB CLB CLB
Configurable
3. THE CONVENTIONAL APPROACH
SM SM SM
Logic Block Straightforward design approach is to code the design logic
in a HDL (Hardware Description Language), and then let
IOB
IOB
CLB CLB CLB CLB
IOB IOB IOB IOB
the synthesis tool to do the job. The drawback of this ap-
proach is that synthesis tools are not intelligent enough and
map all of the logic to a LUT based architecture, which re-
sults in consumption of bigger chip area and longer path de-
Figure 1: Generic FPGA Architecture
lays. Hence, design becomes bigger and run at slower clock
rates. We explain this approach with the help of an exam-
ple. Lets consider we have to design a 8 input AND gate.
We can simply code it using a HDL instruction, following is
an example of Verilog.
assign out = a[0] & a[1] & a[2] & a[3] & a[4] &
Configurable Logic Block (CLB)
a[5] & a[6] & a[7];
Slice 0 Slice 2
Logic Logic Where a is 8-bit input and out is output of AND gate.
Cell (LC) Cell (LC)
IOB IOB IOB IOB
This instruction AND the 8 input bits of variable a and out-
IOB
IOB
IOB
IOB
CLB
SM
CLB
SM
CLB
SM
CLB
Slice 1 Slice 3 perform the AND operation on two 4 bit groups of input and
then resulting two bits will be ANDed using third 4-input
IOB
IOB
a [0 ]
a [1 ]
a [2 ] LUT
Figure 2: Xilinx’s CLB a [3 ]
out
LUT
a [4 ]
a [5 ]
a [6 ] LUT
a [7 ]
COUT
YB
Y
G4 S
G3
I4 LookUp
I3 Table
Carry
and
D Q YQ Figure 4: 8 input AND Function - LUT Based Ar-
O CK
G2
G1
I2 (LUT) Control
Logic EC chitecture
I1 R
F5IN
BY
In Xilinx Spartan-3 FPGA a LUT4 has a gate-delay of
SR
CLK 0.479ns and net-delay of 0.976ns, the overall critical path
CE
delay of this circuit is 9.215ns.
XB
X
G4 I4 LookUp Carry D S Q XQ
G3
G2
I3 Table
I2 (LUT)
O
and
Control
CK 4. THE OPTIMIZED APPROACH
Logic EC
G1 I1 R
In previous example there are two stages of LUTs, there-
fore there will be involvement of two stage delay in critical
BX
CIN path length of the output. We can avoid the second LUT
stage using some dedicated hardware within a Slice. By uti-
Figure 3: Simplified Slice Structure lizing a dedicated AND gate (MULT AND) or a dedicated
multiplexer (MUXCY) we can achieve the same function-
ality with lesser path delay. Following is an example code
using a MULT AND gate in place of third LUT.
assign temp1 = a[0] & a[1] & a[2] & a[3]; a [0 ]
assign temp2 = a[4] & a[5] & a[6] & a[7]; a [1 ]
MUXCY
.I0(temp1); out
CI
.I1(temp2);
a [4 ]
); a [5 ]
S
a [6 ] LUT
Figure 5 describes the resulting hardware. The MULT AND a [7 ]
a [0 ]
5. SOME TECHNIQUES FOR WIDE INPUT
a [1 ]
a [2 ] LUT
GATES
a [3 ]
MULT_AND out
5.1 Wide input AND Operation
MUXCY gates can combine the 4-input LUTs outputs
a [4 ]
across the slices and can cascade them into a chain to pro-
a [5 ]
a [6 ] LUT vide a wide AND functionality [6]. Figure 7 describes the
a [7 ] 16 input AND gate implementation. The technique uti-
lizes the 4-input LUT to provide the SELECT signal for
the MUXCY. The SELECT signal is simple AND operation
of 4 inputs. The VCC at the bottom reach the output only
Figure 5: 8 input AND Function - Using
when all of the input signals are at logic high. This use of
MULT AND
carry logic helps to perform AND functions at high speed
MUXCY can directly be used for the same circuit func- and saves hardware resources.
tionality. Following is an example code using MUXCY in
AND_OUT
place of MULT AND.
LUT MUXCY
);
LUT Output:
LUT MUXCY
out = i1 & i2 & i3 & i4
Figure 6 describes the resulting hardware. The MUXCY in v cc
Spartan-3 FPGA has a gate-delay of 0.983ns and net-delay Slic e 0
of 0.681ns, the overall critical path delay of the circuit is
8.743ns.
Figure 7: 16-bit AND Gate Implementation
Table 1 compares the timing results of 8-input AND gate
implementation with conventional and optimized approach
using MULT AND gate. Results are shown for commonly
used Xilinx FPGAs. For simplicity and to understand the 5.2 Sum of Product Function (SOP)
timing effects of two different approaches more clearly, in- The output of cascaded AND gates (Figure 7) can be com-
put and output buffer delays have been omitted. The last bined with the dedicated ORCY gate to produce a Sum of
column of table shows the percent improvement in terms of Product (SOP) function [6]. Several numbers of slices can
critical path delay for each device. be used to provide the Sum Of Product depending upon the
width of desired data. Figure 8 describes the SOP of 64 bit
wide inputs using 4 cascaded 16-bit AND operations.
F8
MUXF8 c o m bine s th e
Slic e S3 G tw o MUXF7 o u tp u ts
(Tw o CLBs )
F5
F
ORCY ORCY
i[0 ] i[0 ]
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY
F6
i[0 ] i[0 ]
i[1 ] i[1 ] Slic e S2 MUXF6 c o m bine s th e
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY G tw o MUXF5 o u tp u ts
fro m Slic e s S2 an d S3
F5
F
Slic e 1 Slic e 3
i[0 ] i[0 ]
i[1 ] i[1 ]
F7
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY
MUXF7 c o m bine s th e
G tw o MUXF6 o u tp u ts
Slic e S1 fro m Slic e s S0 an d S2
F5
i[0 ] i[0 ]
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY
F
vcc vcc
Slic e 0 Slic e 2
CLB
F6
MUXF6 c o m bine s th e
G tw o MUXF5 o u tp u ts
fro m Slic e s S0 an d S1
F5
SOP
OUT F
ORCY ORCY Slic e S0
i[0 ] i[0 ]
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY
i[0 ] i[0 ]
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY Figure 9: MUXF5 and MUXFX Multiplexers [9]
Slic e 1 Slic e 3
i[0 ] i[0 ]
Select
Output Output
i[1 ] i[1 ]
i[2 ]
LUT MUXCY i[2 ]
LUT MUXCY
LUT LUT
i[3 ] i[3 ]
Input Inputs
i[0 ] i[0 ]
s
Enable
i[1 ] i[1 ]
i[2 ]
i[3 ] LUT MUXCY i[2 ]
i[3 ] LUT MUXCY
vcc vcc
Slic e 0 Slic e 2
CLB
Figure 10: 4-input LUT as a 2:1 MUX
6. CONCLUSIONS
In this paper we have presented some useful techniques
to effectively and efficiently utilize the FPGA hardware re-
sources. By considering the discussed techniques not only
the utilized area of FPGA can be minimized but the critical
path lengths of designs can also be reduced. Consequently,
the designs can run at higher clock rates and more logic
may be added to the chip. Xilinx FPGAs dedicated hard-
ware resources are discussed to minimize the reliance of de-
signs on LUT based architectures, which will be helpful in
reducing area consumption and more timing efficient archi-
tectures. Normally, utilized chip area of FPGA is calculated
in terms of CLBs count. However, conventional mapping