Professional Documents
Culture Documents
Abstract— The utilization of block RAMs (BRAMs) is a critical efficient memory usage in a design. Compared with the
performance factor for multiported memory designs on field- storage module synthesized by slices, BRAMs are more
programmable gate arrays (FPGAs). Not only does the excessive area and power efficient while at the same time achiev-
demand on BRAMs block the usage of BRAMs from other parts
of a design, but the complex routing between BRAMs and logic ing higher operating frequencies. An FPGA usually deploys
also limits the operating frequency. This paper first introduces multiple BRAMs with the same specification. For example,
a brand new perspective and a more efficient way of using a Xilinx Virtex-7 XC7V585 FPGA contains 795 36-kb BRAMs,
conventional two reads one write (2R1W) memory as a 2R1W/4R and each BRAM can be configured as two port mode or
memory. By exploiting the 2R1W/4R as the building block, this dual-port mode [4]. Designers can utilize these memory
paper introduces a hierarchical design of 4R1W memory that
requires 25% fewer BRAMs than the previous approach of dupli- blocks to implement the in-system storage module of a design.
cating the 2R1W module. Memories with more read/write ports Multiported memories, which allow multiple concurrent
can be extended from the proposed 2R1W/4R memory and the reads and writes, are frequently used in various digital designs
hierarchical 4R1W memory. Compared with previous xor-based on FPGAs to achieve high memory bandwidth. For exam-
and live value table-based approaches, the proposed designs can, ple, the register file of an FPGA-based scalar MIPS-like
respectively, reduce up to 53% and 69% of BRAM usage for
4R2W memory designs with 8K-depth. For complex multiported soft processor [3] requires one write port and two read
designs, the proposed BRAM-efficient approaches can achieve ports. Processors that issue multiple instructions require even
higher clock frequencies by alleviating the complex routing in an more access ports. The shared cache system among multiple
FPGA. For 4R3W memory with 8K-depth, the proposed design soft processors on FPGA should support multiple concurrent
can save 53% of BRAMs and enhance the operating frequency accesses. A routing table in a network switching function
by 20%.
would also need to enable multiple accesses in order to
Index Terms— Block RAM (BRAM), field-programmable gate
array (FPGA), multiported memory, performance.
support multiple requests from different ingress ports. Time-
multiplexing and task scheduling are alternative solutions to
I. I NTRODUCTION support multiple accesses. However, these schemes would
Fig. 3. XOR-based memory design that can support two simultaneous writes
W0 and W1 and one read R0 . A write will store an encoded value of both
the new and stale data. A read can recover the most recent value of the target
address by applying the XOR operation again.
Fig. 13. 4RnW memory that integrates HBDX and BDRT, and applies
2R1W/4R modules as building blocks.
Fig. 12. Example of a 2R2W memory that integrates BDX and BDRT.
(a) Initial state of the memory module. W0 is going to address 0 while W1 is TABLE II
going to address 1. R0 reads address 2, and R1 reads address 3. (b) Memory D IFFERENT D ESIGNS OF M ULTIPORTED M EMORY
state after completing the W0 , W1 , R0 , and R1 .
TABLE III
P ERFORMANCE AND C OST OF D ESIGNS FOR 4R2W M EMORY W ITH 8K-D EPTH AND 16K-D EPTH
All the requests will be completed within one cycle. The For BDRT-based designs, using 2R1W/4R as the build-
timing analysis is performed between input registers and ing blocks will reduce the usage of BRAMs by 33.3%
output registers of the multiported memory. compared with the previous designs with 4R1W building
blocks. Compared with LVT-based design, the BRAM usage
B. Designs for 4R2W Memory of BDRT-based designs can be reduced up to 69%. However,
Table III compares the 4R2W multiported designs with the operating frequencies of BDRT-based designs become
8K and 16K memory depth. Table III lists the performance 16%–19% slower. This is because the 4R2W design using
and cost of these designs, including the numbers of BRAMs, 2R1W/4R has to check the possible conflicts for all the six
operating frequencies, and slice utilization. Each entry in memory requests (four reads and two writes). The multiported
a multiported memory is 32 bit. Based on the results in memory designs with original 4R1W building modules only
Table III, the multiported designs can be generally categorized need to check the two writes since every 4R1W building block
into two types based on how the multiple read ports are can serve four reads and one write. The extra bank buffer of
supported. The first type enables multiple read ports with BDRT-based designs also requires more slices to implement
mapping tables [1], [3], such as LVT-based and BDRT-based the remap table, and further limits the maximum operating
designs. The second type enables multiple read ports without frequency.
mapping tables [2], such as the XOR-based designs. Since the For 4R2W designs with 16K-depth, replacing the basic
mapping table helps tracking the correct location of the most memory modules with BDX and HBDX modules in
recent data, designs with mapping tables usually require fewer XOR-based designs can reduce the number of BRAMs
BRAMs to implement a multiported memory. However, these by 30%–42.5%. The operating frequencies are degraded
designs would cause lower operating frequencies due to the by 11%–17% due to the longer critical paths posed by BDX
need for table lookups and more complex control mechanisms. and HBDX.
On the other hand, designs with replication techniques (no There is an interesting observation when LVT-based_BDX
mapping tables) can achieve higher operating frequencies, but and LVT-based_HBDX reduce BRAM usage by 37.5%–53%
demand more BRAMs on an FPGA. Note that both types of while at the same time achieving 10% faster clock fre-
designs, as discussed in Section III-B, still use table-lookup quency compared with the original LVT-based design [1], [3].
to implement multiple write ports. This is mainly because the excessive number of slices required
For 4R2W designs with 8K-depth, compared with the by the original LVT-based design has considerably compli-
XOR-based designs (including XOR-based, XOR-based_BDX, cated the routing on the target FPGA. The complex rout-
and XOR-based_HBDX) that require no mapping tables, ing becomes a limiting factor to the maximum operating
LVT-based designs (including LVT-based, LVT-based_BDX, frequency. BDX and HBDX modules, although being more
LVT-based_HBDX) have lower frequencies due to the more complex compared with previous basic memory modules, can
complex logic and circuit routing. The proposed BDX and actually help reducing the number of BRAMs in a mul-
HBDX modules can help reducing the number of BRAMs. tiported memory design on an FPGA. The fewer BRAMs
Using the proposed BDX and HBDX as the building blocks in alleviate the routing congestion and result in faster clock
the LVT-based and XOR-based designs, the number of BRAMs rates.
can be reduced by 30%–53% with minor increase in slice
utilization. The operating frequencies, however, are degraded C. Designs for 4R3W Memory
by around 5% after adopting the BDX and HBDX modules. Table IV extends the comparisons with 4R3W designs with
This is mainly because BDX and HBDX implement more 8K and 16K memory depth. To support three write ports,
complex logic. The connections between BRAMs and output the XOR-based design needs to occupy enormous BRAMs
multiplexers also increase the critical path of the designs. on the target FPGA. Complex connections from BRAMs to
148 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017
TABLE IV
P ERFORMANCE AND C OST OF D ESIGNS FOR 4R3W M EMORY W ITH 8K-D EPTH AND 16K-D EPTH
the output ports, in the XOR-based design, have become a For designs of 4R3W with 8K-depth, LVT-based_BDX
serious limiting factor to the maximum operating frequency. and LVT-based_HBDX require fewer BRAMs compared with
For 4R3W with 8K-depth, using BDX and HBDX in the LVT-based. However, for LVT-based_HBDX, the lower usage
XOR-based design can greatly reduce the number of BRAMs of BRAMs does not provide as much frequency benefit
by 37.5% and 48%, respectively. The fewer BRAMs also as in previous cases. LVT-based_HBDX results in a 18%
make the routing less complex and therefore help enhanc- slower operating frequency than the LVT-based design that
ing the operating frequency by 13%. Similar results can be requires more BRAMs. This is mainly because the already
found for LVT-based 4R3W designs with 8K-depth. Reducing congested usage of slices in the 4R3W design (almost 70% for
the usage of BRAMs would alleviate the complex routing LVT-based_HBDX) has limited the opportunity for timing
in the original LVT-based design. The proposed BDX and refinement. The more sophisticated decision-making mecha-
HBDX modules can reduce BRAM usage by 37.5%–53% in nism in HBDX further aggravates the timing issue. Although
LVT-based designs, and further enhance the operating fre- using fewer BRAMs, LVT-based_HBDX designs for 4R3W
quency by 20%. with 16K-depth still result in a longer critical path than the
For 4R3W designs with 8K-depth, compared with original LVT-based design.
LVT-based_BDX and LVT-based_HBDX, the BDRT-based Note that the design of BDRT-based_2R1W/4R cannot be
design can further reduce the BRAM usage while keeping a properly synthesized for the 4R3W memory with 16K depth
comparable operating frequency. However, the slice utilization because the number of slices required by the design has
has increased rapidly after applying the 2R1W/4R as the already exceeded the available slices on the target FPGA.
basic building block. The BDRT-based_2R1W/4R occupies
almost all the available slices and is barely routable in the D. Impact of Write Ports and Memory Depth
target FPGA. This is because the design with 2R1W/4R As shown in Tables III and IV, increasing write ports would
needs three extra banks to solve the possible write con- significantly increase the slice utilization. This situation is
flicts. The BDRT-based design deploys two banks while even more apparent for designs that enable multiple read ports
the BDRT-based_2R1W/4R requires total five banks. For with mapping tables, such as LVT-based and BDRT-based
BDRT-based_2R1W/4R, each entry in the remap table now designs. This is because the mapping tables are synthesized
requires 3 bits, instead of 2 bits in the BDRT-based design. by the registers and connections in slices of FPGAs. Having
Compared with the BDRT-based design, which adopts original large mapping tables would not only occupy slices, but also
4R1W as the basic building block, the BDRT-based_2R1W/4R introduce more complex decision logics and routing. Extra
reduces the number of BRAMs by 37.5%, but results in a 56% consumption of slices also blocks the usage of slices by
lower operating frequency. other logic modules, and would consequently impact on the
The designs for 4R3W with 16K-depth have occupied more overall timing of the design. The experimental results have
slices and more BRAMs than designs with 8K-depth. The also revealed that when the slice utilization approaches the
operating frequencies of the designs have dropped due to the capacity of the target FPGA, the timing degradation could be
more congested usage of slices and complex routing among very significant.
BRAMs and logic. For XOR-based designs of 4R3W with Increasing the memory depth introduces linear impact
16K-depth, the BDX and HBDX reduce the usage of BRAMs on the usage of BRAMs. According to the experimental
by 37.5% and 48%, respectively. BDX and HBDX also help results, growing the memory depth from 8K to 16K requires
achieve 17% higher frequency than the original XOR-based twice the number of BRAMs. The slice utilization is also
design due to less complex routing. increased approximately by two times. But again, the operating
LAI AND LIN: EFFICIENT DESIGNS OF MULTIPORTED MEMORY ON FPGA 149
TABLE V TABLE VI
P ERFORMANCE I MPACT OF D IFFERENT BANK O RGANIZATIONS T HROUGHPUT OF XOR -BASED 4R2W M EMORY AND
FOR BDRT-B ASED _2R1W/4R D ESIGN OF 4R3W T IME M ULTIPLEXING (TMX) BASED 4R2W M EMORY
M EMORY W ITH 8K D EPTH
frequency will drop considerably when the design approaches TMX(2R1W) designs. However, XOR-based designs can
the slice capacity of the target FPGA and causes routing achieve higher total throughput than the time-multiplexing
congestion. designs.
Second, to attain the most benefit, users need to properly
E. Impact of Bank Organizations choose between designs that support multiple reads with and
As noted in the previous section, BDRT-based_2R1W/4R without mapping tables. As demonstrated previously, multi-
requires more than 100% slice utilization on the target FPGA, ported designs with mapping tables can more efficiently utilize
and cannot be implemented. But BDRT-based_2R1W/4R does the BRAMs, but would suffer from lower operating frequen-
provide advantage of lower BRAM usage. This section takes cies due to more complex routing. The timing issue is further
BDRT-based_2R1W/4R as a design example, and explores aggravated when the size and complexity of the multiported
the potential of design refinement that could be achieved by design approaches the capacity of the target FPGA. Therefore,
optimizing the bank organizations. users would prefer designs with mapping tables when the
Table V lists two different bank organizations for the BDRT- target FPGA contains abundant slices and relatively scarce
based_2R1W/4R design of 4R3W memory with 8K-depth. BRAMs. If the number of slices becomes a design constraint,
The nominal implementation used in the previous sections users should favor designs without mapping tables in order to
of this paper adopts a two-data-bank organization, shown avoid the possible performance degradation due to insufficient
in the second row of Table V. The third row of Table V slices and congested connections.
lists the results of a four-data-bank design. Together with Third, a proper bank-organization can enable further per-
the three bank buffers, the four-data-bank design involves a formance enhancement. As shown in Section IV-E, different
total of seven banks. Both the two-data-bank and four-data- bank organizations result in disparate performance. Refining
bank designs need three extra bank buffers, and require 3 bits the bank organization would attain performance enhancement
for each entry in the remap table. However, when changing and worth further exploration.
the design from two-data-bank to four-data-bank, the depth And fourth, the current designs still require table-lookups to
of each bank is reduced from 4K to 2K. The depth of the support multiple writes. A design with mapping tables would
remap table is also shrunk from 8K+4K×3 to 8K+2K×3. limit the maximum operating frequency. We are currently
Compared with the original two-data-bank design, the four- developing a technique to support multiple writes without
data-bank design has demonstrated a 16% BRAM reduction, a mapping tables. This technique would avoid the issue of con-
21% higher frequency, and a 27% lower slice utilization. This gested routing and potentially enhance the overall performance
result clearly demonstrates a great opportunity to attain further of the multiported designs.
performance enhancement by tuning the bank organization.
This will be a major study in our future research. V. C ONCLUSION
This paper proposes efficient BRAM-based multiported
F. Summary of Multiported Designs memory designs on FPGAs. The existing design methods
This section summarizes important observations and design require significant amounts of BRAMs to implement a mem-
concerns of the proposed multiported memory. First, the ory module that supports multiple read and write ports.
integrated multiported memory design can provide higher Occupying too many BRAMs for multiported memory could
throughput than the time-multiplexing (TMX) based designs. seriously restrict the usage of BRAMs for other parts of
Table VI compares the throughput of different designs a design. This paper proposes techniques that can attain
for 4R2W memory. XOR-based (4R2W) is the XOR-based efficient multiported memory designs. This paper introduces
4R2W design proposed in this paper. TMX(2R1W) uses the a novel 2R1W/4R memory. By exploiting the 2R1W/4R as
replication-based 2R1W design [1], [3] and applies the time- the building block, this paper proposes a hierarchical design
multiplexing scheme to achieve 4R2W. Here we assume the of 4R1W memory that requires 33% fewer BRAMs than the
time-multiplexing scheme induces zero latency overhead and previous designs based on replication. Memories with more
can run at the same clock frequency as the original 2R1W read/write ports can be extended from the proposed 2R1W/4R
design. Therefore, for TMX(2R1W), it takes two cycles to memory and the hierarchical 4R1W memory. Compared with
serve four reads and two writes. As shown in Table VI, the XOR-based and LVT-based approaches, the proposed designs
XOR-based designs run at slower clock rates than the can, respectively, reduce up to 53% and 69% of BRAM
150 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017
usage for 4R2W memory designs with 8K-depth. For complex [8] H. E. Yantir, S. Bayar, and A. Yurdakul, “Efficient implementations of
multiported designs, the proposed BRAM-efficient approaches multi-pumped multi-port register files in FPGAs,” in Proc. Euromicro
Conf. Digit. Syst. Design (DSD), Sep. 2013, pp. 185–192.
can achieve higher clock frequencies by alleviating the [9] J.-L. Lin and B.-C. C. Lai, “BRAM efficient multi-ported memory on
complex routing in an FPGA. For 4R3W memory with FPGA,” in Proc. Int. Symp. VLSI Design, Autom. Test (VLSI-DAT),
8K-depth, the proposed design can save 53% of BRAMs while Apr. 2015, pp. 1–4.
at the same time enhance the operating frequency by 20%.
This paper also demonstrates the importance of applying an
appropriate bank organization in a memory design. It is shown
that a multiported design with proper bank organization could Bo-Cheng Charles Lai (M’09) received the
B.S. degree in electronics engineering from National
achieve a 16% BRAM reduction, a 21% higher frequency, Chiao Tung University, Hsinchu, Taiwan, in 1999,
and a 27% lower slice utilization. The results present great and the M.S. and Ph.D. degrees in electrical
potential of future design refinement that could be achieved engineering from the University of California at
Los Angeles (UCLA), Los Angeles, CA, USA,
by optimizing the bank organizations. in 2003 and 2007, respectively.
He was with Broadcom Corporation, Irvine, CA,
USA, from 2007 to 2009. He joined National Chiao
R EFERENCES Tung University in 2009, where he is currently an
[1] C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for Associate Professor with the Department of Elec-
FPGAs,” in Proc. 18th Annu. ACM/SIGDA Int. Symp. Field Program. tronics Engineering. His current research interests include parallel computing,
Gate Arrays, 2010, pp. 41–50. multicore architecture, low power designs, and embedded systems.
[2] C. E. LaForest, M. G. Liu, E. Rapati, and J. G. Steffan, “Multi-ported Dr. Lai received the scholarship from the John Deere Foundation in 2003.
memories for FPGAs via XOR,” in Proc. 20th Annu. ACM/SIGDA Int. During his study at UCLA, he won Design Automation Conference student
Symp. Field Program. Gate Arrays (FPGA), 2012, pp. 209–218. design contest in 2003 and 2005.
[3] C. E. Laforest, Z. Li, T. O’Rourke, M. G. Liu, and J. G. Steffan, “Com-
posing multi-ported memories on FPGAs,” ACM Trans. Reconfigurable
Technol. Syst., vol. 7, no. 3, Aug. 2014, Art. no. 16.
[4] Xilinx. 7 Series FPGAs Configurable Logic Block User Guide,
accessed on May 30, 2016. [Online]. Available: http://www.xilinx.com/ Jiun-Liang Lin received the M.S. degree in
support/documentation/user_guides/ug474_7Series_CLB.pdf electronics engineering from National Chiao Tung
[5] Xilinx. Zynq-7000 All Programmable SoC Overview, accessed University, Hsinchu, Taiwan, in 2015.
on May 30, 2016. [Online]. Available: http://www.xilinx.com/ He is currently a Digital Design Engineer with
support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf Mediatek Inc., Hsinchu. His current research inter-
[6] G. A. Malazgirt, H. E. Yantir, A. Yurdakul, and S. Niar, “Application ests include designs and optimizations of multi-
specific multi-port memory customization in FPGAs,” in Proc. IEEE Int. ported memory and field-programmable gate array.
Conf. Field Program. Logic Appl. (FPL), Sep. 2014, pp. 1–4.
[7] H. E. Yantir and A. Yurdakul, “An efficient heterogeneous register file
implementation for FPGAs,” in Proc. IEEE Int. Parallel Distrib. Process.
Symp. Workshops (IPDPSW), May 2014, pp. 293–298.