You are on page 1of 12

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO.

1, JANUARY 2017 139

Efficient Designs of Multiported Memory on FPGA


Bo-Cheng Charles Lai, Member, IEEE, and Jiun-Liang Lin

Abstract— The utilization of block RAMs (BRAMs) is a critical efficient memory usage in a design. Compared with the
performance factor for multiported memory designs on field- storage module synthesized by slices, BRAMs are more
programmable gate arrays (FPGAs). Not only does the excessive area and power efficient while at the same time achiev-
demand on BRAMs block the usage of BRAMs from other parts
of a design, but the complex routing between BRAMs and logic ing higher operating frequencies. An FPGA usually deploys
also limits the operating frequency. This paper first introduces multiple BRAMs with the same specification. For example,
a brand new perspective and a more efficient way of using a Xilinx Virtex-7 XC7V585 FPGA contains 795 36-kb BRAMs,
conventional two reads one write (2R1W) memory as a 2R1W/4R and each BRAM can be configured as two port mode or
memory. By exploiting the 2R1W/4R as the building block, this dual-port mode [4]. Designers can utilize these memory
paper introduces a hierarchical design of 4R1W memory that
requires 25% fewer BRAMs than the previous approach of dupli- blocks to implement the in-system storage module of a design.
cating the 2R1W module. Memories with more read/write ports Multiported memories, which allow multiple concurrent
can be extended from the proposed 2R1W/4R memory and the reads and writes, are frequently used in various digital designs
hierarchical 4R1W memory. Compared with previous xor-based on FPGAs to achieve high memory bandwidth. For exam-
and live value table-based approaches, the proposed designs can, ple, the register file of an FPGA-based scalar MIPS-like
respectively, reduce up to 53% and 69% of BRAM usage for
4R2W memory designs with 8K-depth. For complex multiported soft processor [3] requires one write port and two read
designs, the proposed BRAM-efficient approaches can achieve ports. Processors that issue multiple instructions require even
higher clock frequencies by alleviating the complex routing in an more access ports. The shared cache system among multiple
FPGA. For 4R3W memory with 8K-depth, the proposed design soft processors on FPGA should support multiple concurrent
can save 53% of BRAMs and enhance the operating frequency accesses. A routing table in a network switching function
by 20%.
would also need to enable multiple accesses in order to
Index Terms— Block RAM (BRAM), field-programmable gate
array (FPGA), multiported memory, performance.
support multiple requests from different ingress ports. Time-
multiplexing and task scheduling are alternative solutions to
I. I NTRODUCTION support multiple accesses. However, these schemes would

F IELD-PROGRAMMABLE gate arrays (FPGAs) have


been broadly used in fast prototyping of complex digital
systems. FPGAs contain programmable logic arrays, usually
usually cause more complex designs with many corner cases
and therefore requiring extra effort of verification. Having
a generic multiported memory module can greatly ease the
referred to as slices [4]. Slices can be configured into different design when a system really needs to provide concurrent data
logic functions. The flexible routing channels can support data accesses. This concern becomes more important especially
transferring between logic slices. In addition to implementing when FPGAs are usually used for quick prototyping and
logic operations, if needed, the slices can also be used as functional verification.
storage elements, such as flip-flops, register files, or other Although the current FPGA design tools can automatically
memory modules. Due to the increasing complexity of digital synthesize the multiported memory by configuring slices, this
systems, there is a growing demand for in-system memory approach has been demonstrated to be considerably ineffi-
modules. Synthesizing a large number of memory modules cient in terms of the utilization of slices [1]. The increasing
would consume a significant amount of slices, and would logic depth also becomes a limiting factor to the maximum
therefore result in an inefficient design. The excessive usage operating frequency. Designs of multiported memories that
of slices could also pose a limiting factor to the maximum leverage BRAMs [1]–[3] have been proposed to attain better
size of a system that can be prototyped on an FPGA. utilization of FPGA resources as well as system perfor-
To more efficiently support the in-system memory, modern mance. BRAMs in an FPGA can support two access ports
FPGAs deploy block RAMs (BRAMs) that are hardcore that can be used as either a read or a write port. With
memory blocks integrated within an FPGA to support this basic specification, it requires extra effort if designers
would like to implement a storage module that supports
Manuscript received September 26, 2015; revised January 21, 2016
and March 21, 2016; accepted April 26, 2016. Date of publication more concurrent access ports than the existing BRAMs.
June 8, 2016; date of current version December 26, 2016. This work Compared with the multiported memory designs that only
was supported by the Ministry of Science and Technology, Taiwan, under utilize slices, the previous approaches with BRAMs [1]–[3]
Grant MOST 104-2622-8-009-001.
B.-C. C. Lai is with the Department of Electronics Engineer- have demonstrated less total equivalent area while achieving
ing, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: higher frequencies.
bclai@mail.nctu.edu.tw). The limited quantity and capacity of BRAMs have posed
J.-L. Lin is with MediaTek Inc., Hsinchu 30076, Taiwan (e-mail: qazhph-
phphp3@gmail.com). severe concerns to designers when implementing multiported
Digital Object Identifier 10.1109/TVLSI.2016.2568579 memories on an FPGA. Using an excessive amount of BRAMs
1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
140 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017

for multiported memory could seriously restrict the usage TABLE I


of BRAMs for other parts of a design. Furthermore, the T ECHNIQUES OF M ULTIPORTED M EMORY ON FPGA
insufficient BRAMs would force designers to synthesize the
storage modules with slices, and further consume a vast
amount of slices as well as limiting the maximum operat-
ing frequency of the design. For example, given BRAMs
with a size of 36 kb [5], design methodologies proposed
in [2] and [3] require a total of 80 BRAMs to implement
a 4R2W (four read ports and two write ports) memory with a
32-bit data width and 8K-depth. The same memory module
demands 64 BRAMs for the designs in [1] and [3]. Our
previous work [9] introduced a two reads one write (2R1W)
module based on XOR-encoded values and demonstrated bet- II. R ELATED W ORKS
ter performance with fewer BRAMs compared with [1]–[3]. To implement a multiported memory on an FPGA, two types
However, these designs still occupy more than half of of design techniques are required, namely increasing read
the available BRAMs on some modern FPGAs, such as ports and increasing write ports. Table I lists the techniques
Zynq-7015 and 7020 [5], [9]. The issue is aggravated when the proposed by previous works for multiported memories on
design requires huge internal storage capacity. Having a more FPGAs. The approach of replication [1], [3] enables multiple
efficient design of multiported memories, therefore, remains read ports by replicating the data on multiple BRAMs. This
an imperative design concern for modern digital systems technique uses low complexity of control logic, but requires
on FPGAs. excessive usage of BRAMs. LVT, which is implemented by
This paper aims at efficient designs of multiported memory synthesizing slices on FPGA, enables multiple write ports by
on FPGAs. The main contributions of this paper are as follows. duplicating BRAMs and tracking which BRAM stores the
This paper first introduces a brand new perspective of using a latest value of an address. The other approach to increase write
2R1W module as either a 2R1W or a 4R module, denoted as a ports is referred to as XOR-based [2]. Different from LVT,
2R1W/4R memory. Second, by exploiting the versatile usage which uses a table to track the location of the latest value, the
mode of the proposed 2R1W/4R memory, this paper proposes XOR-based design duplicates BRAMs and encodes the stored
a hierarchical design of 4R1W memory. This hierarchical data with XOR operations. The target data can be retrieved by
4R1W design requires 25% fewer BRAMs than the approach applying the XOR again. In general, the XOR-based approach
of duplicating the 2R1W module used in [9]. Memories with can achieve a higher operating frequency, but requires more
more read/write ports can be extended from the proposed BRAMs than the LVT approach.
2R1W/4R memory and the hierarchical 4R1W memory. Third, Note that this paper focuses on architectural solutions to
the experiments in this paper have extensively explored various achieve multiple accesses for a general memory that takes
design parameters, including the numbers of read and write requests at the current cycle and returns results in the next
ports, and different depths of multiported memory modules. cycle. Users of the multiported memory can be completely
Compared with XOR-based and live value table (LVT)-based ignorant of the details of memory designs. There are other
designs from previous approaches, the proposed designs can, works focusing on enabling multiple accesses for specific
respectively, reduce up to 53% and 69% of BRAM usage for types of storage elements, such as register files [6]–[8]. They
4R2W memory designs with 8K-depth. For complex multi- enable concurrent reads with an approach similar to repli-
ported designs, the proposed BRAM-efficient approaches can cation, but avoid write conflicts by renaming the registers
achieve higher clock frequencies by alleviating the complex with software approaches, such as compiler or assembler.
routing in an FPGA. For 4R3W memory with 8K-depth, the These approaches, which tackle specific storage functions and
proposed design can save 53% of BRAMs while at the same involve effort of users, are not in the scope of this paper.
time enhance the operating frequency by 20%. This paper also The following sections will provide more in-depth discus-
explores the impact of different bank organizations in a design. sions about implementations and design concerns of these
This paper shows that, for a 4R3W memory with 32-bit data techniques. To facilitate a more general discussion, the fol-
width and 8K-depth, a proper bank organization could achieve lowing paragraphs use a memory bank to refer to a standalone
a 16% BRAM reduction, a 21% higher frequency, and a 27% memory module used as a building block to implement a mem-
lower slice utilization. The results present great potential for ory system. A memory system usually consists of multiple
future design refinement that could be achieved by optimizing banks. The memory space, also referred to as memory depth,
the bank organizations. is distributed across the banks. When designing a memory
This paper is organized as follows. Section II discusses the system on FPGAs, a BRAM can be used to support the com-
previous design approaches of multiported memory on FPGAs. plete memory space. BRAMs can also be deployed as banks
Section III proposes the 2R1W/4R memory and the hierar- to enable larger memory space or higher access bandwidth.
chical 4R1W memory that achieve more efficient multiported
memories on FPGAs. Section IV shows the experimental A. Techniques to Increase Read Ports
results and comparisons between different works. Section V Replication is a widely adopted technique to increase read
draws conclusions. ports [1], [3]. This technique enables multiple read ports
LAI AND LIN: EFFICIENT DESIGNS OF MULTIPORTED MEMORY ON FPGA 141

Fig. 1. (a) Replication technique that replicates BRAMs to support m read


ports. (b) Example of a memory design that can support two concurrent reads
with two BRAMs M0 and M1 . The read R1 is accessing M0 at address 2,
while the read R2 is accessing M1 at address 3.

by replicating the data to multiple BRAMs. When there are


multiple reads, each of these reads will be directed to a distinct
BRAM to access the target data without conflicting with other
reads. The data between these BRAMs should be updated
simultaneously. When there is a write to an address, this write
needs to be routed to every BRAM and updates the data to
the corresponding address in each BRAM.
Fig. 1(a) illustrates how multiple read ports are enabled
by the replication technique. Assume a multiported memory
design supporting m concurrent reads. Each Mi , where i is Fig. 2. (a) LVT technique that supports n write ports. The design replicates
from 0 to m−1, is a BRAM module. The data in M0 are nBRAMs and uses an LVT to track the most up-to-date value of each address.
replicated to other BRAMs (M1 to Mm−1 ) in order to support (b) Example of a memory design that can support two concurrent writes with
LVT technique. This figure shows the initial state of the memory module
multiple concurrent reads R0 to Rm−1 . When there is a before the two writes W0 and W1 . W0 is going to address 5 in BRAM M0
write W0 , the write request is routed to the write ports of while W1 is going to address 7 in BRAM M1 . (c) For the same example, this
all the BRAMs and updates the values at the corresponding figure shows the memory state after completing the W0 and W1 requests.
address of each BRAM simultaneously. Fig. 1(b) shows an
example of a memory that can support two concurrent reads. data by applying the XOR again. The following paragraphs
In this example, there are two data stored in the BRAMs, will elaborate the details of these techniques.
where data A are at address 2, and data B are at address 3. LVT is a technique proposed in [1] and [3] to support
When the memory receives two read requests R0 and R1 multiple write ports on FPGAs. Fig. 2(a) illustrates the data
accessing addresses 2 and 3, respectively, each read request access flow of the LVT design that enables n concurrent
can access one of the BRAMs and avoid conflicting with each write requests. The memory is replicatedn times with BRAMs
other. In this way, the memory with two BRAMs M0 and M1 M0 to Mn−1 . In Fig. 2, each BRAM Mi , where i is from
can support two simultaneous reads. 0 to n − 1, can support one write port. Since multiple writes
The main advantage of replication is its simplicity without would update different memory addresses, an additional block
requiring complex control logic; however, it needs to replicate LVT is implemented to keep track of the location of the latest
m times the number of memory modules, where m is the value of a memory address. Fig. 2(b) illustrates an example
number of read ports supported by the target multiported when there are two write requests W0 and W1 . W0 writes
memory design. data A to address 5, while W1 writes data B to address 7.
In this example, W0 stores data A to BRAM M0 , and W1
B. Techniques to Increase Write Ports stores data B to BRAM M1 . Each BRAM is associated with
There are two types of techniques that have been used an identification number defined by designers. This paper
to increase write ports in a multiported memory design simply uses the serial number of a BRAM as its identification
on FPGAs. The first technique applies LVT. This technique number. Then the writes W0 and W1 will, respectively, update
utilizes multiple replicated BRAMs to enable multiple con- the LVT with identification numbers of the BRAMs that
current writes. Different write ports are connected to differ- keep the latest data. As shown in Fig. 2(c), the identification
ent BRAMs. A table, called LVT, is used to track the correct numbers for M0 and M1 (values 0 and 1) are updated to the
BRAM that stores the most up-to-date data of an address. entries 5 and 7, respectively, in the LVT. After the LVT is
The second technique uses an XOR-based scheme to enable updated, the following read request R0 will query the LVT
multiple write ports. The XOR-based technique encodes the to select the correct BRAM that stores the latest value of the
stored data by using XOR operations, and retrieves the correct target data.
142 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017

Fig. 3. XOR-based memory design that can support two simultaneous writes
W0 and W1 and one read R0 . A write will store an encoded value of both
the new and stale data. A read can recover the most recent value of the target
address by applying the XOR operation again.

The XOR-based approach proposed in [2] is a way to


increase write ports without a table. Fig. 3 illustrates an
XOR-based memory design that can support two simultaneous
writes W0 and W1 and one read R0 . The design contains a total
of four BRAMs where each BRAM is assumed to support one
read and one write. The XOR-based design encodes the stored
data by using XOR operations. For example, when W0 stores
a new value Anew , the Anew will be XOR-ed with the stale
value Astale at the same memory address from the two bottom Fig. 4. (a) Example of a 2R2W memory that integrates both replication and
BRAMs. The XOR-ed value of the two instances (Anew ⊕Astale ) LVT techniques. (b) Initial state of the multiported memory. W0 is going to
address 2 in BRAM2 M20 while W1 is going to address 3 in BRAM2 M21 .
will be stored to the two top BRAMs in Fig. 3. Similarly, the R0 reads address 0 and R1 reads address 1. (c) Memory state after completing
write W1 will XOR the values Bnew and Bstale and store the all the reads and writes (W0 , W1 , R0 , and R1 ).
encoded value to the two bottom BRAMs in Fig. 3.
With the encoded values, the read R0 can recover the most one of the BRAM2 modules. For example, W0 is connected
recent value Anew by XOR-ing the two values from the two to M20 while W1 is connected to M21 . When there is a read
BRAMs on the right of Fig. 3. The value-recovering operation request, the read will query the LVT to locate the correct
is also shown in (1). The XOR-based approach can achieve BRAM2 module that stores the latest value and retrieve the
multiple writes without requiring a table to track the position correct target data from the BRAM2 module.
of the most recent value. However, this approach needs n 2 Fig. 4(b) and (c) use an example to show the data flow
memory modules to enable a design with n writes and one of this integrated design. In Fig. 4(b), the write W0 updates
read the value C to address 2 while W1 stores the value D to
address 3. The reads R0 and R1 are retrieving the data from
R0 = (Anew ⊕ Astale ) ⊕ Astale = Anew . (1)
address 0 (value A) and address 1 (value B), respectively.
These four accesses (W0 , W1 , R0 , and R1 ) are sent to the
C. Integrating the Read/Write Techniques memory simultaneously. The reads try to identify the correct
A general multiported memory needs to support multiple BRAM2 modules that have the most recent values of the
read ports and multiple write ports simultaneously. The respec- target data. After querying the LVT, the most recent values
tive techniques for reads and writes discussed in the previous for R0 (address 0) and R1 (address 1) are both located
sections therefore should be integrated in a design to support at M20 . Recall that each BRAM2 is a 2R1W module. The
both multiple read ports and write ports. This section uses two reads (R0 and R1 ), in this example, will both access M20 .
examples to discuss how these techniques can be integrated to Meanwhile, the two writes W0 and W1 will, respectively, store
enable a multiported memory design. the data to M20 and M21 , and update the LVT. Fig. 4(c) shows
Fig. 4 illustrates the integrated design proposed the memory state after completing the requests of all the reads
in [1] and [3]. The design combines the techniques of and writes.
replication and LVT to support two reads (R0 and R1 ) and Fig. 5 shows the approach adopted in [2] that combines
two writes (W0 and W1 ). The multiple writes are handled replication and XOR-based to support two reads (R0 and R1 )
by the LVT, while the multiple reads can be serviced by the and two writes (W0 and W1 ). Note that each building block
replicated BRAMs. The basic building block shown in Fig. 4 in Fig. 5 is also a BRAM2 module that can support 2R1W.
is a BRAM2 module M2i . By composing two basic BRAMs, Fig. 5(a) shows the initial state of the 2R2W memory. Fig. 5(b)
each BRAM2 module applies the replication technique shows the data flow of two reads and two writes. R0 reads both
introduced in Fig. 1 to support 2R1W. Fig. 4(a) shows the the values Astale and A ⊕ Astale from address 2 at two BRAM2
architecture of this design. Each write is associated with modules, and recovers A by XOR-ing these two values.
LAI AND LIN: EFFICIENT DESIGNS OF MULTIPORTED MEMORY ON FPGA 143

Fig. 6. Example of a 2R1W memory implemented by BDX technique.


(a) Supporting multiple reads with XOR operations. (b) Supporting a write
request in the two-cycle pipeline architecture.

added to track the location of the correct data. The main


memory architecture is similar to that of our previous work
introduced in [9]. On top of the main architecture, this paper
introduces a brand new perspective of using a 2R1W module
as either a 2R1W or a 4R module, denoted as a 2R1W/4R
memory. By applying the 2R1W/4R, this paper exploits the
versatile usage mode and proposes a hierarchical XOR-based
design of 4R1W memory that requires fewer BRAMs than
Fig. 5. Example of a 2R2W memory that integrates both replication and
XOR-based techniques. (a) Initial state of the memory module. W0 is going
previous designs. Memories with more read/write ports can
to address 0 while W1 is going to address 1. R0 reads address 2 and be supported by extending the proposed 2R1W/4R memory
R1 reads address 3. (b) Memory state after completing all the reads and writes and the hierarchical 4R1W memory.
(W0 , W1 , R0 , and R1 ).
A. Techniques to Increase Read Ports
R1 can recover B with a similar flow. The operations of reads 1) Bank Division With XOR Design Scheme: Bank
are shown in the following equations: Division With XOR (BDX) is an approach to increase read
R0 = (A ⊕ Astale ) ⊕ Astale = A (2) ports proposed in [9]. Unlike the method used in [1] and [3],
BDX avoids replicating the storage elements of the whole
R1 = (B ⊕ Bstale ) ⊕ Bstale = B. (3)
memory space. With BDX, multiple reads can be supported by
At the same time, W0 reads value Cstale from three BRAMs using the XOR operations. Note that BDX is different from the
at the bottom of Fig. 5(b), and updates the encoded value XOR-based design in [2]. The XOR-based approach in [2] uses
Cnew ⊕ Cstale to address 0 of the three BRAMs on the top of XOR operations to increase write ports by storing the encoded
Fig. 5(b). W0 can update the values at address 1 with a similar data to maintain the data coherence between memory modules.
flow. BDX uses XOR operations to increase read ports by retrieving
Recall that each BRAM2 is a 2R1W module. In this the target data from the encoded value.
2R2W design, each write needs to first read all of the stale Fig. 6 illustrates an example of a 2R1W memory imple-
values from BRAM2 modules and update the XOR-ed values. mented with the BDX scheme. As shown in Fig. 6(a), the
Therefore, the design in Fig. 5 needs a total of six BRAM2 memory space is distributed to four data banks (banks 0–3).
modules to provide sufficient internal read ports and support One XOR-bank is added to keep the XOR values of the data
all the data accesses. Each read port of the multiported banks. Each of these memory banks is assumed to be a 2RW
memory needs to occupy two internal read ports of a BRAM2 module that can support two reads, or two writes, or one
module. Overall, a mRnW replication-XOR-based memory read and one write. This is the dual-port mode supported
needs n × (n − 1 + m) BRAM2s. by the BRAMs in Virtex-7 FPGA. Compared with the dual-
port mode of BRAMs, the BDX scheme in Fig. 6 enables
III. P ROPOSED D ESIGN one more read port. Therefore, the BDX design becomes a
This section proposes efficient solutions to implement mul- 2R1W memory module which can support two reads and one
tiported memories on FPGAs. Unlike the replication method write concurrently. The XOR-bank stores the XOR-ed value of
in [1] and [3], the approach proposed inthis paper supports all the data at the same offset in every data bank. The XOR
multiple reads with XOR operations, while multiple writes operation for this example is shown in (4). Pn represents the
can be enabled using additional BRAMs. A remap table is value stored at offset n of the XOR-bank. D(m,n) represents the
144 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017

Fig. 7. Example of mR1W memory implemented with multiple


2R1W modules.

data at offset n of bank m. Since there are four data banks in


the example shown in Fig. 6(a), the value at the same offset Fig. 8. Two modes of 2R1W/4R module. The module is implemented with
four data banks and one XOR-bank. (a) 2R1W mode. (b) 4R mode.
(offset n) of each bank (banks 0–3) will be XOR-ed and stored
to the same offset (offset n) at the XOR-bank
a new perspective of using a 2R1W module as a 2R1W/4R
Pn = D(0,n) ⊕ D(1,n) ⊕ D(2,n) ⊕ D(3,n) . (4) module. Section III-A2 will discuss the 2R1W/4R module
As shown in Fig. 6(a), two reads R0 and R1 are both going while Section III-A3 will introduce the design of HBDX.
to bank 1. In this example, one read (R0 ) will access the data 2) 2R1W/4R (An Efficient Two-Mode Memory): To imple-
directly from bank 1 (R0 = D(1,0) ). Due to the bank conflict, ment HBDX in an efficient way, this paper introduces a brand
the other read R1 cannot access bank 1. Instead, the target new perspective of using a 2R1W module as either a 2R1W
data of R1 can be recovered by XOR-ing the values at the or a 4R module. This new way of using the 2R1W module is
same offset of the other data banks as well as the XOR-bank. denoted as 2R1W/4R. This hybrid module can support either
As shown in the following equation, the target data at offset n 2R and 1W or 4R. Note that the 2R1W/4R module uses exactly
of data bank 1 can be recovered by XOR-ing the values at the same design as the 2R1W module introduced in Fig. 6.
the same offset (offset n) from banks 0 to 3, as well as the Fig. 8 illustrates how the two modes work. Fig. 8(a) shows
XOR-bank: the 2R1W mode. When there is a write request W0 , this design
can support up to two conflicting reads. The write request W0
D(1,n) = D(0,n) ⊕ D(2,n) ⊕ D(3,n) ⊕ Pn . (5) stores the data directly to the target data bank, and reads all
The write request in this example is implemented in the the data at the same offset from the other data banks (Rupdate )
two-cycle pipeline architecture. As illustrated in Fig. 6(b), at to update the XOR-bank. Fig. 8(b) shows the 4R mode. This
the first cycle W0 writes the data D(1,n) directly to bank 1. mode only works when there is no write request. In this case,
At the same cycle, the data at the same offset of the other the design can support up to four conflicting reads. Consider
banks [D(0,n) , D(2,n) , and D(3,n) ] together with D(1,n) are read one of the worst cases when all the four reads (R0 to R3 ) are
and XOR-ed. At the second cycle, the XOR-ed value will be going to bank 0. As illustrated in Fig. 8(b), R0 and R2 access
written back to the XOR bank at offset n. With the pipeline bank 0 directly. At the same time, R1 and R3 can retrieve the
architecture, this design can process one write request every target data by XOR-ing the values at the same offset of the
cycle. other data banks as well as the XOR-bank.
The 2R1W memory introduced previously only needs an 3) HBDX Designs With 2R1W/4R Module: Fig. 7 illus-
additional XOR-bank that requires the same storage size of trates a design scheme that can support more read ports by
each data bank. In this case, assume all the data banks are of replicating the 2R1W module. However, this design scheme
equal size. The number of memory entries in the XOR-bank could significantly increase the usage of the limited BRAMs
is MemoryDepth/#DataBanks, where MemoryDepth represents on an FPGA. To achieve a more BRAM-efficient design, this
the total number of memory entries in the memory space paper proposes HBDX, which adopts a hierarchical structure
and #DataBanks denotes the number of data banks in the that organizes the 2R1W to achieve 4R1W without replicating
multiported memory. the 2R1W module. To further enhance the design, HBDX in
To support more than two read ports, BDX needs to deploy this section leverages the 2R1W/4R scheme introduced in the
multiple 2R1W memories. Fig. 7 shows an example of mR1W previous section as the basic building module to implement
memory. Each 2R1W module supports two out of m reads. a 4R1W module.
In this way, all the m reads can be serviced concurrently Fig. 9 illustrates a 4R1W memory design by using the
by multiple 2R1W modules. Note that when implementing a HBDX scheme. In this 4R1W design, each basic building
multiported memory on FPGAs, replicating memory modules block is a 2R1W module of the BDX scheme introduced in
could significantly increase the usage of the limited BRAMs Fig. 6. According to the 2R1W/4R memory proposed in the
on an FPGA. To address this issue, this paper proposes a previous section, this 2R1W module can be used as either a
new architecture to increase read ports, referred to as hier- 2R1W or a 4R module. The HBDX in Fig. 9 will utilize this
archical bank division with XOR (HBDX). HBDX applies versatile usage mode to achieve a more efficient 4R1W design.
LAI AND LIN: EFFICIENT DESIGNS OF MULTIPORTED MEMORY ON FPGA 145

Fig. 10. Example of a 1R2W memory implemented with BDRT technique.


Fig. 9. HBDX 4R1W implemented with 2R1W/4R modules. (a) According to the remap table, both W0 and W1 are going to bank 0.
The null entries in BRAMs are the entries that do not store any valid data.
(b) Final state of the multiported memory after completing the two writes
Consider one of the worst cases when all the 4R and 1W W0 and W1 .
are going to bank 0. Two read requests, R0 and R1 in this
example, would read directly from bank 0. The other two read
requests, R2 and R3 , would read the same offset from the
other data banks (banks 1–3) and the XOR-bank to recover
the target data. And the write request W0 can be supported
with a pipeline architecture similar to the one shown in the
example of Fig. 6(b). In this case, W0 stores the data directly
to bank 0. The data at the same offset of the other data banks
are read and XOR-ed at the same cycle. The XOR-ed value
will be stored back to the XOR bank in the next cycle. In this
example, bank 0 is used in the 2R1W mode, while other banks
are used in the 4R mode.
The HBDX-based 4R1W memory has demonstrated to be
more BRAM efficient than the previous approach of simply
Fig. 11. Architecture that integrates HBDX and BDRT to achieve a mRnW
duplicating the 2R1W module [9]. Section IV will more memory.
comprehensively discuss the design concerns between the
number of BRAMs and operating frequency on an FPGA. Compared with the LVT approach used in previous works,
the BDRT requires smaller storage space. The extra cost of
B. Techniques to Increase Write Ports BDRT is the additional registers to implement the remap table.
The required number of these extra registers is shown in the
Bank division with remap table (BDRT) is an approach to
following equation, where MemoryDepth is the total number of
increase write ports proposed in [9]. Unlike the LVT design
memory entries in the memory space and #DataBanks denotes
used in [1] and [3], BDRT avoids replicating the whole
the number of data banks in the multiported memory:
memory space and supports multiple writes using additional
BRAMs and a remap table to track the location of the # of registers for remap table
latest data. Fig. 10 shows an example of the design for = (log2 (# Data Banks + 1) − 1) × Memor y Depth. (6)
a 1R2W memory. This example consists of two data banks
(banks 0 and 1), one bank buffer, and a remap table. The Null
entries in a memory bank are the entries that do not store any C. Integrating the Read/Write Techniques
valid data. When receiving W0 and W1 , these requests will This section shows the design of a multiported memory that
first look up the remap table to identify the correct BRAM that integrates HBDX and BDRT. Fig. 11 shows an architecture
stores the latest data. According to the remap table, W0 and W1 that uses HBDX and BDRT to implement an mRnW memory.
are, respectively, going to address 0 and address 1 in bank 0. This memory architecture is divided into k data banks. Based
In this case, W0 will store the value directly to address 0 in on BDRT, to support all the writes, there require total n − 1
bank 0. W1 will be directed to the bank buffer whose offset 1 bank buffers. A hash mechanism is added to distribute the
is a Null entry. At the same time, the remap table needs to writes to banks.
be updated to reflect the modified location of W1 , as shown 1) Example of 2R2W Memory: Fig. 12 shows an example
in Fig. 10(b). The address 1 of the remap table stores the of 2R2W memory that applies this architecture. In Fig. 12(a),
value 2 which is the identification number of the bank buffer. assume there are two write requests and two read requests.
This state of remap table shows that the data of address 1 is The writes W0 and W1 are going to addresses 0 and 1,
now stored in the bank buffer. respectively. The reads R0 and R1 are retrieving the data from
146 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017

Fig. 13. 4RnW memory that integrates HBDX and BDRT, and applies
2R1W/4R modules as building blocks.
Fig. 12. Example of a 2R2W memory that integrates BDX and BDRT.
(a) Initial state of the memory module. W0 is going to address 0 while W1 is TABLE II
going to address 1. R0 reads address 2, and R1 reads address 3. (b) Memory D IFFERENT D ESIGNS OF M ULTIPORTED M EMORY
state after completing the W0 , W1 , R0 , and R1 .

addresses 2 and 3, respectively. Each bank is a 2R1W module


that can support two read ports and one write port. The 2R1W
module can be implemented with BDX as illustrated in Fig. 6.
A read request will access all the banks and choose the data
returned by the bank that stores the latest value of the target
data based on the record in the remap table. The two write
requests, W0 and W1 , will look up the remap table for the
correct banks to store the values. In this example, the addresses
of the two writes, addresses 0 and 1, both belong to bank 0.
In this case, W0 will store the value directly to bank 0. W1 will
be directed to a bank buffer that has a Null entry at the same
offset of W1 on bank 0. At the same time, the remap table
needs to be updated to reflect the modified location of W1 , as previous works. Section IV-A introduces the FPGA platform
shown in Fig. 12(b). used in this paper and the experimental setup. Section IV-B
2) Extend to 4RnW Memory With 2R1W/4R Module: This uses 4R2W memory as the design target to compare the
section further extends the design to a 4RnW memory. Accord- maximum operating frequencies and slice utilization of dif-
ing to the architecture presented in Fig. 11, a 4RnW memory ferent design approaches. Section IV-C further extends the
requires 4R1W modules as building blocks. A 4R1W mod- discussions to designs for a 4R3W multiported memory.
ule can be implemented simply by the replication technique Section IV-D discusses the impact of increasing write ports
introduced in [1], [3], and [9]. and memory depth and Section IV-E explores the impact of
This paper proposes a more efficient way of implementing different bank organizations, and presents the great potential of
a 4RnW memory by using the 2R1W/4R module introduced future design refinement that could be achieved by optimizing
in this paper. Although supporting two different usage modes, the bank organizations. Section IV-F summarizes important
a 2R1W/4R module is essentially a 2R1W memory block in observations and design concerns of the proposed multiported
terms of design cost and performance. As will be shown in memory.
Section IV, compared with the 4R1W module from replication,
the 2R1W/4R module on an FPGA requires fewer BRAMs and A. Experimental Environment
attains higher operating frequencies. To have fair comparisons, we implemented the works
Fig. 13 shows a 4RnW memory implemented with k data of [1]–[3] on the same FPGA platform. All the multiported
banks and using 2R1W/4R as the building block. The four memory designs are listed in Table II. The target FPGA plat-
reads can be supported by exploiting the 4R mode of the form is Virtex-7 XC7V585 FPGA that contains 795 BRAMs
2R1W/4R module. Note that a 2R1W/4R module cannot and 91 050 slices [4]. Each BRAM is 32-bit data width by
support any write requests when it is servicing more than two 1K-depth, and can be configured as two-port mode (supporting
reads. To support n write requests, this design needs n extra one read and one write), or dual-port mode (supporting two
bank buffers to solve conflicting write requests. More detailed reads, or two writes, or one read and one write). A slice
results and comparisons will be discussed in Section IV. contains four 6-input look-up tables, eight registers, and
multiplexers. The design environment is based on Xilinx
IV. E XPERIMENTAL R ESULTS ISE 14.2. The synthesis tool is set to favor clock frequency.
This section compares the cost and performance of the The data width is 32 bit for all the multiported designs in this
proposed multiported memory designs with approaches from paper. The designs of multiported memory assume no stalls.
LAI AND LIN: EFFICIENT DESIGNS OF MULTIPORTED MEMORY ON FPGA 147

TABLE III
P ERFORMANCE AND C OST OF D ESIGNS FOR 4R2W M EMORY W ITH 8K-D EPTH AND 16K-D EPTH

All the requests will be completed within one cycle. The For BDRT-based designs, using 2R1W/4R as the build-
timing analysis is performed between input registers and ing blocks will reduce the usage of BRAMs by 33.3%
output registers of the multiported memory. compared with the previous designs with 4R1W building
blocks. Compared with LVT-based design, the BRAM usage
B. Designs for 4R2W Memory of BDRT-based designs can be reduced up to 69%. However,
Table III compares the 4R2W multiported designs with the operating frequencies of BDRT-based designs become
8K and 16K memory depth. Table III lists the performance 16%–19% slower. This is because the 4R2W design using
and cost of these designs, including the numbers of BRAMs, 2R1W/4R has to check the possible conflicts for all the six
operating frequencies, and slice utilization. Each entry in memory requests (four reads and two writes). The multiported
a multiported memory is 32 bit. Based on the results in memory designs with original 4R1W building modules only
Table III, the multiported designs can be generally categorized need to check the two writes since every 4R1W building block
into two types based on how the multiple read ports are can serve four reads and one write. The extra bank buffer of
supported. The first type enables multiple read ports with BDRT-based designs also requires more slices to implement
mapping tables [1], [3], such as LVT-based and BDRT-based the remap table, and further limits the maximum operating
designs. The second type enables multiple read ports without frequency.
mapping tables [2], such as the XOR-based designs. Since the For 4R2W designs with 16K-depth, replacing the basic
mapping table helps tracking the correct location of the most memory modules with BDX and HBDX modules in
recent data, designs with mapping tables usually require fewer XOR-based designs can reduce the number of BRAMs
BRAMs to implement a multiported memory. However, these by 30%–42.5%. The operating frequencies are degraded
designs would cause lower operating frequencies due to the by 11%–17% due to the longer critical paths posed by BDX
need for table lookups and more complex control mechanisms. and HBDX.
On the other hand, designs with replication techniques (no There is an interesting observation when LVT-based_BDX
mapping tables) can achieve higher operating frequencies, but and LVT-based_HBDX reduce BRAM usage by 37.5%–53%
demand more BRAMs on an FPGA. Note that both types of while at the same time achieving 10% faster clock fre-
designs, as discussed in Section III-B, still use table-lookup quency compared with the original LVT-based design [1], [3].
to implement multiple write ports. This is mainly because the excessive number of slices required
For 4R2W designs with 8K-depth, compared with the by the original LVT-based design has considerably compli-
XOR-based designs (including XOR-based, XOR-based_BDX, cated the routing on the target FPGA. The complex rout-
and XOR-based_HBDX) that require no mapping tables, ing becomes a limiting factor to the maximum operating
LVT-based designs (including LVT-based, LVT-based_BDX, frequency. BDX and HBDX modules, although being more
LVT-based_HBDX) have lower frequencies due to the more complex compared with previous basic memory modules, can
complex logic and circuit routing. The proposed BDX and actually help reducing the number of BRAMs in a mul-
HBDX modules can help reducing the number of BRAMs. tiported memory design on an FPGA. The fewer BRAMs
Using the proposed BDX and HBDX as the building blocks in alleviate the routing congestion and result in faster clock
the LVT-based and XOR-based designs, the number of BRAMs rates.
can be reduced by 30%–53% with minor increase in slice
utilization. The operating frequencies, however, are degraded C. Designs for 4R3W Memory
by around 5% after adopting the BDX and HBDX modules. Table IV extends the comparisons with 4R3W designs with
This is mainly because BDX and HBDX implement more 8K and 16K memory depth. To support three write ports,
complex logic. The connections between BRAMs and output the XOR-based design needs to occupy enormous BRAMs
multiplexers also increase the critical path of the designs. on the target FPGA. Complex connections from BRAMs to
148 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017

TABLE IV
P ERFORMANCE AND C OST OF D ESIGNS FOR 4R3W M EMORY W ITH 8K-D EPTH AND 16K-D EPTH

the output ports, in the XOR-based design, have become a For designs of 4R3W with 8K-depth, LVT-based_BDX
serious limiting factor to the maximum operating frequency. and LVT-based_HBDX require fewer BRAMs compared with
For 4R3W with 8K-depth, using BDX and HBDX in the LVT-based. However, for LVT-based_HBDX, the lower usage
XOR-based design can greatly reduce the number of BRAMs of BRAMs does not provide as much frequency benefit
by 37.5% and 48%, respectively. The fewer BRAMs also as in previous cases. LVT-based_HBDX results in a 18%
make the routing less complex and therefore help enhanc- slower operating frequency than the LVT-based design that
ing the operating frequency by 13%. Similar results can be requires more BRAMs. This is mainly because the already
found for LVT-based 4R3W designs with 8K-depth. Reducing congested usage of slices in the 4R3W design (almost 70% for
the usage of BRAMs would alleviate the complex routing LVT-based_HBDX) has limited the opportunity for timing
in the original LVT-based design. The proposed BDX and refinement. The more sophisticated decision-making mecha-
HBDX modules can reduce BRAM usage by 37.5%–53% in nism in HBDX further aggravates the timing issue. Although
LVT-based designs, and further enhance the operating fre- using fewer BRAMs, LVT-based_HBDX designs for 4R3W
quency by 20%. with 16K-depth still result in a longer critical path than the
For 4R3W designs with 8K-depth, compared with original LVT-based design.
LVT-based_BDX and LVT-based_HBDX, the BDRT-based Note that the design of BDRT-based_2R1W/4R cannot be
design can further reduce the BRAM usage while keeping a properly synthesized for the 4R3W memory with 16K depth
comparable operating frequency. However, the slice utilization because the number of slices required by the design has
has increased rapidly after applying the 2R1W/4R as the already exceeded the available slices on the target FPGA.
basic building block. The BDRT-based_2R1W/4R occupies
almost all the available slices and is barely routable in the D. Impact of Write Ports and Memory Depth
target FPGA. This is because the design with 2R1W/4R As shown in Tables III and IV, increasing write ports would
needs three extra banks to solve the possible write con- significantly increase the slice utilization. This situation is
flicts. The BDRT-based design deploys two banks while even more apparent for designs that enable multiple read ports
the BDRT-based_2R1W/4R requires total five banks. For with mapping tables, such as LVT-based and BDRT-based
BDRT-based_2R1W/4R, each entry in the remap table now designs. This is because the mapping tables are synthesized
requires 3 bits, instead of 2 bits in the BDRT-based design. by the registers and connections in slices of FPGAs. Having
Compared with the BDRT-based design, which adopts original large mapping tables would not only occupy slices, but also
4R1W as the basic building block, the BDRT-based_2R1W/4R introduce more complex decision logics and routing. Extra
reduces the number of BRAMs by 37.5%, but results in a 56% consumption of slices also blocks the usage of slices by
lower operating frequency. other logic modules, and would consequently impact on the
The designs for 4R3W with 16K-depth have occupied more overall timing of the design. The experimental results have
slices and more BRAMs than designs with 8K-depth. The also revealed that when the slice utilization approaches the
operating frequencies of the designs have dropped due to the capacity of the target FPGA, the timing degradation could be
more congested usage of slices and complex routing among very significant.
BRAMs and logic. For XOR-based designs of 4R3W with Increasing the memory depth introduces linear impact
16K-depth, the BDX and HBDX reduce the usage of BRAMs on the usage of BRAMs. According to the experimental
by 37.5% and 48%, respectively. BDX and HBDX also help results, growing the memory depth from 8K to 16K requires
achieve 17% higher frequency than the original XOR-based twice the number of BRAMs. The slice utilization is also
design due to less complex routing. increased approximately by two times. But again, the operating
LAI AND LIN: EFFICIENT DESIGNS OF MULTIPORTED MEMORY ON FPGA 149

TABLE V TABLE VI
P ERFORMANCE I MPACT OF D IFFERENT BANK O RGANIZATIONS T HROUGHPUT OF XOR -BASED 4R2W M EMORY AND
FOR BDRT-B ASED _2R1W/4R D ESIGN OF 4R3W T IME M ULTIPLEXING (TMX) BASED 4R2W M EMORY
M EMORY W ITH 8K D EPTH

frequency will drop considerably when the design approaches TMX(2R1W) designs. However, XOR-based designs can
the slice capacity of the target FPGA and causes routing achieve higher total throughput than the time-multiplexing
congestion. designs.
Second, to attain the most benefit, users need to properly
E. Impact of Bank Organizations choose between designs that support multiple reads with and
As noted in the previous section, BDRT-based_2R1W/4R without mapping tables. As demonstrated previously, multi-
requires more than 100% slice utilization on the target FPGA, ported designs with mapping tables can more efficiently utilize
and cannot be implemented. But BDRT-based_2R1W/4R does the BRAMs, but would suffer from lower operating frequen-
provide advantage of lower BRAM usage. This section takes cies due to more complex routing. The timing issue is further
BDRT-based_2R1W/4R as a design example, and explores aggravated when the size and complexity of the multiported
the potential of design refinement that could be achieved by design approaches the capacity of the target FPGA. Therefore,
optimizing the bank organizations. users would prefer designs with mapping tables when the
Table V lists two different bank organizations for the BDRT- target FPGA contains abundant slices and relatively scarce
based_2R1W/4R design of 4R3W memory with 8K-depth. BRAMs. If the number of slices becomes a design constraint,
The nominal implementation used in the previous sections users should favor designs without mapping tables in order to
of this paper adopts a two-data-bank organization, shown avoid the possible performance degradation due to insufficient
in the second row of Table V. The third row of Table V slices and congested connections.
lists the results of a four-data-bank design. Together with Third, a proper bank-organization can enable further per-
the three bank buffers, the four-data-bank design involves a formance enhancement. As shown in Section IV-E, different
total of seven banks. Both the two-data-bank and four-data- bank organizations result in disparate performance. Refining
bank designs need three extra bank buffers, and require 3 bits the bank organization would attain performance enhancement
for each entry in the remap table. However, when changing and worth further exploration.
the design from two-data-bank to four-data-bank, the depth And fourth, the current designs still require table-lookups to
of each bank is reduced from 4K to 2K. The depth of the support multiple writes. A design with mapping tables would
remap table is also shrunk from 8K+4K×3 to 8K+2K×3. limit the maximum operating frequency. We are currently
Compared with the original two-data-bank design, the four- developing a technique to support multiple writes without
data-bank design has demonstrated a 16% BRAM reduction, a mapping tables. This technique would avoid the issue of con-
21% higher frequency, and a 27% lower slice utilization. This gested routing and potentially enhance the overall performance
result clearly demonstrates a great opportunity to attain further of the multiported designs.
performance enhancement by tuning the bank organization.
This will be a major study in our future research. V. C ONCLUSION
This paper proposes efficient BRAM-based multiported
F. Summary of Multiported Designs memory designs on FPGAs. The existing design methods
This section summarizes important observations and design require significant amounts of BRAMs to implement a mem-
concerns of the proposed multiported memory. First, the ory module that supports multiple read and write ports.
integrated multiported memory design can provide higher Occupying too many BRAMs for multiported memory could
throughput than the time-multiplexing (TMX) based designs. seriously restrict the usage of BRAMs for other parts of
Table VI compares the throughput of different designs a design. This paper proposes techniques that can attain
for 4R2W memory. XOR-based (4R2W) is the XOR-based efficient multiported memory designs. This paper introduces
4R2W design proposed in this paper. TMX(2R1W) uses the a novel 2R1W/4R memory. By exploiting the 2R1W/4R as
replication-based 2R1W design [1], [3] and applies the time- the building block, this paper proposes a hierarchical design
multiplexing scheme to achieve 4R2W. Here we assume the of 4R1W memory that requires 33% fewer BRAMs than the
time-multiplexing scheme induces zero latency overhead and previous designs based on replication. Memories with more
can run at the same clock frequency as the original 2R1W read/write ports can be extended from the proposed 2R1W/4R
design. Therefore, for TMX(2R1W), it takes two cycles to memory and the hierarchical 4R1W memory. Compared with
serve four reads and two writes. As shown in Table VI, the XOR-based and LVT-based approaches, the proposed designs
XOR-based designs run at slower clock rates than the can, respectively, reduce up to 53% and 69% of BRAM
150 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017

usage for 4R2W memory designs with 8K-depth. For complex [8] H. E. Yantir, S. Bayar, and A. Yurdakul, “Efficient implementations of
multiported designs, the proposed BRAM-efficient approaches multi-pumped multi-port register files in FPGAs,” in Proc. Euromicro
Conf. Digit. Syst. Design (DSD), Sep. 2013, pp. 185–192.
can achieve higher clock frequencies by alleviating the [9] J.-L. Lin and B.-C. C. Lai, “BRAM efficient multi-ported memory on
complex routing in an FPGA. For 4R3W memory with FPGA,” in Proc. Int. Symp. VLSI Design, Autom. Test (VLSI-DAT),
8K-depth, the proposed design can save 53% of BRAMs while Apr. 2015, pp. 1–4.
at the same time enhance the operating frequency by 20%.
This paper also demonstrates the importance of applying an
appropriate bank organization in a memory design. It is shown
that a multiported design with proper bank organization could Bo-Cheng Charles Lai (M’09) received the
B.S. degree in electronics engineering from National
achieve a 16% BRAM reduction, a 21% higher frequency, Chiao Tung University, Hsinchu, Taiwan, in 1999,
and a 27% lower slice utilization. The results present great and the M.S. and Ph.D. degrees in electrical
potential of future design refinement that could be achieved engineering from the University of California at
Los Angeles (UCLA), Los Angeles, CA, USA,
by optimizing the bank organizations. in 2003 and 2007, respectively.
He was with Broadcom Corporation, Irvine, CA,
USA, from 2007 to 2009. He joined National Chiao
R EFERENCES Tung University in 2009, where he is currently an
[1] C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for Associate Professor with the Department of Elec-
FPGAs,” in Proc. 18th Annu. ACM/SIGDA Int. Symp. Field Program. tronics Engineering. His current research interests include parallel computing,
Gate Arrays, 2010, pp. 41–50. multicore architecture, low power designs, and embedded systems.
[2] C. E. LaForest, M. G. Liu, E. Rapati, and J. G. Steffan, “Multi-ported Dr. Lai received the scholarship from the John Deere Foundation in 2003.
memories for FPGAs via XOR,” in Proc. 20th Annu. ACM/SIGDA Int. During his study at UCLA, he won Design Automation Conference student
Symp. Field Program. Gate Arrays (FPGA), 2012, pp. 209–218. design contest in 2003 and 2005.
[3] C. E. Laforest, Z. Li, T. O’Rourke, M. G. Liu, and J. G. Steffan, “Com-
posing multi-ported memories on FPGAs,” ACM Trans. Reconfigurable
Technol. Syst., vol. 7, no. 3, Aug. 2014, Art. no. 16.
[4] Xilinx. 7 Series FPGAs Configurable Logic Block User Guide,
accessed on May 30, 2016. [Online]. Available: http://www.xilinx.com/ Jiun-Liang Lin received the M.S. degree in
support/documentation/user_guides/ug474_7Series_CLB.pdf electronics engineering from National Chiao Tung
[5] Xilinx. Zynq-7000 All Programmable SoC Overview, accessed University, Hsinchu, Taiwan, in 2015.
on May 30, 2016. [Online]. Available: http://www.xilinx.com/ He is currently a Digital Design Engineer with
support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf Mediatek Inc., Hsinchu. His current research inter-
[6] G. A. Malazgirt, H. E. Yantir, A. Yurdakul, and S. Niar, “Application ests include designs and optimizations of multi-
specific multi-port memory customization in FPGAs,” in Proc. IEEE Int. ported memory and field-programmable gate array.
Conf. Field Program. Logic Appl. (FPL), Sep. 2014, pp. 1–4.
[7] H. E. Yantir and A. Yurdakul, “An efficient heterogeneous register file
implementation for FPGAs,” in Proc. IEEE Int. Parallel Distrib. Process.
Symp. Workshops (IPDPSW), May 2014, pp. 293–298.

You might also like