Professional Documents
Culture Documents
Abstract This project aims to see if accelerators based and without royalty allowing high-performance programming
on FPGAs are worthwhile for DNA assembly. It involves by exploiting the parallelism of the hardware architecture.
reprogramming an already existing algorithm - called Ray - It specifies the interface to expose to users, but not the
to be run either on such an accelerator or on a CPU to be
able to compare both. It has been achieved using the OpenCL implementation; each hardware vendor is free at this level.
language. The focus is put on modifying and optimizing the It is thus easy to target many different architectures with the
original algorithm to better suit the new parallelization tool. same code.
Upon running the new program on some datasets, it becomes On the programming side, it is a language based on
clear that FPGAs are a very capable platform that can fare C99. In it, the parallelism is described explicitely according
better than the traditional approach, both on raw performance
and energy consumption. to a hierarchy of workgroups and work-items, mapped to
compute units and processing elements in hardware. Each
I. INTRODUCTION task must at its core be dissected in many small similar and
De novo DNA assembly has been done using different parallel steps.
algorithms throughout time. A recent method which will be
II. ACCELERATORS
the subject of this article consists of using a De Bruijn graph
in which we place DNA fragments. A De Bruijn graph is an A. CPU
oriented graph that allows representing overlaps of length The CPU is used as the host in a compatible OpenCL sys-
k-1 between words of length k, called k-mers, in a given tem. However, it can at the same time act as an accelerator,
alphabet [3]. The number of times each k-mer has been seen where each core is a compute unit. The vector instruction
is saved, which we call coverage. It is then possible to search extensions can also be used for SIMD processing.
paths in the graph that represent some part of the original Compiling OpenCL code for the CPU can be done on-the-
genomic sequence, which we call contigs. fly without any apparent delay to the user. For this reason,
The goal of this project is to complete DNA assembly Altera suggests to do so using the option -march=emulator
using a De Bruijn graph in a reasonable amount of time on instead of compiling for its own accelerators during proto-
devices other than supercomputers. Up until recently, CPU typing.
have been the main calculation power of these, but now
accelerators have taken the lead. OpenCL will thus be used B. FPGA
for parallelization. The algorithm on which this project is Originally, FPGA programming was done using a hard-
based on is called Ray. It has been developed by Sebastien ware description language such as VHDL or Verilog. In
Boisvert from Laval University and it uses OpenMPI for OpenCL, it is the compiler and the optimizer which take on
inter-node parallelization. Our version is called OCLRay the duty of generating an architecture adapted and optimized
because of the new parallelization tool. for the instructions to execute, which is easy with the explicit
Ray is a proposition of a new algorithm for assembling re- parallelism. To top it off, the PCI-E connectivity to the
sults from different sequencing technologies taking the form host and the DDR3 memory controller including DMA are
of short reads. This algorithm is split into many different handled automatically by the SDK. In the end, the OpenCL
parts. First, the graph is filled with the k-mers from the reads. SDK typically promises a better performance than a hand-
Then, there is a purge step which consists of removing edges written architecture in a shorter development time, as well
leading to dead-ends. Next is a statistical count of coverage. as a portable solution that can be migrated to newer FPGAs
This allows determining appropriate vertices for annotating automatically [1].
the reads and determining the seeds, which is the next step. What the Altera OpenCL SDK does is first generate a
This is followed by annihilating the spurious ones and finally, pipeline to obtain a throughput of up to one work-item per
extending them [2] and writing the results. clock edge, independently of the number of instructions to
In this article, the considered OpenCL version is 1.2, execute on each of them. It is then possible to make this
published in November 2011. OpenCL is an open standard pipeline larger by processing many work-items at the same
time in a SIMD fashion or unroll loops for even more
This work was supported, in part, by the Natural Sciences and Engineer-
ing Research Council of Canada, the Fonds de recherche du Quebec - Nature parallelism. Finally, the whole pipeline can be duplicated
et technologies and by the Microsystems Strategic Alliance of Quebec. for increasing the number of compute units. All these tech-
The authors are with the Department of Electrical and Computer niques allow for a better throughput with the first two being
Engineering, Laval University, 2325 Rue de lUniversite, Quebec, Qc,
G1V 0A6, Canada. carl.poirier.2@ulaval.ca, benoit.gosselin@gel.ulaval.ca, preferred because of memory access patterns and resource
paul.fortier@gel.ulaval.ca utilization.
6490
thousands of instructions, representing the parallelization in s l o t = hash ; / / S t a r t i n g s l o t
hardware is not practical. p e r t u r b = hash ; / / I n i t i a l p e r t u r b a t i o n
w h i l e ( s l o t . i s F u l l ( ) &&
IV. TWEAKING THE ALGORITHM s l o t . item != itemToFind )
s l o t = (5* s l o t ) + 1 + p e r t u r b ;
Besides porting the algorithm to OpenCL, many changes p e r t u r b = 5 ;
have been made for ensuring an optimal performance on the
accelerators. Some other changes are not squarely aimed at
performance, but memory usage. Here we describe the most Fig. 2. Collision resolution scheme used for the hash table.
important changes.
A. Decreasing Memory Usage stored as a hash table [8]. It consists of modifying the hash
In some other short read assemblers such as Ray and with a perturbation in a way that indices in the hash table
Velvet, the annotation of a read mentions the position of the are generated pseudo-randomly. The pseudo-code to do so is
vertex in it. Here, we proceed in another fashion. Since the presented in Figure 2. The calculation of a new position then
nucleotides are stored consecutively, we can simply modify gets very light to execute compared to what Ray suggests.
the start of the read so that it points to the first unique This nets a slight performance gain as well as some resource
vertex. The storage used for the offset position is then saved. savings for the FPGA. Speaking of memory accesses, doing
Anyway, the start of the read is not useful in the next steps. so in a pseudo-random order has an adverse effect on the
Also, Ray uses a Bloom filter to filter out k-mers appearing performance; DDR3 memory performs better for a sequential
only once. These are definitely errors created during the access. The end solution consists of stretching the table in
sequencing. OCLRay cascades two filters to eliminate the a second dimension to give it a width of many elements.
ones appearing only twice as well. During a search in this structure, all the elements in the same
row are verified, which results in many consecutive accesses
B. Ensuring Adequate Performance [9]. The width used in OCLRay is four.
Ray takes as a command-line argument the desired length In order to obtain an efficient pipeline in the FPGA,
of k-mers. OCLRay keeps the same interface, but since this inner loops must be avoided. To do so with the search of
value does not change during the execution, it can be used to a vertex in the graph, the collision resolution loop is com-
compile and load the OpenCL kernel. Thus, one kernel for pletely unrolled. It thus needs a finite number of iterations,
each allowed k-mer length is generated beforehand. Having which becomes possible by imposing a maximum number
this value as constant allows the required memory for storing of collisions for a same vertex. It has been calculated that
a k-mer to be sized appropriately without overhead for larger in a open-addressed hash table utilizing a pseudo-random
k-mers. At the same time, it avoids some mathematical collision resolution, the expected probe length to find an
operations on pointers for memory accesses. Second, loops element is dictated by the following formula [5]:
having a number of iterations dependent on the k-mer length
1
are avoided. k = ln(1 ) (3)
Another optimization done is rounding the graph size
to the next power of 2. This prevents modulo operations where is the occupancy rate of the table, between 0 and
required for calculating the index of a vertex in the hash 1. Thus, for a maximum rate of 75%, we obtain an average
table, following the hash function. It can be replaced by an number of accesses of 1.848. During the creation of the
AND operation, as illustrated in the following equation: graph, a maximum of 7 allowed collisions was chosen, so
two full rows have to be verified.
idxvertex = hash%sizetable = hash&(sizetable 1) (1)
V. RESULTS
A division operation used for determining the appropriate
memory space, as explained in III-B, can also be avoided The first tests to be run are raw performance tests, pitting
by replacing it with a binary logarithm, a binary scan and a the Intel Core i7-4770 against the Altera Stratix V, a high-
binary shift, as illustrated in the following equation: end FPGA with 50Mb of on-chip memory. For the results
in Figure 3, OCLRay is run on a dataset consisting in
idxbuf f er = hash/sizetable = salmonella enterica, a run with the identifier SRR749060 in
(2)
hash sizeofbits (sizetable ) BSR(tableSize) DNA databases. On the x axis, the five steps parallelized with
OpenCL are presented. The y axis is the time in seconds
On a x86 Haswell CPU from Intel, a division operation it takes for the kernel to be run. It is quite clear that the
takes 95 cycles whereas a binary operation takes one cycle. FPGA is very competitive, performance-wise, with regards
Similarly, a binary search (BSR) takes three cycles, a sub- to the CPU. In Table I, the FPGA kernel run times are
traction takes one cycle and a shift takes one as well [6]. normalized with respect to the CPU. It is interesting to see
As for an FPGA, as mentioned in section II-B, division and that the relatively simple kernels that cannot be vectorized,
remainder are operations to avoid. be it the count and annihilate kernels, do not perform very
As for resolving the collisions in the hash table, we use well on the FPGA. On the other hand, the purge kernel is
the same method as the Python Dictionary, which is also completely vectorizable and while the annotate and extend
6491
7 i7-4770 i7-4770
PCIe-385N 0.14 PCIe-385N
6
5
0.1
4
0.08
3
0.06
2 0.04
1 0.02
0 0
Purge Count Annotate Annihilate Extend Purge Count AnnotateAnnihilate Extend
Algorithm step Algorithm step
Fig. 3. FPGA and CPU kernel run times according to the algorithm step. Fig. 4. FPGA and CPU performance per watt according to the algorithm
step.
kernels are not, they are very complex, meaning there are
lots of instructions to execute for each work-item. This is VI. CONCLUSIONS
because both include one main loop that has many iterations, Overall, it is clear that FPGAs should be used to speed up
so the FPGA pipeline throughput really shines here. For the DNA assembly, but also to decrease power usage while doing
whole algorithm, the FPGA is 6.89 times as fast as the CPU. so. This particular algorithm shows that FPGAs are potent
Power consumption has been estimated at 28 W using a accelerators that will work well for a range of applications,
as shown by the very different algorithm steps here. Future
TABLE I work should focus on systems using uniform memory access
FPGA KERNEL RUN TIMES , NORMALIZED . such as SoC from Altera, for which memory transfers would
not be needed, and for which the hard CPU cores could take
Purge Count Annotate Anihilate Extend Total on the few serial tasks required. This would perform better
0.02696 4.25641 0.07568 0.33919 0.08168 0.14524 than using atomics in the OpenCL kernels.
ACKNOWLEDGMENT
Kill-A-Watt power meter for the whole FPGA board under Thanks to Sebastien Boisvert for having open-sourced Ray
load. This is by having the meter plugged in the wall outlet and helping clarify some parts of the algorithm. Thanks to
and subtracting the idle power consumption from the load CMC Microsystems for providing design and prototyping
measurement. In the same manner, the whole computer using tools.
the CPU as the accelerator consumes 111 W, 78 W more
than in an idle state. These numbers representing the power R EFERENCES
draw induced by the workload are being used for calculating [1] ALTERA. Implementing FPGA design with the OpenCL standard. http:
the results presented in Figure 4. The values for the FPGA, //www.altera.com/literature/wp/wp-01173-opencl.pdf, 2014.
[2] Sebastien Boisvert, Francois Laviolette, and Jacques Corbeil. Ray:
normalized according to the CPU, are presented in Table II. Simultaneous assembly of reads from a mix of high-throughput se-
We can see that the FPGA takes 13.15 times less energy than quencing technologies. Journal of Computational Biology, 17(11),
the CPU. It is however important to note here that buffer 2010.
[3] Nicolaas Govert de Bruijn. A combinatorial problem. Koninklijke
Nederlandse Akademie v. Wetenschappen, 49:758764, 1946.
TABLE II [4] Dmitry Denisenko. Lucas kanade optical flow from C to OpenCL on
E NERGY SAVINGS FACTOR BY USING THE PCI E -385N INSTEAD OF THE CV SoC. In CMC Microsystems Altera Training on OpenCL, 2014.
C ORE I 7-4770. [5] Gaston H. Gonnet. Expected length of the longest probe sequence in
hash code searching. Department of Computer Science, University of
Waterloo, December 1978.
Purge Count Annotate Anihilate Extend Total [6] Torbjorn Granlund. Instruction latencies and throughput for AMD
37.0940 0.2349 13.2143 2.9482 12.2425 13.1468 and Intel x86 processors. https://gmplib.org/tege/x86-timing.pdf, July
2014.
[7] KHRONOS GROUP. OpenCL 1.2 reference pages. http://www.khronos.
org/registry/cl/sdk/1.2/docs/man/xhtml/, November 2011.
transfer times between the host memory and global device [8] Andy Oram and Greg Wilson. Beautiful Code. OReilly Media, 2007.
memory have not been included in the calculations. This is [9] Xilinx. How to get more than two orders of magnitude better power/per-
because they have not been optimized yet. The results might formance from key-value stores using FPGA. In IEEE Communications
Society Tutorials, 2014.
thus change slightly later on.
6492