Characteristics of An On-Chip Cache On NEC SX

Interdisciplinary Information Sciences Vol. 15, No.
1 (2009) 5166 #Graduate School of Information Sciences, Tohoku University ISSN 1340-9050 print/1347-6157 online DOI 10.4036/iis.2009.51
Characteristics of an On-Chip Cache on NEC SX Vector Architecture

Akihiro MUSA 1;3 , Yoshiei SATO 1 y, Ryusuke EGAWA 1;2 z, Hiroyuki TAKIZAWA 1;2 x, Koki OKABE2 }, and Hiroaki KOBAYASHI 1;2 k
1
Graduate School of Information Sciences, Tohoku University, Sendai 980-8578, Japan 2 Cyberscience Center, Tohoku University, Sendai 980-8578, Japan 3 NEC Corporation, Tokyo 108-8001, Japan
Received May 6, 2008; nal version accepted December 10, 2008
Thanks to the highly eective memory bandwidth of the vector systems, they can achieve the high computation eciency for computation-intensive scientic applications. However, they have been encountering the memory wall problem and the eective memory bandwidth rate has decreased, resulting in the decrease in the bytes per op rates of recent vector systems from 4 (SX-7 and SX-8) to 2 (SX-8R) and 2.5 (SX-9). The situation is getting worse as many functions units and/or cores will be brought into a single chip, because the pin bandwidth is limited and does not scale. To solve the problem, we propose an on-chip cache, called vector cache, to maintain the eective memory bandwidth rate of future vector supercomputers. The vector cache employs a bypass mechanism between the main memory and register les under software controls. We evaluate the performance of the vector cache on the NEC SX vector processor architecture with bytes per op rates of 2 B/FLOP and 1 B/FLOP, to clarify the basic characteristics of the vector cache. For the evaluation, we use the NEC SX-7 simulator extended with the vector cache mechanism. Benchmark programs for performance evaluation are two DAXPY-like loops and ve leading scientic applications. The results indicate that the vector cache boosts the computational eciencies of the 2 B/FLOP and 1 B/FLOP systems up to the level of the 4 B/FLOP system. Especially, in the case where cache hit rates exceed 50%, the 2 B/FLOP system can achieve a performance comparable to the 4 B/ FLOP system. The vector cache with the bypass mechanism can provide the data both from the main memory and the cache simultaneously. In addition, from the viewpoints of designing the cache, we investigate the impact of cache associativity on the cache hit rate, and the relationship between cache latency and the performance. The results also suggest that the associativity hardly aects the cache hit rate, and the eects of the cache latency depend on the vector loop length of applications. The cache shorter latency contributes to the performance improvement of the applications with shorter loop lengths, even in the case of the 4 B/FLOP system. In the case of longer loop lengths of 256 or more, the latency can eectively be hidden, and the performance is not sensitive to the cache latency. Finally, we discuss the eects of selective caching using the bypass mechanism and loop unrolling on the vector cache performance for the scientic applications. The selective caching is eective for ecient use of the limited cache capacity. The loop unrolling is also eective for the improvement of performance, resulting in a synergistic eect with caching. However, there are exceptional cases; the loop unrolling worsens the cache hit rate due to an increase in the working space to process the unrolled loops over the cache. In this case, an increase in the cache miss rate cancels the gain obtained by unrolling. KEYWORDS: Vector processing, Vector cache, Performance characterization, Memory system, Scientic application
1. Introduction
Vector supercomputers have high computation eciency for computation-intensive scientic applications. In our previous work (Kobayashi (2006), Musa et al. (2006)), we have shown that NEC SX vector supercomputers can achieve high sustained performance in several leading applications. This is due to their high bytes per op rates compared to scalar systems. However, as recent advantages in VLSI technology have also been accelerating processor speeds, vector supercomputers have been encountering the memory wall problem. As a result, it is getting harder for vector supercomputers to keep a high memory bandwidth balanced with the improvement of their op/s performance. On the NEC SX systems, the bytes per op rate has decreased from 8 B/FLOP to 2 B/FLOP during the last twenty years.
y z x
}
E-mail: E-mail: E-mail: E-mail: E-mail: E-mail:
musa@sc.isc.tohoku.ac.jp yoshiei@sc.isc.tohoku.ac.jp egawa@isc.tohoku.ac.jp tacky@isc.tohoku.ac.jp okabe@isc.tohoku.ac.jp koba@isc.tohoku.ac.jp
52
MUSA et al.
We have proposed an on-chip cache, called vector cache, to maintain the eective memory bandwidth rate of future vector supercomputers (Kobayashi et al. (2007)) and provided the early evaluation of the vector cache using the NEC SX simulator (Musa et al. (2007)). The vector cache employs a bypass mechanism between a vector register and a main memory under software controls, and load/store data are selectively cached. The bypass mechanism and the cache work complementarily together to provide data to the register le, resulting in the high eective memory bandwidth rate. However, we did not investigate the fundamental conguration of the vector cache, i.e. cache associativity and cache latency. Moreover, our previous work did not discuss the relationship between loop unrolling and caching. The main contribution of this paper is to clarify the characteristics of the on-chip vector cache on the vector supercomputers. We investigate the relationship among cache associativity, cache latency and performance using ve practical applications selected from the areas of advanced computational sciences. In addition, we discuss in detail selective caching and loop unrolling under the vector cache to improve the performance. The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes a vector architecture with an on-chip vector cache discussed in this paper. Section 4 provides our experimental methodology and benchmark programs for performance evaluation. Section 5 presents experimental results when executing the kernel loops and ve real applications on our architecture. Through the experimental results, we clarify the characteristics of the vector cache. In Section 6, we examine the eect of selective caching on the performance, in which only the data with high localities of reference are cached. In addition, we discuss the relationship between loop unrolling and the vector cache. Finally, Section 7 summarizes the paper with some future work.
2. Related Work
Vector caches have been previously studied by a number of researchers using trace driven simulators of convectional vector architectures in the 1990s. Gee et al. have provided an evaluation of the cache performance in vector architectures: Cray X-MP and Ardent Titan (Gee and Smith (1992), Gee and Smith (1994)). Their caches employ a fullassociative cache and an n-way set associative cache. The line sizes range from 16 to 128 bytes, and the maximum cache size is 4 MB. Their benchmark programs are selected from Livermore Loops, NAS kernels, Los Alamos benchmark set and real applications in chemistry, uid dynamics and linear analysis. They showed that vector references contained somewhat less temporal locality, but large amounts of spatial locality compared to instruction and scalar references, and the cache improved the computational performance of the vector processors. Kontothanassis et al. have evaluated the cache performance of Cray C90 architecture using NAS parallel benchmarks (Kontothanassis et al. (1994)). Their cache is a direct-mapped cache with a 128 bytes line size, and the maximum cache size is 8 MB. They showed that the cache reduced memory trac from o-chip memory, and a DRAM memory system with the cache was competitive to a SRAM memory system. Modern vector supercomputer Cray X1 has a 2 MB vector cache, called Ecache, organized as a 2-way set associative write-back cache with 32 bytes line size and the least recently used (LRU) replacement policy (Abts et al. (2003)). The performance of Cray X1 has been evaluated using real scientic applications by several researchers (Dunigan Jr. et al. (2005), Oliker et al. (2004), Shan and Strohmaier (2004)). These studies have compared the performance of Cray X1 with that of other platforms; IBM Power, SGI Altix and NEC SX. However, they have not quantitatively discussed the eect of the cache on its performance. The aim of our research is to quantitatively investigate the eect of the vector cache using modern vector supercomputer NEC SX architecture. We clarify the basic characteristics of the vector cache and discuss its eective usages.
3. On-Chip Cache Memory for Vector Architecture

We design an on-chip vector cache for a vector processor architecture as shown in Figure 1. Our vector processor has a superscalar unit, a vector unit and an address control unit. The vector unit contains ve types of vector arithmetic pipes (Mask, Logical, Multiply, Add/Shift and Divide), vector registers and a vector cache. Four parallel vector pipes
Vector Unit
Mask Reg.
Mask Logical
4 4
Vector Registers
Main Memory Unit
4B/FLOP
Vector Cache
Vector Registers
Multiply Add/Shift Divide
4 4 4
Vector Cache Vector Cache Vector Cache
..
Vector Cache
Vector Cache
32 1B/FLOP
~ 4B/FLOP
Scalar Unit Address Control Unit
Main Memory
Figure 1. Vector architecture with vector cache and memory system block diagram.
53
are provided for each type; total 20 pipes are included in the vector unit. The address control unit queues and issues vector operation instructions to the vector unit. The vector operation instructions contain vector memory instructions and vector computational instructions. Each instruction concurrently deals with 256 double-precision oating-point data. The main memory unit employs an interleaved memory system. The main memory is divided into 32 parts, each of which operates independently and supplies data to the vector processor. Thus, the main memory and the vector processor are interconnected through 32 memory ports. Moreover, we implement the vector cache in each memory part; the vector cache consists of 32 sub-caches, and the sub-cache is a non-blocking cache with a bypass mechanism between a vector register le and the main memory. The bypass mechanism is software-controlled, and vector load/ store instructions have a ag for cache control, i.e. cache on/o. When the ag indicates cache on, the supplied data by the vector load/store instruction are cached. Meanwhile, when the ag is cache o, the vector load/store instruction provides data via the bypass mechanism: the data are not cached. The sub-cache employs a set-associative write-through cache with the LRU replacement policy. The line size is 8 bytes; it is the unit for memory accesses.
4. Experimental Environment
4.1 Methodology We use a trace-driven simulator that can simulate the behavior of the proposed vector architecture at the register transfer level. It is based on the NEC SX simulator, which accurately models a single processor of the SX architecture; the vector unit, the scalar unit and the memory system. The simulator takes a system parameter le and a trace le as input, and the output of the simulator contains instruction cycle counts of a benchmark program and cache hit information. The system parameter le has conguration parameters of a vector architecture and setting parameters, i.e., the cache size, the associativity, the cache latency and the memory bandwidth. The trace le contains an instruction sequence of a benchmark program and directives of cache control: cache on/ o. These directives are used in selective caching and set the cache control ags in the vector load/store instructions. A benchmark program is compiled by the NEC FORTRAN compiler: FORTRAN90/SX. It supports ANSI/ISO Fortran95 in addition to functions of automatic vectorization and automatic parallelization. The executable program runs on the SX trace generator to produce the trace le. In this work, we use two kernel loops and the original source codes of ve applications, and compile them with the highest optimizations option (-C hopt) and inlining subroutines. To evaluate on-chip vector caching for future vector processors with a higher op/s rate but a relatively lower ochip memory bandwidth, we evaluate the eect of the cache on the vector processor by limiting its memory bandwidth per op/s rate from 4 B/FLOP down to 1 B/FLOP. Here, we adopt the NEC SX-7 architecture (Kitagawa et al. (2003)) to our vector processor. The setting parameters are shown in Table 11.
Table 1. Summary of setting parameters.
Base System Architecture Main Memnory Vector Cache Total Size (Sub-cache Size) Cache Policy Associativity Cache Latency Cache Bank Cycle Line Size MemoryCache bandwidth per op/s CacheRegister bandwidth per op/s NEC SX-7 DDR-SDRAM SRAM 256 KB8 MB (8 KB256 KB) LRU, Write-through 2WAY, 4WAY, 8WAY 15%, 50%, 85%, 100% (Rate of main memory latency) 5% of memory cycle 8B 1 B/FLOP, 2 B/FLOP, 4 B/FLOP 4 B/FLOP
4.2
Benchmark Programs
To clarify the basic characteristics and validity of the vector cache, we select basic kernel loops and ve leading applications. In this section, we describe our benchmark programs in detail. The following kernel loops are used to evaluate the performance of the vector processor with the cache. Ai X i Y i: Ai X i Y i Z i:
1
1 2
The absolute values of memory related parameters such as latency and bank cycle cannot be disclosed in this paper because of their condentiality.
54
MUSA et al.
As leading applications, we have selected ve applications in three scientic computing areas. The applications have been developed by researchers of Tohoku University for the SX-7 system and are representative of each research area. Table 2 shows the summary of the evaluated applications, which methods are standard in individual areas (Kobayashi (2006)).
Table 2. Summary of benchmark programs.
Area Electromagnetic Analysis Name and Description GPR simulation: Simulation of Array Antenna Ground Penetrating Radar APFA simulation: Simulation of Anti-Podal Fermi Antenna CFD/Heat Analysis PRF simulation: Simulation of Premixed Reactive Flow in Combustion SFHT simulation: Simulation of Separated Flow and Heat Transfer Seismology PBM simulation: Simulation of Plate Boundary Model on Seismic Slow Slip Method Subdivision FDTD 50 750 750 FDTD 612 105 505 DNS 513 513 SMAC 711 91 221 Friction Law 32400 32400 Memory Size 8.9 GB 12 GB 1.4 GB 6.6 GB 8 GB
The GPR simulation code evaluates a performance of the SAR-GPR (Synthetic Aperture RadarGround Penetrating Radar) in detection of buried anti personnel mines under conditions of inhomogeneous subsurface mediums (Kobayashi et al. (2003)). The simulation method is based on the three dimensional FDTD (Finite Dierence Time Domain) method (Kunz and Luebbers (1993)). The vector operation ratio is 99.7% and the vector loop length is over 500. The APFA simulation is applied to design of a high gain antenna; an Anti-Podal Fermi Antenna (APFA) (Takagi et al. (2004)). The simulation calculates directional characteristics of the APFA using the FDTD method and the Fourier transform. The vector operation ratio is 99.9% and the vector loop length is 255. The PRF simulation provides numerical simulations of two dimensional Premixed Reactive Flow (PRF) in combustion for design of engines of planes (Tsuboi and Masuya (2003)). The simulation uses the sixth-order compact nite dierence scheme and the third-order Runge-Kutta method for time advancement. The vector operation ratio is 99.3% and the vector loop length is 513. The SFHT simulation realizes direct numerical simulations of three-dimensional laminar Separated Flow and Heat Transfer (SFHT) on surfaces of a plane (Nakajima et al. (2005)). The nite-dierence forms are the fth-order upwind dierence scheme for space derivatives and the CrankNicholson method for a time derivative. The resulting nitedierence equations are solved using the SMAC method. The vector operation ratio is 99.4% and the vector loop length is 349. The PBM simulation uses the three-dimensional numerical Plate Boundary Models (PBM) to explain an observed variation in propagation speed of postseismic slip (Ariyoshi et al. (2007)). This is a quasi-static simulation in an elastic half-space including a rate- and state-dependent friction. The vector operation ratio is 99.5% and the vector loop length is 32400.
5. Performance Evaluation of Vector Cache

We evaluate the performance of the proposed vector processor architecture by using the benchmark programs and clarify the basic characteristics of the vector cache. 5.1 Impact of Memory Bandwidth on Eective Performance
To evaluate the impact of the memory bandwidth on performance of the vector processor without the vector cache, we varied memory bandwidth per op/s rates from 4 B/FLOP down to 1 B/FLOP. Figure 2 shows the computation eciency, the ratio of the sustained performance to the peak performance, in the execution of the benchmark programs. Here, the 4 B/FLOP case is equal to the memory bandwidth per op/s rate of the SX-8 or earlier system, the 2 B/FLOP is the same as Cary X1 and BlackWidow (Abts et al. (2007)), and the 1 B/FLOP is the same level as the commoditybased high performance scalar systems such as NEC TX7 series (Senta et al. (2003)) and SGI Altix series (SGI (2007)). Figure 2 indicates that the computational performance is seriously aected by the B/FLOP rate. In particular, the performances of GPR and PBM, which are memory-intensive programs, are degraded by half and quarter when the memory bandwidth per op/s rate is reduced to 2 B/FLOP and 1 B/FLOP from 4 B/FLOP. Even if APFA is computation-intensive, the performance on the 1 B/FLOP system is decreased to 69% of the performance on the 4 B/FLOP system. Therefore, the memory bandwidth per op/s rate needs the 4 B/FLOP rate on the vector supercomputer.
55
4B/FLOP 60 50 Efficiency(%) 40 30 20 10 0 X + Y X*Y+Z GPR
2B/FLOP
1B/FLOP
APFA
PRF
SFHT
PBM
Figure 2. Eect of memory bandwidth per op/s rate.
5.2
Relationship between Eciency and Cache Hit Rate on Kernel Loops
We simulate the execution of the kernel loop (1) on the vector processor with the vector cache. Since this loop has no temporal locality, we store a part of X i and Y i data on the vector cache in advance of the loop execution. In the following two ways, we select the data for caching to change the range of cache hit rates, and examine their eects on performance. One is to store both X i and Y i with the same index i on the cache, and load instructions of both X i and Y i are set to cache on, named Case 1. The other is to cache only X i, and load instructions of X i and Y i are set to cache on and cache o, respectively. This is Case 2. Here, the range of index i is 1 i 25600. Figure 3 shows the relationship between the cache hit rate and the relative memory bandwidth that is obtained by normalizing the eective memory bandwidth of the systems with the vector cache by that of the 4 B/FLOP system without the cache. Figure 4 shows the relationship between the cache hit rate and the computation eciency. Figure 3 indicates that the relative memory bandwidth of each case increases as the cache hit rate increases. Therefore, the vector cache is one of the promising solutions to cover a lack of the memory bandwidth. Similar to Figure 3, the computation eciency of each case in Figure 4 improves as the cache hit rate increases.
Relative Memory Bandwidth
1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 Cache Hit Rate (%) 80 90 100

4B/F 2B/F Case2 1B/F Case2 2B/F Case1 1B/F Case1
Figure 3. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (1).
Figures 3 and 4 show the correlation between the relative memory bandwidth and the computation eciency. On both the 2 B/FLOP and 1 B/FLOP systems, the performance of Case 2 is greater than that of Case 1. In particular, Case 2 with the vector cache of the 50% hit rate on the 2 B/FLOP system achieves the same performance of the 4 B/ FLOP system. On the 4 B/FLOP system, each of X i and Y i is provided at a 2 B/FLOP bandwidth rate on average. When the cache hit rate is 50% in Case 2, all data of X i are supplied directly from the vector cache and all data of Y i are supplied from the memory at the 2 B/FLOP bandwidth rate through the bypass mechanism. Consequently, each of X i and Y i is provided at the 2 B/FLOP bandwidth rate on the vector registers, and the total amount of data provided to vector registers in time is equal to that of the 4 B/FLOP system. However, the 1 B/FLOP system with vector caching with the 50% hit rate does not achieve the performance of the 4 B/FLOP system. Because Y i is provided at a 1 B/ FLOP bandwidth rate, the total bandwidth rate at the vector registers does not reach the 4 B/FLOP. On the other hand, in Case 1, the data of X i and Y i are supplied from either of the vector cache or the memory at the same index i. While supplying the data from the memory, X i and Y i have to share the memory bandwidth rate. Therefore, the 1 B/
56
MUSA et al.
18 16 14 Efficiency (%) 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 Cache Hit Rate (%) 80 90 100

4B/F 2B/F Case2 1B/F Case2 2B/F Case1 1B/F Case1
Figure 4. Relationship between cache hit rate and computation eciency on Kernel loop (1).
FLOP and 2 B/FLOP systems need a cache hit rate of 100% to achieve a performance comparable to that of the 4 B/ FLOP system. Similarly, we examine the performance of Kernel (2) on the 2 B/FLOP system in the following three cases. In Case 1, all of X i, Y i, and Z i are cached with the same index i in advance. In Case 2, both X i and Y i are provided from the vector cache, and therefore the maximum cache hit rate is 66%. In Case 3, only X i is in the cache, and hence the maximum cache hit rate is 33%. Figure 5 shows the relationship between the cache hit rate and the change in the relative memory bandwidth regarding Kernel (2). On the 4 B/FLOP system, each of X i, Y i and Z i is provided at a 4/3 B/FLOP bandwidth rate on average. The performance of Case 2 is comparable to that of the 4 B/ FLOP system when the cache hit rate is 66%, because the bandwidth per op/s rate of each data is the 4/3 B/FLOP rate on average. However, in Case 3, both Y i and Z i are provided from the memory, and Y i and Z i have to share the 2 B/FLOP bandwidth rate at the vector register. As a result, the relative memory bandwidth never reaches that of the 4 B/FLOP system.
1 0.9 0.8 0.7 0.6 0.5 0 10 20 30 40 50 60 70 Cache Hit Rate (%) 80 90 100
4B/F 2B/F Case1 2B/F Case2 2B/F Case3
Figure 5. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (2).
These results indicate that the vector cache has a potential for the future SX systems to cover the shortage of memory bandwidth. In additions, all data do not need to be cached from the discussions on Figures 3 and 5. The key data determining the performance should be cached to make good use of limited on-chip capacity. 5.3 Relationship between Eciency and Cache Hit Rate on Five Scientic Applications
We simulate the execution of ve scientic applications with the vector cache varying memory bandwidth per op/s rate; 1 B/FLOP and 2 B/FLOP. Here, all vector load/store instructions are set to cache on. Figure 6 shows the relationship between the cache hit rate of the 8 MB vector cache and the relative memory bandwidth. Figure 6 indicates that the vector cache can improve the eective memory bandwidth, depending on the cache hit rate of the applications. Cache hit rates vary from 13% in SFHT to 96% in APFA, because these depend on the locality of reference in individual applications. In APFA, the relative memory bandwidth is 0.96 at the 2 B/FLOP system with almost 100% hit of the 8 MB cache, resulting in an eective memory bandwidth almost equal to the 4 B/FLOP system. In this case, most data are provided
57
1 0.75 SFHT 0.5 GPR PRF PBM APFA 2B/F 8MB 0.25 0 0 20 40 60 80 100 Cache Hit Rate (%) 1B/F 8MB
Figure 6. Relationship between cache hit rate and relative memory bandwidth on ve applications.
from the cache. PBM of the 2 B/FLOP system reaches the eective memory bandwidth of the 4 B/FLOP system at the cache hit rate 50%. Figure 7 shows the main routine of PBM. The inner loop, index j, is vectorized and arrays gd dip and wary are stored in the vector cache by vector load instructions. However, gd dip is spilled from the cache, because gd dip needs 7.6 GB. On the other hand, wary can be held in the cache, because its needs only 250 KB and the array is dened in the preceding loop. Therefore, the cache hit rate of this loop is 50%. Then all data of wary are supplied directly from the vector cache at the 2 B/FLOP rate and all data of gd dip are supplied from the memory at the 2 B/ FLOP rate through the cache, and the total amount of data provided to vector registers in time is equal to that of the 4 B/FLOP system.
01 02 03 04 05
do i=1,ncells do j=1,ncells wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j) end do end do

Figure 7. Main routine of PBM.
The relative memory bandwidths in others applications, SFHT, GPR and PRF are over 0.6 on the 2 B/FLOP system and 0.3 on the 1 B/FLOP system. Here, the relative memory bandwidth without the cache is 0.5 on the 2 B/FLOP system and 0.25 on the 1 B/FLOP system. Thus, the improvement of the relative memory bandwidth is over 20% using the 8 MB cache. Figure 8 shows the computation eciency of ve scientic applications on the 4 B/FLOP system without the cache, the 2 B/FLOP and 1 B/FLOP systems with/without the 8 MB cache. Figure 8 indicates that the eciency of the 2 B/ FLOP and 1 B/FLOP systems is increased by the cache. In particular, the 2 B/FLOP system with the cache achieves the same eciency as the 4 B/FLOP system in APFA and PBM. Figure 9 shows the relationship between the cache hit rate of the 8 MB cache and the recovery rate of the performance by the cache from the 2 B/FLOP and 1 B/FLOP systems to the 4 B/FLOP system. The recovery rate goes up as the cache hit rate increases. The vector cache can increase the recovery rate by 18 to 99% on the 2 B/FLOP system and 9 to 96% on the 1 B/FLOP system, depending on the data access locality of each application. The results indicate that the vector cache has a potential for the future SX systems to cover the shortage of memory bandwidth on real scientic applications. 5.4 Relationship between Associativity and Cache Hit Rate
In this section, we examine the eect of the associativity ranging from 2 to 8 ways on the performance. Figure 10 shows that the cache hit rate on each set associative cache, covering cache sizes from 256 KB to 8 MB. Four applications, APFA, PRF, SFHT and PBM, approximately have constant cache hit rates across three associativity cases. In GPR, the cache hit rate slightly varies with the associativity. The cache hit rate is dependent on among placements of the arrays on memory addresses, memory access patterns of the programs and the associativity. In GPR, its basic computational structure consists of triple nested loops. The loops access many arrays at intervals of 584 bytes stride addresses, and the cache data are frequently replaced. Therefore, the
58
4B/F 2B/F non-cache 2B/F 8MB 1B/F non-cache 1B/F 8MB
MUSA et al.
50 40 30 20 10 0 GPR APFA PRF SFHT PBM
Figure 8. Computation eciency of ve applications with/without 8 MB vector cache.
Performance Recovery Rate (%)
Efficiency (%)
100 80 60 40 GPR 20 0 0 20 40 60 Cache Hit Rate (%) 80 100 SFHT PBM APFA PRF 2B/F 8MB 1B/F 8MB
Figure 9. Recovery rate of performance on 8 MB cache.
cache hit rate varies with the cache capacities and the associativity. In SFHT, the memory access patterns have 8 bytes stride accesses. On the 2 MB case, the cached data which will be reused on the 2-way set associative cache are evicted by some other data. Thus this is due to a placement of memory addresses of the arrays. Other programs have constant cache hit rates across three associativity cases. These memory access patterns are 8 bytes or 16 bytes stride accesses, and the placement of the arrays on memory do not cause the conict of reused data. These results provide that the associativity hardly have eects on the cache hit rates, and the cache capacity mainly inuences the cache hit rate across the ve applications. 5.5 Eects of Vector Cache Latency on Performance
In general, the latency of the on-chip cache is considerably shorter than that of the o-chip memory. We investigate the eects of the cache access latency on the performance of the ve applications. Figure 11 shows the relationship between the computation eciency of the applications and four cache latencies; 15, 50, 85 and 100% of the memory access latency on the 2 B/FLOP system with the 2 MB cache. The ve applications have almost the same eciency when changing the cache latency. Because the vector architecture can hide the memory access times by pipelined vector operations when a vector operation ratio and a vector loop length of applications are enough large. In addition, the vector cache consists of 32 sub-caches between a vector register le and the main memory via each memory port. The sub-caches connected to 32 memory ports provide multiple words at a time. The cache latency of the second memory access or later is hidden in the sub-cache system. For comparison, we examine the performance of shorter loop length cases using Kernel loops (1) and (2) on the 2 B/ FLOP system when changing the cache access latency. Figure 12 shows that the relationship between the computation eciency of Kernel loops and four cache latency cases. Here, we examine three loop length cases in the 100% cache hit rate; 1 i 64, 1 i 128, 1 i 256. On the shorter loop length cases, Figure 12 indicates that the vector cache latency is an importance factor in the performance. Especially, on the loop length 64 case, the 15% latency case has two times higher eciency in Kernel (1) than the 100% case, and the 15% latency case has 12% higher eciency in Kernel

GPR
0.25MB 50 40 30 20 10 0 2WAY 4WAY 8WAY 0.5MB 1MB 2MB 4MB 8MB
100 0.25MB 0.5MB
59
APFA
1MB 2MB 4MB 8MB
Cache Hit Rate (%)
Cache Hit Rate (%)
80 60 40 20 0 2WAY 4WAY 8WAY
PRF
0.25MB 50 0.5MB 1MB 2MB 4MB 8MB 50 0.25MB 0.5MB
SFHT
1MB 2MB 4MB 8MB
Cache Hit Rate (%)
Cache Hit Rate (%)
40 30 20 10 0 2WAY 4WAY 8WAY
40 30 20 10 0 2WAY 4WAY 8WAY
PBM
0.25MB 50
Cache Hit Rate (%)
0.5MB
1MB
2MB
4MB
8MB
40 30 20 10 0 2WAY 4WAY 8WAY
Figure 10. Cache hit rate vs. associativity on ve applications.
60 50 Efficiency (%) 40 30 20
GPR APFA PRF SFHT PBM
10 0 15% 50% 85% Cache Latency Case 100%
Figure 11. Computation eciency of ve applications on four cache latency cases.
(2). However, the longer the loop length, the lower the eect of the cache latency. Figure 13 shows that the computation eciency of Kernel loops in the 4 B/FLOP and 2 B/FLOP systems with/without the vector cache. On both the loop lengths 64 and 128 of Kernel (1), the eciency of the 4 B/FLOP system with the cache is higher than that of the 4 B/FLOP system without the cache, and the loop length 128 case of the 4 B/FLOP system with the cache achieves the highest eciency. Meanwhile, in Kernel (2), the loop length 64 case of the 4 B/FLOP system with the cache has the highest eciency. This is because the ratio of memory access time to the processing time becomes the
60
Kernel Loop (1)
18 16 14 12 10 8 6 4 2 0 15% 30 25 Efficiency (%) 20 15 10 64 5 0 50% 85% Cache Latency Case 100% 15% 50% 85% Cache Latency Case 100% 128
MUSA et al.
Kernel Loop (2)
Efficiency (%)
256
64
128
256
Figure 12. Computation eciency of two kernel loops on four cache latency cases.
Kernel Loop (1)
4B/F 20 18 16 14 12 10 8 6 4 2 0 64 128 Loop Length 256 4B/F 2MB 2B/F 2B/F 2MB 35 30 Efficiency (%) 25 20 15 10 5 0 64 128 Loop Length 256 4B/F
Kernel Loop (2)

4B/F 2MB 2B/F 2B/F 2MB
Efficiency (%)
Figure 13. Computation eciency of two kernel loops on 4 B/FLOP and 2 B/FLOP.
smallest due to the decrease in memory latency by the cache, and in these kernel loops the memory access time of the shorter loop length case cannot be entirely hidden by pipelined vector operations only. Therefore, the performance of shorter loop length cases is sensitive to the vector cache latency, and the vector cache can boost the performance of the applications with short vector lengths on the 4 B/FLOP system. Moreover, the cache latency is not hidden when the bank conicts occur. In this work, the bank conicts hardly occur in the ve applications. As our future work, quantitative discussions on the inuence of the bank conicts will be addressed.
6. Optimizations for Vector Caching: Selective Caching and Loop Unrolling

In this section, we discuss some optimization techniques for the vector cache to reduce cache miss rates on scientic applications. In particular, we show that the relationship between loop-unrolling and caching. Many compilers automatically optimize programs using the loop-unrolling, which is the most basic loop optimization. 6.1 Eects of Selective Caching
The vector cache has a bypass mechanism, and data are selectively cached: selective caching. The simulation methods of the four scientic applications, GPR, APFA, PRF and SFHT, are based on the dierence schemes; Finite Dierence Time Domain (FDTD) method and Direct Numerical Simulation of ow and heart (DNS), and a part of arrays has a high locality in many cases. As the on-chip cache size is limited, selective caching might be an eective approach to ecient use of the on-chip cache. We evaluate the selective caching for two applications, GPR and PRF, which are typical examples of FDTD and DNS, respectively. Figure 14 shows one of kernel loops of GPR; the loop is a main routine of the FDTD method, and many routines of GPR are similar to this kernel loop. The innermost loop is indexed with j. Usually, FORTRAN compilers interchange loops j and i, however, FORTRAN90/SX compiler does not interchange the loops, because the compiler decides that the computational performance cannot be improved due to the short length of loop i. Loop i has a length of 50, and j is 750. As the size of each array used in GPR is 330 MB, the cache cannot store all the arrays. However, the dierence scheme as shown in Figure 14 generally has temporal locality in accessing arrays; H x, H y and H z. We select these arrays for selective caching. Figure 15 shows the eect of selective caching on the 2 B/FLOP system with the vector cache in GPR. Figure 16 shows the recovery rate of the performance from the 2 B/FLOP system without the cache to the 4 B/FLOP system by selective caching in GPR. We discuss the cache sizes from 256 KB up to 8 MB. Here, ALL in the gures indicates that all the arrays in the loop are cached. Selective shows that some of the arrays are selectively cached. Cache Usage Rate indicates the ratio of number of cache hit references to the total memory references.
61
01 DO 10 k=0,Nz 02 DO 10 i=0,Nx 03 DO 10 j=0,Ny 04 E_x(i,j,k) = C_x_a(i,j,k)*E_x(i,j,k) 05 & + C_x_b(i,j,k) * (( H_z(i,j,k) - H_z(i,j-1,k))/dy 06 & - (H_y(i,j,k) - H_y(i,j,k-1))/dz - E_x_Current(i,j,k)) 07 E_z(i,j,k) = C_z_a(i,j,k)*E_z(i,j,k) 08 & + C_z_b(i,j,k) * ((H_y(i,j,k)-H_y(i-1,j,k) )/dx 09 & - (H_x(i,j,k)-H_x(i,j-1,k) )/dy - E_z_Current(i,j,k)) 10 E_y(i,j,k) = C_y_a(i,j,k)*E_y(i,j,k) 11 & + C_y_b(i,j,k) * ((H_x(i,j,k) - H_x(i,j,k-1) )/dz 12 & - (H_z(i,j,k) - H_z(i-1,j,k))/dx - E_y_Current(i,j,k)) 13 10 CONTINU
Figure 14. A kernel loop of GPR.
40
ALL (Performance) Selective (Performance) ALL (Cache usage rate) Selective (Cache usage rate)
30
Efficiency (%)
20 36
34 10 32
30 256K 512K 1M 2M 4M 8M Cache Size (B)
Figure 15. Eciency of selective caching and cache hit rate on 2 B/FLOP system (GPR).
40 Recovery Rate of Performanc (%)

ALL (Recover Rate) Selective (Recover Rate) ALL (Cache usage rate) Selective (Cache usage rate)
30
20 20 10 10
0 256K 512K 1M 2M Cache Size (B) 4M 8M
Figure 16. Recovery rate of performance and cache hit rate on selective caching (GPR).
In the 256 KB cache case, the eciencies of ALL and Selective are 33.3 and 34.1%, and the cache usage rate are 9.6 and 16.6%, respectively. The arrays H yi; j; k, H yi 1; j; k (line 08 in Figure 14), H xi; j; k (line 11) and H zi 1; j; k (line 12) can load data from the cache on Selective. In ALL, H yi 1; j; k (line 08) and
Cache Usage Rate (%)
30
Cache Usage Rate (%)
38
62
MUSA et al.
H zi 1; j; k (line 12) cause cache misses, because the reused data of these arrays are replaced by other array data. Figures 15 and 16 indicate that the performance and the cache usage rate increase as the cache size increase. On the 4 MB cache, the cache hit rate of Selective is 24% and its eciency is 3% higher than that of ALL. The recovery rate of performance is mere 34%. However, GPR cannot achieve high cache usage rates and recovery rates of performance by selective caching. This is because this loop has many non-locality arrays; C x a, E x, E x Current etc.
01 DO KK = 2,NJ 02 DO I = 1,NI 03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L) 04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) wDY1YC1(I,KK,L) * wDY1YC3(I,KK-1,L) 05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L) 06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L) 07 END DO 08 END DO
Figure 17. High cost kernel loop of PRF.
Figure 17 shows one of kernel loops of PRF; the loop is a high cost routine. The size of each array in PRF is 18 MB, and many loops are doubly nested loop. Thus, the arrays are treated as two dimensions array, and the size of cached data per array is 2 MB only. wDY 1YC 1, wDY 1YC2 and wDY 1YC3 in Figure 17 are dened in the preceding loop. In addition, wDY 1YC 1 and wDY 1YC 2 have the spatial locality. We select these arrays for selective caching. Here, wDY 1YC 3I ; KK 1; L (line 04 in Figure 17), wDY 1YC 2I ; KK ; L (line 05), wDY 1YC 1I ; KK ; L (line 06) and wDY 1YC 2I ; KK ; L (line 06) can reuse data on the register les. Thus, these arrays do not access the cache.
35
70 60 Cache Usage Rate (%) 50 40 30
Efficiency (%)
30
25
20 10
20 256K 512K 1M 2M Cache Size (B) 4M 8M
Figure 18. Eciency and cache hit rate of selective caching on performance (PRF).
Figures 18 and 19 show the eciency and the recovery rate of the performance from the 2 B/FLOP system to the 4 B/FLOP system by selective caching in PRF. Figure 18 indicates that the cache hit rates and eciency of Selective are the same as that of ALL until 2 MB cache size. wDY 1YC 2I ; KK 1; L (line 03) and wDY 1YC 1I ; KK 1; L (line 08) are cache hit arrays in each case, and the cache hit rate is 33%. Over 4 MB cache size, Selective achieves a 66% cache hit rate, resulting in a 30+% improvement in eciency. Moreover, Figure 19 indicates that the recovery rate is 78%. The results of GPR and PRF show that the selective caching improves the performance of the applications. Compared with GPR, PRF has higher values regarding both the cache usage rate and improved eciency, because the ratio of arrays with locality per loop on PRF is higher than that of GPR. In GPR the reused data are only dened in the loop, and the cache miss always occurs in this loop. For example, arrays H yi; j; k (line 08 in Figure 14) hits the data of H yi; j; k (line 06), but H yi; j; k (line 06) cannot hit the data due to their rst accesses (cold miss). On the other hand, the cached data of PRF are provided in the immediately preceding loop, thus cache misses do not occur like the rst access in PRF.

80 Recovery Rate of performance (%) 70 60 Cache Usage Rate (%) 50 40 30 40 20 10 20 256K 512K 1M 2M Cache Size (B) 4M 8M 0
63
60
Figure 19. Recovery rate of performance and cache hit rate on selective caching (PRF).
6.2
Eects of Loop-unrolling and Caching
Loop unrolling is eective for higher utilization of vector pipelines, however, loop unrolling also needs more cache capacity to capture all the arrays of unrolled loops. Therefore, we discuss the relationship between loop unrolling and vector caching on the performance.
01 02 03 04 05 06 07 08
do i=1,ncells,4 do j=1,ncells wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j) wf_dip(i+1)=wf_dip(i+1)+gd_dip(j,i+1)*wary(j) wf_dip(i+2)=wf_dip(i+2)+gd_dip(j,i+2)*wary(j) wf_dip(i+3)=wf_dip(i+3)+gd_dip(j,i+3)*wary(j) end do end do

Figure 20. Outer unrolling loop of PBM.
Figure 20 shows a code outer unrolled 4 times, whose original code is shown in Figure 7. In the outer unrolling case, array wary of the outer loop index i (line 03 in Figure 20) reuses the data of the index i 1 on the cache and arrays wary (line 04, 05 and 06) reuse the data on register les of wary (line 03), hence, memory references are reduced. However, the cache capacity for gd dip increases due to an increase in the number of arrays reference in the innermost loop. Figure 21 shows the eciency and the cache hit rate of the outer unrolling case on PBM with the 2 B/FLOP system. 2B/F indicates the without-cache case, and the eciency constantly increases as the degree of unrolling increases. The cache hit rate of the 0.5 MB cache case is 49% on the non-unrolling case and becomes poor as the loop unrolling proceeds, because the cached data of wary are evicted from the cache by gd dip. Thus, the eciency of the 0.5 MB cache case achieves 49% on the non-unrolling case, and it is similar to that of the 2 B/FLOP system with unrolling. In this case, a conicting eect between caching and unrolling appears. Moreover, the cache hit rates of the 8 MB cache case decrease as the degree of unrolling increases. The cached data remain on the cache in this case, but the cache hit rate decreases due to a decrease in the number of wary (line 03) references. However, the eciency is approximately constant; 48 or 49% because unrolling covers the losing eect of caching. This case shows that the eects of caching are comparable to that of unrolling. Figure 22 shows a code obtained by unrolling code two times shown in Figure 17. Outer unrolling reduces the memory references of the arrays wDY 1YC 2I ; KK ; L (line 07 in Figure 22) and wDY 1YC 1I ; KK ; L (line 10). Figure 23 shows the eciency and the cache hit rate of outer unrolling on PRF with the 2 B/FLOP system. The cache hit rates of both the 0.5 and 8 MB caches gradually reduce as the degree of unrolling increase, because of an increase in the cache miss rate. In this case, the highest eciency is 33% on the 8 MB cache with the unrolling degree of 4 since the gain by loop unrolling covers the loss due to the increase in the miss rate. It is higher than the selective caching case; it is 30.7% (Ref. Figure 18). Thus, this case indicates the synergistic eect of both caching and unrolling.
64
2B/F 0.5MB 8MB Hit rate 0.5MB Hit rate 8MB
MUSA et al.
60 50 Efficiency (%) 40 30 20 10 0
Non-Unroll Unroll 2 Unroll 4 Unroll 8 Unroll 16
60 Cache Hit Rate (%)

Cache Hit Rate (%)
50 40 30 20 10 0
Figure 21. Eciency and cache hit rate of outer unrolling on performance (PBM).
01 DO KK = 2,NJ 02 DO I = 1,NI 03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L) 04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) wDY1YC(I,KK,L) * wDY1YC3(I,KK-1,L) 05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L) 06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L) 07 wDY1YC3(I,KK,L) = wDY1YC3(I,KK,L) * wDY1YC2(I,KK,L) 08 wDY1YC2(I,KK+1,L) = wDY1YC2(I,KK+1,L) wDY1YC(I,KK+1,L) * wDY1YC3(I,KK,L) 09 wDY1YC2(I,KK+1,L) = 1.D0 / wDY1YC2(I,KK+1,L) 10 wDY1YC(I,KK+1,L) = (wPHIY12(I,KK+1) wDY1YC1(I,KK+1,L) * wDY1YC1(I,KK,L)) * wDY1YC2(I,KK+1,L) 11 END DO 12 END DO
Figure 22. Outer unrolling loop of PRF.
2B/F
0.5MB
8MB
hit rate 0.5MB
hit rate 8MB
35 30 Efficiency (%) 25 20
60 50 40 30
15 10 5 0
Non-Unroll Unroll 2 Unroll 4 Unroll 8 Unroll 16
20 10 0
Figure 23. Eciency and cache hit rate of outer unrolling on performance (PRF).
These results indicate that loop unrolling has both conicting and synergistic eects of caching. Loop unrolling is a most basic optimization and has benecial eects in various applications. Therefore, caching has to be used carefully as a complementary tuning option for loop unrolling.
7. Conclusions
This paper has presented the characteristics of an on-chip vector cache with the bypass mechanism on the vector supercomputer NEC SX architecture. We have demonstrated the relationship between the cache hit rate and the
65
performance using two DAXPY-like loops. Moreover, we have claried the same tendency of the relationship on the ve scientic applications. The vector cache can recover the lack of the memory bandwidth, and boosts the computation eciency of the 2 B/FLOP and 1 B/FLOP systems. Especially, as cache hit rates are 50% or more, the 2 B/FLOP system can achieve a performance comparable to the 4 B/FLOP system. The vector cache with a bypass mechanism can provide the data from both the memory and the cache in time, and the sustained memory bandwidth for the registers are increased. Therefore, the vector cache has a potential for the future SX systems to cover the shortage of their memory bandwidth. We have also discussed the relationship between performance and cache design parameters such as cache associativity and cache latency. We have examined the eect of the cache associativity setting from 2-way to 8-way. When the associativity is two or more, the cache hit rate is not sensitive to the associativity in the ve applications. In addition, we have examined the eect of the cache latency on the performance when changing it to 15, 50, 85 and 100% of the memory latency. We have demonstrated the computational eciencies of ve applications are constant across these latency changes, when the vector loop lengths of the applications are 256 or more. In these cases, the latency is hidden by pipelined vector operations. However, in the case of shorter vector loop lengths, the cache latency aects the performance, and the 15% latency case of Kernel (1) has two times higher eciency than the 100% case in the 2 B/FLOP system. In addition, the 4 B/FLOP system is also boosted due to the eect of the short latency of the cache. The above results indicate that the performance of cache does not largely depend on the cache associativity, however, the cache latency aects the performance of applications with short vector loop lengths. Finally we have discussed selective caching and the relationship between loop-unrolling and caching. We have shown that selective caching, which is controlled by means of the bypass mechanism, is eective for ecient use of limited on-chip caches. We have examined two cases; the ratios of arrays with locality per loop are higher and lower cases. In each case, the higher performance is obtained by selective caching, compared with all the data caching. In addition, the loop unrolling is useful in the improvement of performance, and caching is complementary to the eect of loop unrolling. In the future work, we will evaluate other scientic applications using the vector cache, and quantitatively discussions the relationship between the bank conicts and the cache latency. In addition, we will also investigate the mechanism of vector cache on multi core vector processors.
Acknowledgments
The authors would like to gratefully thank: Dr. M. Sato, Dr. A. Hasegawa, Professor G. Masuya, Professor T. Ota, Professor K. Sawaya of Tohoku University, for their advice about applications. We also would like to thank Matsuo Aoyama and Hirofumi Akiba of NEC Software Tohoku for their assistance in experiments.
REFERENCES [1] Abts, D., et al., The Cray BlackWidow: A Highly Scalable Vector Multiprocessor, In Proceedings of the ACM/IEEE SC2007 conference (2007). [2] Abts, D., Scott, S., and Lilja, D. J., So Many States, So Little Time: Verifying Memory Coherence in the Cray X1, In International Parallel and Distributed Processing Symposium (2003, April). [3] Ariyoshi, K., Matsuzawa, T., and Hasegawa, A., The key frictional parameters controlling spatial variation in the speed of postseismic slip propagation on a subduction plate boundary, Earth Planet. Sci. Lett., 256: 136146 (2007). [4] Dunigan Jr., T. H., Vetter, J. S., White III, J. B., and Worley, P. H., Performance Evaluation of The Cray X1 Distributed Shared-Memory Architecture, In IEEE MICRO, 25: 3040 (2005). [5] Gee, J. D., and Smith, A. J., Vector Processor Caches, Technical Report UCB/CSD-92-707, University of California (1992, October). [6] Gee, J. D., and Smith, A. J., The Eectiveness of Caches for Vector Processors, In The 8th international Conference on Supercomputing 1994, pp. 333343 (1994, July). [7] Kitagawa, K., Tagaya, S., Hagihara, Y., and Kanoh, Y., A Hardware Overview of SX-6 and SX-7 Supercomputer, NEC Research & Development, 44: 27 (2003). [8] Kobayashi, H., Implication of Memory Performance in Vector-Parallel and Scalar-Parallel HEC, In M. Resch et al. (Eds.), High Performance Computing on Vector Systems 2006, pp. 2150, Springer-Verlag (2006). [9] Kobayashi, H., Musa, A., Sato, Y., Takizawa, H., and Okabe, K., The potential of On-chip Memory Systems for Future Vector Architectures, In M. Resch et al. (Eds.), High Performance Computing on Vector Systems 2007, pp. 247264, Springer-Verlag (2007). [10] Kobayashi, T., Feng, X., and Sato, M., FDTD simulation on array antenna SAR-GPR for land mine detection, In Proceedings of SSR2003: 1st International Symposium on Systems and Human Science, Osaka, Japan, pp. 279283 (2003, November). [11] Kontothanassis, L. I., Sugumar, R. A., Faanes, G. J., Smith, J. E., and Scott, M. L., Cache Performance in Vector Supercomputers, In Proceedings of the 1994 conference on Supercomputing, Washington D.C., pp. 255264 (1994). [12] Kunz, K. S., and Luebbers, R. J., The Finite Dierence Time Domain Method for Electromagnetics, Boca Raton, FL: CRC Press (1993). [13] Musa, A., Sato, Y., Egawa, R., Takizawa, H., Okabe, K., and Kobayashi, H., An On-Chip cache Design for Vector
66
MUSA et al. Processors, In Proceedings of the 8th MEDEA workshop on Memory performance: Dealing with Applications, systems and architecture, Romania, pp. 1723 (2007). Musa, A., Takizawa, H., Okabe, K., Soga, T., and Kobayashi, H., Implications of Memory Performance for Highly Ecient Supercomputing of Scientic Applications, In Proceedings of the 4th International Symposium on Parallel and Distributed Processing and Application 2006, Italy, pp. 845858 (2006). Nakajima, M., Yanaoka, H., Yoshikawa, H., and Ota, T., Numerical Simulation of Three-Dimensional Separated Flow and Heat Transfer around Staggerd Surface-Mounted Rectangular Blocks in a Channel, Numerical Heat Transfer (PartA), 47: 691708 (2005). Oliker, L., et al., A Performance Evaluation of the Cray X1 for Scientic Applications, In VECPAR04: 6th International Meeting on High Performance Computing for Computational Science, Valencia, Spain (2004, June). Senta, T., Takahashi, A., Kadoi, T., Itoh, T., Kawaguchi, S., and Kishida, Y., Itanium2 32-way Server System Architecture, NEC Research & Development, 44: 812 (2003). SGI, SGI Altix 4700 Servers and Supercomputers, Technical report, SGI Inc., http://www.sgi.com/pdfs/3867.pdf (2007). Shan, H., and Strohmaier, E., Performance Characteristics of the Cray X1 and Their Implications for Application Performance Tuning, In 18th Annual International Conference on Supercomputing, Malo, France, pp. 175183 (2004). Takagi, Y., Sato, H., Wagatsuma, Y., Mizuno, K., and Sawaya, K., Study of High Gain and Broadband Antipodal Fermi Anetnna with Corrugation, In 2004 International Symposium on Antennas and Propagation, Vol. 1, pp. 6972 (2004). Tsuboi, K., and Masuya, G., Direct Numerical Simulations for Instabilities of Remixed Planar Flames, In The Fourth AsiaPacic Conference on Combustion, Nanjing, China, pp. 136139 (2003, November).
[14]
[15]
[16] [17] [18] [19] [20] [21]

Characteristics of An On-Chip Cache On NEC SX

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Characteristics of An On-Chip Cache On NEC SX

Uploaded by

Copyright:

Available Formats

Interdisciplinary Information Sciences Vol. 15, No.

Characteristics of an On-Chip Cache on NEC SX Vector Architecture

E-mail: E-mail: E-mail: E-mail: E-mail: E-mail:

musa@sc.isc.tohoku.ac.jp yoshiei@sc.isc.tohoku.ac.jp egawa@isc.tohoku.ac.jp tacky@isc.tohoku.ac.jp okabe@isc.tohoku.ac.jp koba@isc.tohoku.ac.jp

3. On-Chip Cache Memory for Vector Architecture

Main Memory Unit

Multiply Add/Shift Divide

Characteristics of an On-Chip Cache on NEC SX Vector Architecture

5. Performance Evaluation of Vector Cache

Characteristics of an On-Chip Cache on NEC SX Vector Architecture

4B/FLOP 60 50 Efficiency(%) 40 30 20 10 0 X + Y X*Y+Z GPR

Figure 2. Eect of memory bandwidth per op/s rate.

Relationship between Eciency and Cache Hit Rate on Kernel Loops

Relative Memory Bandwidth

1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 Cache Hit Rate (%) 80 90 100

18 16 14 Efficiency (%) 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 Cache Hit Rate (%) 80 90 100

Relative Memory Bandwidth

Characteristics of an On-Chip Cache on NEC SX Vector Architecture

Relative Memory Bandwidth

do i=1,ncells do j=1,ncells wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j) end do end do

50 40 30 20 10 0 GPR APFA PRF SFHT PBM

Figure 8. Computation eciency of ve applications with/without 8 MB vector cache.

Performance Recovery Rate (%)

Characteristics of an On-Chip Cache on NEC SX Vector Architecture

Cache Hit Rate (%)

Cache Hit Rate (%)

80 60 40 20 0 2WAY 4WAY 8WAY

Cache Hit Rate (%)

Cache Hit Rate (%)

40 30 20 10 0 2WAY 4WAY 8WAY

40 30 20 10 0 2WAY 4WAY 8WAY

40 30 20 10 0 2WAY 4WAY 8WAY

Figure 10. Cache hit rate vs. associativity on ve applications.

10 0 15% 50% 85% Cache Latency Case 100%

Figure 11. Computation eciency of ve applications on four cache latency cases.

Kernel Loop (2)

6. Optimizations for Vector Caching: Selective Caching and Loop Unrolling

Characteristics of an On-Chip Cache on NEC SX Vector Architecture

30 256K 512K 1M 2M 4M 8M Cache Size (B)

40 Recovery Rate of Performanc (%)

0 256K 512K 1M 2M Cache Size (B) 4M 8M

Cache Usage Rate (%)

Cache Usage Rate (%)

70 60 Cache Usage Rate (%) 50 40 30

20 256K 512K 1M 2M Cache Size (B) 4M 8M

Characteristics of an On-Chip Cache on NEC SX Vector Architecture

Eects of Loop-unrolling and Caching

do i=1,ncells,4 do j=1,ncells wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j) wf_dip(i+1)=wf_dip(i+1)+gd_dip(j,i+1)*wary(j) wf_dip(i+2)=wf_dip(i+2)+gd_dip(j,i+2)*wary(j) wf_dip(i+3)=wf_dip(i+3)+gd_dip(j,i+3)*wary(j) end do end do

60 Cache Hit Rate (%)

hit rate 0.5MB

hit rate 8MB

Characteristics of an On-Chip Cache on NEC SX Vector Architecture

[16] [17] [18] [19] [20] [21]

You might also like

do i=1,ncells,4 do j=1,ncells wf_dip(i)=wf_dip(i)+gd_dip(j,i)wary(j) wf_dip(i+1)=wf_dip(i+1)+gd_dip(j,i+1)wary(j) wf_dip(i+2)=wf_dip(i+2)+gd_dip(j,i+2)wary(j) wf_dip(i+3)=wf_dip(i+3)+gd_dip(j,i+3)wary(j) end do end do