Simplified Multi-Ported Cache in High Performance Processor
Hao Zhang, Dongrui Fan
Key Laboratory of Computer System and Architecture
Institute of Computing Technology
Chine Academy of Sciences
zhanghao@ict.ac fandr@ict.ac
Abstract
The memory bandwidth demands of modern microprocessors require the u of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propo technique for using a simplified dual-ported cache instead, which is mostly compod of single-ported SRAMs, without decreasing the performance of the processor apparently. We evaluate this technique using realistic applications that include the operating system. Our technique using a simplified multi-ported banking cache, reduces the delay of lect logic in LSQ by 16.1%, and achieves 98.1% of the performance of an ideal dual-ported cache.
沈阳婚礼
1. Introduction
All It is well known that the on-chip memory system of a microprocessor is a key determinant of microprocessor performance [1]. Improvements in microprocessor performance continue to surpass the performance gains of their memory subsystems. Higher clock rates and increasing numbers of instructions issued and executed in parallel account for much of this improvement [2]. By exploiting instruction level parallelism (ILP), the processors are capable of issuing multiple instructions per cycle, which places a greater demand on the memory system to rve multiple requests per cycle. As microprocessor designers push for more performance, the trend to aggressively exploit more and more ILP will continue. With about a third of a program’s instruction mix being memory references [12], an average of 2 load/stores per cycle is necessary in order to sustain a 4-wide issue rate. With such demands, multi-ported cache implementations are clearly necessary. As cache is on the critical path in the pipeline and its complexity [16], there is a need to explore low-cost techniques for designing simple multi-ported cache in order to meet the need for sustainable cache bandwidth.
Currently, multiple cache ports are implemented in one of four ways [2]: either by conventional and costly ideal multi-porting, by time division multiplexing, by replicating multiple single-port copies of the cache, or (with lower performance and possibly lower cost) by interleaving the cache through mu
ltiple independently-addresd banks. Conceptually, ideal multi-porting requires that all n ports of a n-ported cache be able to operate independently, allowing up to n cache access per cycle to any address. However, ideal multi-porting is generally considered too costly and impractical for commercial implementation for anything larger than a register file. Current commercial multi-porting implementations therefore u one of the remaining three techniques. For time division technique, as clock speed is getting doubled every technology generation making cache access latency to be more than one cycle, this method is not ud by mainstream microprocessors anymore. For multiple copy replications, to keep both copies coherent, every store operation must be nt to both cache ports simultaneously, thus reducing the effectiveness and scalability of this approach relative to ideal multi-porting. Another major cost of this approach is the die area necessary for cache replication. For banking technique, although dividing a cache into banks can be economical, the cost of the crossbar between the load/store units and the cache ports grows superlinearly as the banks (and ports) increa.
Although each technique has significant costs and drawbacks, we find that multi-banking holds the key to a low cost cache memory design that can cope with increasing degrees of instruction level parallelism. And it is preferred by mainstream microprocessors [7][9].
A look at the memory reference stream reveals that the number of load operations is about the same to the
1-4244-0328-6/06/$20.00 ©2006 IEEE.
number of store operations [12], since load instructions are twice as many as the store instructions and store access cache twice, one for hit checking and one for storing data to cache. In this paper we propo and evaluate technique for simplifying the multi-banking technique by parating load from store when access the cache. Load and store u their dedicated ports without mutual disturbing. The lect function in LSQ(Load Store Queue) [17][18] is simplified, and the delay of lecting bank-conflict-free operations is decread substantially. And the performance is similar to the processor with ideal multi-ported cache.
The rest of this paper is organized as follows: Related work is given in ction 2, and the architecture and environment ud in this investigation is described in ction 3, Section 4 contains the results characterizing the design space by comparing our design with original design and ideal design. Section 5 prents the conclusions drawn from the experimental results.
2. Related Work
All the four ways mentioned in the introduction part has been implemented in processors, and other techniques are also focud in the rearch area.
The time division multiplexed technique, employed in the DEC Alpha 21264 [6], achieves dual-porting by running the cache SRAM at twice the speed of the processor clock. As data access parallelism moves beyond 2, extending this technique by running SRAMs n times as fast as the processor will become infeasible as a multi-porting solution. According to time division multiplexed cache technique, the main issue is that the bit-line delay does not scale well with technology as compared to the clock cycle time.
饮料英语单词
The data cache implementation in the DEC Alpha 21164 [8] provides an example of multi-porting through multiple copy replication. The 21164 implements a two-ported cache by maintaining identical copies of the data t in each cache.
A 2-bank (interleaved) data cache is found, for example, in the MIPS R10000 [9]. A simultaneously rved pair of data references must address different banks. With a well balanced and well scheduled memory reference stream, this approach can boost data access parallelism and deliver high bandwidth. With the wrong memory reference stream, however, bank access conflicts can riously degrade the delivered performance toward single port single bank performance.
AMD Opteron has parate L1 data and instruction caches, each 64 Kbytes. They are two-way t-associative, linearly indexed, and physically tagged with a cache line size of 64 bytes [7]. The Opteron's cache has two 64 bit ports. Two access can occur each cycle. Any combination of loads and stores is possible. The dual port mechanism is implemented by
a banking mechanism: The cache consist of 8 individual banks. Two access can occur simultaneously if they are to different banks.
One way to increa memory bandwidth and at the same time u a single-ported cache is to add a small buffer with a one cycle access time to the processor’s load/store execution unit [15]. This level zero cache can be implemented as a small line buffer. A line buffer holds cache data inside the processor’s load/store unit to allow loads to recently accesd data
to be satisfied from the line buffer instead of from the cache. According to its complexity, this technique is not widely accepted by mainstream processors in order
to avoid affecting the critical path, and this technique always be ud to design low power cache [11].
底面积Our work bad on the banking technique, but simplified the lecting logic in the LSQ, allowing one store and one load to be nt to the virtual dual-ported cache, which is implemented with single-ported SRAM.
3. Propod Architecture
This ction describes the processor architecture that is ud to simulate the results prented in this paper. The details of our dynamic superscalar processor and memory model are prented below, followed by descriptions of the simulator and the benchmark.
3.1. The Godson-2 Processor
The Goodson project [3][4] is the first attempt to design high performance general-purpo microprocessors in China. The Goodson-2 processor is
a 64-bit, 4-issue, out-of-order execution RISC processor that implements a 64-bit MIPS-like instruction t. The adoption of the aggressive out-of-order execution techniques (such as register mapping, branch prediction, and dynamic scheduling) and cache techniques (such as non-blocking cache, load speculation, dynamic memory disambiguation) helps the Goodson-2 processor to achiev
e high performance even at not so high frequency. The Goodson-2 processor has been physically implemented on a 6-metal 0.13 um CMOS technology bad on the automatic placing and routing flow with the help of some crafted library cells and macros. The area of the
chip is 6,700 micrometers by 6,200 micrometers and the clock cycle at typical corner is 1.3ns.
Figure 1 shows the memory hierarchy. This CPU has 64KB Data cache and 64KB instruction cache, and has off-chip 8MB L2 cache. And the L1 data cache is write-allocate, and write-back. In order to parallel the TLB and L1 cache access, the L1 cache is indexed by virtual address and tagged by physical address. More details are included in paper [4].
番茄酸汤鱼Figure 1. Memory hierarchy of Goodson-2 CPU
Table 1. Goodson-2 processor parameters Parameter Value Functional Units 2 FP, 2 Int, 2 AGU Pipeline depth 9 stages
Load & Store queue 16 entries
Instruction Windows Size 16-entry FP, 16-entry Int Instruction Cache 64kB, 4-way, 32 byte/line Data Cache 64kB, 4-way, 32 byte/line
L1 cache hit latency 3 cycle
L2 Cache no (off-chip)
I/DTLB 48/128 entry
Latency(to CPU) L2 6 cycle, Mem 80
cycles
branch predictor Gshare, RAS, BTB
3.2. Cycle-by-Cycle Simulator
We developed our own full-system cycle-by-cycle simulator, which is ud to build the processor prototype and make performance analysis. Table 1 shows the detail configuration. Our experiments show that the simulator can match the real CPU chip quite well, the error range is within 5% [5]. The configuration parameters are the same as Table 1, except for ignoring the simulation of L2 off-chip cache. 3.3. Benchmark
掩耳盗铃的近义词
To perform our evaluation, we run SPEC CPU2000 benchmarks [14] on our simulator. The benchmarks are compiled with the peak ttings, which perform many aggressive optimizations. The train data t is cho due to the long time simulation. All the benchmarks run on our full-system simulator.
4. Simplified multi-ported cache
To understand why we design a simplified dual-ported cache bad on banking technique, we must understand the nature of the memory reference stream as prented to the cache structure. We explore how the nature of the memory reference stream can be ud to support our design, and techniques that can help simplify the lect logic in LSQ without apparently
total instructions.
4.1. Characteristics of the Memory Reference Stream
It is important to understand that typical software instructions contain about 38 percent memory stores and loads [12]. As is shown in Figure 2, a substantial part of committed instructions in SPEC CPU2000 benchmarks are loads and stores, and at the same time, wide and out-of-order pipeline in modern processor also demands a multi-ported cache.
Although multiple cache ports are required to satisfy the bandwidth demands of modem superscalar processors, increasing the number of port does not get linear increa in performance. In a four issue dynamic superscalar processor with a 32 Kbyte primary data cache, adding a cond cache port increas processor performance by an average of 25% [15]. As expected, there are diminishing returns in performance from
increasing the number of cache ports beyond two. Increasing the number of cache ports to three and four incread processor performance by only four and one
memory instructions.
According to the ratio between loads and stores, we can e in Figure 3 that generally loads are twice as many as stores. Store operation access cache twice, at the first time, store operation access the cache for hit checking, in order to find out if the block it needs is in the cache, at the cond time, if hit in the first time, store the right data to the cache. So we can draw the conclusion that loads access the cache as many times as stores.
4.2. Simplified multi-ported cache architecture
The original cache is a banked cache, allowing two access each cycle, and any combination of loads and stores is possible. Two access can occur simultaneously if they are to different banks, a
nd this is ensured by the lect logic in LSQ [7].
Since high performance processors with out of order pipelines demands multi-ported caches and port number more than 2 do not get good payback, we concentrate on caches with two ideal ports.
For the reason that loads access the cache for as many times as stores, the simplified multi-ported cache technique parates loads from stores when access the cache. The cache provides two ports, one for loads and one for stores. Loads and stores u their dedicated port without mutual disturbing. The lect function is located in LSQ, choosing one load and one store simultaneously.
Figure 4 shows the details of the simplified dual-ported cache. When store access cache for the first time for hit checking, only the tag SRAM is accesd, without the data SRAM; on the contrary, cond time store only access data SRAM; while load access tag SRAM and data SRAM at the same time. We u dual-ported tag SRAM and banked single-ported data SRAM to rver one load and one store simultaneously. When the lect logic in LSQ choos one load and one first time store, they are free to be nt to each dedicated port of the cache. But when one load and one cond time store is chon, bank conflict must be checked. If the two operations access the same bank, the cond time store will be discarded. If the two operations access different banks, t
hey can also be nt to cache simultaneously. The conflict checking logic is after the lect logic, that means we choo one load and one store, and then check for conflict.
Figure 4. Simplified dual-ported cache
5. Results
Selecting one load and one store reduces the complexity of the lect logic in LSQ, comparing with lecting two memory access operations no matter load or store. This can be en from the latency of the lect logic. Table 2 gives the latency of the lect logic in LSQ with and without the optimization of simplified multi-ported cache. Each entry of LSQ includes 64bit data, 64bit address, and more than 20 bits for control. All the data are got with Design Compiler of Synopsis, Inc., under 130nm process. The delay decreas for 1.24ns to 1.03 ns, reduced by 16.9%. LSQ is on the critical path of memory access pipeline, so it is very important to reduce the latency of LSQ. If LSQ is the stage with longest latency in the pipeline, the frequency of the processor can be greatly enhanced, by 20.4% under ideal circumstances.
Table 2. Delay of lect logic in LSQ
Select logic in LSQ Original Simplified Delay(ns) 1.24 1.03 As is described before, lecting a load and a cond time store simultaneously in LSQ may cau bank conflict. The experiment shows that this situation ldom happens. The frequency of such situation can
be en in Figure 5, the average frequency is 4.5 times per 10,000 cycles. The influence of such conflict can be annoyed in our rearch.
a cond time store
Although we simplified the lect logic of LSQ, the performance of the processor has not been affected apparently. Figure 6 gives the IPC of SPEC CPU2000 with original lect logic in LSQ and with simplified lect logic in LSQ. We can e that the change is not
女英语
Figure 7 gives the percentage of IPC decrea after using simplified lect logic in LSQ, we can e that performance of most of the programs are decread no more than 0.5%, averaged by 0.3%.
We can e the comparison between IPC of ideal multi-ported cache and the IPC of our simplified dual-ported cache in Figure 8. Ideal Multi-ported here means dual-ported cache which rves two memory operations simultaneously, and any combination of loads and stores is possible, ignoring bank conflict. It is shown that the performance of our simplified dual-ported cache does not decrea much compared with the ideal multi-ported cache. For most of the programs of SPEC CPU2000, the performance is more than 98% of the processor with an ideal multi-ported cache, and
achieves an average of 98.1%.
with original dual-ported cache
with ideal multi-ported cache
6. Conclusion
The memory bandwidth demands of modem microprocessors require the u of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement and each
of the four methods in implementing a multi-ported caches has its shortcomings. Besides this reason, adding more ports (more than 2) to the cache does not bring good payback. Although each technique has significant costs and drawbacks, we find that multi-banking holds the key to a low cost cache memory design that can cope with increasing degrees of instruction level parallelism, and it is preferred by mainstream microprocessors. So in this paper we propo techniques for using a dual-ported banked cache.
A look at the memory reference stream reveals that the number of load operations is about the same to the number of store operations, since load instructions are twice as many as the store instructions and store access cache twice, one for hit checking and one for storing data to cache. So in this paper we propo and evaluate technique for simplifying the dual-ported banking
technique by parating load from store when access
the cache, and the simplified dual-ported banked cache can be implemented with dual-ported tag SRAM and single-ported data SRAM.
Our technique reduces the delay of lect logic in LSQ by 16.1%, and achieves 98.1% of the performance of an ideal dual-ported cache, without decreasing the performance of the processor ap
parently.
7. References
[1].Wilson, K.M.; Olukotun, K.; “Designing High Bandwidth On-chip Caches”, ISCA'97, 1997 , pp.121-132.
[2].Rivers, J.A. et al., “On high-bandwidth data cache design for multi-issue processors”, MICRO'97, 1997, pp.46-56.
[3].Weiwu Hu, Zhimin Tang, “Microarchitecture design of the Godson-1 processor”,Chine Journal
of Computers, April 2003, pp.385–396.
[4].Wei-Wu Hu, Fu-Xin Zhang, Zu-Song Li, “Microarchitecture of theGodson-2 Processor”, Journal of Computer Science and Technology, Vol.20, No.2, March 2005.
[5].Hou Rui, et al., “A Memory Bandwidth Effective Cache Store Miss Policy”, Asia-Pacific Computer Systems Architecture Conference, 2005, pp.750-760 [6].Kessler R, “The Alpha 21264 microprocessor”, IEEE Micro, March/April 1999, pp.24-36.
[7].Chetana N. Keltcher et al.; “The AMD Opteron Processor for Multiprocessor Servers”; IEEE MICRO; Volume 23, Issue 2, 2003, pp.66-76.
漓江图片[8].John H. Edmondson, et al, “Internal Organization
of the Alpha21164 a 300-MHz&l-bit Quad-issue CMOS RISC Microprocessor”, Digital Technical Journal, Special 10th Anniversary Issue, Vol. 7, No. 1.1995, pp. 119-135.
[9].Kenneth Yeager, “The MIPS R10000 superscalar microprocessor”, IEEE Micro, 1996, pp.28–41. [10].Vikas Agarwal,et al., “Clock rate versus IPC:
the end of the road for conventional microarchitectures”, ISCA'00, 2000, pp.248- 259 [11].Weiyu Tang et al., “Reducing power with an L0 instruction cache using history-bad prediction”, International Workshop on Innovative Architecture
for Future Generation High-Performance Processors and Systems, 2002, pp.11-18.
喜剧之王影评[12].Jack Doweck, “Intel Smart Memory Access: Minimizing Latency on Intel Core Microarchitecture”, Intel Technology Magazine, 2006.09. [13].John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach,
3rd edition, Elvier Science Pte Ltd, 2003.
[14].J. L. Henning, “SPEC CPU 2000: Measuring CPU Performance in the new millennium”, IEEE Computer, July 2000.
[15].Olukotun, K. et al., “Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors”, ISCA’96, 1996, pp.147 – 147. [16].Agarwal, A., et al., “Exploring high bandwidth pipelined cache architecture for scaled technology”, DATE’03, 2003, pp.778 – 783
[17].Baugh L, Zilles C, “Decomposing the load-store queue by function for power reduction and scalability”, IBM Journal of Rearch and Development, 2006.
[18].Nicolaescu, D. et al., “Reducing data cache energy consumption via cached load/store queue”, ISLPED '03, 2003, pp.252 – 257.