like traditional I/O pads; the d2d vias have size and electri-cal characteristics similar to conventional vias that connect on die metal routing layers. In face-to-face bonding, through-silicon-vias (TSVs) are required to connect the C4 I/O to the active regions of the two die. Power is also deliv-ered through the backside vias. Die #2 is thinned for im-proved electrical characteristics and physical construction of the TSVs for power delivery and I/O. Good discussions of the processing details can be found in [8][11][14][15][26][27].
简爱电影Recently 3D die stacking is drawing a great deal of at-tention, primarily in embedded processor systems. Prior work examines system-on-chip opportunities [4][5][10] [16][18][24], explores cache implementations [15][28] [30], designs 3D adder circuits [14][21], and projects wire benefits in full microprocessors [1][4][5][17][29]. In order to transform 3D design rearch ideas into products Tech-nology Venture sponsors a dedicated forum for "3D Archi-tectures for Semiconductor Integration and Packaging." At this forum [33] it is clear that the embedded industry con-siders emerging 3D technology a very attractive method for integrating small systems. Furthermore, existing 3D prod-ucts from Samsung [32] and Tezzaron [34] corporations demonstrate that the silicon processing and asmbly of structures similar to Figure 1 are feasible in large scale in-dustrial productions. This work hence, focus on power, performance and thermal issues of 3D stacking without delving into the feasibility details.
逸乐This paper explores the performance advantages of eliminating wire using 3D on two fronts:
(1) Shorten wires dedicated to off die interfaces con-necting disparate die, such as off die wires connecting CPU and memory. Section 3 evaluates the performance potential of stacking memory on logic (Memory+Logic) [5][7] [12][13]. We quantify the performance and power benefits of stacking a large SRAM or DRAM caches on a micropro-cessor. Our results show that dramatically increasing on die storage increas performance and reduces required off die bandwidth while simultaneously reducing power. A key difference between our work and previous studies is that the prior work assumes that all of main memory can be in-tegrated into the 3D stack. We consider RMS applications that target systems with main memory requirements that cannot be incorporated in a two-die stack, and instead we u the 3D-integrated DRAM as additional high-density cache.
(2) The cond approach is to shorten wires connecting blocks within a traditional planar microprocessor. In this approach it is possible to implement a traditional microar-chitecture across two or more die to construct a 3D floor-plan. Such a Logic+Logic stacking, takes advantage of incread transistor density to eliminate wire between blocks of the microarchitecture [1][17][25]. The result is shorter latencies between blocks yielding higher perfor-mance and lower power. Section 4 takes a microprocessor from the Intel® Pentium® 4 family and converts it to a Logic+Logic 郑州学校
3D stacking to quantify the performance and power benefits of reduced wire delays in 3D.
While 3D provides power and performance advantages in both the above approaches, the most significant concern to 3D design is that 3D designs may increa the thermal hotspots. We evaluate the thermal impact of 3D design in the two scenarios and show that while 3D design does in-crea the temperature, the growth in temperature is negli-gible or can be overcome by an overall reduction in power consumption. Our results demonstrate that thermals are not an inexorable barrier to 3D design as generally believed.
2. Modeling Environment
This ction describes our 3D performance and thermal evaluation infrastructure. The Memory+Logic stacking evaluation prented in Section 3 requires us to evaluate the performance of adding large caches to a microprocessor. In order to evaluate large cache benefits it is necessary to have long running benchmarks that have large data footprints to exerci the cache structures. On the other hand evaluating Logic+Logic stacking of a microprocessor requires a de-tailed microarchitecture simulator that can model the inter-connection delays of logic blocks accurately. Hence, the goals of the two infrastructures are conflicting forcing us to u two different simulators which are described in Section
2.1 and Section 2.2, respectively. For both sceneries we u
啮虫目
a general thermal simulation infrastructure, which is de-scribed in Section 2.3.
2.1. Modeling Memory+Logic Performance
For evaluating Memory+Logic stacking we u a trace driven multi-processor memory hierarchy simulator that can run billions of memory references to exerci large caches. This internally developed rearch tool is designed to model all aspects of the memory hierarchy including DRAM caches with banks, RAS, CAS, page sizes, etc. The input to this simulator is a novel memory address trace gen-erated from a multi-threaded application running on a full system multi-processor simulator. The trace generator module runs alongside the full system simulator and keeps track of dependencies between instructions. The trace gen-erator outputs one trace record for each memory instruction executed by the full system simulator. In addition to the usual trace fields such as cpu id, memory access address, and instruction pointer address, every trace record contains the unique identification number of an earlier trace record this record is dependent upon. The memory hierarchy sim-ulator in turn honors all the dependencies specified in the
日语自我介绍牛津高中英语模块五the three stacking options described above. The bars in Fig-ure 5 show that for veral of the RMS benchmarks (gauss,pcg, sMVM, sTrans, sUS, and svm) CPMA decrea dra-matically as the last level cache increas from 4 to 64MB.The benchmarks that do not e improvement fit in the 4MB baline and do not require more capacity. The c-ondary Y-axis plots the off die bandwidth for all four con-figurations. The bandwidth lines in Figure 5 show significant reduction in off die bandwidth as the cache ca-pacity increas. The larger caches are effective at convert-ing off-die bus access to on-die cache hits. Increasing the last level cache capacity from 4MB to 32MB, on average,reduce
s bus bandwidth requirements by 3x and CMPA by 13% with peak CMPA reduction of 50%. There is also a 66% average power reduction in average bus power, due to
reduced bus activity. Assuming a bus power consumption rate of 20mW/Gb/s, 3D stacking of DRAM reduces bus power by 0.5W.
The performance improvements and bandwidth reduc-tions in Figure 5 are very good; however in a 3D die stack the resulting thermals may not be acceptable. Figure 6(a) il-lustrates the power density map and Figure 6(b) illustrates the thermal map of the baline microprocessor with 4MB of shared L2 cache which occupies approximately 50% of the chip area. The power map clearly illustrates the differ-ence in the heat generated within the cores relative to the cache. The total power corresponding to the power maps are from a 92W skew of the baline processor. The great-est concentration of power is in the FP units, rervation stations, and the load/store unit, pointed to in Figure 6(b).Using our 3D thermal modeling tool assuming standard desktop package cooling and an ambient temperature of 40ºC, the two hottest spots are at 88.4ºC and the coldest spot is 59ºC for the reference planar design.
Figure 7 shows the block diagrams including power consumption of (a) the baline 4MB processor;
(b) an ad-ditional 8MB of stacked SRAM with a total of 12MB of L2; (c) 32MB of stacked DRAM with the 4MB SRAM re-moved; and (d) 64MB of stacked DRAM. In our design 4MB of SRAM consume 7W, 32MB of DRAM consume 3.1W, and 64MB of DRAM consume 6.2W. This 3D DRAM is low power compared to DDR3 becau the 3D die to die interconnect is much lower power than traditional off-die I/O. The RC of the all copper die to die interconnect ud to interface the DRAM to the processor is comparable to 1/3 the RC of a typical via stack from first metal to last metal. The power of each configuration in Figure 7 is a lit-tle different making thermal comparisons challenging. The
Parameter Value
Core Parameters Same as Intel® Core™ 2 Duo L1D Cache 32KB, 64B line, 8-way, 4 cyc Shared L2 4 MB, 64B line, 16-way, 16 cyc Stacked L2
SRAM: 12 MB, 24 cyc
丝瓜的英文DRAM: 4-64MB, 512B page, 16 address interleaved banks, 64B ctors
DDR Main Memory 16 banks, 4KB page, 192 cyc Bank delays
(stacked L2 & DDR memory)
Page open 50 cyc Precharge 54 cyc Read 50 cyc Off die Bus BW
yale university16 GB/s
Table 3. Microarchitecture parameters Figure 6. Intel® Core™ 2 Duo planar floorplan: (a) power map; (b) thermal map
Temperature Scale
Hotest spots (88.35ºC)
Coolest area (59ºC)
英文手机铃声Core #1
L2 Cache 4MB
Core #2
(a)
(b)
FP
RS
michael phelpsLdSt
Edge temp drop is due to an epoxy fillet around die