Syndicate content IEEE Computer Society
Updated: 7 hours 29 min ago

PrePrint: Exploiting Locality to Improve Circuit-level Timing Speculation

7 hours 29 min ago
Circuit-level timing speculation has been proposed as a technique to reduce dependence on design margins, eliminating power and performance overheads. Recent work has proposed microarchitectural methods to dynamically detect and recover from timing errors in processor logic. This work has not evaluated or exploited the disparity of error rates at the level of static instructions. In this paper, we demonstrate pronounced locality in error rates at the level of static instructions. We propose timing error prediction to dynamically anticipate timing errors at the instruction-level and reduce the costly recovery penalty. This allows us to achieve 43.6% power savings when compared to a baseline policy and incurs only 6.9% performance penalty.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=1677b5c5b6f9935b15be049629e2e9fbp=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=1677b5c5b6f9935b15be049629e2e9fbp=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: PRR-PRR Dynamic Relocation

7 hours 29 min ago
Partial bitstream relocation (PBR) on FPGAs has been gaining attention in recent years as a potentially promising technique to scale parallelism of accelerator architectures at run time, enhance fault tolerance, etc. PBR techniques to date have focused on reading inactive bitstreams stored in memory, on-chip or off-chip, whose contents are generated for a specific partial reconfiguration region (PRR) and modified on demand for configuration into a PRR at a different location. As an alternative, we propose a PRR-PRR relocation technique to generate source and destination addresses, read the bitstream from an active PRR (source) in a non-intrusive manner, and write it to destination PRR. We describe two options of realizing this on Xilinx Virtex 4 FPGAs: (a) hardware-based accelerated relocation circuit (ARC) and (b) a software solution executed on Microblaze. A comparative performance analysis to highlight the speed-up obtained using ARC is presented. For real test cases, performance of our implementations are compared to estimated performances of two state of the art methods.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=e1ec1ad431012fb1324755eb8419e903p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=e1ec1ad431012fb1324755eb8419e903p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: A process-variation aware technique for tile-based, massive multi-core processors

7 hours 29 min ago
Process variations in advanced nodes introduce significant core-to-core performance differences in single-chip multi-core architectures. Isolating each core with its own frequency and voltage island helps improving the performance of the multi-core architecture by operating at the highest frequency possible rather than operating all the cores at the frequency of the slowest core. However, inter-core communication suffers from additional cross-clock-domain latencies that can offset the performance benefits. This work proposes the concept of the configurable, variable-size frequency and voltage domain, and it is described in the context of a tile-based, massive multi-core architecture.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=78a345475a3149cc3d2494fdb8e1f632p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=78a345475a3149cc3d2494fdb8e1f632p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Characterizing the Energy Consumption of Software Transactional Memory

7 hours 29 min ago
The well-known drawbacks imposed by lock-based synchronization have forced researchers to devise new alternatives for concurrent execution, of which transactional memory is a promising one. Extensive research has been carried out on Software Transaction Memory (STM), most of all concentrated on program performance, leaving unattended other metrics of great importancel like energy consumption. This letter presents a thorough evaluation of energy consumption in a state-of-the-art STM. We show that energy and performance results do not always follow the same trend and, therefore, it might be appropriate to consider different strategies depending on the focus of the optimization. We also introduce a novel strategy based on dynamic voltage and frequency scaling for contention managers, revealing important energy and energy-delay product improvements in high-contended scenarios. This work is a first study towards a better understanding of the energy consumption behavior of STM systems, and could prompt STM designers to research new optimizations in this area, paving the way for an energy-aware transactional memory.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=aecf4c1bace1ac7dd95757620c8ec519p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=aecf4c1bace1ac7dd95757620c8ec519p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Power Management of Datacenter Workloads Using Per-Core Power Gating

7 hours 29 min ago
While modern processors offer a wide spectrum of software-controlled power modes, most datacenters only rely on Dynamic Voltage and Frequency Scaling (DVFS, a.k.a. P-states) to achieve energy efficiency. This paper argues that, in the case of datacenter workloads, DVFS is not the only option for processor power management. We make the case for per-core power gating (PCPG) as an additional power management knob for multi-core processors. PCPG is the ability to cut the voltage supply to selected cores, thus reducing to almost zero the leakage power for the gated cores. Using a testbed based on a commercial 4-core chip and a set of real-world application traces from enterprise environments, we have evaluated the potential of PCPG. We show that PCPG can significantly reduce a processor's energy consumption (up to 40%) without significant performance overheads. When compared to DVFS, PCPG is highly effective saving up to 30% more energy than DVFS. When DVFS and PCPG operate together they can save up to almost 60%.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=b109eaa7b639cb6e9f014e36b016233ep=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=b109eaa7b639cb6e9f014e36b016233ep=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Operand Registers and Explicit Operand Forwarding

7 hours 29 min ago
Operand register files are small, inexpensive register files that are integrated with function units in the execute stage of the pipeline, effectively extending the pipeline operand registers into register files. Explicit operand forwarding lets software opportunistically orchestrate the routing of operands through the forwarding network to avoid writing ephemeral values to registers. Both mechanisms let software capture short-term reuse and locality close to the function units, improving energy efficiency by allowing a significant fraction of operands to be delivered from inexpensive registers that are integrated with the function units. An evaluation shows that capturing operand bandwidth close to the function units allows operand registers to reduce the energy consumed in the register files and forwarding network of an embedded processor by 61%, and allows explicit forwarding to reduce the energy consumed by 26%.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=18c3ded83e67f76dd29465c825ecf491p=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=18c3ded83e67f76dd29465c825ecf491p=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

PrePrint: Accurate Functional-First Multicore Simulators

7 hours 29 min ago
Fast and accurate simulation of multicore systems requires a parallelized simulator. This paper describes a novel method to build cycle-accurate-capable and parallelizable functional-first simulators of multicore targets.br clear=both style=clear: both;/ br clear=both style=clear: both;/ a href=http://ads.pheedo.com/click.phdo?s=32d9e09b58e1e80dd82b4990c3be26abp=1img alt= style=border: 0; border=0 src=http://ads.pheedo.com/img.phdo?s=32d9e09b58e1e80dd82b4990c3be26abp=1//a img alt= height=0 width=0 border=0 style=display:none src=http://a.rfihub.com/eus.gif?eui=2225/
Categories: IEEE Members Only

IEEE Computer Architecture Letters - January-June 2009 (Vol. 8, No. 1)

7 hours 29 min ago
IEEE Computer Architecture Letters
Categories: IEEE Members Only