instruction fetch

A comparative evaluation of software techniques to hide memory latency

1995

Lizy Kurian John Vinod Reddy Paul T. Hulina Lee D. Coraor

Software oriented techniques to hide memory latency in superscalar and superpipe2ined machines include loop unrolling, software pipelining, and software cache prefetching. Issuing the data fetch request prior to actual need for data allows overlap of accessing with useful computations. Loop unrolling and software pipelining do not necessitate microarchitecture or instruction set architecture ch...

متن کامل

Load Cache

1997

Gary S. Tyson Todd M. Austin

As processors continue to exploit more instruction level parallelism, a greater demand is placed on reducing the eeects of memory access latency. In this paper, we introduce a novel modiication of the processor pipeline called memory renaming. Memory renaming applies register access techniques to load instructions, reducing the eeect of delays caused by the need to calculate effective addresses...

متن کامل

Signal Processing through Field Programmable Gate Arrays: Prospects and Challenges

2010

Shubhajit Roy Chowdhury

The paper focuses on the use of field programmable gate arrays (FPGA) for signal processing applications. By allowing designers to create circuit architectures developed for the specific applications, high levels of performance can be achieved using FPGA for many digital signal processing (DSP) applications providing considerable improvements over conventional microprocessor and dedicated DSP p...

متن کامل

Instruction Pipeline Efficient Mechanism with Maximum Hit Ratio

2013

Shahnawaz Talpur Yizhuo Wang Shahnawaz Farhan Khahro XiaoJun Wang Xu Chen Feng Shi

To achieve highest performance in rapidly growing advancement in multi-core technology, there is need to minimize the large gap between faster processor speed and memory. It becomes more critical issue when branch occurs with penalty of cache miss. Many researchers proposed different branch prediction, instruction perfecting methods and algorithms but the CPU pipeline performance couldn’t be th...

متن کامل

Efficient Cycle-Accurate Simulation of the Ultrasparc III CPU

2007

Peter E. Strazdins Bill Clarke Andrew Over

This paper presents a novel technique for cycleaccurate simulation of the Central Processing Unit (CPU) of a modern superscalar processor, the UltraSPARC III Cu processor. The technique is based on adding a module to an existing fetch-decode-execute style of CPU simulator, rather than the traditional method of fully modelling the CPU microarchitecture. It is also suitable for accurate SMP model...

متن کامل

Scalable Atomic Primitives for Distributed Shared Memory Multiprocessors

1995

Maged M. Michael Michael L. Scott

Our research addresses the general topic of atomic update of shared data structures on large-scale shared-memory multiprocessors. In this paper we consider alternative implementations of the general-purpose single-address atomic primitives fetch and , compare and swap, load linked, and store conditional. These primitives have proven popular on small-scale bus-based machines, but have yet to bec...

متن کامل

ForkLight: A Control-Synchronous Parallel Programming Language

Journal: :Universität Trier, Mathematik/Informatik, Forschungsbericht 1998

Christoph W. Kessler Helmut Seidl

ForkLight is an imperative, task-parallel programming language for massively parallel shared memory machines. It is based on ANSI C, follows the SPMD model of parallel program execution, provides a sequentially consistent shared memory, and supports dynamically nested parallelism. While no assumptions are made on uniformity of memory access time or instruction– level synchronicity of the underl...

متن کامل

Introducing Runahead Threads for SMT Processors

2007

Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Mateo Valero

In this paper, we propose Runahead threads on Simultaneous Multithreading processors as a valuable solution for both exploiting the memory-level parallelism and reducing the resource contention. This approach transforms a memory-bounded eager resource thread into a speculative light thread, alleviating critical resource con icts among multiple threads. Furthermore, it improves the threadlevel p...

متن کامل

A Time Stamping Algorithm for Computing the Critical Path of Program Execution on Superscalar Processors

2000

Gabriel Loh Dana Henry

The increasing complexity of modern superscalar processors makes the evaluation of new designs more difficult. Current simulators such as Stanford’s SimOS [16] and the University of Wisconsin’s Simplescalar Toolset [2] perform detailed cycle-level simulation of the processor to obtain performance measurements at the cost of very slow simulation times. This report presents and analyzes an algori...

متن کامل

A Study of Mispredicted Branches Dependent on Load Misses in Continual Flow Pipelines

2004

Pradheep Elango Saisuresh Krishnakumaran Ramanathan Palaniappan

Large instruction window processors can achieve high performance by supplying more instructions during long latency load misses, thus effectively hiding these latencies. Continual Flow Pipeline (CFP) architectures provide high-performance by effectively increasing the number of actively executing instructions without increasing the size of the cycle-critical structures. A CFP consists of a Slic...

متن کامل