Toward Efficient Fine-grained Dynamic Scheduling on Many-Core Architectures
نویسندگان
چکیده
The recent evolution of many-core architectures has resulted in chips where the number of processor elements (PEs) are in the hundreds and continue to increase every day. In addition, many-core processors are more and more frequently characterized by the diversity of their resources and the way the sharing of those resources is arbitrated. On such a machine, task scheduling is of paramount importance to orchestrate a satisfactory distribution of tasks with an efficient utilization of resources, especially when fine-grain parallelism is desired or required. In the past, the primary focus of scheduling techniques has been on achieving load balancing and reducing overhead with the aim to increase total performance. This focus has resulted in a scheduling paradigm where Static Scheduling (SS) is preferred over Dynamic Scheduling (DS) for highly regular and embarrassingly parallel applications running on homogeneous architectures. We have revisited the task scheduling problem for these types of applications under the scenario imposed by many-core architectures to investigate whether or not there exists scenarios where DS is better than SS. Our main contribution is the idea that, for highly regular and embarrassingly parallel applications, DS is preferable over SS in some situations commonly found in many-core architectures. We present experimental evidence that shows how the performance of SS is degraded by the new environment on many-core chips. We analyze three reasons that contribute to the superiority of DS over SS on many-core architectures under the situations described: 1. A uniform mapping of work to processors without considering the granularity of tasks is not necessarily scalable under limited amounts of work. 2. The presence of shared resources (i.e. the crossbar switch) produces unexpected and stochastic variations on the duration of tasks that SS is unable to manage properly. 3. Hardware features, such as in-memory atomic operations, greatly contribute to lower the overhead of DS.
منابع مشابه
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
We believe that future many-core architectures should support a simple and scalable way to execute many threads that are generated by parallel programs. A good candidate to implement an efficient and scalable execution of threads is the DTA (Decoupled Threaded Architecture), which is designed to exploit fine/medium grained Thread Level Parallelism (TLP) by using a hardware scheduling unit and r...
متن کاملResource - agnostic programming for many - core microgrids 1
Many-core architectures are a commercial reality, but programming them efficiently is still a challenge, especially if the mix is heterogeneous. Here granularity must be addressed, i.e. when to make use of concurrency resources and when not to. We have designed a data-driven, fine-grained concurrent execution model (SVP) that captures concurrency in a resource-agnostic way. Our approach separat...
متن کاملArchitectural Support for Fine-Grained Parallelism on Multi-core Architectures
In order to harness the additional compute resources of future Multi-core Architectures (MCAs) with many cores, applications must expose their thread-level parallelism to the hardware. One common approach to doing this is to decompose a program into parallel “tasks” and allow an underlying software layer to schedule these tasks on different threads. Software task scheduling can provide good par...
متن کاملParaWeaver: Performance Evaluation on Programming Models for Fine Grained Threads
There is a trend towards multicore or manycore processors in computer architecture design. In addition, several parallel programming models have been introduced. Some extract concurrent threads implicitly whenever possible, resulting in fine grained threads. Others construct threads by explicit user specifications in the program, resulting in coarse grained threads. How these two mechanisms imp...
متن کاملAn Energy Efficient Real-Time Object Recognition Processor with Neuro-Fuzzy Controlled Workload-aware Task Pipelining
An energy efficient pipelined architecture is proposed for multi-core object recognition processor. The proposed neuro-fuzzy controller and intelligent estimation of the workload of input video stream enable seamless pipelined operation of the 3 object recognition tasks. The neuro-fuzzy controller extracts the fine-grained region-of-interest, and its task pipelining achieves 60.6fps, 5.8x highe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012