Mohammed A. Noaman Al-hayanni1, Rishad Shafik1, Ashur Rafiev2, Fei Xia1, Alex Yakovlev1
(m.a.n.al-hayanni, rishad.shafik, ashur.rafiev, fei.xia, alex.yakovlev)@ncl.ac.uk
School of Electrical and Electronic Engineering1 and Computer Science2 - Newcastle University, UK
Traditional speedup models, such as Amdahl's, facilitate study of the impact of running parallel workloads on many- core systems. However, these models are typically based on software characteristics only, assuming ideal hardware behaviors. As such, the applicability of these models for energy and/or performance-driven system optimization is limited by two factors. Firstly, speedup cannot be measured without instrumenting the original software codes, and secondly, the parallelization factor of an application running on specific hardware is generally unknown.
In this paper, we propose a novel method, whereby standard performance counters found in modern many-core platforms can be used to derive speedup without instrumenting applications for time measurements. We postulate that speedup can be accurately estimated as a ratio of instructions per cycle for parallel many core system to the same of a single core system. This leads to the determination of the parallelization factor and the optimal system configuration for energy and/or performance. The method is extensively demonstrated through experiments on three different platforms with core numbers ranging from 4 to 61, running a parallel benchmark application on Linux operating system. Speedup and parallelization estimations using the method and their cross-validations show negligible errors (up to 8%) in these systems.
The specific designed benchmark application source code is avaliable at: pthreads.c. In this code we can control some specific parameters. The table below describe all parameters.
Parameters | Description |
v | Option turns on verbose timestamp reporting (time for each sequential and parallel section). By default, only the total time is reported. |
p | Specifiy parallel fraction P ( float , 0 < P < 1, default is 0.5 ). |
j | Adjusts the parallel workload, such that Workload_par = total Workload * P * j , Workload_seq = total Workload*(1-P). For example, for P=0.5 we expect the same execution time for parallel and sequential sections |
w , r | w (int, default is 10000) and r (int, default is 5) options specify the workload size. The application first does parallel execution with each thread doing Workload_par*4000 iterations of sqrt, then Workload_seq*4000 iterations of sqrt is done in a single (main) thread. This {parallel+sequential} execution is repeated R times in total. |
z | Pins the sequential execution (and thus the main thread) to the specified core. (int, default is 0). |
c | Defines the per-thread core pinning (sequence of ints as string, no spaces, e.g. 0123 or 777, default is 01). The number of parallel threads is equal to the length of this string. For example, "-c 3276 -z 7" will spawn 4 threads pinning them to C3, C2, C7, and C6, while pinning sequential execution to C7. |
The outcomes collected from running the experimental benchmark application on Intel platforms( Core-i7-4820k , Xeon E5-2630V2 and Xeon Phi 7120X ) are available at: Outcomes.xlsx. The results inserted in one file includes three spreadsheets, one for each platform.
Coloum's label | Description |
P | Parallel fraction. |
C | Number of pinned cores. |
Sequential , Parallel and Total | Experimental (sequential ,parallel and total execution time). |
SPTheory | Amdahl's speedup ( theoretical ). SP = 1 / (1 - P) + (P / N) |
SPTime | Experimental execution time based speedup SP = T1/TN |
errExp | Experimental speedup error ratio in comparison with theoretical speedup. |
IR_sum | Total INSTR_RETIRED_ANY. |
IR_Max | Maximum INSTR_RETIRED_ANY. |
Delta | Total INSTR_RETIRED_ANY. - Maximum INSTR_RETIRED_ANY. |
Delta/T | Delta / total execution time. |
UC_Max | Maximum CPU_CLK_UNHALTED_CORE. |
SpClock | Maximum unhalted clock speedup |
errTheory | Maximum unhalted clock speedup error ratio in comparison with theoretical speedup. |
errExp | Maximum unhalted clock speedup error ratio in comparison with experimental speedup |
IPC | System Instruction per Clock. |
IPCApp | Application IPC. |
SPIPC | System Instruction per Clock based speedup |
errTheory | System Instruction per Clock based speedup error ratio in comparison with theoretical speedup. |
errExp | System Instruction per Clock based speedup error ratio in comparison with experimental speedup |
SPIPCApp | Application Instruction per Clock based speedup |
errTheory | Application Instruction per Clock based speedup error ratio in comparison with theoretical speedup. |
errExp | Application Instruction per Clock based speedup error ratio in comparison with experimental speedup. |
Last modified 06/12/2016 by IGC