Speedup and Parallelization Models for Many-Core Systems Using Performance Counters


Mohammed A. Noaman Al-hayanni1, Rishad Shafik1, Ashur Rafiev2, Fei Xia1, Alex Yakovlev1
(m.a.n.al-hayanni, rishad.shafik, ashur.rafiev, fei.xia, alex.yakovlev)@ncl.ac.uk
School of Electrical and Electronic Engineering1 and Computer Science2 - Newcastle University, UK

Abstract

Traditional speedup models, such as Amdahl's, facilitate study of the impact of running parallel workloads on many- core systems. However, these models are typically based on software characteristics only, assuming ideal hardware behaviors. As such, the applicability of these models for energy and/or performance-driven system optimization is limited by two factors. Firstly, speedup cannot be measured without instrumenting the original software codes, and secondly, the parallelization factor of an application running on specific hardware is generally unknown.
In this paper, we propose a novel method, whereby standard performance counters found in modern many-core platforms can be used to derive speedup without instrumenting applications for time measurements. We postulate that speedup can be accurately estimated as a ratio of instructions per cycle for parallel many core system to the same of a single core system. This leads to the determination of the parallelization factor and the optimal system configuration for energy and/or performance. The method is extensively demonstrated through experiments on three different platforms with core numbers ranging from 4 to 61, running a parallel benchmark application on Linux operating system. Speedup and parallelization estimations using the method and their cross-validations show negligible errors (up to 8%) in these systems.

1. Benchmark Application.

The specific designed benchmark application source code is avaliable at: pthreads.c. In this code we can control some specific parameters. The table below describe all parameters.

Parameters Description
v Option turns on verbose timestamp reporting (time for each sequential and parallel section). By default, only the total time is reported.
p Specifiy parallel fraction P ( float , 0 < P < 1, default is 0.5 ).
j Adjusts the parallel workload, such that Workload_par = total Workload * P * j , Workload_seq = total Workload*(1-P). For example, for P=0.5 we expect the same execution time for parallel and sequential sections
w , r w (int, default is 10000) and r (int, default is 5) options specify the workload size. The application first does parallel execution with each thread doing Workload_par*4000 iterations of sqrt, then Workload_seq*4000 iterations of sqrt is done in a single (main) thread. This {parallel+sequential} execution is repeated R times in total.
z Pins the sequential execution (and thus the main thread) to the specified core. (int, default is 0).
c Defines the per-thread core pinning (sequence of ints as string, no spaces, e.g. 0123 or 777, default is 01). The number of parallel threads is equal to the length of this string. For example, "-c 3276 -z 7" will spawn 4 threads pinning them to C3, C2, C7, and C6, while pinning sequential execution to C7.

2. Results.

The outcomes collected from running the experimental benchmark application on Intel platforms( Core-i7-4820k , Xeon E5-2630V2 and Xeon Phi 7120X ) are available at: Outcomes.xlsx. The results inserted in one file includes three spreadsheets, one for each platform.

Coloum's label Description
P Parallel fraction.
C Number of pinned cores.
Sequential , Parallel and Total Experimental (sequential ,parallel and total execution time).
SPTheory Amdahl's speedup ( theoretical ). SP = 1 / (1 - P) + (P / N)
SPTime Experimental execution time based speedup SP = T1/TN
errExp Experimental speedup error ratio in comparison with theoretical speedup.
IR_sum Total INSTR_RETIRED_ANY.
IR_Max Maximum INSTR_RETIRED_ANY.
Delta Total INSTR_RETIRED_ANY. - Maximum INSTR_RETIRED_ANY.
Delta/T Delta / total execution time.
UC_Max Maximum CPU_CLK_UNHALTED_CORE.
SpClock Maximum unhalted clock speedup
errTheory Maximum unhalted clock speedup error ratio in comparison with theoretical speedup.
errExp Maximum unhalted clock speedup error ratio in comparison with experimental speedup
IPC System Instruction per Clock.
IPCApp Application IPC.
SPIPC System Instruction per Clock based speedup
errTheory System Instruction per Clock based speedup error ratio in comparison with theoretical speedup.
errExp System Instruction per Clock based speedup error ratio in comparison with experimental speedup
SPIPCApp Application Instruction per Clock based speedup
errTheory Application Instruction per Clock based speedup error ratio in comparison with theoretical speedup.
errExp Application Instruction per Clock based speedup error ratio in comparison with experimental speedup.

Last modified 06/12/2016 by IGC