Speedup and Parallelization Models for Many-Core Systems Using Performance Counters

Mohammed A. Noaman Al-hayanni¹, Rishad Shafik¹, Ashur Rafiev², Fei Xia¹, Alex Yakovlev¹
(m.a.n.al-hayanni, rishad.shafik, ashur.rafiev, fei.xia, alex.yakovlev)@ncl.ac.uk
School of Electrical and Electronic Engineering¹ and Computer Science² - Newcastle University, UK

Abstract

Traditional speedup models, such as Amdahl's, facilitate study of the impact of running parallel workloads on many- core systems. However, these models are typically based on software characteristics only, assuming ideal hardware behaviors. As such, the applicability of these models for energy and/or performance-driven system optimization is limited by two factors. Firstly, speedup cannot be measured without instrumenting the original software codes, and secondly, the parallelization factor of an application running on specific hardware is generally unknown.
In this paper, we propose a novel method, whereby standard performance counters found in modern many-core platforms can be used to derive speedup without instrumenting applications for time measurements. We postulate that speedup can be accurately estimated as a ratio of instructions per cycle for parallel many core system to the same of a single core system. This leads to the determination of the parallelization factor and the optimal system configuration for energy and/or performance. The method is extensively demonstrated through experiments on three different platforms with core numbers ranging from 4 to 61, running a parallel benchmark application on Linux operating system. Speedup and parallelization estimations using the method and their cross-validations show negligible errors (up to 8%) in these systems.

1. Benchmark Application.

The specific designed benchmark application source code is avaliable at: pthreads.c. In this code we can control some specific parameters. The table below describe all parameters.

Parameters	Description
v	Option turns on verbose timestamp reporting (time for each sequential and parallel section). By default, only the total time is reported.
p	Specifiy parallel fraction P ( float , 0 < P < 1, default is 0.5 ).
j	Adjusts the parallel workload, such that Workload_par = total Workload * P * j , Workload_seq = total Workload*(1-P). For example, for P=0.5 we expect the same execution time for parallel and sequential sections
w , r	w (int, default is 10000) and r (int, default is 5) options specify the workload size. The application first does parallel execution with each thread doing Workload_par4000 iterations of sqrt, then Workload_seq4000 iterations of sqrt is done in a single (main) thread. This {parallel+sequential} execution is repeated R times in total.
z	Pins the sequential execution (and thus the main thread) to the specified core. (int, default is 0).
c	Defines the per-thread core pinning (sequence of ints as string, no spaces, e.g. 0123 or 777, default is 01). The number of parallel threads is equal to the length of this string. For example, "-c 3276 -z 7" will spawn 4 threads pinning them to C3, C2, C7, and C6, while pinning sequential execution to C7.

2. Results.

The outcomes collected from running the experimental benchmark application on Intel platforms( Core-i7-4820k , Xeon E5-2630V2 and Xeon Phi 7120X ) are available at: Outcomes.xlsx. The results inserted in one file includes three spreadsheets, one for each platform.

Coloum's label	Description
P	Parallel fraction.
C	Number of pinned cores.
Sequential , Parallel and Total	Experimental (sequential ,parallel and total execution time).
SPTheory	Amdahl's speedup ( theoretical ). SP = 1 / (1 - P) + (P / N)
SPTime	Experimental execution time based speedup SP = T1/TN
errExp	Experimental speedup error ratio in comparison with theoretical speedup.
IR_sum	Total INSTR_RETIRED_ANY.
IR_Max	Maximum INSTR_RETIRED_ANY.
Delta	Total INSTR_RETIRED_ANY. - Maximum INSTR_RETIRED_ANY.
Delta/T	Delta / total execution time.
UC_Max	Maximum CPU_CLK_UNHALTED_CORE.
SpClock	Maximum unhalted clock speedup
errTheory	Maximum unhalted clock speedup error ratio in comparison with theoretical speedup.
errExp	Maximum unhalted clock speedup error ratio in comparison with experimental speedup
IPC	System Instruction per Clock.
IPCApp	Application IPC.
SPIPC	System Instruction per Clock based speedup
errTheory	System Instruction per Clock based speedup error ratio in comparison with theoretical speedup.
errExp	System Instruction per Clock based speedup error ratio in comparison with experimental speedup
SPIPCApp	Application Instruction per Clock based speedup
errTheory	Application Instruction per Clock based speedup error ratio in comparison with theoretical speedup.
errExp	Application Instruction per Clock based speedup error ratio in comparison with experimental speedup.

Last modified 06/12/2016 by IGC