Upload
roberto-coba-lliquin
View
214
Download
0
Embed Size (px)
Citation preview
8/2/2019 BI131
1/6
Bioingeniera 131 1
EPIC processors performance improvement
estimation in biomedical signals intensive
processing
Alejandro Furfaro1, Nahuel Gonzalez
1, Martn Belzunce
1, Marcelo Risk
1,2
1Facultad Regional Buenos Aires, Universidad Tecnolgica Nacional, ARGENTINA
2 CONICET y Departamento de Computacin, Facultad de Ciencias Exactas y Naturales,
Universidad de Buenos Aires, ARGENTINA
AbstractPerformance improvement, in processors designed
for intensive calculation is based in the Instruction Level
Parallelism (ILP) optimization. To implement ILP, high
performance processors follow one of two models: superscalar
and EPIC. The aim of this work was to estimate the performance
improvement in systems based in processors designed according
to the EPIC model, given its greater compiler dependency for the
ILP optimization, and then the study was extended to the
contribution of using calculation functions contained in librariesspecially designed for EPIC processor. Equipments based in
Itanium 2 and Xeon processors that follow EPIC and superscalar
models respectively were employed. A total of ten
Electrocardiogram (ECG) recordings, of twenty four hours each
one were processed using digital filters and Fast Fourier
Transform; source code versions using double and float type
variables were compiled, using gcc (GNU C Compiler) and icc
(Intel C Compiler), with optimization switch O3. To compute the
Fast Fourier Transform, gsl (GNU Scientific Library) open
source portable to every processor, and Intel mkl (Math Kernel
Library), were employed for both processors under study. The
intensive ECG data processing, showed advantages of the
Itanium 2 processor, specially in double data types calculations.
Finally, was verified the greater dependency of compilersefficiency for the Itanium 2 processor, according with the EPIC
model basic principles, and the substantial improvements in
performance introduced by using the EPIC model versus the
superscalar.
Index TermsWLIB, EPIC, ILP, PNI, SIMD.
I. INTRODUCTION
nstruction Level Parallelism (ILP), is a set of techniques
applied in microprocessors and compiler design, which
allows a same CPU execute various instructions at the same
time [1]. High performance processors designers focus their
efforts to improve processors ILP, because it is one of themost important aspects in increasing performance in intensive
numerical algorithms. They get it by integrating in the same
CPU multiple Execution Units.
At the end of 70s, ILP was implemented through a
architectural model named VLIW (Very Long Instruction
Word), which can be considered the first EPIC (Explicitly
Parallel Instruction Computation) documented model; this
model was actually implemented in the most modern
processors like Itanium 2 which in addition has several
Multimedia Calculation (for Digital Signal Processing), and
Floating Point Units, thus conferring a very good profile for
high performance scientific applications [9], [10], [11].
Then at the end of the 80s, some processors began
including more than one execution unit in the same CPU, at
least for integer calculation, that made a set of analysis
necessary in order to determinate which programs
instructions can be executed in parallel in the different
processors Execution Units, without changing the result
expected by the application programmer. Interdependency
between two consecutive instructions was one of the major
problems to solve for an efficient ILP implementation.
In order to implement this series of analysis two micro
architectural models were established: 1) Superscalar Model,
that raises to integrate into the processor the necessary logic to
determine the feasibility to execute two instructions in parallel
as these enter the processor, and by the other hand, 2) VLIW
model that raises the implementation of this dependencies
analysis in the compiler, generating whenever it is possible
very wide words, capable to contain several instructions that
can be executed in parallel by the different processorscalculation units, without programs logic disturb.
VLIW subtracts complexity to hardwares design and
reduce significantly the power consume, a limiting aspect in
high performance integrated circuits manufacturing.
Finally at the 90s, the development of integration
technologies established an standard, having multiple
execution units implementations for integer calculation,
floating point calculation, and jump operations in the same
processor, allowing to reach high ILP levels, processing
several instructions in one clock cycle.
The contributions of using compilers designed specifically
for EPIC architecture processors in optimize their
performance, especially in floating point Arithmetic, have
been verified [2], [3].
The aim of the present work was to estimate the
performance improvement in systems based in processors
designed according to the EPIC model, given its greater
compiler dependency for the ILP optimization, and then the
study was extended to the contribution of using calculation
functions contained in libraries specially designed for EPIC
processor.
I
8/2/2019 BI131
2/6
Bioingeniera 131 2
II. MATERIAL AND METHODS
A. DataThis work was based on the intensive processing of ten
electrocardiogram (ECG) files, each of them from 24 hours
Holter studies; each file contains 2 ECG channels acquired at
256 samples per second, which totalizes an approximated
length of 80 Mbytes for each one [6].
B. Processing SystemsWe have used the following two equipments: a) a server
with two 1.5 GHz Itanium 2 EPIC architecture processor,
400MHz FSB, 6 Mbytes L3 cache memory, chipset Intel
E8870, 8 Gbytes DDR200 RAM, and SCSI Ultra 320
controller with three 140 Gbytes disks in RAID 5
configuration, Operating System Linux Red Hat AS4 Kernel
2.6.9-1, and b) a server with one 2.8 GHz Xeon processor
with Hyper Threading Technology, 512 KB L2 cache
memory, FSB 533 MHz, 1 GB DDR RAM, Mother Intel
SE7505VB2, Chipset Intel E7505, and HD ATA 200GB,
Operating System Linux Fedora Core 3, kernel 2.6.9-5.
C. AlgorithmsAn algorithm for the R wave detection and the subsequent
interval RR measurement and frequency spectrum calculation
in ECG surface registers was used. [7]; this algorithm has a
FIR low pass filtering with equiripple, fc=50 Hz, and 19 taps
in both channels. Then, a resulting signal is obtained by the
following formula:
[ ] [ ] [ ]22 21 nchnchnx +=(1)
The use of this combined signal is to represent bothchannels information simultaneously. In case of a flaw in any
channel (devices failing, disconnection, etc.), this signal
maintains the other channel information. In this signal we
proceed to localize the R wave, and determine the RR interval.
Then every channel spectrum and x[n] signal was
calculated, by using the Fast Fourier Transform (FFT). The
results were stored in a csv formatted file for each Holter
register.
Two version of algorithms was implemented: one using all
floating point simple precision variables (float type), and the
other one, using all floating point double precision variables
(double type), both of them according to IEEE 754 standard.All algorithms was compiled with two different tools: a)
GNU C Compiler (gcc) standard in Linux distributions,
version 3.4.2 in the Xeon system, and version 3.4.3 for the
Itanium 2 system, and b) Intel C Compiler (icc) version 8.1,
specifically designed for Intel processors.
Both compilers have aggressive processor optimization
(switch O3 in command line).
FFT calculation was made using two different libraries: a)
GNU Scientific Library (gsl) version 1.7; and b) Intel Math
Kernel Library (mkl) version 8.0.019 specifically designed for
Itanium 2 processor, in that is based our equipment.
GSL library was compiled according with each version of
the algorithms, that is, using gcc or icc according to the test to
be realized. The mkl library was used like is provided by the
manufacturer because the source code is not available.
D. Performance estimationIn order to estimate both systems performance, time stamps
with milliseconds resolution was taken in the different inputand output points of the algorithms. According with the
volume of information to be processed the adopted resolution
was considered enough.
On the other hand, in order to evaluate performance just for
the processors involved in the measurements, processes that
access only to the disk and storage the ECG files content in
memory was developed, and in order to compute the time due
to this task specific time stamps for it were taken. The values
obtained were used to adjust the processing algorithms results,
removing the disk access component from the measurement.
The results were then analyzed with a Student t test for
paired samples; the statistical significance level was defined to
0.05.
III. RESULTS
After the processing of 10 recordings with the 24 hours of
ECG, then comparing the outputs in each case in the system
under study, and finally having verified the same quantitative
results independently of the FFT functions origin library, and
compiler employed, processing time in each case was taken
for evaluation expressed by their mean and standard deviation
(SD).
The summary of this results are showed in figures 1 and 2.
Tables I-A and I-B show the obtained results for different
combinations of compiler scientific calculation library. In allcases for both compilers the switch O3 has been used
(aggressive optimization respect the processor).
A. Compiler incidence in each processorTaking the Itanium 2 processor results, programs compiled
with gcc respect with the same compiled with icc, give the
following performance relationships, calculated as the average
processing time using gcc respect the average processing time
using icc, for the following cases: a) 2.1 times using gsl
library functions and type double variables (3/1), b) 2 times
using mkl library functions and type double variables (4/2), c)
2.8 times using gsl library functions and type float variables
(11/9), d) 3.2 times using mkl library functions and type float
variables (12/10).
On the other hand, for the Xeon processor, the same
performance relationships threw the following values: a) 1.0
times using gsl library functions and type double variables
(7/5), b) 1.5 times using mkl library functions and type double
variables (6/8), c) 1.4 times using gsl library functions and
variables type float variables (15/13), d) 1.1 times using mkl
library functions and type float variables (16/14). These
values are represented in Table II.
8/2/2019 BI131
3/6
Bioingeniera 131 3
B. Library used for solving FFT functions incidenceAnalyzing Tables I-A and I-B, the performance relationship
for the Itanium2 processor calculated as the average
processing time using gsl libraries FFT functions with respect
to the same functions from mkl library, for the same compiler,
the following relationships are obtained: a) 1.8 timescompiling with icc and using type double variables (1/2); b)
1.9 times compiling with gcc and using type double variables
(3/4); c) 1,7 times compiling with icc and using type float
variables (9/10); d) 1.5 times compiling with gcc and using
type float variables (11/12).
On the other hand for the Xeon processor, the same
performance relationships gave the following values: a) 3.8
times compiling with icc and using type double variables
(5/6); b) 2.5 times compiling with gcc and using type double
variables (7/8); c) 1.2 times compiling with icc and using type
TABLEII
GCC VERSUS ICC COMPILERPERFORMANCERELATIONSHIPS.
Processor Library Double Float
gsl 2.1 2.8Itanium2
mkl 2.0 3.2
gsl 1.0 1.4
Xeonmkl 1.5 1.1
Xeon with double precision
ECG record
1 2 3 4 5 6 7 8 9 10
Processingtime(s)
0
5
10
15
20
25
30
35
40
45
gcc compiler withGSL
icc compiler withGSL
gcc compiler withMKL
icc compiler withMKL
Xeon with float precision
ECG record
1 2 3 4 5 6 7 8 9 10
Processingtime(s)
0
5
10
15
20
25
gcc compiler withGSL
icc compiler withGSL
gcc compiler withMKL
icc compiler withMKL
Figure 2. Execution time values in seconds for the ECG de 24 hours 10 lots
processed with Xeon using data types double y float respectively.
TABLE I- B
Processing times for each processor and compiler in seconds.
float
Lib gsl Lib mkl
Itanium2 with icc 11.90.29
6.90.210
Itanium2 with gcc 33.90.311
22.00.212
Xeon with icc 14.20.113 11.80.1 14
Xeon with gcc 20.00.215
12.60.116
TABLE I-A
Processing times for each processor and compiler in seconds.
double
Lib gsl Lib mkl
Itanium2 with icc 14.10.21 7.80.2 2
Itanium2 with gcc 29.00.3
3
15.50.2
4
Xeon with icc 39.20.3 5 10.40.1 6
Xeon with gcc 39.80.47 15.90.4 8
Itanium with double precision
ECG record
1 2 3 4 5 6 7 8 9 10
Processingtime(s)
0
5
10
15
20
25
30
35
gcc compiler withGSLicc compiler withGSL
gcc compiler withMKL
icc compiler withMKL
Itanium with float precision
ECG record
1 2 3 4 5 6 7 8 9 10
Processing
time(s)
0
5
10
15
20
25
30
35
40
gcc compiler withGSL
icc compiler withGSLgcc compiler withMKL
icc compiler withMKL
Figure 1. Execution time values in seconds for the ECG de 24 hours 10 lots
processed with Itanium 2 using data types double y float respectively.
8/2/2019 BI131
4/6
Bioingeniera 131 4
float variables (13/14); d) 1.6 times compiling with gcc and
using type float variables (15/16). These performance
relationships are showed in Table III.
C. Processor specific compiler and library combinationincidence
Always starting from the Tables I-A and I-B results, the
contribution to the Itanium 2 system performance using icc
compiler and FFT functions from mkl library, respect to the
obtained by means of using gcc compiler and FFT functionsfrom gsl, brings the following performance relationships: a)
3.7 times using type double variables (3/2); and b) 4.9 times
using type float variables (11/10). For Xeon processor, the
same relationships bring the following values: a) 3.8 times
using type double variables (7/6), and b) 1.7 times using type
float variables (15/14). These performance relationships can
be viewed in Table IV.
D. Performance relationship between processors accordingto the compiler library combination
Always from Tables I-A and I-B results values, the
performance relationships between Xeon and Itanium 2
processors are calculated, for the different compilers and
libraries combinations, obtaining the following values: a) 1.4
times using type double variables, compiling with gcc and
using FFT calculation functions from gsl library (7/3); b) 1.0
times using type double variables, compiling with gcc and
using FFT calculation functions from mkl library (8/4); c) 2.8
times using type double variables, compiling with icc and
using FFT calculation functions from gsl library (5/1); d) 1.3
times using type double variables, compiling with icc and
using FFT calculation functions from mkl library (6/2); e) 0.6
times using type float variables, compiling with gcc and using
FFT calculation functions from gsl library (15/11); f) 0.6
times using type float variables, compiling with gcc and using
FFT calculation functions from mkl library (16/12); g) 1.2
times using type float, variables, compiling with icc and using
FFT calculation functions from gsl library (13/9); h) 1.7 times
using type float variables, compiling with icc and using FFT
calculation functions from mkl library (14/10). The
performance relationships can be observed in Table V.
IV. DISCUSSION
A. Compiler incidence in each processorItanium 2 processor architecture, based in the EPIC model,
has been conceived to reach high ILP indexes with a relative
very low complexity in hardware. Based in the VLIW model,
EPIC established a set of guidelines to implement very high
ILP processors, and a basic architectural functions set. In
particular EPIC support the main VLIW characteristics: The
compiler resolves the algorithms Execution Plan, in other
words the instruction list to be executed, their order, and
which processors internal Execution Unit is going to be used
[3]. In this way the execution plan is static and will be defined
in compiling time.
Many times, and in order to optimize performance and takeadvantage of the parallelism, the compiler will need to alter
the instruction sequence before send them to the processor.
Therefore, the processor must have a predictable internal
behavior; in order to the instruction reordering done by the
compiler does not modify the essence of the algorithm, then
changing the expected results [3].
This condition does not appear in the superscalar
processors, which at the moment have a very complex out of
order instruction execution engine, which is able to work over
a window having sometimes more than 126 instructions [5],
by analyzing them dynamically and establishing which
instructions can be executed without damage on the programexpected result. To obtain an execution plan in compiling time
is practically impossible in this kind of processors, since they
can not determine the way in that the dynamic scheduling
hardware will organize the execution of those instructions that
keep dependencies.
This situation can be clearly denoted working with IA-32
processors, that are based on an Out of Order Execution core,
known as the P6 Micro Architecture (Pentium Pro, Pentium II
and Pentium III) [5], and at the moment the Xeons processor
Netburst Architecture, used at this work. The Out of Order
TABLEIV
PERFORMANCE RELATIONSHIPS USING GCC AND GSL LIBRARY VERSUS ICC
AND MKL LIBRARY.
Processor double float
Itanium2 3.7 4.9gcc+gsl /icc+mkl
Xeon 3.8 1.7
TABLEIII
GSL VERSUS MKL LIBRARY PERFORMANCE RELATIONSHIPS.
Processor compiler double float
icc 1.8 1.7Itanium2
gcc 1.9 1.5
icc 3.8 1.2Xeon
gcc 2.5 1.6
TABLEV
COMPILER LIBRARY COMBINATION PROCESSORS PERFORMANCERELATIONSHIPS.
Processor compiler double float
gcc + gsl 1.4 0.6
gcc + mkl 1.0 0.6
icc + gsl 2.8 1.2Xeon / Itanium2
icc + mkl 1.3 1.7
8/2/2019 BI131
5/6
Bioingeniera 131 5
Execution model also resolves latencies in instructions
operands accesses, when these are not stored in the L1 cache
data memory [4], [5]. This situation does not follow simple
statistical estimation criteria, and therefore they remain hidden
from the compiler, preventing then to generate an efficient
Execution Plan as in VLIW model. The high Xeons
performance is at the cost of a very complex hardware design.
Nevertheless, EPIC processors ability to transfer the
complexity in building the Execution Plan to the compilermust not be overestimated, since that, in return, imply a
compiler dependency, in many cases excessive, and, if proper
development tools do not be used, the results will not match
with the expectations. This situation is evident in Table V
results, in that can be appraised that using a multiprocessor
compiler that is gcc, makes the performance relationship
Xeon-Itanium 2 even smaller to 1 in some cases.
This scenario can be appraised using float data types in the
calculations, which are 32 bit variables, which are not the data
type that take better benefits from a 64 bit native architecture,
like the Itanium 2 processor. In that table we can view that
using specific compilers like icc, the results match accordingto expectations rather better.
Anyway, situations whose results can not be predicted
appear, since the required information for its resolution is
generated in execution time. These are: 1) branch instructions
in that the branch condition evaluations result can not be
determined previously, 2) code blocks that result from multi
execution branches flow diagrams, 3) concurrent resource
access situations, for example, a memory address, and 4) data
loads from memory that consume nondeterministic access
time, because it depends of the cache level in that is stored the
data, or still worse, if data is located in systems DRAM
memory. For the first three cases, EPIC compiler employs the
same speculative techniques that superscalar processors
hardware [3]. In general, algorithms build loops based in
conditional branch instructions that will have always the same
destination, except when the branch condition expires.
Therefore, if it is assumed that the result will be the branch
that corresponds to the true condition; this will be always right
except the only time in that the loop expires. In such
condition, the processor hardware will have to be able to solve
the situation, at cost of system performance. All this
speculative execution hardware is not necessary in EPIC
processors.
For the branch instructions, superscalar processors use a
hardware block named Branch Target Buffer that by means ofspeculative execution logic is in charge to maintain a cache
with the most probable branch destination addresses [4]. For
these reasons, Table II performance relationships show a
different behavior, working with the icc compiler in Itanium 2
processor, while in the Xeon processor, although performance
improvement exists, is not so significant.
B. Library used for solving FFT functions incidenceWith respect to mkl library, in addition to be compiled
originally using the icc compiler, its value consist in that their
algorithms make an extremely efficient use of the processors
resources [12]. Both processors have several resources for
floating point and multimedia calculation, beginning from
parallel execution units for these kind of operations, registers,
and SIMD instructions [4], [5]. In the Xeon processors case,
these resources are grouped in the multimedia extensions
SSE3 [5].
By its side, Itanium 2 processor, although it executes code
compiled for Xeon, it has its own calculation multimediaresources too [8], [13]. In the design of a scientific calculation
library, two approaches can be taken: a) to take maximum
advantage of these resources, even resorting to assembler
language optimizations, sacrificing portability, or 2) to design
generic algorithms avoiding employ particular architecture
dependent resources, sacrificing performance to ensure code
portability in every platform.
The first approach is the case of mkl library, whose design
is in charge of the manufacturer of both processors used in
this study [12]. However, gsl library has been designed
privileging its portability [14]. The performance relationships
showed in Table III indicate that the mkl library contribution,is sensibly greater in the case of Xeon processor than in
Itanium 2 processor working with double data types, because
in between its multiple multimedia resources, Xeon
processors has eight 128-bit registers, that enable it to process
two data of double type in a single instruction [5].
C. Processor specific compiler and library combinationincidence
Table IV shows significant performance improvements for
applications developed using simultaneously compilers and
libraries specifically designed for each processor. For general
use applications, employing specific development tools is not
the best policy in every case. But working in specificapplication design, such as the case of biomedical signals
processing at present work, where the high performance
computing is a priority requirement, the whole system must be
designed taking account of applications requirements.
In such cases, the contribution of using specific-processor
compilers and scientific calculation libraries, allows to reduce
almost in four times the same algorithm execution time using
64-bit floating point data (double type) in both processors,
and, in the specific case of Itanium 2 processor, 4.9 times for
32-bit floating point data (float type), difference that can result
in a system able or unable to resolve real time calculations. In
real time biomedical signals acquisition and processing for on
line decision taking, against an emergency for example, these
contribution is crucial.
D. Performance relationship between processors accordingto the compiler library combination
Comparing both processors on Table V performance
relationship basis, the Itanium 2 results more efficient working
with double data types, that is, 64-bit floating point, since this
processor has many 64-bit registers files, fact that minimize
memory accesses [8], [9], [13]. In this table, we can observe
8/2/2019 BI131
6/6
Bioingeniera 131 6
that the compiler is responsible for the main Itanium 2
processors performance contribution, like is expectable in an
EPIC processor. However in the Xeon processor, the mkl
library contribution improves the performance relationship,
for its multiple multimedia application execution resources,
that are better used for this library [4] [5] [12].
ACKNOWLEDGMENTS
The authors are thankful to Intel Tecnologa de Argentina
SA, because the donation of an Itanium 2 system, granted for
the research project Analysis of Heart Rate Variability,
Arterial Pressure and Pulse in Normotensive and Hypertensive
Subjects.
Marcelo Risk is investigator of CONICET (Consejo
Nacional de Investigaciones Cientficas y Tcnicas),
Ministerio de Educacin, Ciencia y Tecnologa, Argentina.
REFERENCES
1. Ramakrishna Rau B. Fischer Joseph A. Instruction LevelParallelism Processing: History, Overview and Perspective.Computer Systems Laboratory HPL-92-132. October 1992.
2. Alejandro Furfaro, Mariano Llamedo Soria, Julin S.Bruno, Nahuel Gonzalez, Marcelo R. Risk . Procesamiento
Intensivo del ECG con procesadores IA-32 e IA-64.
Facultad Regional Buenos Aires, Universidad Tecnolgica
Nacional.
3. Michael S. Schlansker, B. Ramakrishna Rau - EPIC: AnArchitecture for Instruction-Level Parallel Processors.
Compiler and Architecture Research, HPL-1999-111.
February 2000.
4. IA-32 Intel Architecture Optimization Reference Manual.Intel 2005
5. IA-32 Intel Architecture Software Developers Manual.Volume 1: Basic Architecture. Intel 2005
6. Sobh J, Risk MR, Barbieri R, Saul P. Database for ECG,arterial blood pressure and respiration signal analysis:
feature extraction, spectral estimation, and parameter
quantification. IEEE-EMBC and CMBSC, vol 4, pp. 955-
956, 1997.
7. Risk M, Sobh J, Barbieri R, Saul P. A simple algorithm forQRS peak location: use on long term ECG recordings from
the HMS-MIT-FFMS database. IEEE-EMBC 1995.
8. Intel Itanium Architecture. Software Developers Manual.Volume 1: Application Architecture. Revision 2.1. Octubre
2002. Intel
9. Intel Itanium Architecture. Reference Manual for SoftwareOptimization. November 2001. Intel
10. McNairy C, Soltis D. Itanium 2 ProcessorMicroarchitecture. IEEE Micro, March-April, pp 44-55,
2003.
11. Sharangpani H, Arora K. Itanium ProcessorMicroarchitecture. IEEE Micro, September-October: pp 24-
43, 2000.
12. Intel. Intel Math Kernel Library. Reference Manual.Document Number: 630813-017.
13. Hewlett Packard White Paper. Inside the Intel Itanium 2Processor: an Itanium Processor Family member for
balanced performance over a wide range of applications.
July 2002
14. Galassi Mark, Davies Jim, Theiler James, Gough Brian,Jungman Gerard, Booth Michael, Rossi Fabrice. GNU
Scientific Library Reference Manual, Edition 1.7. for GSL,
version 1.7.