BI131

8/2/2019 BI131

1/6

Bioingeniera 131 1

EPIC processors performance improvement

estimation in biomedical signals intensive

processing

Alejandro Furfaro1, Nahuel Gonzalez

1, Martn Belzunce

1, Marcelo Risk

1,2

1Facultad Regional Buenos Aires, Universidad Tecnolgica Nacional, ARGENTINA

2 CONICET y Departamento de Computacin, Facultad de Ciencias Exactas y Naturales,

Universidad de Buenos Aires, ARGENTINA

AbstractPerformance improvement, in processors designed

for intensive calculation is based in the Instruction Level

Parallelism (ILP) optimization. To implement ILP, high

performance processors follow one of two models: superscalar

and EPIC. The aim of this work was to estimate the performance

improvement in systems based in processors designed according

to the EPIC model, given its greater compiler dependency for the

ILP optimization, and then the study was extended to the

contribution of using calculation functions contained in librariesspecially designed for EPIC processor. Equipments based in

Itanium 2 and Xeon processors that follow EPIC and superscalar

models respectively were employed. A total of ten

Electrocardiogram (ECG) recordings, of twenty four hours each

one were processed using digital filters and Fast Fourier

Transform; source code versions using double and float type

variables were compiled, using gcc (GNU C Compiler) and icc

(Intel C Compiler), with optimization switch O3. To compute the

Fast Fourier Transform, gsl (GNU Scientific Library) open

source portable to every processor, and Intel mkl (Math Kernel

Library), were employed for both processors under study. The

intensive ECG data processing, showed advantages of the

Itanium 2 processor, specially in double data types calculations.

Finally, was verified the greater dependency of compilersefficiency for the Itanium 2 processor, according with the EPIC

model basic principles, and the substantial improvements in

performance introduced by using the EPIC model versus the

superscalar.

Index TermsWLIB, EPIC, ILP, PNI, SIMD.

I. INTRODUCTION

nstruction Level Parallelism (ILP), is a set of techniques

applied in microprocessors and compiler design, which

allows a same CPU execute various instructions at the same

time [1]. High performance processors designers focus their

efforts to improve processors ILP, because it is one of themost important aspects in increasing performance in intensive

numerical algorithms. They get it by integrating in the same

CPU multiple Execution Units.

At the end of 70s, ILP was implemented through a

architectural model named VLIW (Very Long Instruction

Word), which can be considered the first EPIC (Explicitly

Parallel Instruction Computation) documented model; this

model was actually implemented in the most modern

processors like Itanium 2 which in addition has several

Multimedia Calculation (for Digital Signal Processing), and

Floating Point Units, thus conferring a very good profile for

high performance scientific applications [9], [10], [11].

Then at the end of the 80s, some processors began

including more than one execution unit in the same CPU, at

least for integer calculation, that made a set of analysis

necessary in order to determinate which programs

instructions can be executed in parallel in the different

processors Execution Units, without changing the result

expected by the application programmer. Interdependency

between two consecutive instructions was one of the major

problems to solve for an efficient ILP implementation.

In order to implement this series of analysis two micro

architectural models were established: 1) Superscalar Model,

that raises to integrate into the processor the necessary logic to

determine the feasibility to execute two instructions in parallel

as these enter the processor, and by the other hand, 2) VLIW

model that raises the implementation of this dependencies

analysis in the compiler, generating whenever it is possible

very wide words, capable to contain several instructions that

can be executed in parallel by the different processorscalculation units, without programs logic disturb.

VLIW subtracts complexity to hardwares design and

reduce significantly the power consume, a limiting aspect in

high performance integrated circuits manufacturing.

Finally at the 90s, the development of integration

technologies established an standard, having multiple

execution units implementations for integer calculation,

floating point calculation, and jump operations in the same

processor, allowing to reach high ILP levels, processing

several instructions in one clock cycle.

The contributions of using compilers designed specifically

for EPIC architecture processors in optimize their

performance, especially in floating point Arithmetic, have

been verified [2], [3].

The aim of the present work was to estimate the

performance improvement in systems based in processors

designed according to the EPIC model, given its greater

compiler dependency for the ILP optimization, and then the

study was extended to the contribution of using calculation

functions contained in libraries specially designed for EPIC

processor.

I

8/2/2019 BI131

2/6

Bioingeniera 131 2

II. MATERIAL AND METHODS

A. DataThis work was based on the intensive processing of ten

electrocardiogram (ECG) files, each of them from 24 hours

Holter studies; each file contains 2 ECG channels acquired at

256 samples per second, which totalizes an approximated

length of 80 Mbytes for each one [6].

B. Processing SystemsWe have used the following two equipments: a) a server

with two 1.5 GHz Itanium 2 EPIC architecture processor,

400MHz FSB, 6 Mbytes L3 cache memory, chipset Intel

E8870, 8 Gbytes DDR200 RAM, and SCSI Ultra 320

controller with three 140 Gbytes disks in RAID 5

configuration, Operating System Linux Red Hat AS4 Kernel

2.6.9-1, and b) a server with one 2.8 GHz Xeon processor

with Hyper Threading Technology, 512 KB L2 cache

memory, FSB 533 MHz, 1 GB DDR RAM, Mother Intel

SE7505VB2, Chipset Intel E7505, and HD ATA 200GB,

Operating System Linux Fedora Core 3, kernel 2.6.9-5.

C. AlgorithmsAn algorithm for the R wave detection and the subsequent

interval RR measurement and frequency spectrum calculation

in ECG surface registers was used. [7]; this algorithm has a

FIR low pass filtering with equiripple, fc=50 Hz, and 19 taps

in both channels. Then, a resulting signal is obtained by the

following formula:

[ ] [ ] [ ]22 21 nchnchnx +=(1)

The use of this combined signal is to represent bothchannels information simultaneously. In case of a flaw in any

channel (devices failing, disconnection, etc.), this signal

maintains the other channel information. In this signal we

proceed to localize the R wave, and determine the RR interval.

Then every channel spectrum and x[n] signal was

calculated, by using the Fast Fourier Transform (FFT). The

results were stored in a csv formatted file for each Holter

register.

Two version of algorithms was implemented: one using all

floating point simple precision variables (float type), and the

other one, using all floating point double precision variables

(double type), both of them according to IEEE 754 standard.All algorithms was compiled with two different tools: a)

GNU C Compiler (gcc) standard in Linux distributions,

version 3.4.2 in the Xeon system, and version 3.4.3 for the

Itanium 2 system, and b) Intel C Compiler (icc) version 8.1,

specifically designed for Intel processors.

Both compilers have aggressive processor optimization

(switch O3 in command line).

FFT calculation was made using two different libraries: a)

GNU Scientific Library (gsl) version 1.7; and b) Intel Math

Kernel Library (mkl) version 8.0.019 specifically designed for

Itanium 2 processor, in that is based our equipment.

GSL library was compiled according with each version of

the algorithms, that is, using gcc or icc according to the test to

be realized. The mkl library was used like is provided by the

manufacturer because the source code is not available.

D. Performance estimationIn order to estimate both systems performance, time stamps

with milliseconds resolution was taken in the different inputand output points of the algorithms. According with the

volume of information to be processed the adopted resolution

was considered enough.

On the other hand, in order to evaluate performance just for

the processors involved in the measurements, processes that

access only to the disk and storage the ECG files content in

memory was developed, and in order to compute the time due

to this task specific time stamps for it were taken. The values

obtained were used to adjust the processing algorithms results,

removing the disk access component from the measurement.

The results were then analyzed with a Student t test for

paired samples; the statistical significance level was defined to

0.05.

III. RESULTS

After the processing of 10 recordings with the 24 hours of

ECG, then comparing the outputs in each case in the system

under study, and finally having verified the same quantitative

results independently of the FFT functions origin library, and

compiler employed, processing time in each case was taken

for evaluation expressed by their mean and standard deviation

(SD).

The summary of this results are showed in figures 1 and 2.

Tables I-A and I-B show the obtained results for different

combinations of compiler scientific calculation library. In allcases for both compilers the switch O3 has been used

(aggressive optimization respect the processor).

A. Compiler incidence in each processorTaking the Itanium 2 processor results, programs compiled

with gcc respect with the same compiled with icc, give the

following performance relationships, calculated as the average

processing time using gcc respect the average processing time

using icc, for the following cases: a) 2.1 times using gsl

library functions and type double variables (3/1), b) 2 times

using mkl library functions and type double variables (4/2), c)

2.8 times using gsl library functions and type float variables

(11/9), d) 3.2 times using mkl library functions and type float

variables (12/10).

On the other hand, for the Xeon processor, the same

performance relationships threw the following values: a) 1.0

times using gsl library functions and type double variables

(7/5), b) 1.5 times using mkl library functions and type double

variables (6/8), c) 1.4 times using gsl library functions and

variables type float variables (15/13), d) 1.1 times using mkl

library functions and type float variables (16/14). These

values are represented in Table II.

8/2/2019 BI131

3/6

Bioingeniera 131 3

B. Library used for solving FFT functions incidenceAnalyzing Tables I-A and I-B, the performance relationship

for the Itanium2 processor calculated as the average

processing time using gsl libraries FFT functions with respect

to the same functions from mkl library, for the same compiler,

the following relationships are obtained: a) 1.8 timescompiling with icc and using type double variables (1/2); b)

1.9 times compiling with gcc and using type double variables

(3/4); c) 1,7 times compiling with icc and using type float

variables (9/10); d) 1.5 times compiling with gcc and using

type float variables (11/12).

On the other hand for the Xeon processor, the same

performance relationships gave the following values: a) 3.8

times compiling with icc and using type double variables

(5/6); b) 2.5 times compiling with gcc and using type double

variables (7/8); c) 1.2 times compiling with icc and using type

TABLEII

GCC VERSUS ICC COMPILERPERFORMANCERELATIONSHIPS.

Processor Library Double Float

gsl 2.1 2.8Itanium2

mkl 2.0 3.2

gsl 1.0 1.4

Xeonmkl 1.5 1.1

Xeon with double precision

ECG record

1 2 3 4 5 6 7 8 9 10

Processingtime(s)

0

5

10

15

20

25

30

35

40

45

gcc compiler withGSL

icc compiler withGSL

gcc compiler withMKL

icc compiler withMKL

Xeon with float precision

ECG record

1 2 3 4 5 6 7 8 9 10

Processingtime(s)

0

5

10

15

20

25


icc compiler withGSL



Figure 2. Execution time values in seconds for the ECG de 24 hours 10 lots

processed with Xeon using data types double y float respectively.

TABLE I- B

Processing times for each processor and compiler in seconds.

float

Lib gsl Lib mkl

Itanium2 with icc 11.90.29

6.90.210

Itanium2 with gcc 33.90.311

22.00.212

Xeon with icc 14.20.113 11.80.1 14

Xeon with gcc 20.00.215

12.60.116

TABLE I-A

Processing times for each processor and compiler in seconds.

double

Lib gsl Lib mkl

Itanium2 with icc 14.10.21 7.80.2 2

Itanium2 with gcc 29.00.3

3

15.50.2

4

Xeon with icc 39.20.3 5 10.40.1 6

Xeon with gcc 39.80.47 15.90.4 8

Itanium with double precision

ECG record

1 2 3 4 5 6 7 8 9 10

Processingtime(s)

0

5

10

15

20

25

30

35

gcc compiler withGSLicc compiler withGSL



Itanium with float precision

ECG record

1 2 3 4 5 6 7 8 9 10

Processing

time(s)

0

5

10

15

20

25

30

35

40


icc compiler withGSLgcc compiler withMKL


Figure 1. Execution time values in seconds for the ECG de 24 hours 10 lots

processed with Itanium 2 using data types double y float respectively.

8/2/2019 BI131

4/6

Bioingeniera 131 4

float variables (13/14); d) 1.6 times compiling with gcc and

using type float variables (15/16). These performance

relationships are showed in Table III.

C. Processor specific compiler and library combinationincidence

Always starting from the Tables I-A and I-B results, the

contribution to the Itanium 2 system performance using icc

compiler and FFT functions from mkl library, respect to the

obtained by means of using gcc compiler and FFT functionsfrom gsl, brings the following performance relationships: a)

3.7 times using type double variables (3/2); and b) 4.9 times

using type float variables (11/10). For Xeon processor, the

same relationships bring the following values: a) 3.8 times

using type double variables (7/6), and b) 1.7 times using type

float variables (15/14). These performance relationships can

be viewed in Table IV.

D. Performance relationship between processors accordingto the compiler library combination

Always from Tables I-A and I-B results values, the

performance relationships between Xeon and Itanium 2

processors are calculated, for the different compilers and

libraries combinations, obtaining the following values: a) 1.4

times using type double variables, compiling with gcc and

using FFT calculation functions from gsl library (7/3); b) 1.0

times using type double variables, compiling with gcc and

using FFT calculation functions from mkl library (8/4); c) 2.8

times using type double variables, compiling with icc and

using FFT calculation functions from gsl library (5/1); d) 1.3

times using type double variables, compiling with icc and

using FFT calculation functions from mkl library (6/2); e) 0.6

times using type float variables, compiling with gcc and using

FFT calculation functions from gsl library (15/11); f) 0.6

times using type float variables, compiling with gcc and using

FFT calculation functions from mkl library (16/12); g) 1.2

times using type float, variables, compiling with icc and using

FFT calculation functions from gsl library (13/9); h) 1.7 times

using type float variables, compiling with icc and using FFT

calculation functions from mkl library (14/10). The

performance relationships can be observed in Table V.

IV. DISCUSSION

A. Compiler incidence in each processorItanium 2 processor architecture, based in the EPIC model,

has been conceived to reach high ILP indexes with a relative

very low complexity in hardware. Based in the VLIW model,

EPIC established a set of guidelines to implement very high

ILP processors, and a basic architectural functions set. In

particular EPIC support the main VLIW characteristics: The

compiler resolves the algorithms Execution Plan, in other

words the instruction list to be executed, their order, and

which processors internal Execution Unit is going to be used

[3]. In this way the execution plan is static and will be defined

in compiling time.

Many times, and in order to optimize performance and takeadvantage of the parallelism, the compiler will need to alter

the instruction sequence before send them to the processor.

Therefore, the processor must have a predictable internal

behavior; in order to the instruction reordering done by the

compiler does not modify the essence of the algorithm, then

changing the expected results [3].

This condition does not appear in the superscalar

processors, which at the moment have a very complex out of

order instruction execution engine, which is able to work over

a window having sometimes more than 126 instructions [5],

by analyzing them dynamically and establishing which

instructions can be executed without damage on the programexpected result. To obtain an execution plan in compiling time

is practically impossible in this kind of processors, since they

can not determine the way in that the dynamic scheduling

hardware will organize the execution of those instructions that

keep dependencies.

This situation can be clearly denoted working with IA-32

processors, that are based on an Out of Order Execution core,

known as the P6 Micro Architecture (Pentium Pro, Pentium II

and Pentium III) [5], and at the moment the Xeons processor

Netburst Architecture, used at this work. The Out of Order

TABLEIV

PERFORMANCE RELATIONSHIPS USING GCC AND GSL LIBRARY VERSUS ICC

AND MKL LIBRARY.

Processor double float

Itanium2 3.7 4.9gcc+gsl /icc+mkl

Xeon 3.8 1.7

TABLEIII

GSL VERSUS MKL LIBRARY PERFORMANCE RELATIONSHIPS.

Processor compiler double float

icc 1.8 1.7Itanium2

gcc 1.9 1.5

icc 3.8 1.2Xeon

gcc 2.5 1.6

TABLEV

COMPILER LIBRARY COMBINATION PROCESSORS PERFORMANCERELATIONSHIPS.

Processor compiler double float

gcc + gsl 1.4 0.6

gcc + mkl 1.0 0.6

icc + gsl 2.8 1.2Xeon / Itanium2

icc + mkl 1.3 1.7

8/2/2019 BI131

5/6

Bioingeniera 131 5

Execution model also resolves latencies in instructions

operands accesses, when these are not stored in the L1 cache

data memory [4], [5]. This situation does not follow simple

statistical estimation criteria, and therefore they remain hidden

from the compiler, preventing then to generate an efficient

Execution Plan as in VLIW model. The high Xeons

performance is at the cost of a very complex hardware design.

Nevertheless, EPIC processors ability to transfer the

complexity in building the Execution Plan to the compilermust not be overestimated, since that, in return, imply a

compiler dependency, in many cases excessive, and, if proper

development tools do not be used, the results will not match

with the expectations. This situation is evident in Table V

results, in that can be appraised that using a multiprocessor

compiler that is gcc, makes the performance relationship

Xeon-Itanium 2 even smaller to 1 in some cases.

This scenario can be appraised using float data types in the

calculations, which are 32 bit variables, which are not the data

type that take better benefits from a 64 bit native architecture,

like the Itanium 2 processor. In that table we can view that

using specific compilers like icc, the results match accordingto expectations rather better.

Anyway, situations whose results can not be predicted

appear, since the required information for its resolution is

generated in execution time. These are: 1) branch instructions

in that the branch condition evaluations result can not be

determined previously, 2) code blocks that result from multi

execution branches flow diagrams, 3) concurrent resource

access situations, for example, a memory address, and 4) data

loads from memory that consume nondeterministic access

time, because it depends of the cache level in that is stored the

data, or still worse, if data is located in systems DRAM

memory. For the first three cases, EPIC compiler employs the

same speculative techniques that superscalar processors

hardware [3]. In general, algorithms build loops based in

conditional branch instructions that will have always the same

destination, except when the branch condition expires.

Therefore, if it is assumed that the result will be the branch

that corresponds to the true condition; this will be always right

except the only time in that the loop expires. In such

condition, the processor hardware will have to be able to solve

the situation, at cost of system performance. All this

speculative execution hardware is not necessary in EPIC

processors.

For the branch instructions, superscalar processors use a

hardware block named Branch Target Buffer that by means ofspeculative execution logic is in charge to maintain a cache

with the most probable branch destination addresses [4]. For

these reasons, Table II performance relationships show a

different behavior, working with the icc compiler in Itanium 2

processor, while in the Xeon processor, although performance

improvement exists, is not so significant.

B. Library used for solving FFT functions incidenceWith respect to mkl library, in addition to be compiled

originally using the icc compiler, its value consist in that their

algorithms make an extremely efficient use of the processors

resources [12]. Both processors have several resources for

floating point and multimedia calculation, beginning from

parallel execution units for these kind of operations, registers,

and SIMD instructions [4], [5]. In the Xeon processors case,

these resources are grouped in the multimedia extensions

SSE3 [5].

By its side, Itanium 2 processor, although it executes code

compiled for Xeon, it has its own calculation multimediaresources too [8], [13]. In the design of a scientific calculation

library, two approaches can be taken: a) to take maximum

advantage of these resources, even resorting to assembler

language optimizations, sacrificing portability, or 2) to design

generic algorithms avoiding employ particular architecture

dependent resources, sacrificing performance to ensure code

portability in every platform.

The first approach is the case of mkl library, whose design

is in charge of the manufacturer of both processors used in

this study [12]. However, gsl library has been designed

privileging its portability [14]. The performance relationships

showed in Table III indicate that the mkl library contribution,is sensibly greater in the case of Xeon processor than in

Itanium 2 processor working with double data types, because

in between its multiple multimedia resources, Xeon

processors has eight 128-bit registers, that enable it to process

two data of double type in a single instruction [5].

C. Processor specific compiler and library combinationincidence

Table IV shows significant performance improvements for

applications developed using simultaneously compilers and

libraries specifically designed for each processor. For general

use applications, employing specific development tools is not

the best policy in every case. But working in specificapplication design, such as the case of biomedical signals

processing at present work, where the high performance

computing is a priority requirement, the whole system must be

designed taking account of applications requirements.

In such cases, the contribution of using specific-processor

compilers and scientific calculation libraries, allows to reduce

almost in four times the same algorithm execution time using

64-bit floating point data (double type) in both processors,

and, in the specific case of Itanium 2 processor, 4.9 times for

32-bit floating point data (float type), difference that can result

in a system able or unable to resolve real time calculations. In

real time biomedical signals acquisition and processing for on

line decision taking, against an emergency for example, these

contribution is crucial.

D. Performance relationship between processors accordingto the compiler library combination

Comparing both processors on Table V performance

relationship basis, the Itanium 2 results more efficient working

with double data types, that is, 64-bit floating point, since this

processor has many 64-bit registers files, fact that minimize

memory accesses [8], [9], [13]. In this table, we can observe

8/2/2019 BI131

6/6

Bioingeniera 131 6

that the compiler is responsible for the main Itanium 2

processors performance contribution, like is expectable in an

EPIC processor. However in the Xeon processor, the mkl

library contribution improves the performance relationship,

for its multiple multimedia application execution resources,

that are better used for this library [4] [5] [12].

ACKNOWLEDGMENTS

The authors are thankful to Intel Tecnologa de Argentina

SA, because the donation of an Itanium 2 system, granted for

the research project Analysis of Heart Rate Variability,

Arterial Pressure and Pulse in Normotensive and Hypertensive

Subjects.

Marcelo Risk is investigator of CONICET (Consejo

Nacional de Investigaciones Cientficas y Tcnicas),

Ministerio de Educacin, Ciencia y Tecnologa, Argentina.

REFERENCES

1. Ramakrishna Rau B. Fischer Joseph A. Instruction LevelParallelism Processing: History, Overview and Perspective.Computer Systems Laboratory HPL-92-132. October 1992.

2. Alejandro Furfaro, Mariano Llamedo Soria, Julin S.Bruno, Nahuel Gonzalez, Marcelo R. Risk . Procesamiento

Intensivo del ECG con procesadores IA-32 e IA-64.

Facultad Regional Buenos Aires, Universidad Tecnolgica

Nacional.

3. Michael S. Schlansker, B. Ramakrishna Rau - EPIC: AnArchitecture for Instruction-Level Parallel Processors.

Compiler and Architecture Research, HPL-1999-111.

February 2000.

4. IA-32 Intel Architecture Optimization Reference Manual.Intel 2005

5. IA-32 Intel Architecture Software Developers Manual.Volume 1: Basic Architecture. Intel 2005

6. Sobh J, Risk MR, Barbieri R, Saul P. Database for ECG,arterial blood pressure and respiration signal analysis:

feature extraction, spectral estimation, and parameter

quantification. IEEE-EMBC and CMBSC, vol 4, pp. 955-

956, 1997.

7. Risk M, Sobh J, Barbieri R, Saul P. A simple algorithm forQRS peak location: use on long term ECG recordings from

the HMS-MIT-FFMS database. IEEE-EMBC 1995.

8. Intel Itanium Architecture. Software Developers Manual.Volume 1: Application Architecture. Revision 2.1. Octubre

2002. Intel

9. Intel Itanium Architecture. Reference Manual for SoftwareOptimization. November 2001. Intel

10. McNairy C, Soltis D. Itanium 2 ProcessorMicroarchitecture. IEEE Micro, March-April, pp 44-55,

2003.

11. Sharangpani H, Arora K. Itanium ProcessorMicroarchitecture. IEEE Micro, September-October: pp 24-

43, 2000.

12. Intel. Intel Math Kernel Library. Reference Manual.Document Number: 630813-017.

13. Hewlett Packard White Paper. Inside the Intel Itanium 2Processor: an Itanium Processor Family member for

balanced performance over a wide range of applications.

July 2002

14. Galassi Mark, Davies Jim, Theiler James, Gough Brian,Jungman Gerard, Booth Michael, Rossi Fabrice. GNU

Scientific Library Reference Manual, Edition 1.7. for GSL,

version 1.7.

Documents

BI131