BI131

Embed Size (px)

Citation preview

  • 8/2/2019 BI131

    1/6

    Bioingeniera 131 1

    EPIC processors performance improvement

    estimation in biomedical signals intensive

    processing

    Alejandro Furfaro1, Nahuel Gonzalez

    1, Martn Belzunce

    1, Marcelo Risk

    1,2

    1Facultad Regional Buenos Aires, Universidad Tecnolgica Nacional, ARGENTINA

    2 CONICET y Departamento de Computacin, Facultad de Ciencias Exactas y Naturales,

    Universidad de Buenos Aires, ARGENTINA

    AbstractPerformance improvement, in processors designed

    for intensive calculation is based in the Instruction Level

    Parallelism (ILP) optimization. To implement ILP, high

    performance processors follow one of two models: superscalar

    and EPIC. The aim of this work was to estimate the performance

    improvement in systems based in processors designed according

    to the EPIC model, given its greater compiler dependency for the

    ILP optimization, and then the study was extended to the

    contribution of using calculation functions contained in librariesspecially designed for EPIC processor. Equipments based in

    Itanium 2 and Xeon processors that follow EPIC and superscalar

    models respectively were employed. A total of ten

    Electrocardiogram (ECG) recordings, of twenty four hours each

    one were processed using digital filters and Fast Fourier

    Transform; source code versions using double and float type

    variables were compiled, using gcc (GNU C Compiler) and icc

    (Intel C Compiler), with optimization switch O3. To compute the

    Fast Fourier Transform, gsl (GNU Scientific Library) open

    source portable to every processor, and Intel mkl (Math Kernel

    Library), were employed for both processors under study. The

    intensive ECG data processing, showed advantages of the

    Itanium 2 processor, specially in double data types calculations.

    Finally, was verified the greater dependency of compilersefficiency for the Itanium 2 processor, according with the EPIC

    model basic principles, and the substantial improvements in

    performance introduced by using the EPIC model versus the

    superscalar.

    Index TermsWLIB, EPIC, ILP, PNI, SIMD.

    I. INTRODUCTION

    nstruction Level Parallelism (ILP), is a set of techniques

    applied in microprocessors and compiler design, which

    allows a same CPU execute various instructions at the same

    time [1]. High performance processors designers focus their

    efforts to improve processors ILP, because it is one of themost important aspects in increasing performance in intensive

    numerical algorithms. They get it by integrating in the same

    CPU multiple Execution Units.

    At the end of 70s, ILP was implemented through a

    architectural model named VLIW (Very Long Instruction

    Word), which can be considered the first EPIC (Explicitly

    Parallel Instruction Computation) documented model; this

    model was actually implemented in the most modern

    processors like Itanium 2 which in addition has several

    Multimedia Calculation (for Digital Signal Processing), and

    Floating Point Units, thus conferring a very good profile for

    high performance scientific applications [9], [10], [11].

    Then at the end of the 80s, some processors began

    including more than one execution unit in the same CPU, at

    least for integer calculation, that made a set of analysis

    necessary in order to determinate which programs

    instructions can be executed in parallel in the different

    processors Execution Units, without changing the result

    expected by the application programmer. Interdependency

    between two consecutive instructions was one of the major

    problems to solve for an efficient ILP implementation.

    In order to implement this series of analysis two micro

    architectural models were established: 1) Superscalar Model,

    that raises to integrate into the processor the necessary logic to

    determine the feasibility to execute two instructions in parallel

    as these enter the processor, and by the other hand, 2) VLIW

    model that raises the implementation of this dependencies

    analysis in the compiler, generating whenever it is possible

    very wide words, capable to contain several instructions that

    can be executed in parallel by the different processorscalculation units, without programs logic disturb.

    VLIW subtracts complexity to hardwares design and

    reduce significantly the power consume, a limiting aspect in

    high performance integrated circuits manufacturing.

    Finally at the 90s, the development of integration

    technologies established an standard, having multiple

    execution units implementations for integer calculation,

    floating point calculation, and jump operations in the same

    processor, allowing to reach high ILP levels, processing

    several instructions in one clock cycle.

    The contributions of using compilers designed specifically

    for EPIC architecture processors in optimize their

    performance, especially in floating point Arithmetic, have

    been verified [2], [3].

    The aim of the present work was to estimate the

    performance improvement in systems based in processors

    designed according to the EPIC model, given its greater

    compiler dependency for the ILP optimization, and then the

    study was extended to the contribution of using calculation

    functions contained in libraries specially designed for EPIC

    processor.

    I

  • 8/2/2019 BI131

    2/6

    Bioingeniera 131 2

    II. MATERIAL AND METHODS

    A. DataThis work was based on the intensive processing of ten

    electrocardiogram (ECG) files, each of them from 24 hours

    Holter studies; each file contains 2 ECG channels acquired at

    256 samples per second, which totalizes an approximated

    length of 80 Mbytes for each one [6].

    B. Processing SystemsWe have used the following two equipments: a) a server

    with two 1.5 GHz Itanium 2 EPIC architecture processor,

    400MHz FSB, 6 Mbytes L3 cache memory, chipset Intel

    E8870, 8 Gbytes DDR200 RAM, and SCSI Ultra 320

    controller with three 140 Gbytes disks in RAID 5

    configuration, Operating System Linux Red Hat AS4 Kernel

    2.6.9-1, and b) a server with one 2.8 GHz Xeon processor

    with Hyper Threading Technology, 512 KB L2 cache

    memory, FSB 533 MHz, 1 GB DDR RAM, Mother Intel

    SE7505VB2, Chipset Intel E7505, and HD ATA 200GB,

    Operating System Linux Fedora Core 3, kernel 2.6.9-5.

    C. AlgorithmsAn algorithm for the R wave detection and the subsequent

    interval RR measurement and frequency spectrum calculation

    in ECG surface registers was used. [7]; this algorithm has a

    FIR low pass filtering with equiripple, fc=50 Hz, and 19 taps

    in both channels. Then, a resulting signal is obtained by the

    following formula:

    [ ] [ ] [ ]22 21 nchnchnx +=(1)

    The use of this combined signal is to represent bothchannels information simultaneously. In case of a flaw in any

    channel (devices failing, disconnection, etc.), this signal

    maintains the other channel information. In this signal we

    proceed to localize the R wave, and determine the RR interval.

    Then every channel spectrum and x[n] signal was

    calculated, by using the Fast Fourier Transform (FFT). The

    results were stored in a csv formatted file for each Holter

    register.

    Two version of algorithms was implemented: one using all

    floating point simple precision variables (float type), and the

    other one, using all floating point double precision variables

    (double type), both of them according to IEEE 754 standard.All algorithms was compiled with two different tools: a)

    GNU C Compiler (gcc) standard in Linux distributions,

    version 3.4.2 in the Xeon system, and version 3.4.3 for the

    Itanium 2 system, and b) Intel C Compiler (icc) version 8.1,

    specifically designed for Intel processors.

    Both compilers have aggressive processor optimization

    (switch O3 in command line).

    FFT calculation was made using two different libraries: a)

    GNU Scientific Library (gsl) version 1.7; and b) Intel Math

    Kernel Library (mkl) version 8.0.019 specifically designed for

    Itanium 2 processor, in that is based our equipment.

    GSL library was compiled according with each version of

    the algorithms, that is, using gcc or icc according to the test to

    be realized. The mkl library was used like is provided by the

    manufacturer because the source code is not available.

    D. Performance estimationIn order to estimate both systems performance, time stamps

    with milliseconds resolution was taken in the different inputand output points of the algorithms. According with the

    volume of information to be processed the adopted resolution

    was considered enough.

    On the other hand, in order to evaluate performance just for

    the processors involved in the measurements, processes that

    access only to the disk and storage the ECG files content in

    memory was developed, and in order to compute the time due

    to this task specific time stamps for it were taken. The values

    obtained were used to adjust the processing algorithms results,

    removing the disk access component from the measurement.

    The results were then analyzed with a Student t test for

    paired samples; the statistical significance level was defined to

    0.05.

    III. RESULTS

    After the processing of 10 recordings with the 24 hours of

    ECG, then comparing the outputs in each case in the system

    under study, and finally having verified the same quantitative

    results independently of the FFT functions origin library, and

    compiler employed, processing time in each case was taken

    for evaluation expressed by their mean and standard deviation

    (SD).

    The summary of this results are showed in figures 1 and 2.

    Tables I-A and I-B show the obtained results for different

    combinations of compiler scientific calculation library. In allcases for both compilers the switch O3 has been used

    (aggressive optimization respect the processor).

    A. Compiler incidence in each processorTaking the Itanium 2 processor results, programs compiled

    with gcc respect with the same compiled with icc, give the

    following performance relationships, calculated as the average

    processing time using gcc respect the average processing time

    using icc, for the following cases: a) 2.1 times using gsl

    library functions and type double variables (3/1), b) 2 times

    using mkl library functions and type double variables (4/2), c)

    2.8 times using gsl library functions and type float variables

    (11/9), d) 3.2 times using mkl library functions and type float

    variables (12/10).

    On the other hand, for the Xeon processor, the same

    performance relationships threw the following values: a) 1.0

    times using gsl library functions and type double variables

    (7/5), b) 1.5 times using mkl library functions and type double

    variables (6/8), c) 1.4 times using gsl library functions and

    variables type float variables (15/13), d) 1.1 times using mkl

    library functions and type float variables (16/14). These

    values are represented in Table II.

  • 8/2/2019 BI131

    3/6

    Bioingeniera 131 3

    B. Library used for solving FFT functions incidenceAnalyzing Tables I-A and I-B, the performance relationship

    for the Itanium2 processor calculated as the average

    processing time using gsl libraries FFT functions with respect

    to the same functions from mkl library, for the same compiler,

    the following relationships are obtained: a) 1.8 timescompiling with icc and using type double variables (1/2); b)

    1.9 times compiling with gcc and using type double variables

    (3/4); c) 1,7 times compiling with icc and using type float

    variables (9/10); d) 1.5 times compiling with gcc and using

    type float variables (11/12).

    On the other hand for the Xeon processor, the same

    performance relationships gave the following values: a) 3.8

    times compiling with icc and using type double variables

    (5/6); b) 2.5 times compiling with gcc and using type double

    variables (7/8); c) 1.2 times compiling with icc and using type

    TABLEII

    GCC VERSUS ICC COMPILERPERFORMANCERELATIONSHIPS.

    Processor Library Double Float

    gsl 2.1 2.8Itanium2

    mkl 2.0 3.2

    gsl 1.0 1.4

    Xeonmkl 1.5 1.1

    Xeon with double precision

    ECG record

    1 2 3 4 5 6 7 8 9 10

    Processingtime(s)

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    gcc compiler withGSL

    icc compiler withGSL

    gcc compiler withMKL

    icc compiler withMKL

    Xeon with float precision

    ECG record

    1 2 3 4 5 6 7 8 9 10

    Processingtime(s)

    0

    5

    10

    15

    20

    25

    gcc compiler withGSL

    icc compiler withGSL

    gcc compiler withMKL

    icc compiler withMKL

    Figure 2. Execution time values in seconds for the ECG de 24 hours 10 lots

    processed with Xeon using data types double y float respectively.

    TABLE I- B

    Processing times for each processor and compiler in seconds.

    float

    Lib gsl Lib mkl

    Itanium2 with icc 11.90.29

    6.90.210

    Itanium2 with gcc 33.90.311

    22.00.212

    Xeon with icc 14.20.113 11.80.1 14

    Xeon with gcc 20.00.215

    12.60.116

    TABLE I-A

    Processing times for each processor and compiler in seconds.

    double

    Lib gsl Lib mkl

    Itanium2 with icc 14.10.21 7.80.2 2

    Itanium2 with gcc 29.00.3

    3

    15.50.2

    4

    Xeon with icc 39.20.3 5 10.40.1 6

    Xeon with gcc 39.80.47 15.90.4 8

    Itanium with double precision

    ECG record

    1 2 3 4 5 6 7 8 9 10

    Processingtime(s)

    0

    5

    10

    15

    20

    25

    30

    35

    gcc compiler withGSLicc compiler withGSL

    gcc compiler withMKL

    icc compiler withMKL

    Itanium with float precision

    ECG record

    1 2 3 4 5 6 7 8 9 10

    Processing

    time(s)

    0

    5

    10

    15

    20

    25

    30

    35

    40

    gcc compiler withGSL

    icc compiler withGSLgcc compiler withMKL

    icc compiler withMKL

    Figure 1. Execution time values in seconds for the ECG de 24 hours 10 lots

    processed with Itanium 2 using data types double y float respectively.

  • 8/2/2019 BI131

    4/6

    Bioingeniera 131 4

    float variables (13/14); d) 1.6 times compiling with gcc and

    using type float variables (15/16). These performance

    relationships are showed in Table III.

    C. Processor specific compiler and library combinationincidence

    Always starting from the Tables I-A and I-B results, the

    contribution to the Itanium 2 system performance using icc

    compiler and FFT functions from mkl library, respect to the

    obtained by means of using gcc compiler and FFT functionsfrom gsl, brings the following performance relationships: a)

    3.7 times using type double variables (3/2); and b) 4.9 times

    using type float variables (11/10). For Xeon processor, the

    same relationships bring the following values: a) 3.8 times

    using type double variables (7/6), and b) 1.7 times using type

    float variables (15/14). These performance relationships can

    be viewed in Table IV.

    D. Performance relationship between processors accordingto the compiler library combination

    Always from Tables I-A and I-B results values, the

    performance relationships between Xeon and Itanium 2

    processors are calculated, for the different compilers and

    libraries combinations, obtaining the following values: a) 1.4

    times using type double variables, compiling with gcc and

    using FFT calculation functions from gsl library (7/3); b) 1.0

    times using type double variables, compiling with gcc and

    using FFT calculation functions from mkl library (8/4); c) 2.8

    times using type double variables, compiling with icc and

    using FFT calculation functions from gsl library (5/1); d) 1.3

    times using type double variables, compiling with icc and

    using FFT calculation functions from mkl library (6/2); e) 0.6

    times using type float variables, compiling with gcc and using

    FFT calculation functions from gsl library (15/11); f) 0.6

    times using type float variables, compiling with gcc and using

    FFT calculation functions from mkl library (16/12); g) 1.2

    times using type float, variables, compiling with icc and using

    FFT calculation functions from gsl library (13/9); h) 1.7 times

    using type float variables, compiling with icc and using FFT

    calculation functions from mkl library (14/10). The

    performance relationships can be observed in Table V.

    IV. DISCUSSION

    A. Compiler incidence in each processorItanium 2 processor architecture, based in the EPIC model,

    has been conceived to reach high ILP indexes with a relative

    very low complexity in hardware. Based in the VLIW model,

    EPIC established a set of guidelines to implement very high

    ILP processors, and a basic architectural functions set. In

    particular EPIC support the main VLIW characteristics: The

    compiler resolves the algorithms Execution Plan, in other

    words the instruction list to be executed, their order, and

    which processors internal Execution Unit is going to be used

    [3]. In this way the execution plan is static and will be defined

    in compiling time.

    Many times, and in order to optimize performance and takeadvantage of the parallelism, the compiler will need to alter

    the instruction sequence before send them to the processor.

    Therefore, the processor must have a predictable internal

    behavior; in order to the instruction reordering done by the

    compiler does not modify the essence of the algorithm, then

    changing the expected results [3].

    This condition does not appear in the superscalar

    processors, which at the moment have a very complex out of

    order instruction execution engine, which is able to work over

    a window having sometimes more than 126 instructions [5],

    by analyzing them dynamically and establishing which

    instructions can be executed without damage on the programexpected result. To obtain an execution plan in compiling time

    is practically impossible in this kind of processors, since they

    can not determine the way in that the dynamic scheduling

    hardware will organize the execution of those instructions that

    keep dependencies.

    This situation can be clearly denoted working with IA-32

    processors, that are based on an Out of Order Execution core,

    known as the P6 Micro Architecture (Pentium Pro, Pentium II

    and Pentium III) [5], and at the moment the Xeons processor

    Netburst Architecture, used at this work. The Out of Order

    TABLEIV

    PERFORMANCE RELATIONSHIPS USING GCC AND GSL LIBRARY VERSUS ICC

    AND MKL LIBRARY.

    Processor double float

    Itanium2 3.7 4.9gcc+gsl /icc+mkl

    Xeon 3.8 1.7

    TABLEIII

    GSL VERSUS MKL LIBRARY PERFORMANCE RELATIONSHIPS.

    Processor compiler double float

    icc 1.8 1.7Itanium2

    gcc 1.9 1.5

    icc 3.8 1.2Xeon

    gcc 2.5 1.6

    TABLEV

    COMPILER LIBRARY COMBINATION PROCESSORS PERFORMANCERELATIONSHIPS.

    Processor compiler double float

    gcc + gsl 1.4 0.6

    gcc + mkl 1.0 0.6

    icc + gsl 2.8 1.2Xeon / Itanium2

    icc + mkl 1.3 1.7

  • 8/2/2019 BI131

    5/6

    Bioingeniera 131 5

    Execution model also resolves latencies in instructions

    operands accesses, when these are not stored in the L1 cache

    data memory [4], [5]. This situation does not follow simple

    statistical estimation criteria, and therefore they remain hidden

    from the compiler, preventing then to generate an efficient

    Execution Plan as in VLIW model. The high Xeons

    performance is at the cost of a very complex hardware design.

    Nevertheless, EPIC processors ability to transfer the

    complexity in building the Execution Plan to the compilermust not be overestimated, since that, in return, imply a

    compiler dependency, in many cases excessive, and, if proper

    development tools do not be used, the results will not match

    with the expectations. This situation is evident in Table V

    results, in that can be appraised that using a multiprocessor

    compiler that is gcc, makes the performance relationship

    Xeon-Itanium 2 even smaller to 1 in some cases.

    This scenario can be appraised using float data types in the

    calculations, which are 32 bit variables, which are not the data

    type that take better benefits from a 64 bit native architecture,

    like the Itanium 2 processor. In that table we can view that

    using specific compilers like icc, the results match accordingto expectations rather better.

    Anyway, situations whose results can not be predicted

    appear, since the required information for its resolution is

    generated in execution time. These are: 1) branch instructions

    in that the branch condition evaluations result can not be

    determined previously, 2) code blocks that result from multi

    execution branches flow diagrams, 3) concurrent resource

    access situations, for example, a memory address, and 4) data

    loads from memory that consume nondeterministic access

    time, because it depends of the cache level in that is stored the

    data, or still worse, if data is located in systems DRAM

    memory. For the first three cases, EPIC compiler employs the

    same speculative techniques that superscalar processors

    hardware [3]. In general, algorithms build loops based in

    conditional branch instructions that will have always the same

    destination, except when the branch condition expires.

    Therefore, if it is assumed that the result will be the branch

    that corresponds to the true condition; this will be always right

    except the only time in that the loop expires. In such

    condition, the processor hardware will have to be able to solve

    the situation, at cost of system performance. All this

    speculative execution hardware is not necessary in EPIC

    processors.

    For the branch instructions, superscalar processors use a

    hardware block named Branch Target Buffer that by means ofspeculative execution logic is in charge to maintain a cache

    with the most probable branch destination addresses [4]. For

    these reasons, Table II performance relationships show a

    different behavior, working with the icc compiler in Itanium 2

    processor, while in the Xeon processor, although performance

    improvement exists, is not so significant.

    B. Library used for solving FFT functions incidenceWith respect to mkl library, in addition to be compiled

    originally using the icc compiler, its value consist in that their

    algorithms make an extremely efficient use of the processors

    resources [12]. Both processors have several resources for

    floating point and multimedia calculation, beginning from

    parallel execution units for these kind of operations, registers,

    and SIMD instructions [4], [5]. In the Xeon processors case,

    these resources are grouped in the multimedia extensions

    SSE3 [5].

    By its side, Itanium 2 processor, although it executes code

    compiled for Xeon, it has its own calculation multimediaresources too [8], [13]. In the design of a scientific calculation

    library, two approaches can be taken: a) to take maximum

    advantage of these resources, even resorting to assembler

    language optimizations, sacrificing portability, or 2) to design

    generic algorithms avoiding employ particular architecture

    dependent resources, sacrificing performance to ensure code

    portability in every platform.

    The first approach is the case of mkl library, whose design

    is in charge of the manufacturer of both processors used in

    this study [12]. However, gsl library has been designed

    privileging its portability [14]. The performance relationships

    showed in Table III indicate that the mkl library contribution,is sensibly greater in the case of Xeon processor than in

    Itanium 2 processor working with double data types, because

    in between its multiple multimedia resources, Xeon

    processors has eight 128-bit registers, that enable it to process

    two data of double type in a single instruction [5].

    C. Processor specific compiler and library combinationincidence

    Table IV shows significant performance improvements for

    applications developed using simultaneously compilers and

    libraries specifically designed for each processor. For general

    use applications, employing specific development tools is not

    the best policy in every case. But working in specificapplication design, such as the case of biomedical signals

    processing at present work, where the high performance

    computing is a priority requirement, the whole system must be

    designed taking account of applications requirements.

    In such cases, the contribution of using specific-processor

    compilers and scientific calculation libraries, allows to reduce

    almost in four times the same algorithm execution time using

    64-bit floating point data (double type) in both processors,

    and, in the specific case of Itanium 2 processor, 4.9 times for

    32-bit floating point data (float type), difference that can result

    in a system able or unable to resolve real time calculations. In

    real time biomedical signals acquisition and processing for on

    line decision taking, against an emergency for example, these

    contribution is crucial.

    D. Performance relationship between processors accordingto the compiler library combination

    Comparing both processors on Table V performance

    relationship basis, the Itanium 2 results more efficient working

    with double data types, that is, 64-bit floating point, since this

    processor has many 64-bit registers files, fact that minimize

    memory accesses [8], [9], [13]. In this table, we can observe

  • 8/2/2019 BI131

    6/6

    Bioingeniera 131 6

    that the compiler is responsible for the main Itanium 2

    processors performance contribution, like is expectable in an

    EPIC processor. However in the Xeon processor, the mkl

    library contribution improves the performance relationship,

    for its multiple multimedia application execution resources,

    that are better used for this library [4] [5] [12].

    ACKNOWLEDGMENTS

    The authors are thankful to Intel Tecnologa de Argentina

    SA, because the donation of an Itanium 2 system, granted for

    the research project Analysis of Heart Rate Variability,

    Arterial Pressure and Pulse in Normotensive and Hypertensive

    Subjects.

    Marcelo Risk is investigator of CONICET (Consejo

    Nacional de Investigaciones Cientficas y Tcnicas),

    Ministerio de Educacin, Ciencia y Tecnologa, Argentina.

    REFERENCES

    1. Ramakrishna Rau B. Fischer Joseph A. Instruction LevelParallelism Processing: History, Overview and Perspective.Computer Systems Laboratory HPL-92-132. October 1992.

    2. Alejandro Furfaro, Mariano Llamedo Soria, Julin S.Bruno, Nahuel Gonzalez, Marcelo R. Risk . Procesamiento

    Intensivo del ECG con procesadores IA-32 e IA-64.

    Facultad Regional Buenos Aires, Universidad Tecnolgica

    Nacional.

    3. Michael S. Schlansker, B. Ramakrishna Rau - EPIC: AnArchitecture for Instruction-Level Parallel Processors.

    Compiler and Architecture Research, HPL-1999-111.

    February 2000.

    4. IA-32 Intel Architecture Optimization Reference Manual.Intel 2005

    5. IA-32 Intel Architecture Software Developers Manual.Volume 1: Basic Architecture. Intel 2005

    6. Sobh J, Risk MR, Barbieri R, Saul P. Database for ECG,arterial blood pressure and respiration signal analysis:

    feature extraction, spectral estimation, and parameter

    quantification. IEEE-EMBC and CMBSC, vol 4, pp. 955-

    956, 1997.

    7. Risk M, Sobh J, Barbieri R, Saul P. A simple algorithm forQRS peak location: use on long term ECG recordings from

    the HMS-MIT-FFMS database. IEEE-EMBC 1995.

    8. Intel Itanium Architecture. Software Developers Manual.Volume 1: Application Architecture. Revision 2.1. Octubre

    2002. Intel

    9. Intel Itanium Architecture. Reference Manual for SoftwareOptimization. November 2001. Intel

    10. McNairy C, Soltis D. Itanium 2 ProcessorMicroarchitecture. IEEE Micro, March-April, pp 44-55,

    2003.

    11. Sharangpani H, Arora K. Itanium ProcessorMicroarchitecture. IEEE Micro, September-October: pp 24-

    43, 2000.

    12. Intel. Intel Math Kernel Library. Reference Manual.Document Number: 630813-017.

    13. Hewlett Packard White Paper. Inside the Intel Itanium 2Processor: an Itanium Processor Family member for

    balanced performance over a wide range of applications.

    July 2002

    14. Galassi Mark, Davies Jim, Theiler James, Gough Brian,Jungman Gerard, Booth Michael, Rossi Fabrice. GNU

    Scientific Library Reference Manual, Edition 1.7. for GSL,

    version 1.7.