Fachhochschule Kaiserslautern ANGEWANDTE INFORMATIK

Fachhochschule Kaiserslautern

ANGEWANDTE INFORMATIK

— Praxis- und Bachelorarbeit —

Performance Assessment ofState-of-the-Art Computing Servers for

Scientific Applications

von

Axel Busch

01. 10. 2009

Fachhochschule KaiserslauternFachbereich Informatik und Mikrosystemtechnik

Studiengang "Angewandte Informatik"

Betreuer: Prof. Dr. Jörg HettelZweitkorrektor: Prof. Dr.-Ing. Wilhelm Meier

AbstractThe Intel R© Nehalem Microarchitecture arrived recently on the market and pro-vides several interesting new properties and functionalities. The performanceper clock and the energy efficiency were increased. Simultaneous Multithread-ing (SMT) was reintroduced and a functionality called turbo boost was imple-mented for the first time. The performance gain was measured using the popularbenchmark suite SPEC CPU2006. The use of this suite also allowed the energy ef-ficiency of the new architecture to be evaluated. An interesting observation wasthe impact of SMT and the turbo mode. More precisely the measurements weredone for three flavours of Nehalem: The low power L5520, the mid-range E5540and the most powerful of the Nehalem series: the X5570.

All the measurements were compared to the previous Harpertown platformusing different benchmarks like SPEC CPU2006 as a synthetic benchmark as wellas real physics applications like the ALICE framework, used in CERN’s ALICEexperiment, a benchmark called "test40" from CERN’s GEANT4 Suite for moni-toring the singlethread performance and a multithreaded benchmark, based onIntel’s Threading Building Blocks (TBB) to observe the multithreading capabili-ties of the Nehalem.

The gain with SMT with small benchmarks was quite impressive, thus it wasdecided to have a closer look at the overall performance it provides for largerphysics applications. For these tests, a benchmark from the GEANT4 Suite andthe ALICE framework were taken. A multithreaded benchmark, based on In-tel’s Threading Building Blocks (TBB), was also used. The benchmarks were con-trolled by a Linux kernel utility called CPUset, that allows threads to run on de-fined subsets of CPUs. This provides a detailed insight into the scalability ofthe forcing of applications, as well as the amount of benefit in performance andpower consumption one can be expect using SMT, and what the gain can be inpractice.

With the help of CPUset it was also possible to reevaluate the older SMT im-plementation of the 2005 Xeon, introduced as Irwindale, and to say more aboutits performance.

To restrict processes to CPUsets by hand takes time and several steps are needed,thus a framework was created to automatize the functionality and make it easierto use without being familiar with all the intermediate steps.

Finally, the technology and performance of using a Solid State Drive (SSD) withCERN’s ALICE framework was observed and showed a rough view to its perfor-mance and how it could be used in the future.

The practical work comprises chapter 1 - 4 and the bachelor thesis chapter 5.

I

Affidavit

Hiermit erkläre ich, Axel Busch, geboren am 30. September 1986 in Pirmasens,ehrenwörtlich, dass ich meine Bachelorarbeit mit dem Titel:"Performance Assessment of State-of-the-Art Computing Servers for ScientificApplications" selbstständig und ohne fremde Hilfe angefertigt habe und keineanderen als in der Abhandlung angegebenen Hilfen benutzt habe;dass ich die Übernahme wörtlicher Zitate aus der Literatur sowie die Verwen-dung der Gedanken anderer Autoren an den entsprechenden Stellen innerhalbder Arbeit gekennzeichnet habe.Ich bin mir bewusst, dass eine falsche Erklärung rechtliche Folgen haben kann.

Zweibrücken, 01. 10. 2009

II

Contents

Abstract I

Affidavit II

1. Introduction 11.1. The openlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2. State-of-the-art of processor technology . . . . . . . . . . . . . . . . 21.3. Nehalem architecture and openlab . . . . . . . . . . . . . . . . . . . 3

2. Energy measurements 52.0.1. Hardware configuration . . . . . . . . . . . . . . . . . . . . . 52.0.2. System software setup . . . . . . . . . . . . . . . . . . . . . . 5

2.1. Reference method of CERN IT for energy measurements . . . . . . 62.1.1. Measurement equipment . . . . . . . . . . . . . . . . . . . . 62.1.2. ZES-Zimmer LMG450 power analyzer . . . . . . . . . . . . . 72.1.3. Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2. Test software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1. CPUBurn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2. LAPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3. Test execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1. CPUBurn and LAPACK . . . . . . . . . . . . . . . . . . . . . 92.3.2. Power measurement . . . . . . . . . . . . . . . . . . . . . . . 122.3.3. Example measurement . . . . . . . . . . . . . . . . . . . . . . 132.3.4. Issues during energy consumption measurements . . . . . . 142.3.5. Results of the energy measurements . . . . . . . . . . . . . . 17

3. Performance measurements 193.1. Setting up the SPEC CPU2006 benchmark suite . . . . . . . . . . . . 193.2. Running the SPEC CPU2006 benchmark . . . . . . . . . . . . . . . . 23

3.2.1. Preparations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.2. Linux tool Anacron (at) . . . . . . . . . . . . . . . . . . . . . 263.2.3. Running SPEC CPU2006 . . . . . . . . . . . . . . . . . . . . . 27

III

Contents

3.3. Summary of the test runs . . . . . . . . . . . . . . . . . . . . . . . . . 283.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4. Comparison between Intel’s Harpertown and the Nehalem platform 344.1. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.1. Setup of the tbb benchmark . . . . . . . . . . . . . . . . . . . 344.1.2. Setup of test40 from the Geant4 suite . . . . . . . . . . . . . 35

4.2. Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3. Power measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4. Test procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5. Running the benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5.1. The tbb benchmark . . . . . . . . . . . . . . . . . . . . . . . . 374.5.2. test40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.6. Results of test40 & the tbb benchmark . . . . . . . . . . . . . . . . . 39

5. Evaluation of SMT 415.1. CPUset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1. Types of CPUsets . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.2. Script for creation and allocation of the CPUsets . . . . . . . 435.1.3. Worker script . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.1.4. Start script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1.5. Testing the CPUset scripts . . . . . . . . . . . . . . . . . . . . 49

5.2. Test procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.3. Benchmarks using CPUset . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.1. test40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3.2. tbb benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3.3. ALICE framework . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4. Issues with results of the observed tests . . . . . . . . . . . . . . . . 545.4.1. Results of test40 & the tbb benchmark . . . . . . . . . . . . . 565.4.2. ALICE framework . . . . . . . . . . . . . . . . . . . . . . . . 625.4.3. Comparison between SMT on the Irwindale and on the Ne-

halem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.5. Solid State Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5.1. ALICE framework using a SSD . . . . . . . . . . . . . . . . . 675.6. Benchmark framework . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.6.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.6.2. Framework architecture . . . . . . . . . . . . . . . . . . . . . 705.6.3. Implementation of the architecture . . . . . . . . . . . . . . . 725.6.4. Implementation of the job script . . . . . . . . . . . . . . . . 73

IV

Contents

5.6.5. Running the benchmarks using the job script . . . . . . . . . 745.7. Running test40 & the tbb benchmark at the same time on Nehalem 75

5.7.1. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6. Conclusion and outlook 79

7. References 81

8. Acknowledgements 83

List of Figures 85

A. Appendix 86A.1. Results of the energy measurements in detail . . . . . . . . . . . . . 86

A.1.1. Results of the Nehalem CPUs . . . . . . . . . . . . . . . . . . 86A.1.2. Results of the Core i7 965 . . . . . . . . . . . . . . . . . . . . 89

A.2. Results of the performance measurements in detail . . . . . . . . . . 89A.2.1. Results of the Nehalem CPUs [SPEC marks] . . . . . . . . . 89A.2.2. Results of the Core i7 965 [SPEC marks] . . . . . . . . . . . . 90

A.3. Results comparison between Nehalem and Harpertown using test40and tbb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.4. Sysstat runtime and I/Os for the ALICE framework using a SSD . . 96A.5. Shell script for compiling and executing of CPUBurn and LAPACK 101A.6. CPUset scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.6.1. cr_specific_cpusets.sh . . . . . . . . . . . . . . . . . . . . . . 102A.6.2. worker.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104A.6.3. start.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.6.4. bench.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

A.7. Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108A.7.1. cr_specific_cpusets.sh . . . . . . . . . . . . . . . . . . . . . . 108A.7.2. start.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

V

1. Introduction

CERN, the European Organization for Nuclear Research located in Geneva/Switzer-land, has finished the build and repair process of the new particle acceleratorLHC1, which is about 100 meters deep built. The LHC straddles the border be-tween France and Switzerland and is now in a test and checkout stage. With27 kilometers in length it is the largest accelerator ever built. It allows to gainour knowledge of the basic principles of our universe, for example analyzing thephysical conditions fractions of a second after the Big-Bang.

Four big particle detectors Atlas, CMS, Alice and LHCb create data, with a ratebetween 300 and 1200 MB/s, if the data is already filtered.[1] This huge amountof data is gathered by the CERN Computer Center, has to be stored quickly andis analyzed there. With this there is about 15 PB of data created in one year (anequivalent would be a 20 km high pile of CDs). For caching all the data there areseveral disk servers with fast disk drives. Later, this data is stored on large tapestripe systems like the Sun’s STK SL8500; which reaches a maximum capacityof 5 PB. All over the world remote facilities analyze the data using the World-wide LHC Computing Grid (WLCG), but the CERN Computer Center also has arelatively high computing performance.

In particular the processing of data needs a lot of computing power. The CERNComputer Center, which was built in 1972, has a 2.9 MW limited capacity, due tothe building’s design and current available cooling technology. To increase theefficiency of the Computer Center it is only possible to upgrade to newer andthus more powerful and more energy efficient processors. The CPUs basicallyused are from Intel‘s "Harpertown" series, that is based on the previous IntelCore2 microarchitecture.

1.1. The openlab

Openlab2 is a collaboration between CERN and several industrial partners andis also accountable for building the grid network system. Openlab exists for six

1Large Hadron Collider2http://openlab-mu-internal.web.cern.ch/openlab-mu-internal/

1

CHAPTER 1. INTRODUCTION

years now, and thanks to the excellent relationship to all partners it establisheditself as a reference in terms of collaboration with the industry. The knowledgefrom CERN and the partners produces results leading to innovations in manyareas. Partners commit for a periode of three years to the collaboration and pro-vide salaries for young researchers and summer students at openlab, productsand services, and engineering capacity. Contributors commit for one year with alower annual level of funding.

In 2009, openlab started the third phase of the program: openlab-III. WithSiemens a new partner joined to openlab. Four broad areas of activity in theCERN openlab are established:

• Automation and Controls Competence Centre (ACCC): With Siemens a newpartner joined to openlab, thus CERN and Siemens improve the securityof Programmable Logic Controllers (PLCs) and the opening of automationtools towards software engineering and handling of large environments.

• Database Competence Centre (DCC): Together with Oracle the DCC is fo-cused on data distribution and replication, monitoring and infrastructuremanagement, highly available database services, application design, auto-matic failover and standby databases.

• Networking Competence Centre (NCC): Together with HP ProCurve, open-lab wants to observe and understand the behaviour of large computing net-works with more than 1000 nodes in High Performance Computing (HPC).Grid monitoring and messaging projects are realized together with EDS (anHP Company)

• Platform Competence Centre (PCC): PCC is focused on the PC based com-puting hardware and the related software. Together with Intel, importantfields as thermal optimisation, application tuning and benchmarking arecarried out. It also has a strong emphasis on teaching.

The openlab’s partners are Intel, HP, Oracle, and Siemens. EDS (an HP company)joined as contributor.[2]

1.2. State-of-the-art of processor technology

45 years ago, Gordon Moore described a law about the long term trend of thecomplexity of integrated switching circuits. Today, this law is called Moore’s law.

2

1.3. NEHALEM ARCHITECTURE AND OPENLAB

He promised a doubling of the complexity, respective the number of transistorsevery year and corrected it later to two years. Today, Moore’s law was modifiedand proves the doubling every 18 months.[3]

In 2005, Intel introduced the first consumer CPU containing two physical cores:The Pentium Extreme Edition 840, 3.20 GHz. Before, the additional transistorswere used to add new functionalities, and thanks to higher transistor density,increase the CPU’s clock frequency per core. But increasing the frequency toomuch results in a huge heat dissipation that can lead to a self destruction due tooverheat and also increases the energy consumption of the CPU. A huge energyconsumption leads to loud fans that carry the heat out of the system and de-creases the battery lifetime for mobile systems. Thus, energy saving technologieswere developed like disabling unused parts and decreasing the clock frequencywhile the CPU is in an idle state.[4]

In the beginning of the dual core era, the clock frequency per core actuallydecreased. Thus, an atomic core needs less energy, the total energy consumptiondecreases, but the total performance increases. The additional transistors are nowalso used for including more than one core as parallel working units, on one chip.

Often, only a single core’s pipeline is not fully loaded. Due to the latency of thememory or disk accesses, the core wastes lots of time waiting for these accesses.To avoid many waiting cycles, which decrease the CPU’s throughput, multi-threaded CPUs (Simultaneous Multithreading) were introduced.[4] Dependingon the architecture, instructions could be executed alternatively or simultane-ously to bypass waiting cycles. In multi core architectures this effect is evenworse. The number of execution units increases, but the performance of the mem-ory subsystem does not grow accordingly. Thus, several cache levels were intro-duced. For single and dual core CPUs, two cache levels were enough. Today, upto 4 physical or 8 logical cores are available on only one CPU, thus a third levelwas introduced to afford communication between the atomic cores by passingthe slow memory subsystem to gather shared data.

1.3. Nehalem architecture and openlab

Intel’s current processor generation is nicknamed "Nehalem". In Intel’s "tick-tock-model"3 a tick represents a smaller and more efficient version of the microarchi-tecture, with an increased transistor density. A tock is a new version of Intel’s mi-croarchitecture. Nehalem describes a tock. In this architecture Intel reintroducedSMT (Simultaneous Multithreading), previously implemented in Intel’s Pentium

3http://www.intel.com/technology/tick-tock/index.htm

3

CHAPTER 1. INTRODUCTION

4 as Hyper Threading Technology (HT Technology). The aim of SMT is to increasethe throughput of the CPU when more than one thread is running. Another newfeature of Nehalem is the "Intel Turbo Boost Technology" (Turbo mode). It allowsautomatic overclocking of the CPU cores when the entire processor is below thespecified Thermal Design Power (TDP). This is often the case when only one coreis loaded and all the other cores are idle. This said, Turbo mode is also possible ifall cores are running under some load. Nehalem can overclock dynamically thecores by up to two "Speed Bins". One Speed Bin is correlated to the QPI’s basefrequency of 133 MHz.

Another new feature is the integrated memory controller. It allows the CPU toaccess the local memory with lower latency. As a result the CPU4 needs less timeto wait for uncached data and can work more efficiently. The Nehalem memorycontroller can access up to 3 DDR3 channels. Additionally, this new architec-ture implements Intel’s Quick Path Interconnect (QPI), a Point-to-Point connec-tion between the chipset and CPU. This increases scalability and throughput, andreplaces Intel’s Front-Side-Bus. QPI is also used for communication between theindividual cores. The third improvement is the addition of an L3 cache with acapacity of 8 MB.[5]

Also new is the possibility to almost completely deactivate each core to savepower. This is known as "C6 mode" and is used to decrease the consumption ofsingle CPUs if they are unused. Depending on the workload it is possible to setup to 3 cores in C6 mode.

With all of these additions a new socket was deemed necessary. The currentsocket is the LGA1366, with a declared TDP of 130 W.

The close collaboration with Intel allows openlab to evaluate the new NehalemCPUs long before they are available on the open market. The intention is to calcu-late how many processors are needed to meet the requirements that can sustainthe amount of data the detectors create. And also give a hint to the physicists howthey should design their software to reach a maximum of performance. Becausethe CPU frequency will not increase significantly, but at the same time there arenew technologies coming up like multi core and SMT. So openlab analyzes theimpact of these new technologies and looks for problems in applications whichcost lots of execution time. At the same time the energy consumption and theresulted costs are calculated. Above that, the data can also be important for theFIO5 Group at CERN, for their decision which kind of processors they want tobuy at the next round.

4CPU = processor5Fabric Infrastructure and Operations (http://it-div-fio.web.cern.ch/it-div-fio/)

4

2. Energy measurements

Multiple energy measurements are executed for each server configuration. Thefollowing BIOS configurations are used for each hardware configuration:

• SMT mode off and Turbo mode off

• SMT mode on and Turbo mode off

• SMT mode off and Turbo mode on

• SMT mode on and Turbo mode on

2.0.1. Hardware configuration

Three Nehalem processor flavours, [email protected] GHz, [email protected] GHz and [email protected] GHz, are alternating installed on a dual socket Intel S5520UR motherboard.This motherboard supports up to 12 DDR3 DIMMs of memory, running at 800,1066 or 1333 MHz.[6] With each of the three flavours the board is equipped with12 2 GB memory modules. It is also equipped with two 500 GB SATA drives. Thememory runs at 1066 MHz.

A second test system is the Intel’s SX58SO desktop board for the Intel Corei7 [email protected] GHz. This motherboard is a single-socket system, equipped with oneIntel Core i7 CPU, supporting a maximum of 8 GB RAM (1066, 1333 or 1600 MHzmodules).[7] For the tests it is equipped with 6 GB of memory running in the 1333MHz mode.

2.0.2. System software setup

Scientific Linux CERN 5.3 (SLC 5) is used on both machines for all tests. SLC 5 isbased on Red Hat Enterprise Linux 5 (Server) within the framework of ScientificLinux. The standard SLC 5 kernel in version 2.6.18-128.1.1.el5 is used for all thetests and measurements.

5

CHAPTER 2. ENERGY MEASUREMENTS

Figure 2.1.: Nehalem system with Intel’s S5520UR Motherboard

2.1. Reference method of CERN IT for energy

measurements

CERN is interested in the overall power consumption of the computer systemwith all the components inside and also including the power supply of the sys-tem. High efficient power supplies can reach an efficiency of about 91 - 97 %. Theefficiency is higher under load (about 0.97) than in the idle state (about 0.91). Thepower supply loses energy during transforming the incoming AC voltage streaminto the different voltages which are needed for the motherboard, CPU and allthe other components in the system.

2.1.1. Measurement equipment

Hardware

For the energy measurements, a high precision power analyzer LMG450 of ZES-Zimmer Electronic Systems is deployed. It is possible to connect the power an-alyzer via the RS232 interface to a laptop allowing the current energy consump-tion to be read. Each measurement is carried out two times. First, the energy

6

2.1. REFERENCE METHOD OF CERN IT FOR ENERGY MEASUREMENTS

Workstation

Power Analyzer

Power Strip

Measurement Adapter

Test System

Network Connection

Measurement Cable

Power Cable

Figure 2.2.: Power test setup

consumption is measured in an idle state, and then, with a separate run, it ismeasured while the server is under load. The results are sent then to a speci-fied email account by a script running on the laptop. Finally the average result iswritten to a spreadsheet.

2.1.2. ZES-Zimmer LMG450 power analyzer

The ZES-Zimmer LMG450 power analyzer allows the measurement commonpower electronics and as well as network analysis. It has an accuracy of 0.1 %and allows the measurement of four channels at the same time. For openlab onlythree values are of importance:

• Active Power (P) : The active power is also often called "real" power and ismeasured in Watts (W). If the active power is measured over time the kilo-watt hours value is determined.

• Apparent Power (S) : Apparent power is the product of voltage (in Volts) andcurrent (in Amperes) in the loop. This part describes the consumption fromthe electrical circuit. It is measured in VA.

7


• Power Factor : Here, the power factor means the efficiency of the power sup-ply. The further the power factor is in the near of one, the better is theefficiency. (powerfactor = active power

apparent power)[8,9]

Software

The power analyzer is connected to a laptop by the RS232 interface. On the laptopsome Python scripts made by Alex Iribarren (CERN IT/FIO) and a "runpower.sh"shell script are running.

The Python scripts are designed to interact with an USB to Serial Bridge Con-troller PL-2303 that translates the serial commands to USB commands. This sendscommands to the power analyzer and receive the measurements. The recordedmeasurements are limited to "real power", "apparent power" and "power factor".At the end of the measurement the scripts create a CSV1 file and appends the av-erage values from the three recorded values.The Python script is adjustable to change the delay of measuring between 1 and60 seconds. The scripts also calculate the average of the measured values.

The "runpower.sh" script is designed to provide a more user friendly interface,to send the results to the given mail address and copy the results to the networkdistributed file system AFS2.

2.1.3. Calculation

CERN IT has decided for its energy measurements, 20 % of the time a given servershould be in an idle state and 80 % under load. Therefore, this mix was used asreference.

2.2. Test software

The idle power consumption is measured while only the standard operating sys-tem is working with no other additional applications or components. To mea-sure the servers under load, it has been ensured that the machines are truly fullyloaded by using CPUBurn to stress the CPU cores.

Since more and more memory is installed in modern multicore servers, theenergy consumption of the memory becomes increasingly more significant. Forthis reason, LAPACK is used to generate a memory load in parallel to CPUBurn.

1Comma Separated Values2Andrew File System

8

2.3. TEST EXECUTION

2.2.1. CPUBurn

CPUBurn is originally designed as a tool for overclockers, to stress the over-clocked CPUs, and check if they run stably or not. It can report if an error occurswhile the benchmark is running. It runs Floating Point Unit (FPU) intensive oper-ations to get the CPUs under full load, allowing the highest power consumptionto be measured.[10]

2.2.2. LAPACK

LAPACK (Linear Algebra PACKage) is written in Fortran90 and used to load thememory system and the CPU. It provides routines for solving systems of simul-taneous linear equations, least squares solutions of linear systems of equations,eigen-value problems, and singular value problems.[11] LAPACK’s memory con-sumption depends on the size of the generated matrices.

2.3. Test execution

At first the power analyzer is connected to the energy circuit of the main plug tothe power supply of the server.

2.3.1. CPUBurn and LAPACK

On half of the CPU cores CPUBurn and on the other half LAPACK runs. It isdepending on the chosen BIOS configuration how often CPUBurn and LAPACKwill be started. In the course of reintroducing SMT on Nehalem it is necessarilyto pay attention to the LAPACK configuration. For each memory configurationLAPACK has to be modified and recompiled to make sure that the memory sub-system is filled up and that no swap on a disk occurs. Swapping memory doesnot cover the test scenario, because in practice the calculations must not use swapmemory, due to performance aspects. If SMT is activated there are twice manylogical cores, so CPUBurn and LAPACK have also to be started twice times often.In that case it is necessarily to recompile LAPACK with smaller matrices to makesure that the higher number of LAPACK instances don’t overload the memorysubsystem. Thus, the size of the matrices has to be modified to take the memoryconfiguration referred to the number of LAPACK instances into account.

Adjusting settings by hand are often susceptible to mistakes, thus a small shell-script "powerbench.sh" is available that can be used to compile LAPACK andCPUBurn. The script identifies for itself the number of available CPU cores and

9


starts on half of the cores CPUBurn and on the other half LAPACK. But the scriptdoes not adjust the correct settings for the matrices to fill up the memory withdata. There are two options to set the correct settings:

• It is possible to store already compiled resources that they do not need tobe recompiled each time. This resources can be used by LAPACK. Just asystem link has to be established:

1 ln -s lapack_1480mb lapack

Listing 2.1: LAPACK

In this case a system link is set from the file "lapack_1480mb" to "lapack".The "powerbench.sh" script will search for this file. If it is already there itwill be taken for the calculations without recompiling it and about 1480 MBof memory per process and core will be used.

• If no resources are available a new source has to be created. There are al-ready several template files which can be used. For example "1000d_1480mb.f".It contains the algorithm of the calculation and information about the sizeof the matrix. In this case only the size of the matrix is crucial. The firstlines of the file are responsible for the size of the matrices and thus for thememory consumption:

2 double precision a(13864,13863),b(13863),x(13863)

3 [..]

4 integer ipvt(13863)

5 lda = 13864

6 [..]

7 n = 13863

Listing 2.2: 1000d_1480mb.f

To calculate matrix dimensions for other memory configurations the follow-ing advisement is done: If the size of the matrix is doubled, the amount ofmemory is increased for four times. So following formula can be used to cal-culate the matrices’ size to get the right amount x of memory: |13864 ·

√x

1480|

(values for the 1480 MB memory configuration were given). The secondvalue for the matrix is the first value decreased by one.

After modifying the file it has to be declared in "powerbench.sh"

10

2.3. TEST EXECUTION

1 g77 -llapack -o lapack 1000d_1480mb.f

Listing 2.3: powerbench.sh

If "LAPACK" and "powerbench.sh" are configured and modified everything isprepared for starting the test procedure. Start command:

2 ./powerbench.sh

Listing 2.4: Starting powerbench.sh

The "powerbench.sh" script

This should show an overview about how the powerbench.sh script works.Detecting the type and the number of cores:

3 CPU=‘grep vendor_id -m 1 /proc/cpuinfo | awk ’{print $3}’‘CORES

=‘grep -c "^processor" /proc/cpuinfo‘

Listing 2.5: Detect CPUs

Compiling the benchmarks:

4 echo "Compiling benchmarks..."

5 gcc -s -nostdlib -m32 -o cpuburn $burn || fail "Error: Unable to

compile cpuburn test"

6 g77 -llapack -o lapack 1000d_1000mb.f || fail "Error: Unable to

compile lapack test"

Listing 2.6: Compile the benchmarks

Running the benchmarks:

7 echo "Running benchmarks..."

8 for i in ‘seq $CORES‘;

9 do

10 # Run a cpuburn and a lapack on alternating cores

11 let even=$i%2

12 if [[ $even -eq 0 ]]; then

13 echo "Launching lapack"

14 ./lapack >/dev/null || fail "Lapack died" &

15 else

16 echo "Launching cpuburn"

17 ./cpuburn >/dev/null || fail "Cpuburn died" &

18 fi

19 done

11


Listing 2.7: Running the benchmarks

Waiting for the power measurements:

20 echo "Please wait 5 minutes for power consumption to stabilize

..."

21 sleep 5m

22 echo "Start reading the power consumption now"

23 sleep 30m

24 echo "Stop averaging now."

Listing 2.8: Waiting for power measurements

And finally killing all the running CPUBurn and LAPACK processes:

25 echo "Cleaning up..."

26 sleep 1m # Wait a bit so power consumption doesn’t drop suddenly

27 pkill cpuburn

28 pkill lapack

29 echo "All done!"

Listing 2.9: Cleaning up

2.3.2. Power measurement

When "powerbench.sh" is running the measurement for the energy consumptioncan be started. The "runpower.sh" script on the laptop, that is connected to thepower analyzer, is used. This script also calls a Python script that uses the RS232interface to communicate with the power analyzer. To start the power measure-ment it is possible to deliver some parameters to the "runpower.sh" script:

1 ./runpower.sh -m DESCRIPTION -t STATE -l DURATION -e MAIL@CERN.

CH

Listing 2.10: Usage of runpower.sh

DESCRIPTION: It can be used to be able to distinct the single results later on.STATE: If the measurement takes place while the server load is idle, STATE wouldbe "idle" and if it would take place while it is on load it would be "load".DURATION: The duration of the measurement can be set here. For example"20m" for 20 minutes of runtime or "60s" for 60 seconds.-e [email protected]: It defines the mail address the final results will be send to.

Example:

12

2.3. TEST EXECUTION

2 ./runpower.sh -m opladev27_cpu-X5570_smt-on_T-off -t idle -l 20m

-e [email protected]

Listing 2.11: Example for runpower.sh

Here, an idle benchmark is started on a server that is called opladev27, runningwith the Intel X5570 CPU. In the BIOS configuration the SMT mode is switchedon and the Turbo mode is switched off. The measurement will be carried out for20 minutes and the final results will be send to [email protected]

2.3.3. Example measurement

Idle measurement

No process that needs a significant amount of CPU time must run. During the idlemeasurement only the runpower.sh script runs on the laptop that is connected tothe power analyzer:

3 sleep 5m;./runpower.sh -m opladev27_cpu-X5570_smt-on_T-off -t

idle -l 20m -e [email protected]

Listing 2.12: Example for runpower.sh

The command "sleep 5m" is needed to wait for 5 minutes, due to stabilizing thesystem’s energy consumption. After a sleeping time of 5 minutes the "runpower"script will start with recording the values from the power analyzer for 20 minutes.After that it will send the results as an e-mail.

Load measurement

No process should run that needs a significant amount of CPU time. It shouldalso not be forgotten that the system link of LAPACK is set to the right file, thatthe memory of the server is nearly under fully load and not swapping memoryto the disk. In the case of 24 GB memory and 8 cores 24

82

= 6 GB of memory areneeded for each LAPACK instance (LAPACK runs on half of the CPU cores andCPUBurn on the other half). Thus, the system link should be at a file that needsabout 6 GB memory. In this example it is the "lapack_5780m" file, because it needsabout 5780 MB per instance:

1 ln -s lapack_5780mb lapack

Listing 2.13: System link

A system link is created from "lapack_5780mb" to "lapack".To carry out the measurement the following command has to be executed:

13


4 ./runpower.sh

Listing 2.14: Runpower.sh

Now, the laptop needs to be advised to start the measurement. Again with a 5minutes delay for stabilizing the server’s power consumption:

5 sleep 5m;./runpower.sh -m opladev27_cpu-X5570_smt-on_T-off -t

load -l 20m -e [email protected]

Listing 2.15: Example for starting "runpower.sh"

2.3.4. Issues during energy consumption measurements

Three issues occurred while carrying out the measurements:

• The laptop for power measuring often could not initialize the power ana-lyzer for a second time without unplugging it from USB, reload the driverand replug it.

• The measurements has to take place in the Computer Center, because themeasurement process of the power analyzer has to be started from the lap-top physically and firewall restrictions forbids remote access.

• To change the BIOS settings physical access to the server is needed.

To improve the efficiency of the power measurements it is essentially to fix thesethree problems.

Initialization problem

At first the initialization problem has to be solved. Because if after one single runthe power analyzer has to be replugged, the Computer Center has to be accessedanyway.

One of the used Python scripts, communicating with the power analyzer, hada problem to establish a connection:

1 if(not self.writeConfirm("*RST")):

2 //# Bail out if we don’t get a response

3 raise ZesException, "No power meter found, check

settings "+\

14

2.3. TEST EXECUTION

4 "(57600 baud, 8N1, LF, Echo off, RTS/CTS)"

Listing 2.16: zes.py

If it is not possible to connect to the power analyzer an exception is thrown. Aworkaround avoids this problem:

5 for i in (1, 7):

6 # Reset the reader to default values

7 if(self.writeConfirm("*RST")):

8 break

9 if i == 7:

10 //# Bail out if we don’t get a response

11 raise ZesException, "No power meter found, "+\

12 "check settings (57600 baud, 8N1, LF, Echo off, RTS/CTS)"

Listing 2.17: zes.py

If it is tried for 7 times the exception will be thrown anyway, but during the teststhe error did not occur again. Thus, seven tries seems to be enough to avoid thisproblem.

Reverse access to the laptop

A SSH3 daemon runs on the laptop, but the access is restricted by the firewall.SSH supports "Reverse Port Forwarding", that allows to access a service a firewallmight be blocking. [12,13]

The laptop establishes a reverse SSH tunnel (as root) to a server at the Com-puter Center (olbl0101).

1 ssh root@olbl0101 -R 10022:localhost:22

Listing 2.18: Laptop -> Server

Now, olbl0101 is bound on port 10022 with port 22 on the laptop with the con-nected power analyzer (green line). And it is possible to connect the office PCto olbl0101‘s port 10022, that is redirected to port 22 on the laptop (blue line). Ifthe connection to olbl0101 on port 10022 is established you will emerge at thelaptop’s port 22 (red line).

1 ssh root@olbl0101

2 ssh root@localhost -p 10022

Listing 2.19: Office PC -> Server -> Laptop

3Secure Shell

15


Thus, it is possible to remote control the laptop at the Computer Center withoutstaying all the time there.

Workstation

CERN

Network

Server: olbl0201

Firewall blocks

port 22

Reverse SSH tunnel

SSH Connection to

Port 10022

Laptop/

Power analyzer

Figure 2.3.: Reverse access using SSH

Using IPMI for changing the BIOS settings

Intelligent Platform Management Interface (IPMI) is a collection of interfaces foradministrating a computer.With IPMI it is possible to establish a remote communication with a machine. Itis operating system independent and allows to switch the power on and off, evenif the machine is not powered on. Serial Over LAN (SOL) allows to redirect theserial port to the network. Thus, it is possible to get access to the BIOS of theremote machine.

For using IPMI the motherboard needs a second autonomous controller thatis called BMC (Baseband Management Controller). This controller also works,if the machine is powered off but is listening to the LAN. The BMC has its ownfirmware, thus it is independent of the system’s BIOS. This controller has its ownIP address.[14,15]

For using this feature, a SOL session has to be initialized on the remote ma-chine. Thanks to the "ipmitool" for linux it is possible to activate it by console:

16

2.3. TEST EXECUTION

1 ipmitool -I INTERFACE -H IPMI-ADDRESS -U USERNAME -P PASSWORD

sol payload enable CHANNEL_NUMBER USERID

2 Example:

3 ipmitool -I lanplus -H opladev27-ipmi -U openlab -P 123456 sol

payload enable 1 6

Listing 2.20: Initialization of a SOL session

To connect to the SOL session from another machine:

1 ipmitool -eE -I INTERFACE -H IPMI-ADDRESS -U USERNAME -P

PASSWORD sol activate

2 Example:

3 ipmitool -eE -I lanplus -H opladev27-ipmi -U openlab -P 123456

sol activate

Listing 2.21: Connect to the SOL-session

SOL creates traffic, thus it has to be allocated as payload. The in the BMC defineduserID for openlab is 6 and channel 1 is used. With different userIDs it is possibleto restrict the privileges for different users. The "lanplus" interface communicateswith the BMC over a LAN connection. It uses the RMCP+ protocol, that allowsfor example improved authentications, data integrity checks and data encryption.Further the sol activate command needs the "lanplus" interface to run.

w_i\\

~~ ø #

|||

1 12 13

14 156 --- ** 788

LAN / WAN

Workstation /

Remote ConsoleManaged Server

BMC

Figure 2.4.: IPMI scheme

2.3.5. Results of the energy measurements

In figure 2.5 the CERN IT energy measurements can be seen. These measure-ments do not convey much information about the throughput efficiency of the

17


system. It is therefore necessary to perform the performance measurements withSPEC CPU2006, in order to create a correlation between achieved SPEC marksand energy consumption.

0

40

80

120

160

200

240

280

320

360

400

SMT-offTurbo-off

SMT-offTurbo-on

SMT-onTurbo-off

SMT-onTurbo-on

W

L5520-E5540-X5570-Core i7 965 power consumption

L5520E5540X5570

Core i7 965

Figure 2.5.: CERN IT power measurements: 20 % idle and 80 % load mix

18

3. Performance measurements

The CPU performance measurements were conducted using the SPEC CPU2006benchmark from the SPEC1 Corporation. SPEC CPU2006 is an industry stan-dardized benchmark suite, designed to stress a system’s processor, the cacheand memory subsystem. The benchmark suite contains real user applicationsand the source code is available. A High Energy Physics (HEP) working grouphas demonstrated load correlation between the SPEC results and High EnergyPhysics applications. The C++ subset of the tests from the SPEC CPU2006 bench-mark suite is used, to cover the requirements of CERN.

Hardware and software configuration

To be able to correlate the power consumption to the performance results it ismandatory to use the same hardware and software configuration for all measure-ments. Thus, the hardware and software configuration is the same as observingthe power measurements. SPEC CPU2006 is compiled with GCC 4.1.2, GCC 4.3.3and Intel’s ICC v. 11.0; each in 32- and 64-bit-mode. The results show the differ-ences between several compilers, and the respective compiler versions.A completely new SPEC CPU2006 suite is set up for these tests.

3.1. Setting up the SPEC CPU2006 benchmark suite

The SPEC CPU2006 benchmark suite comes as a tar file that has to be uncom-pressed. An additional tar file (configs.tar) is built to make it easier to install theSPEC suite and not to search for all files in a previous installation. This tar filehas also to be unpacked.

For the installation a provided "install.sh" script can be used. This requires tobe logged in as administrator. The "SPEC2006/run" folder has to be replaced withthe "run" folder from "configs.tar". This new folder needs the linux permissions0755. Now, the folder "SPEC2006/gcc_01/SPEC2006_v11/config" has to be re-placed with the "config" folder of "configs.tar". Again the permissions should be

1Standard Performance Evaluation Corporation (http://www.spec.org)

19

CHAPTER 3. PERFORMANCE MEASUREMENTS

0755.From "configs.tar" the three files "run_spec_job.gcc-4.1.2", "run_spec_job.gcc-4.3.3"and "run_spec_job.intel-11.0" are copied to "SPEC2006/gcc_01/SPEC2006_v11/".

The "SPEC2006/run" folder contains 3 important subfolders:

Figure 3.1.: Subfolders of "SPEC2006/run"

• "build_spec" contains 3 shell scripts for compiling SPEC CPU2006. They areadjusted for GCC 4.1.2, GCC 4.3.3 and the Intel compiler v. 11.0. The 4th fileis a template for configuring another compiler version, that is not includedyet.

• "config" contains several files that describe settings for the different com-pilers, like compiler flags, optimization levels, CPU bit modes and settingsfor each of the different benchmarks, the suite contains. There are also md5checksums. The checksums make sure, that the compiler version of the con-fig file is conforming to the compiler version that will be used for compilingthe suite. If not the SPEC suite has to be recompiled (described later).

• "jobs" contains shell scripts that initiate the "run_spec_job.*" scripts, whichactivate the actual benchmark with the desired bit mode and a job descrip-tion for a better distinction of the results. The scripts will be called by"bench.sh", that is also located in the "run" folder.

Now, some settings have to be adjusted:In "SPEC2006/run/jobs/job_gcc-4.1.2.sh", "SPEC2006/run/jobs/job_gcc-4.3.3.sh"and "SPEC2006/run/jobs/job_intel-11.0.sh" a user specific path has to be set:

1 #!/bin/bash

2

3 bit=$1

4 run=$2

5

6 . ./desc.sh

7 #### own path has to be set ####

8 cd /data2/abusch/SPEC2006/gcc_01/SPEC2006_v11/

20

3.1. SETTING UP THE SPEC CPU2006 BENCHMARK SUITE

Listing 3.1: SPEC2006/run/jobs/job_intel-11.0.sh

In "SPEC2006/run/build_spec/build_spec-gcc-4.1.2", "SPEC2006/run/build_spec/build_spec-gcc-4.3.3" and "SPEC2006/run/build_spec/build_spec-intel-11.0" the specific root path has to be set.

9 #! /bin/bash

10

11 #### root-path has to be set ####

12 rootpath=data2/abusch

13 compiler_path="/opt/intel/Compiler/11.0/081"

14

15 dir=$1

16 bit=$2

17

18 cd /${rootpath}/SPEC2006/${dir}/SPEC2006_v11

19 . ./shrc

20

21 # setup compiler environment

22 . /opt/intel/setup.sh

23 if [ "-${bit}-" == "-32-" ]; then

24 . ${compiler_path}/bin/iccvars.sh ia32

25 else

26 . ${compiler_path}/bin/iccvars.sh intel64

27 fi

28

29 # clean sourve trees

30 runspec --config=linux${bit}-intel-11.0_cern --action=clean

all_cpp

31 # remove exe for the run

32 ext=$(grep ^ext config/linux${bit}-intel-11.0_cern.cfg | awk ’{

print $3}’)

33 find benchspec/CPU2006/*/exe/ -type f| grep "\.${ext}\$" | xargs

-itoto rm -f toto

34

35 runspec --config=linux${bit}-intel-11.0_cern --action=build

all_cpp

Listing 3.2: SPEC2006/run/build_spec/build_spec-intel-11.0

"SPEC2006/run/desc.sh" has to be modified to set the specific CPU and memorysettings. This file is used for identify the single results later on. It is possible to

21


influence it by modifying the memory, SMT, turbo mode, SpeedStep and CPUname description:

36 # description of the running configuration

37 # for Nehalem-L5520_SMT-off_turbo-on-SpeedStep=on_memory-12x2GB

38 MEM_CONFIG="12x2GB"

39 SMT_CONFIG="off"

40 TURBO_CONFIG="on"

41 SPEED_STEP_CONFIG="on"

42 CPU_NAME="L5520"

43 DESC="cpu-${CPU_NAME}_SMT-${SMT_CONFIG}_turbo-${TURBO_CONFIG}

_SpSt-${SPEED_STEP_CONFIG}_memory-${MEM_CONFIG}"

Listing 3.3: SPEC2006/run/desc.sh

After that, the configuration of SPEC CPU2006 is finished and the suite can becompiled:

44 ./SPEC2006/run/build_spec/build_spec-gcc-4.1.2 PATH BIT-MODE

45 ./SPEC2006/run/build_spec/build_spec-gcc-4.3.3 PATH BIT-MODE

46 ./SPEC2006/run/build_spec/build_spec-intel-11.0 PATH BIT-MODE

Listing 3.4: Compilation of SPEC CPU2006

• PATH means the "gcc_01"-path with the sources and execution files of "SPECCPU2006"

• BIT-MODE means 32 or 64 CPU bit mode

• In "build_spec" scripts for different compilers are located and allow compil-ing the suite with different compiler versions. These versions can be used totest different compilers for performance differences on the same hardwareand software environment.

Example for compiling the suite with GCC2 version 4.1.2 in 32 bit mode andthe source files are located in folder "gcc_01":

47 ./SPEC2006/run/build_spec/build_spec-gcc-4.1.2 gcc_01 32

Listing 3.5: Example for compiling SPEC CPU2006

Now, the SPEC CPU2006 benchmark suite is set up and ready to go.

2GCC - GNU Compiler Collection http://gcc.gnu.org/

22

3.2. RUNNING THE SPEC CPU2006 BENCHMARK

3.2. Running the SPEC CPU2006 benchmark

When SMT is disabled benchmarks run for about 4 hours and about 6 hours whenSMT is enabled. Due to the high number of tests which are carried out, it isnecessarily to think about a mechanism to get an automatic start of a completetest series. A test serie is a run of the three different GCC versions and the Intelcompiler. Each of them are tested in the 32 and 64 bit mode.

The performance measurements were also carried out with different BIOS set-tings:

• SMT mode off and Turbo mode off

• SMT mode on and Turbo mode off

• SMT mode off and Turbo mode on

• SMT mode on and Turbo mode on

In total, four test series, comprising each six single SPEC jobs have to be com-pleted (32 and 64 bit for each of the three compilers).

3.2.1. Preparations

To make sure that the single SPEC CPU2006 test runs start automatically, it wasnecessarily to create several scripts which were also alluded during the descrip-tion of the installation of the suite. These scripts were written by Julien Leduc(CERN openlab).

Bench.sh script

At first there is the "bench.sh" script. It is used as a primitive scheduler. A lockfileretards that the several runs start all at the same time. Thus, "bench.sh" creates alockfile in "/tmp/".

1 lockfile="/tmp/bench"

Listing 3.6: Listing of bench.sh

If there is no lockfile yet, it creates one and writes to this file the own PID. Ifa race condition between several bench instances occur, the looser instance willbe rescheduled. The winner will be executed and at the end, after the run, the

23


lockfile will be deleted. If there is already a lockfile placed, the benchmark willalso be rescheduled.

2 if [ ! -e ${lockfile} ]; then

3 echo $$ > ${lockfile}

4 sleep 2

5 if [ "-$$-" != "-$(cat ${lockfile})-" ]; then

6 reschedule

7 exit 0

8 else

9 echo "executing $@"

10 $@

11 rm ${lockfile}

12 fi

13 else

14 reschedule

15 fi

Listing 3.7: Listing of "bench.sh"

The reschedule function calls the linux tool "anacron" (at). It reschedules the SPECCPU2006 benchmark and tries to execute it at the in the script indicated time,here, 30 minutes, later:

16 period="30 minutes"

17

18 command_to_run="$@"

19

20 reschedule () {

21 echo "rescheduling job ${command_to_run}"

22 at now + ${period} <<EOF

23 $0 ${command_to_run}

24 EOF

25 exit 0

26 }

Listing 3.8: Reschedule function of "bench.sh"

Job script

"SPEC2006/run/jobs/" contains several scripts to execute the right job and thecorrect GCC and ICC3 version. It just saves the given parameters and calls "desc.sh"to set the required description parameters.

3ICC - Intel C++ Compiler

24


1 bit=$1

2 run=$2

3

4 . ./desc.sh

Listing 3.9: Listing of "job_intel-11.0.sh"

Now, the directory is changed to the right folder and calls the appropriate scriptfor the final execution:

5 cd /data2/abusch/SPEC2006/gcc_01/SPEC2006_v11/

6 ./run_spec_job.intel-11.0 ${bit} run-${run}_intel-11.0_bit-${bit

}_spec-allcpp_${DESC}

Listing 3.10: Listing of "job_intel-11.0.sh"

"Run SPEC" script

The run SPEC script is built for running as many instances of SPEC CPU2006 asCPU cores are available.At first it saves again several input parameters and set up the compiler environ-ment for the Intel compiler. Here:

1 bit=$1

2 run_base_name=$2

3

4 # setup compiler environment

5 . /opt/intel/setup.sh

6 . /opt/intel/Compiler/11.0/069/bin/iccvars.sh intel64

7

8 . ./shrc

Listing 3.11: Listing of "run_spec_job.intel-11.0"

Now, the directory for the results is created and the benchmark is started as oftenas CPU cores are available:

9 mkdir -p results/${run_base_name}

10

11 COUNT=‘grep -c "^processor" /proc/cpuinfo‘;

12 for jobid in ‘seq $COUNT‘;

13 do

14 runspec --config=linux${bit}-intel-11.0_cern.cfg --nobuild

all_cpp 2>&1 > results/${run_base_name}/output_linux${bit

}-intel-11.0_cern_${jobid}_job_${COUNT} &

25


15 done

16 wait

Listing 3.12: Listing of "run_spec_job.intel-11.0"

The instances are forked for n-times and are sent to the background. Thus, it isnecessary to wait for the end of the single processes, because the "bench.sh" scriptwaits for the end of all the instances of SPEC CPU2006 to delete the lockfile. Thus,also "run_spec_job.intel-11.0" has to wait. Otherwise it would not be possible for"bench.sh" to indicate the end of the benchmark run and the lockfile could not bedeleted.

3.2.2. Linux tool Anacron (at)

Anacron is a periodic command scheduler. It executes commands after a certaininterval. This feature can be used to add benchmark jobs to a queue. Thus, it ispossible to add all the jobs at the same time to the queue and anacron will executethem one by one. The "bench.sh" script is designed for supporting this feature.To attach a job to the queue, the following command is used:

1 at TIME <<EOF

Listing 3.13: Anacron

TIME can be "now" or a time in the future:

2 at now <<EOF

3 OR

4 at now + 5 minutes <<EOF

Listing 3.14: Anacron

Now, the execution command is defined:

5 at now + 5 minutes <<EOF

6 > sleep 5m;

7 > EOF

Listing 3.15: Example for Anacron

From now in 5 minutes a sleep command will be executed.

1 atq

Listing 3.16: atq

"Atq" allows to get a list of all queued jobs and the associated jobID. The jobIDcan be used to remove jobs from the queue:

26


1 atrm JOBID

Listing 3.17: atrm

"at" helps to get detailed information about a specific job by the jobID:

1 at -c JOBID

Listing 3.18: at

3.2.3. Running SPEC CPU2006

At first "desc.sh" in SPEC2006/run/desc.sh has to be set to the right settings. Torun SPEC CPU2006 the following command is used:

1 SPEC2006/run/bench.sh SPEC2006/run/jobs/JOB.SH BIT-MODE RUN

Listing 3.19: Starting the SPEC CPU2006 benchmarks

The results will be located at "SPEC2006/gcc_01/SPEC2006_v11/result" and"SPEC2006/gcc_01/SPEC2006_v11/results".

• JOB.SH means the modified job script or a prefabricated script of the"SPEC2006/run/jobs" folder.

• BIT-MODE means 32 or 64 CPU bit mode.

• RUN is a identifier to distinct the single runs.

• The "results" folder contains different subfolders, which are named by theargument that was declared in the "job" scripts. There is one folder for eachrun. Each subfolder contains one log file for every single core on which thebenchmarks are executed. The log files contains information about the run,like the initiated benchmarks and finally the reference to a file, that containsthe benchmark score.

• The "result" folder contains log files with the ratings of the benchmarks.They are referenced by the log files of the "results" folder.

27


1 at now <<EOF

2 > ./bench.sh jobs/job_intel-11.0.sh 1 32

3 > EOF

Listing 3.20: Example for starting the SPEC CPU2006 Benchmarks

Now, the other jobs from the test series can be started one by one. The "bench.sh"script will manage the scheduling:

4 ./bench.sh jobs/job_intel-11.0.sh 1 64

5 ./bench.sh jobs/job_gcc-4.1.2.sh 1 32




Listing 3.21: Example for starting the SPEC CPU2006 benchmarks

Now, the complete test series is queued, and no user interaction is required any-more.

3.3. Summary of the test runs

In "SPEC2006/gcc_01/SPEC2006_v11/" two result directories "result/" and "re-sults/" are created. The final score for the atomar benchmarks, the suite contains,is spread over these two folders. To calculate the final result by hand would taketoo much time and would be too error prone. Thus, three scripts are available,which allow the calculation of the final results.

The first script "get_data.sh" allows to get the data from different servers. It isjust a simple rsync:

1 #!/bin/bash

2 SERVERS="opladev27"

3

4 for server in ${SERVERS}; do

5 rsync -avz -e "ssh -l abusch" ${server}:/data2/abusch/

SPEC2006/gcc_01/SPEC2006_v11/result* ${server}

6 done

Listing 3.22: get_data.sh

Here, the helper script "compute_results.sh" is called, that executes "compute_runs.sh"that calculates the final results for each run of the SPEC CPU2006 benchmark, foreach file fetched by "compute_results.sh".

28

3.4. RESULTS

1 #!/bin/bash

2

3 for file in ~/opladev*/results/*; do

4 bash ./compute_runs.sh $file $(dirname ${file}|sed -e ’s

/results$/result/’) $(echo $file|sed -e ’s/.*\///’)

2>/dev/null;

5 done

Listing 3.23: compute_results.sh

3.4. Results

"compute_results.sh" provides a raw version of the results:

1 Final result for run-10_gcc-4.1.2_bit-32_spec-allcpp_cpu-

E5540_SMT-off_turbo-off_SpSt-off_memory-12x2GB: 99.77







5 Final result for run-10_intel-11.0_bit-32_spec-allcpp_cpu-




7 [..]


Core_i7_965_SMT-on_turbo-on_SpSt-on_memory-12x2GB: 80.51











Listing 3.24: Example-output of ./compute_results.sh

29


These results will be recorded to a spreadsheet.

Results - SPEC CPU2006

The results of the various performance measurements show that the highest SPECCPU2006 marks on the different flavours of the Nehalem are obtained enablingboth SMT and the Turbo mode. According to 3.5 and 3.3, the absolute SPECmarks are about 140 for the L5520, 150 for the E5540 and 170 for the X5570. Whenthe performance between the Nehalem and the Harpertown was compared, theNehalem reached a 58.4 % higher performance than a CPU from the previousgeneration within the same constrained power budget.

The results show that Turbo mode can provide a benefit from about 1 % (withGCC 4.3.3, 64 bit, L5520) to 12 % (with GCC 4.1.2, 64 bit, E5540). SMT achievesa gain from 19 % (with GCC 4.1.2, 32 bit, E5540) to 30 % (with GCC 4.1.2, 64 bit,L5520). Intel’s ICC compiler, provides a maximum performance gain of 15 %(with SMT enabled and Turbo mode disabled) over GCC 4.1.2 or GCC 4.3.3 in the32 bit mode. In the 64 bit mode it offers 5 % (also with SMT enabled and Turbomode disabled) as maximal additional performance.

In previous years, the energy costs for computing were rather insignificant butnow, in times of increasing energy costs, it becomes an important topic. Andoften the computer centers reached its total power consumption and to gain thetotal performance the new CPUs have to provide a higher energy efficiency. Theefficiency unit used here is SPEC marks per Watt and is calculated for the differentCPUs with the ratio of the reached SPEC mark using GCC 4.1.2 compiler in 32Bit mode by the CERN IT standard measurement. The choice of the conditionsfor the SPEC measurements are done according to the standard default compilerincluded in CERN SLC 5.3 and the standard compilation mode.

In those conditions, the reference system using Harpertown reaches 60.76 SPECmarks consuming 200 Watt.

According to figure 3.6, the L5520 CPU is the most efficient of the 3 Nehalemflavours and the Core i7 CPU. During the tests it reaches 0.411 SPECs/Watt. With0.393 for the E5540 and 0.389 SPECs/Watt for the X5570, both are slightly lessefficient than the L5520. The Core i7 965 reaches 0.295 and the reference system,equipped with the Harpertown 5410, reaches 0.303 SPECs/Watt. Thus, in ourmeasurements, the Nehalem reaches about 36 % more efficiency (SPECs/Watt)compared to the Harpertown. In our study of the Nehalem CPUs, the Core i7turns out to be the least efficient one. But the comparison is somewhat unfair,since it is a desktop CPU and the peripheral components are less optimized forpower savings.

30

3.4. RESULTS

0

20

40

60

80

100

120

140

160

180

200

220

SMT-offTurbo-off

SMT-offTurbo-on

SMT-onTurbo-off

SMT-onTurbo-on

SP

EC

SPEC CPU 2006 32-Bit, GCC 4.3.3

L5520E5540X5570

Core i7 965

Figure 3.2.: SPEC CPU2006 results of Nehalems and Core i7 965 for GCC 4.3.3, 32Bit

31


0

20

40

60

80

100

120

140

160

180

200

220

SMT-offTurbo-off

SMT-offTurbo-on

SMT-onTurbo-off

SMT-onTurbo-on

SP

EC

SPEC CPU 2006 64-Bit, GCC 4.3.3

L5520E5540X5570

Core i7 965

Figure 3.3.: SPEC CPU2006 results of Nehalems and Core i7 965 for GCC 4.3.3, 64Bit

0

20

40

60

80

100

120

140

160

180

200

220

SMT-offTurbo-off

SMT-offTurbo-on

SMT-onTurbo-off

SMT-onTurbo-on

SP

EC

SPEC CPU 2006 32-Bit, Intel ICC 11.0

L5520E5540X5570

Core i7 965

Figure 3.4.: SPEC CPU2006 results of Nehalems and Core i7 965 for Intel ICC 11.0,32 Bit

32

3.4. RESULTS

0

20

40

60

80

100

120

140

160

180

200

220

SMT-offTurbo-off

SMT-offTurbo-on

SMT-onTurbo-off

SMT-onTurbo-on

SP

EC

SPEC CPU 2006 64-Bit, Intel ICC 11.0

L5520E5540X5570

Core i7 965

Figure 3.5.: SPEC CPU2006 results of Nehalems and Core i7 965 for Intel ICC 11.0,64 Bit

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

SP

EC

/ W

L5520-E5540-X5570-corei7 HEP performance per Watt

L5520E5540X5570

Core i7reference H5410

Figure 3.6.: Efficiency of Nehalem L5520, E5540, X5570, Core i7 965 and Harper-town H5410

33

4. Comparison between Intel’sHarpertown and the Nehalemplatform

In November 2008 openlab evaluated Intel’s Atom processor if it is ready for highenergy physics.[8] It was a comparison between Intel’s Atom [email protected] GHz,dual core processor and a dual socket Intel "Harpertown" system [email protected] GHz.

4.1. Software

Future CPU generations are not expected to rise their frequency significantly, butthe number of cores per chip and the use of SMT will increase. Thus, the openlabteam wrote a test application based on Intel’s Threading Building Blocks Tem-plate Library (TBB).[16] The TBB Template Library is a C++-Template Libraryfrom Intel for the development of parallel applications. TBB is compiler indepen-dent: it is a template library like the Standard Template Library (STL), and nopragma-operations need to be used in order to translate for the compiler.

"tbb" is a benchmark based on the track fitter from the High Level Trigger inALICE. Openlab adapted it to run multithreaded on X86 using Intel’s TBB, henceits name.

The benchmark test40, is a test based on the Geant4 simulation toolkit. Geant4is typically used in High Energy Physics (HEP) for simulating the passing of par-ticles through matter.[17,18] As a benchmark candidate for the SPEC CPU2006benchmark suite, test40 is used to test the new Nehalem CPU. In order to com-pare these new results with results that were recorded previously, GCC 4.3.0 isused for compiling the code of test40. This ensures that the same conditions areestablished for the Nehalem system as for the previous system.

4.1.1. Setup of the tbb benchmark

The sources of the tbb benchmark come as a tar file that has to be unpacked. Someenvironment variables are set to launch it:

34

4.1. SOFTWARE

1 export PATH=/opt/gcc-4.3.0/bin:$PATH

Listing 4.1: Environment variables for the tbb benchmark

To the $PATH variable the "bin/" folder of the GCC version 4.3.0 is added. Now,"make" will use the defined GCC version, when compiling the tbb benchmark. Ifthe "bin" path of GCC would not point to the variable $PATH, the system globalGCC version would be used for the benchmark compilation.

2 export LD_LIBRARY_PATH="/opt/gcc-4.3.0/lib64"


The lib64 path has to be added to $LD_LIBRARY_PATH. Now, GCC 4.3.0 cancompile code in the 64 bit mode, and the tbb benchmark now also runs in 64 bit.

3 . /opt/intel/tbb/2.1.014/bin/tbbvars.sh intel64


The tbbvars.sh shell script sets some environment variables for running the tbbbenchmark. This script is part of Intel’s Threading Building Blocks library.

If all the environment variables are set, the tbb benchmark is ready to be com-piled:

1 make tbb

Listing 4.4: Compiling the tbb benchmark

The tbb benchmark is now set up and ready to run.

4.1.2. Setup of test40 from the Geant4 suite

test40 also comes as a tar file and has to be unpacked. For test40 the same com-piler version of GCC is used than for the tbb benchmark. Thus, again the envi-ronment variables are set:


Listing 4.5: Preparations for running test40

The tar file contains a makefile. Thus, the setup of test40 can be done using thesimple command:

1 make

Listing 4.6: Compiling test40

Now, test40 is also set up and ready to go.

35

CHAPTER 4. COMPARISON BETWEEN INTEL’S HARPERTOWN AND THENEHALEM PLATFORM

4.2. Hardware

For the power measurements the same power analyzer ZES-Zimmer LMG450 isused and also the same scripts and way of recording the results. The test systemis again the Nehalem system equipped with the [email protected] GHz CPU flavourand the same operating system Scientific Linux CERN 5 (SLC5).

4.3. Power measurements

For the power measurements the standard method of CERN IT for power mea-surements is not used. This time the interval of the measurements of the pow-ermeter’s script was modified that it measures every second the current values.Just the default value in an configuration file is decreased from 10 to 1, thus noother script has to be modified.

1 [..]

2 parser.add_option("-d", "--delay", action="store", type="seconds

", default=1, dest="delay",

3 help="Integration time, defaults to 10 seconds. You can use

the same format"+\

4 " for parameters as in --time. Minimum is 1 second, maximum

is 60 seconds")

5 [..]

Listing 4.7: Options.py

The power measurements are started while the benchmark is running. The powerconsumption of the CERN IT’s standard method would provide different mea-surements than if it was measured directly during the run of the benchmark,indeed these two tests are real world applications and not optimized to set all theexecution units of the CPU under load. Thus, the power consumption was lessthan if it would be measured with the standard method. But openlab cares aboutthe real absorbed power of these benchmarks. Both are real applications whichare used on a production system, thus the real power consumption is important.

4.4. Test procedure

Each of Intel’s Nehalem chips provide 4 physical cores with, thanks to SMT, 8logical cores. In the adducted dual socket system it results in total to 16 cores.The already evaluated Harpertown system also provides 4 physical cores, but no

36

4.5. RUNNING THE BENCHMARKS

further logical cores, because it don’t support SMT. It is also a dual socket system,and reaches a total of 8 cores.

Thus, the Harpertown system is observed with 1, 2, 4 and 8 threads or processesand the new Nehalem system with 1, 2, 4, 8 and 16 threads or processes.Several values are recorded:

• Average runtime

• Runtime compared to the Harpertown system in percent per core

• Active power consumption in Watt

• Throughput referred to the Harpertown system in percent

• Throughput per Watt

4.5. Running the benchmarks

4.5.1. The tbb benchmark

Because the tbb benchmark is multithreaded, it is only necessarily to start onlyone process. It can be started using the following command:

1 ./tbb #THREADS

Listing 4.8: Starting the tbb benchmark

1 ./tbb 16

Listing 4.9: Example of starting the tbb benchmark

In this example, the tbb benchmark is started using 16 threads.

Output of the tbb benchmark

tbb provides some values at the end of its run, but only a single one is interesting:

The real fit time/track in µs.

1 Preparation time/track = 0.05 [us]

2 CPU fit time/track = 0.4435 [us]

3 Real fit time/track = 0.443626 [us]

4 Total fit time = 8.87 [sec]

5 Total fit real time = 8.8729 [sec]

6 track size = 1520

7 char size = 1

37


Listing 4.10: Example output of the tbb benchmark (./tbb 1)

4.5.2. test40

The runtime of test40 is measured with the linux tool "time". It executes thegiven command with the appended arguments. When the executed programdies, "time" writes a message to standard error with the observed time statistics,referred to this run.

Linux tool "time"

Here, "time" is used for measuring the runtime of "runme.sh":

1 time -p ./runme.sh

Listing 4.11: Command for running time

The -p option reformats the output to make it easier to use in other shell scripts,for example to use the linux tools "grep" or "awk" to process the data:

1 [opladev27] > time -p ./test40 < test40.in50

2 real 29.27

3 user 29.26

4 sys 0.00

Listing 4.12: Example output of time

Here, ./test40 needs 29.27 seconds of real time (time from start to finish of thecall), 29.26 seconds of user time (amount of CPU time spent in the user mode;without the kernel time) and about 0.00 seconds of the system time (amount ofthe CPU time spent in kernel functions during the process).

Running test40

test40 is not multithreaded, thus it has to be forked as often as it is needed. Asmall shell script is written to allow that:

1 #!/bin/bash

2 COUNT=1

3 while [ $COUNT -le $1 ]

4 do

5 time -p ./test40 < test40.in50 > /dev/null &

6 COUNT=$[$COUNT+1]

38

4.6. RESULTS OF TEST40 & THE TBB BENCHMARK

7 done

Listing 4.13: runme.sh

This is a simple while loop that forks test40 with some test data (test40.in50) for$1 times. $1 is the first argument that is delivered by the command:

1 [opladev27] /data1/abusch/Test40/test40 > ./runme.sh 4

2 [opladev27] /data1/abusch/Test40/test40 > real 33.78

3 user 33.77

4 sys 0.00

5 real 33.67

6 user 33.66

7 sys 0.00

8 real 33.88

9 user 33.87

10 sys 0.00

11 real 33.77

12 user 33.76

13 sys 0.00

Listing 4.14: Execution and output of runme.sh

4.6. Results of test40 & the tbb benchmark

The benchmarks test40 and tbb were only tested with the Nehalem E5540 flavour.

The older Harpertown E5472 needs about 32 seconds for a single test40 run. Forthe same benchmark the E5540 needs 34 seconds(t). Therefore throughput is onlyat 95 %. But the E5540 needs about 10.4 % fewer cycles to compute the same work(1 - tE5540·fE5540

tE5472·fE5472). It should not be forgotten that the E5472 has a clock frequency(f)

of 3.0 GHz with only 4GB RAM and the E5540 runs at 2.53 GHz and has 24 GBRAM which also uses energy.

If the energy consumption is taken into account the E5540 has a throughput perWatt that is between 20 % to 40 % more than that of the E5472. Throughput perWatt is up to 780 % (SMT on, Turbo mode on and 16 threads) of performance perWatt compared to the E5472, with only one thread. If the E5472 uses 8 threads,the E5540 with 16 threads was about 18.75 % more efficient.

tbb provides similar results to those of test40. The real fit time per track neededfor the E5540 about is 0.51 µs, whereas the E5472 takes only 0.47 µs. Concerningthe energy consumption, there is a gain of 3 to 10 % in throughput per Watt. This

39


goes up to 542 % (SMT on, Turbo mode on and 16 threads) of performance perWatt, over that of the E5472, with only one thread. If the previous E5472 systemcomputes 8 threads at the same time and the new E5540 calculates 16 threads, theE5540 is 1.84 % more efficient.

40

5. Evaluation of SMT

When SMT is enabled the operating system recognizes more processors in thesystem than it actually has. One physical processor with one physical core isshown as two logical processors. The term "logical" is used, because two logicalprocessors are not the same as a dual core processor. In a dual core processor eachcore has its own set of execution units. The two logical cores on one physical coreshare the same execution units. This leads to an increased load of the CPU’spipeline and results in a better throughput.[19]

The new processors will be mainly busy with physical calculations. Thus, thetests for the performance measurements are operated with a chosen application,the same that will be used in practice later on. test40 of the Geant41 toolkit is used.Geant4 is a software packed that is typically used in High Energy Physics (HEP)for simulating the passing of particles through matter. test40 was a benchmarkcandidate that was submitted for inclusion in SPEC CPU2006.

Since the tests conducted showed that SMT can provide an interesting gain inperformance, a closer investigation of SMT is undertaken. The aim is to find outhow SMT behaves with real applications and what kind of performance can beexpected. The tests are again carried out with test40, the tbb benchmark and theframework of the ALICE experiment as a real world application.

5.1. CPUset

To evaluate SMT, it is necessary to make sure that a set of processes run onlyon specific cores and that the Linux scheduler is forced to use only those; andfortunately CPUset provides this functionality. CPUsets are objects in the Linuxkernel, which allows the partitioning of machines in terms of CPUs. It is a virtu-alization layer and allows here the creation of exclusive areas in which processescan be attached and are allowed to run. This function has been implemented inkernel 2.6 as a kernel patch.[20]

CPUsets can be created and manipulated through a pseudo filesystem inter-face. In a CPUset it is also possible to create sub CPUsets. Processes and memory

1Geometry and Tracking (http://geant4.web.cern.ch/geant4/collaboration/)

41

CHAPTER 5. EVALUATION OF SMT

(memory nodes) can then be associated. To establish a CPUset, a folder has tobe created in /dev/. Later it can be mounted and all needed files will be createdautomatically:

1 #creating a CPUset

2 mkdir /dev/cpuset

3 #mounting a CPUset

4 #mount type is cpuset and mount option also cpuset. Mountpoint

is /dev/cpuset

5 mount -t cpuset -ocpuset cpuset /dev/cpuset

6 #allocating of CPUs

7 echo 0-1 > /dev/cpuset/cpus

8 #sets CPU placement for tasks to exclusive

9 echo 1 > /dev/cpuset/cpu_exclusive

10 #allocating memory ressources (memory-nodes)

11 echo 0 > /dev/cpuset/mems

12 #attaching tasks to the CPUset. In this case the PID of the

current process

13 echo $$ > /dev/cpuset/tasks

14 #unmount a CPUset

15 umount /dev/cpuset/

Listing 5.1: Example for creating a CPUset

5.1.1. Types of CPUsets

Because many tests with the different benchmarks like the tbb benchmark, test40and the ALICE framework are carried out some shell scripts are written to makethe handling easier. The first script creates the CPUset(s) and allocates the singleCPU cores to the set. For the singlethreaded application two different ways toallocate tasks to a particular CPU are possible:

• Create one sub CPUset for each CPU and attach the cores and tasks "byhand" to each set.

• Create one big CPUset and attach all the cores and tasks to that one CPUsetand let the scheduler do the work. Moreover some issues with reschedulingof processes is avoided.

For multithreaded applications only the second option is convenient, because oneProcess ID (PID) can’t be allocated to more than one CPUset. To cover both re-

42

5.1. CPUSET

quirements the script is designed to create only one big CPUset and let the sched-uler do the work.

5.1.2. Script for creation and allocation of the CPUsets

Since Nehalem it becomes more difficult to create and allocate the CPUsets cor-rectly. Due to SMT, there must be a determination between physical and logicalcores. Two logical cores on one physical core share the same registers and caches.For the script it is essentially to distinct between physical and logical on physicalcores.

Core_id and physical_package_id are the two values that indicate the atomarcores. If two core_ids and physical_package_ids are the same both logical coresare located on the same physical core and share the same execution unit. Core_idand physical_package_id can be read out in "/sys/devices/system/cpu/". Eachlogical core, has an own directory with several settings that can be changed orcomprise information about it. In the "cpufreq/" folder the settings for the En-hanced Intel SpeedStep Technology2 can be changed.[21] Several "governors" thatcontrols the CPUfreq scaling are available and can be modified. Another folder is"topology/". It contains information about the core_id and physical_package_id.

The physical_package_id is important if the server contains more than onephysical CPU. The script’s design is very flexible. Thus, this fact is also takeninto account.

At the beginning of the script the old CPUsets are deleted:

1 if [ -d /dev/cpuset ]

2 then

3 declare -a SETS

4 SETS=($(ls -d /dev/cpuset/*/))

5

6 for ((i=0; i<${#SETS[*]}; i++)); do

7 rmdir ${SETS[$i]}

8 done

9 fi

Listing 5.2: cr_specific_cpusets.sh

If a cpuset is already mounted it will be dismounted and finally the main CPUsetwill be deleted:

2EIST is a dynamic frequency scaling technology that allows to change the clock speed of theprocessor dynamically by software. This affords energy saving.

43


10 MOUNT=$(mount | cut -d’ ’ -f1 | grep cpuset)

11

12 if [ "$MOUNT" != "" ]

13 then

14 umount /dev/cpuset

15 fi

16

17 rmdir /dev/cpuset


A new main CPUset is created and the arguments are transfered to more con-venient variable descriptors. The first two arguments represent the number ofdedicated physical and shared logical cores the user set to create. The rest arePIDs that are stored in an array:

18 mkdir /dev/cpuset


20 PHYSICAL_CORES=$1

21 SMT_CORES=$2

22 PID_COUNTER=0

23

24 i=0

25 for arg in $@; do

26 if [[ $i != 0 && $i != 1 ]]

27 then

28 PID[$PID_COUNTER]="${arg}"

29 ((PID_COUNTER++))

30 fi

31

32 i=$(($i + 1))

33 done


The PIDs are necessary later on to attach it to the CPUset. Only if the PID or theparent’s PID is attached to the CPUset the process run in it.

The CPUset "cpuset1/" which as the major CPUset is created and some flags areset for it:

34 cpus=""

35 ht_count=0

36 mkdir /dev/cpuset/cpuset1/

44

5.1. CPUSET

37 echo 1 > /dev/cpuset/cpuset1/cpu_exclusive

38 echo 0 > /dev/cpuset/cpuset1/mems


Then temporary CPUsets are created. These CPUsets are named like this: Thefirst number represents the physical_package_id, the second is the core_id andthe third number indicates how many cores the same physical_package_id andcore_id have. If more than one core has exactly the same IDs it is a processor thatsupports SMT and the script has to take this fact into account. The folders areonly created, due to an earlier implementation and were only needed for creatingthe final CPUset. They have no other function than to help finding out whichlogical cores are placed together on a physical core:

39 for dir in /sys/devices/system/cpu/cpu[0-9]*; do

40 physID=$(cat ${dir}/topology/physical_package_id)

41 coreID=$(cat ${dir}/topology/core_id)


For each directory in "/sys/devices/system/cpu/", that starts with "cpu" andis continued with numbers from 0 to 9 the physical_package_id and core_id isstored.

42 num=$(( $(ls -d /dev/cpuset/${physID}_${coreID}_* 2>/dev

/null | wc -l) + 1 ))


The number of detected cores with the same core_id and also physical_package_idis calculated.

43 ph_cores=$(( $(ls -d /dev/cpuset/* | grep ._._1 2>/dev/

null | wc -l) + 0 ))

44 ht_cores=$(( $(ls -d /dev/cpuset/* | grep ${physID}_${

coreID}_. 2>/dev/null | wc -l) + 0 ))


The number of dedicated physical and shared logical cores (here labled as SMT_CORES) is calculated.

45 if [[ (("$ph_cores" == "$PHYSICAL_CORES") && ("$num" ==

"1")) || ("$ht_count" == "$SMT_CORES" && "$num" > "1"

) ]]

46 then

47 continue 1

45


48 fi

49

50 if [[ "$num" > "1" ]]

51 then

52 ht_count=$(($ht_count + 1))

53 fi


If no more dedicated physical cores are needed, skip the current cycle of the mainloop. If it is not a dedicated physical core, but there are still needed, skip it also.If it is a shared logical core and it is not needed, also skip the cycle. In any othercases continue and create a new temporary CPUset and write the core_ids to astring:

54 mkdir /dev/cpuset/${physID}_${coreID}_${num}

55 cpus="$cpus$(echo $dir|sed -e ’s/^.*system\/cpu\/cpu//’)

,"

56 done


to be able to allocate it later on:

57 echo $cpus > /dev/cpuset/cpuset1/cpus


After that the given PIDs are attached to the final CPUset:

58 for ((i=0;i<${#PID[*]};i++)); do

59 echo ${PID[${i}]} > /dev/cpuset/cpuset1/tasks

60 done

61

62 echo done.


5.1.3. Worker script

All these scripts are built to start benchmark jobs. test40 is not multithreaded.Thus, it has to be forked as often as it is needed.

Forking processes needs time, due to this fact the single process will not startthen all together at the same time. A signal is taken to make sure that all theprocesses will start their calculations at the same time.

46

5.1. CPUSET

Here, the signal handler is established and the "trap" is registered. The signalhandler just starts the command that is delivered as an argument:

1 #!/bin/bash

2 #Signal handler

3 on_signal()

4 {

5 exec /bin/bash -c "${ARGS}"

6 exit 0

7 }

8 #End of the signal handler

9 #Registration of the signal "SIGUSR1"

10 trap ’on_signal’ SIGUSR1

Listing 5.13: worker.sh

The arguments are parsed and a loop just waits for the trap:

11 #Parsing the arguments

12 ARGS=""


14 ARGS="${ARGS} ${arg}"

15 done

16 #End of parsing the arguments

17

18 #Busyloop that waits for the trap

19 until [ ]

20 do sleep 1

21 done

22 #End of busyloop

Listing 5.14: worker.sh

5.1.4. Start script

The script "start.sh" contains the workflow of creating CPUsets, allocating PIDsand finally starting the benchmark and the user don’t need to care about how thedifferent scripts are working together. "start.sh" expects at least 4 arguments. Theboth first arguments declare how many processes shall run on dedicated physi-cal respectively on shared logical cores. It is not possible to run more processeson shared logical cores than on dedicated physical cores, because a core only be-comes a shared logical core if both logical cores on the physical core are used atthe same time.

47


The third argument declares how many instances are needed. If a multithreadedapplication runs, it should be "1" and if not it should be the number of instanceswhich are needed to fill up all the defined cores. The remaining arguments arethe command for executing the benchmark:

1 #Usage

2 if [[ "$2" == "" || "$1" == "" || "$3" == "" ]]

3 then

4 echo "Usage: ./$(basename $0) #NATIVE_CORES #SMT_CORES #

INSTANCES ./YOUR_APPLICATION.sh"

5 exit 0

6 fi

7 #End of Usage

Listing 5.15: start.sh

Worker will be started as often as it is needed:

8 [..]

9 #Starting worker.sh for $COUNT times

10 for ((i=0;i<$COUNT;i++)); do

11 /bin/bash -c "./worker.sh $ARGS1" &

12 KILL="$KILL $!"

13 done

14 #End of starting worker.sh


The required CPUset is created and allocated:

15 ./cr_specific_cpusets.sh $PARAM

16 sleep 1


Finally the signal "SIGUSR1" is sent to each process:

17 kill -SIGUSR1$KILL


48

5.1. CPUSET

5.1.5. Testing the CPUset scripts

To test if all the scripts work together correctly in terms of scheduling on the de-sired cores, the linux tool "md5sum" together with the tool "top" can be consulted.md5sum loads the cores while calculating md5 checksums. In top it can be ob-served if the right number of jobs is started and if they are scheduled each on anown core. The following command starts "md5sum" for two times:

1 for i in {1..2}; do md5sum /dev/urandom & done &> /dev/null

Listing 5.19: md5sum

The results are redirected to "/dev/null", because we don’t need the result, butonly want to load the CPUs.If the processes are scheduled on only one core, top shows:

Figure 5.1.: Scheduling jobs on only one core

If they are scheduled on different cores this should happen:

Figure 5.2.: Scheduling jobs on different cores

A good indicator to observe on which cores the processes are scheduled is to havea look at the core clock. If SpeedStep is activated the core clock will be decreasedby the linux governor for power saving. This can be checked using the followingcommand:

49


1 grep -m 32 -e "^processor" -e "^cpu MHz" /proc/cpuinfo

Listing 5.20: core clock info

Again, if the processes are scheduled on only dedicated physical cores this shouldhappen:

Only scheduled on dedicated physical cores

and if they are scheduled on all the logical cores:

Scheduler uses also shared logical cores

Another indicator to check CPUset configuration is to call up the list of the tasks

50

5.2. TEST PROCEDURE

that are allocated to the CPUset:

1 [opladev27] /data1/abusch > cat /dev/cpuset/cpuset1/tasks

2 6278

3 6566

4 [opladev27] /data1/abusch >

Listing 5.21: Allocated tasks

To check which processes are referred to a PID this command is useful:

1 [opladev27] /data1/abusch > ps ax | grep 6278 | grep -v grep

2 6278 pts/3 Ss+ 0:00 /bin/bash

3 [opladev27] /data1/abusch > ps ax | grep 6566 | grep -v grep

4 6566 pts/3 R 3:00 md5sum /dev/urandom

5 [opladev27] /data1/abusch >

Listing 5.22: ps -ax

Bash and md5sum are attached to the tasks list of the "cpuset1/. Thus, "/bin/bash"and "md5sum" are attached to the CPUset.

5.2. Test procedure

Altering BIOS settings is cumbersome, because CPUsets will restrict all processesto particular cores by themselves. Thus, the number of CPUset tests has a tendingto increase quickly. Therefore SMT and turbo mode are switched on all the time.

The Nehalem dual-socket system provides 8 physical cores, and when SMT isenabled, each physical core hosts two logical cores, providing a total of 16 logicalcores. Thus 1 to 16 cores can be tested. For example with 8 cores, a number ofdifferent CPUset configurations are possible. Here are several examples:

• Eight threads using eight logical cores on eight physical cores

• Eight threads using eight logical cores on seven physical cores

51


• Eight threads using eight logical cores on six physical cores

• Eight threads using eight logical cores on five physical cores

• Eight threads using eight logical cores on four physical cores

If this is carried out for all possible configurations, there would be 44 tests tocomplete for test40, and again 44 tests for the tbb benchmark:

2 ·∑n2i=1 (b i

2c+ 1) − b

n2c

2, n = number of logical cores; two logical on one physical

core

Due to process scheduling issues with odd numbers of CPU cores (for example 14logical cores on 7 physical cores), it is not possible to measure all configurations.

These tests are carried out in order to understand in which area the gain of SMTshows up, and also how the hardware behaves in terms of energy consumption.Energy consumption is measured as with the comparison between Harpertownand Nehalem.

52

5.3. BENCHMARKS USING CPUSET

5.3. Benchmarks using CPUset

The script for creating a CPUset requires root privileges. Otherwise it is not pos-sible to generate the folders to mount the pseudo filesystem.

5.3.1. test40

test40 is a singlethreaded benchmark, thus it is forked n-times and restricted toa CPUset with n cores attached. Therefore, runtime of a single test40 benchmarkdoes not decrease using more cores. Single test40 instances are independent, andthus have no influence on each other. As a result, the gain with SMT can easilybe quantified. Running test40 is similar to the tbb benchmark. Here, it should betaken care about the correct number of forked processes:

1 ./start.sh 1 1 2 ./runme.sh

Listing 5.23: Example of using CPUset with test40

5.3.2. tbb benchmark

Unlike test40, the tbb benchmark is a multithreaded benchmark, and thus it is notforked n times, but instead the tbb scheduler is advised to use n threads. Thus,the runtime of the tbb benchmark should decrease if more cores are used. Thisbenchmark also shows the gain with SMT and the scalability of Intel’s TBB.

1 ./start.sh 4 4 1 ./tbb 8

Listing 5.24: Example of using CPUset with the tbb benchmark

In this example "start.sh" calls "cr_specific_cpusets.sh" and creates a CPUset witheight logical cores on four physical cores. After that it calls "worker.sh" that forksone instance of the tbb benchmark with 8 threads. Then it sends a signal to theprocess and the benchmark starts to run.

Issues with the tbb benchmark

During the tests an issue occurs with the tbb benchmark: If it ran without anyCPUset and if it was only called with 1 or 2 threads (./tbb 2) the CPU load showedin "top" grew over 200% and thus the process was not restricted to two threads.Thus, the source code of the benchmark had to be analyzed.

The source code of the benchmark uncovered that there was not any restrictionto use only the given number of CPUs:

53


1 #ifdef TBB

2 // task_scheduler_init init;

3 tasks = atoi( argv[1] );

4 task_scheduler_init init;

5 #endif

Listing 5.25: Original tbb benchmark source code

The modified version initializes task scheduler with the given number of tasks:

6 #ifdef TBB

7 tasks = atoi( argv[1] );

8 task_scheduler_init init(tasks);

9 #endif

Listing 5.26: Original tbb benchmark source code

After that the tbb benchmark is restricted to the given number of threads and isalso restricted to schedule the task to this threads.

5.3.3. ALICE framework

The ALICE framework uses the Geant3 toolkit and is comprised of two basicparts: the first part being the simulation part, that is for example the simulation ofprimary collisions and generation of the emerging particles, and the second beingthe data reconstruction. This framework is singlethreaded. Thus, n-instances hadto be forked to run it on n-cores. The number of events which are used in thesimulation part can be chosen, but this has a big influence to the total runtimeof the benchmark. Here, 100 events are used. It needs several hours to run thebenchmark, thus the benchmark scripts of the SPEC CPU2006 benchmark areused and modified to fulfill the requirements of the ALICE framework.

5.4. Issues with results of the observed tests

Both benchmarks uncovered a problem while scheduling the several processesor threads, because a few additional seconds were needed to run the benchmarkswith CPUset compared to without.

The first try to fix this issue was compiling and setting up a newer kernel ver-sion. But with the new kernel version the issue increased and other problemsabout scheduling occurred. Thus, some other compilation settings for the kernel,but without any positive effect.

54

5.4. ISSUES WITH RESULTS OF THE OBSERVED TESTS

The second try was to find out how the linux CPUfreq governor behaved. It al-lows to change the CPU clock speed on the fly. Thanks to that feature it is possibleto save energy. How fast the governor changes the CPU clock speed depends onthe configuration. If it is set too slowly it can influence the performance, becausethe CPU runs with less than the maximum speed. This is useful for laptops, butnot in this situation. Several governors are available:

• Performance : In "/sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq"and scaling_max_freq the minimum and the maximum of the CPU clockspeed is described, that the governor is allowed to use. The performancegovernor always sets the CPU frequency to the maximum which is de-scribed in scaling_max_freq.

• Powersave : The powersave governor behaves in the opposite of the perfor-mance governor. It always sets the frequency to the lowest value that isgiven by scaling_min_freq.

• Userspace : The user or a program that is running with UserID (UID) "root"is allowed to set the frequency manually by creating the "sysfs" file "scal-ing_setspeed" in the CPU device directory "/sys/devices/system/cpu/cpu*/cpufreq/".

• Ondemand : The ondemand governor sets the clock speed depending on thecurrent usage. If more performance is used it scales the speed up, if not itscales it down. The delay for changing the clock speed can be defined inseveral values.

• Conservative : The conservative governor behaves in the same way like theondemand governor, but it does not set it directly to the maximum speed.There are several stages between minimum and maximum. Dependentupon load it changes the clock speed.[22]

To minimize the impact of the CPUfreq governor it is set to performance to havealways the maximum clock speed:

1 echo performance > /sys/devices/system/cpu/cpu0/cpufreq/

scaling_governor

55


Listing 5.27: Modified cr_specific_cpusets.sh

If the governor for one core is changed all the others are also changed automati-cally.But this is not the suggestion, because the behavior of the benchmark results didnot change.

If more than one core is used the average runtime approximates to the optimalruntime. A closer look to the log files disclosed that the job running on core0 has alonger runtime than the others. core0 runs more OS related processes, and leads tolower increases in both performance and energy consumption. The used CPUsetscript starts attaching the CPU cores with the low numbers (and starts with core0).To fix that issue the script is modified to attach at first core15 to minimize theimpact:

2 [..]

3 for dir2 in $(ls /sys/devices/system/cpu/ | grep -e ^cpu[0-9]*$

| sort -k 1.4gr); do

4 [..]

Listing 5.28: Modified cr_specific_cpusets.sh

Thus, the CPUs are sorted in a decresing order, the measurements become moreserious, and the issue is not to see anymore, because it occurs only at a highnumber of cores and here it is eliminated by the average time.

The second issue is, that during the measurements of the tbb benchmark andtest40 some other problems occur. For example for the configuration with 6 log-ical cores on 3 physical cores, it happens that the scheduler schedules two pro-cesses on the same core and one core is totally unused. The issue is reproduciblefor some specific configurations but with most it works fine. Probably it is a bugin the scheduler. To get an idea of the gain of SMT it is not necessarily to have allthe measurements. Thus here, the issue is not important.

5.4.1. Results of test40 & the tbb benchmark

test40 results

The test40 scheduling graph in figure 5.3 shows the execution time, according tothe number of simultaneously scheduled processes and to the scheduling policy.The green line represents the execution time according to the scheduling policythat tries to maximize the number of dedicated physical cores. Here, n is thenumber of processes:

56


0

5

10

15

20

25

30

35

40

45

50

2 4 6 8 10 12 14 16

time

(s)

# processes

test40 execution time according to scheduling

no SMTSMT

Figure 5.3.: Execution time of test40 on Nehalem X5570 using #processes

• 1 <= n <= 8: between 1 and 8 scheduled processes, each process is as-signed to a dedicated physical core: n logical cores on n physical cores.

• 9 <= n <= 16: between 9 and 16 scheduled processes, there are not enoughphysical cores available for all the processes anymore: n logical cores on 8physical cores.

The blue line, being the opposite scheduling policy, tries to minimize the numberof physical cores used on the server: n processes use n logical cores on b(n+1)/2)cphysical cores. Those two scheduling policies are two extremes, and all the otherscheduling schemes will give an execution time within these two extreme bound-aries. Thus, graphically the other scheduling policies are in the red area.

To deepen the analysis, the execution time was correlated with the power con-sumption during the runs in order, to deduce the overall amount of energy, con-sumed by each run. This correlation is shown in figure 5.4, and now highlights 3interesting scheduling policies:

1. "loose no SMT" scheduling policy, that tries to maximize the use of dedicatedphysical cores.

57


0

2000

4000

6000

8000

10000

12000

14000

16000

2 4 6 8 10 12 14 16

ener

gy (

J)

# processes

test40 energy/process

loose no SMTSMT

strict no SMT

Figure 5.4.: Energy per process of test40 on Nehalem X5570 using #processes

2. "SMT" scheduling policy, that tries to minimize the number of dedicatedphysical cores.

3. "strict no SMT" scheduling policy, that uses only dedicated physical coresand requires an additional server to run further processes.

Figure 5.4 highlights a new area, shown in red, that corresponds to the energyconsumed by any scheduling policy that tries to minimize the number of servers.

These first two graphs, figures 5.3 and 5.4, allow taking another point of view andanalyze the gains in terms of the throughput SMT offers, compared to a sole re-liance on the dedicated physical cores. Indeed, to evaluate SMT, an even numberof processes is required, in order to schedule a process on the two logical coresof each physical core. When using a strict no SMT scheduling policy the exactsame physical core only runs one process. The result was that a single core runsone process in 26 seconds, with the strict no SMT policy, or two processes in 42seconds using the SMT policy.Below is the resulting execution time taken to run two processes with both of theafore mentioned policies on a dedicated physical core:

• strict no SMT policy, a single core requires 2 ∗ 26 = 52 seconds

58


0

5

10

15

20

25

30

1 2 3 4 5 6 7 8

% g

ain

# physical cores

test40 SMT efficiency

SMT performance gainSMT energy gain

Figure 5.5.: Efficiency per thread of test40 on Nehalem X5570 using #processes

• SMT policy, the two processes run in 42 seconds

This clearly displays the advantage in using SMT scheduling in terms of through-put.Moreover, the gain in throughput is larger than the consumption penalty seenin figure 5.4. Thus, SMT not only increases overall compute efficiency, but alsoincreases energy efficiency. Figure 5.5 depicts those gains in terms of through-put and power efficiency, thanks to SMT, depending on the number of dedicatedphysical cores. According to this, the gain in performance with SMT is between20 to 25 %, increasing linearly when using more and more physical cores. At thesame time the energy consumption is pretty stable, at just over 15 % when usingSMT. The drop observed at 8 physical cores, is due to the fact that scheduling 16processes with SMT implies using the last core, core0, that is normally runningmore OS related processes. This leads to lower increases in both performanceand energy.

tbb benchmark results

Since tbb is a multithreaded process, all the comparisons for it are observed usingthe number of threads within a single process. This allows a more intuitive, di-

59


0

10

20

30

40

50

60

70

80

2 4 6 8 10 12 14 16

time

(s)

# threads

tbb execution time according to scheduling

no SMTSMT

Figure 5.6.: Execution time of the tbb benchmark on Nehalem X5570 using#threads

rect comparison, since running a tbb process with any number of threads alwayscarries out a fixed amount of work.Figure 5.6 shows the execution time according to the number of threads and the

scheduling policy. The green line represents the execution time according to thescheduling policy that tries to maximize the number of dedicated physical cores.Let say that n is the number of threads:

• 1 <= n <= 8: between 1 and 8 threads, each thread is assigned to a dedi-cated physical core: n logical cores on n physical cores.

• 9 <= n <= 16: between 9 and 16 threads, there are not enough physicalcores for all the threads available anymore: n logical cores on 8 physicalcores.

The blue line, is the opposite scheduling policy, where as it tries to minimize thenumber of physical cores used on the server: n processes use n logical cores onbn/2c physical cores. All other scheduling policies land in the red area. This plotalso shows that the tbb benchmark scales well, because using twice the amountof threads almost halves the execution time.

As already done for test40, the execution time is correlated with power consump-

60


0

2000

4000

6000

8000

10000

12000

14000

2 4 6 8 10 12 14 16

ener

gy (

J)

# threads

tbb energy/process

loose no SMTSMT

strict no SMT

Figure 5.7.: Energy per thread of the tbb benchmark on Nehalem X5570 using#threads

tion during the runs according to the number of threads and the scheduling poli-cies. This is shown in figure 5.7, with again the three different scheduling policies:loose no SMT, SMT and strict no SMT.The first two graphs, figures 5.6 and 5.7, can also be analyzed in terms of com-puting throughput. Here, to compare different policies with a fixed number ofdedicated cores used, the execution times can be directly compared using twicethe number of threads for the strict no SMT scheduling policy.To run a single tbb benchmark process fully using n cores requires:

• strict no SMT policy, n cores run n threads

• SMT policy, n cores run 2 ∗ n threads

If we dedicate two physical cores to one tbb process:

• strict no SMT policy, two threads run on two logical cores on two physicalcores, and require 38.9 seconds, and consume 2.033 Wh.

• SMT policy, four threads run on four logical cores in two physical cores, andrequire 34.2 and consume 1.917 Wh.

61


0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8

% g

ain

# physical cores

tbb efficiency

SMT performance gainSMT energy gain

Figure 5.8.: Efficiency per thread of the tbb benchmark on Nehalem X5570 using#threads

This again gives an advantage to the SMT scheduling policy in terms of through-put and energy efficiency. Figure 5.8 provides a full picture of the gains in termsof both compute efficiency and energy efficiency using the SMT scheduling pol-icy over the no SMT scheduling policy. According to this, the gain in performanceis stable at just over 12 %, while the energy efficiency is decreasing from 7 to justunder 1 %. SMT is therefore very interesting in terms of throughput, and never apenalizing factor on the energy consumption.

test40 & tbb tests conclusion

The two tests uncovered an interesting gain that SMT provides. To investigatethis further it is decided to carry out a deeper analysis with a real physics appli-cation: the framework of the ALICE experiment.

5.4.2. ALICE framework

The tests worked fine using eight threads or less, but if more are used the averageCPU load decreases. (as will explained later)

62


With the aid of sysstat3 it is observed that in the case of 12 processes that all carryout some disk I/O operations. The result was a lot of I/O waits spent in theCPUs. The Nehalem server only contains one SATA disk drive, with 12 processesof the ALICE framework accessing it at the same time it was simply overloaded.Using two SATA disks or a Solid State Drive (SSD) to run the ALICE framework,however, shows much less I/O waits and also the CPU load is back to optimal.

Alice #proc #physical #logical Runtime CPU-time Throughputframework total cores cores AVG(m) AVG(m)

1 disk 1 1 1 200.270 200.270 100 %2 2 2 199.245 398.490 100.5 %2 1 2 330.300 330.300 117.5 %4 4 4 202.520 810.080 98.9 %4 2 4 315.660 631.320 121.2 %8 8 8 218.392 1747.136 91.0 %8 4 8 300.270 1201.080 125.0 %12 6 12 369.130 2214.780 107.8 %

2 disks 12 6 12 340.185 2041.110 115.1 %

SMT provides also a stable gain calculating the work done by the ALICE frame-work. Thus, in conclusion, it increases the performance of a real application by atleast 15 %.

5.4.3. Comparison between SMT on the Irwindale and on theNehalem

Irwindale, a member of Intel’s Xeon processor family was introduced in 2005. Itwas the first CPU that provided an implementation of SMT: Intel Hyper Thread-ing Technology (HT Technology). Early tests showed that HT did not increase theperformance and could even sometimes decrease it. To see if the hardware or thesoftware implementation was the problem, the aforementioned tests are repeatedon a dual-socket Irwindale server with 2x 3.6 GHz CPUs. Thus, this system hastwo physical or four logical cores available.

Results of the SMT evaluation on the Irwindale

First, the SPEC CPU2006 and the tbb benchmark are executed on the Irwindalesystem. The settings of the tbb benchmark are slightly changed, in order to take

3"sysstat" is a powerful tool to measure activities of the server subsystems like CPU load, mem-ory usage, memory swapping, network usage or disk load.

63


the much longer runtime for the benchmarks on the Irwindale into account.

At first all the instances of SPEC CPU2006 are only scheduled on two indepen-dent hardware threads. The other two hardware threads are left unused by thekernel. However, forcing the scheduler with CPUsets to use all the cores solvesthis issue. With the help of CPUsets it is shown that the previous SMT implemen-tation on the Irwindale provides a benefit of about 17 % for SPEC CPU2006 (figure5.9). The tbb benchmark provides a gain of about 29 % (figure 5.10). With two log-ical cores on two physical cores the benchmark needs 545.737 seconds. Using twological cores on one physical core the runtime increases to 769.933 seconds. To beable to compare these two values it can be calculated: 545.737·2−769.933

545.737·2 = 0.294

The first runtime has to be doubled, because the tbb benchmark run on two phys-ical cores and the second time only on one physical core.

The conclusion is that it was not the hardware implementation of Irwindale’sSMT that caused the initial problems, but rather the way the old Linux schedulerschedules the processes on the individual cores.

0

2

4

6

8

10

12

14

16

18

20

SMT-off32-Bit

SMT-on32-Bit

SMT-off64-Bit

SMT-on64-Bit

SP

EC

SPEC CPU 2006 32/64-Bit, GCC 4.1.2 Irwindale 3.2 GHz

Figure 5.9.: SPEC CPU2006 results of Irwindale 3.6 GHz for GCC 4.1.2

64

5.5. SOLID STATE DRIVE

0

200

400

600

800

1000

1200

1 2 3 4

time

(s)

# threads

tbb execution time according to scheduling

no SMTSMT

Figure 5.10.: Execution time of the tbb benchmark on Irwindale 3.6 GHz using#threads

5.5. Solid State Drive

A Solid State Drive (SSD) is a device for data storage which typically uses NANDflash memory to store data persistent. The SSD controller chip simulates the be-haviour of a normal hard disk to the system.[23]

With NAND flash, the single cells are arranged sequentially as a page and sev-eral pages are arranged as a block. Thus, fewer data connections and less space isneeded. But here, a single cell can’t be accessed directly.[24] When write access toa single cell is needed, the complete page has to be written. If one cell is alreadyfilled up with data and it should be written with other data, at first the page hasto be deleted. Thus, writing to a already filled cell often needs two write accesses.It’s only possible to erase a complete block and not only a single page. Erasing ablock needs time and decreases the lifetime of the cell, as explained later.

Two types of cells are used in NAND flash: Single-Level Cells (SLC) and Multi-Level Cells (MLC). The difference between the two cells is the number of bits thatcan be stored physically in a single cell. The SLC can store one bit and the MLCcan store two (or four) bits. Both cells need the same space on the chip. Thus, theMulti-Level Cell doubles the capacity for the same manufacturing cost than the

65


Single-Level Cell. To write to a MLC, in case of 2 bits per cell, four voltage statesare needed (for writing 00, 01, 10, 11) and to a SLC only two (0, 1). Thus, to writea ML Cell requires more time, because here, ensure to write the correct bits needsmore time. Due to the more simple access strategy of the SLC, it has a longer celllifetime, than the MLC. The performance of erasing SLCs and MLCs is the same.But reading from a MLC takes twice long and writing can take four times as longas writing to a SLC.

Figure 5.11.: SLC (left) vs. MLC (right) [25]

Data is written to a cell by electron tunneling. When a high enough energyis applied to the gate it is possible for an electron to tunnel through the oxideinto the floating gate. If the voltage is removed, the tunnel through the oxideis restricted and the electron remains in the floating gate. When the voltage isapplied across the channel, instead of the gate, effects that the electron will go inthe other direction. Thus, if no energy is applied no electron can access or left thefloating gate and the current configuration is stored persistently.[25]

Figure 5.12.: Mock-up of a NAND flash cell [25]

Write accesses damage the oxide and it becomes weaker and finally it can’t

66


prevent the electron from leaving the gate. A MLC can be written for about 10,000times and thanks to the simplicity of the design, a SLC reaches about 100,000write accesses. If a file is modified or deleted the complete page would haveto be erased and written again, even if only one bit would have been changed.Depending on the page and block size it would mean a huge amplification ofwrite accesses and would therefore reduce the cell’s lifetime. Due to the limitedlifetime, the controller has to take care of preserving the cells and spreading thewritings over all cells consistently to extend the lifetime of each single cell andas a result the lifetime of the device. Reading does not reduce the cell’s abilityto store data. Thus, the same cell can be read as often as needed. To avoid ahuge amplification of erasing cells, wear leveling algorithms are implemented inthe controller logic. This algorithm cares about spreading all the write accessesover all cells. If you delete a file, the cells containing the data will be marked asdeleted and next time something is written other cells will be used. And only if allcells are already used the first marked as deleted will be used again for writing.Thanks to this algorithm 10,000 respective 100,000 times writing to one single cellmeans a lot and extends the lifetime of the SSD significantly.[26, 27, 28]

5.5.1. ALICE framework using a SSD

When many instances of the ALICE framework are running in parallel, a lot ofrandom read and write accesses happen. Random access is one of the SSD’s high-lights. Due to the absence of mechanical parts, there is no dependency to theposition of a rotating disk to access a block.

The SSD is tested with different file systems and conditions:

• ext2 and each block initialized with 0

• ext3 and each block initialized with 0

• ext2 and each block initialized with random bits

• ext3 and each block initialized with random bits

When the blocks are written with random bits, the tests should show less perfor-mance, respectively more IO waits, than if the blocks are erased with 0.

For the conducted tests the Intel X25-E SATA Solid State Drive[29] is taken. It isspecified for usage in a server environment and uses Intel NAND flash memorybased on Single Level Cells.

67


Results of the ALICE framework using a SSD

The tests are carried out with the four different settings explained before:

Alice #proc #physical #logical Runtime CPU-time Throughputframework total cores cores AVG(m) AVG(m)

SATA, ext3, 1 hdd 12 6 12 369.130 2214.780 100 %SATA, ext3, 2 hdds 12 6 12 340.185 2041.110 107.8 %

SSD, ext2, zeros 12 6 12 324.726 1948.356 112.0 %SSD, ext2, random 12 6 12 333.564 2001.384 109.6 %

SSD, ext3, zeros 12 6 12 335.165 2010.990 109.2 %SSD, ext3, random 12 6 12 350.893 2105.358 104.9 %

12 instances of the ALICE framework generate many disk I/Os. With this work-load, one hard disk is simply overloaded. Adding a second hard disk shows asignificant gain in throughput. Due to the very short random access time, theSSD can gain the throughput again. But the SSD technology has the disadvan-tage that the performance diversifies depending on the state of the NAND flashchips. If the SSD controller do not has to clear them, the performance is higherthan if they are already filled with (in this case random) data and the controllerhas to clear the flash chips before it can start to write the actual data.

The file system also influences the performance. Ext3 uses a journal. Here, eachwrite access causes another write access to the journal. Thus, using ext3 leads tomore write accesses than using ext2 and write accesses are quite expensive usinga SSD and beyond that decreases the lifetime of the drive.

Again, the disk accesses are recorded using sysstat. Figure 5.13, showing thecumulated I/O waits (full version in Appendix A.1), observes, that ext3, initial-ized with random bits, wastes more time for I/O waits than ext2, initialized withzeros (about 7.5 %). It also shows that the joarnal overhead of writing to a SSDis 3.1 % if the disk was initialized with zeros and 5 % if random bits were usedfor the initialization. Even if the I/O waits and idle delays are removed, the re-sults show, that the CPU time of the ALICE framework is higher using ext3 thanext2. If the disk is initialized with randoms the runtime is expended additionally.It shows, that the uding the journal and erasing the flash cells of the SSD needsadditional CPU time.

Another issue occurs using the Intel X25-E SSD. If the drive is filled up withabout 50 % of its capacity, the writing performance drops a lot. But the used driveis an engineering sample. Thus, the controller works with an early firmware andthis could cause the problem. The ALICE framework creates about 1.2 GB of datafor each instance. Thus, the 50 % used capacity border is never exceeded, using

68


12 instances. But for practical use this issue could be important.In case of ext3, initialized with random bits, the total throughput the SSD pro-

vides is less than using 2 SATA hard disks. Today, the costs for one SSD is muchhigher and the capacity is far less, than taking two SATA drives. Thus, today it ischeaper, easier and more performance to use two hard disks than a SSD. But theSSD technology is still young and in a development process. When the file systemis optimized for SSDs and the controller logic improved, it could be an alternativefor SATA disk drives. With higher capacity, the lifetime of each cell will becomeless important, because the wear leveling algorithm has a lot of more flash chipsto share out the write accesses. Thus, each single cell will be stressed less and thetotal lifetime of the drive is expanded.

1

10

100

1000

10000

100000

1e+06

Alice framework write

kb written SSD ext2 zerokb written SSD ext3 zerokb written SSD ext2 RDkb written SSD ext3 RD

0.1

1

10

100

1000

10000

100000

Alice framework read

kb read SSD ext2 zerokb read SSD ext3 zerokb read SSD ext2 RDkb read SSD ext3 RD

0 50

100 150 200 250

Alice framework iowaitcumul iowait SSD ext2 zerocumul iowait SSD ext3 zerocumul iowait SSD ext2 RDcumul iowait SSD ext3 RD

1

10

100

1000

10000

100000

1e+06

Alice framework write removing iowait and idle delays

kb written SSD ext2 zerokb written SSD ext3 zerokb written SSD ext2 RDkb written SSD ext3 RD

0.1

1

10

100

1000

10000

100000

00 5000 10000 15000 20000

Alice framework read removing iowait and idle delays

kb read SSD ext2 zerokb read SSD ext3 zerokb read SSD ext2 RDkb read SSD ext3 RD

Figure 5.13.: ALICE framework disk I/Os, removed I/O waits and delays

69


5.6. Benchmark framework

5.6.1. Motivation

While evaluating SMT lots of different benchmarks are used and all of them withdifferent ways to start and compute the results. One time it is a multithreadedbenchmark and next time it is singlethreaded and has to be forked several times.Or the runtime is measured using the linux tool "time" and next time the bench-mark prints out the runtime for itself. Thus, each benchmark is started in a dif-ferent way and also the results are presented in different ways.

The second factor is that CPUset should not only be a method to run bench-marks, but it should also be used in a production environment. This is not possi-ble if there is no common framework to run tasks in CPUsets. Thus, a frameworkis created to get a standard way of starting benchmarks or any other processes inCPUsets.

5.6.2. Framework architecture

The platform is split up into two parts: a common part with all the scripts whichare needed for each benchmark and a specific part that set the required environ-ment like setting variables for a specific benchmark.

The specific job script sets the necessarily environment for the benchmark, cre-ates folders for the different instances of it and calls "start.sh", that creates theCPUsets and allocates the needed CPUs and PIDs to the sets. After that the"start.sh" script calls "worker.sh". Worker.sh finally runs the benchmark (againusing a signal to get a synchronized start of all the benchmark’s instances) witha toggled mechanism ("/usr/bin/time") to record the duration. "/usr/bin/time"writes the runtime to "duration.log" and worker.sh writes the output of the bench-mark to "output.log" to allow to check if any errors occur during the run. Thismechanism works for all the benchmarks, because setting the environment forspecific processes on the one hand and creating CPUsets, allocating CPUs andPIDs, forking the processes and write down the results on the other hand arenow strictly separated.

70

5.6. BENCHMARK FRAMEWORK

− call start script withspecific arguments

Job−Script:

Specific

Results:

duration.log

calls

records/usr/bin/time ./benchmark

calls n−times (fork)

calls

callsstart.sh:

−creates CPUsets

−starts worker.sh

worker.sh:

− Finally starts the

Common

Framework User call

Results:

output.log

records

./benchmark

− set environment

benchmark

Figure 5.14.: Framework for executing jobs in CPUsets

The scripts are cleaned up and reworked and some other mechanism are added.Now, cr_specific_cpusets.sh behaves in different ways:

• The standard behavior, creating a CPUset "cpuset1" in /dev/cpuset/, is stillpossible.

• The argument "-c CPUSET_NAME CORE_IDs" effects that the script createsa CPUset with a user defined name and attaches the given core_IDs to theset.

71


• The argument "-a CPUSET_NAME PID" associates a PID to the CPUset.

• The argument "-d CPUSET_NAME" deletes a given CPUset.

These four new behaviors are needed for the next tests that should be done: Run-ning test40 & the tbb benchmark at the same time on on two logical cores on onephysical core:

test40

tbb

5.6.3. Implementation of the architecture

Script for creation and allocation of the CPUsets

The main changes affect that the different options and arguments are taken intoaccount and the order of execution (mounting the CPUset, deleting old CPUsets...)is changed to supply the needs of the different arguments. The second majorchange affects that the script does not allocate the PIDs of the processes to theCPUsets. This function is moved to start.sh.

Start script

The main changes of the start.sh script affect that it also supports an argumentto make sure that the script do not create any CPUset for itself. The difference tothe "standard" behavior is that it don’t create a new CPUset, but takes the CPUsetwhich it got as an argument:

1 start.sh -c CPUSET_NAME RESULT_DIRECTORY

Listing 5.29: Running the modified start.sh

As a second major modification the start.sh script attaches its own PID to theCPUset. Thus, all the PIDs of the child processes, launched by start.sh, are di-rectly attached to the CPUset. Sometimes problems can be monitored if the pro-cess is not directly attached to the CPUset: The runtime of the process increasesa lot. This happens, because the processes are moved from the main CPUset (in

72

5.6. BENCHMARK FRAMEWORK

which they are allocated at creation) to the specific CPUset. After that no prob-lems with memory movements or PID movements between different CPUsets canoccur anymore, because at creation they are directly attached to the right CPUset.

Worker script

The fundamental change in the worker.sh script:

1 exec /bin/bash -c "export RESULT_DIRECTORY=${result_directory};

/usr/bin/time -a -o ${result_directory}/duration.log ${ARGS}

> ${result_directory}/output.log"

Listing 5.30: Modification of start.sh

This line effects that each instance of the benchmark writes "duration.log" and"output.log" to its own result folder.

5.6.4. Implementation of the job script

As an representative example the functionality of the job script is described forthe tbb benchmark:

1 #!/bin/bash

2

3 REAL_CORES=$1

4 SMT_CORES=$2

5 NUMBER_OF_CORES=1

Listing 5.31: job_TBB.sh

In this lines the number of dedicated physical cores and shared logical cores istaken from the arguments. Here, the number of instances is set to 1, because thetbb benchmark is multithreaded, thus, only one process is needed.

The result directory and the folder that contains the sources respective the bi-nary file for execution is defined and the command to run the benchmark is cre-ated:

6 RESULT_DIRECTORY="/data1/abusch/TBB/$3"

7 SOURCE_DIRECTORY="/data1/abusch/TBB/hltsse_gcc_430/"

8 COMMAND="./tbb ""$(($REAL_CORES + $SMT_CORES))"


The number of the needed result directories is created. In this case it is alwaysonly one folder, because the tbb benchmark only needs one instance:

73


9 for i in $(seq 1 ${NUMBER_OF_CORES}); do

10 mkdir -p ${RESULT_DIRECTORY}/$i

11 done


The script enters the folder with the sources and the binaries and exports somerequired variables for running the tbb benchmark. Finally it calls the start.shscript that runs the different scripts.

12 cd ${SOURCE_DIRECTORY}

13


15 export LD_LIBRARY_PATH="/opt/gcc-4.3.0/lib64"

16 . /opt/intel/tbb/2.1.014/bin/tbbvars.sh intel64

17

18 start.sh ${REAL_CORES} ${SMT_CORES} ${RESULT_DIRECTORY} ${

COMMAND}


5.6.5. Running the benchmarks using the job script

At first, the environment for the benchmark framework is prepared:

• The common scripts start.sh, worker.sh, cr_specific_cpusets.sh and bench.share copied to a public folder. For example to "/opt/cpuset_scripts/bin/".This is not necessarily, but it is recommended, because there should be onlyone copy on a machine and thus it should not be located in any "/home/*"folders.

• The scripts needs execution rights. (maybe 0755)

• cr_specific_cpusets.sh needs to run as root to be able to create the CPUsets.The older version of the starter scripts started the complete benchmark asroot. But the framework is also created for running standard applications inCPUsets and not only benchmarks. Thus, security is a greater concern andthe processes should not run as root. Only the cr_specific_cpusets.sh scriptshould have access to root rights. Two lines have to be added to "/etc/su-doers" to allow this:

74

5.7. RUNNING TEST40 & THE TBB BENCHMARK AT THE SAME TIME ONNEHALEM

1 ## Benchmarks

2 Cmnd_Alias BENCHMARKS = /opt/cpuset_launcher/bin/

cr_specific_cpusets.sh

Listing 5.35: Modifications on /etc/sudoers

An alias BENCHMARKS is added that points to the cr_specific_cpusets.shscript

3 abusch ALL = NOPASSWD: BENCHMARKS

Listing 5.36: Modifications on /etc/sudoers

The user, in this case abusch, is now allowed to run the alias BENCH-MARKS as root.

• The scripts are added to the PATH variable and the shell can find the scriptsin /opt/cpuset_scripts/ when only start.sh is typed:

1 PATH=/opt/cpuset_scripts/bin/:$PATH

Listing 5.37: Modification on $PATH

Now everything is set up to run the benchmark:

1 ./job_TBB.sh 1 1

Listing 5.38: Run the tbb benchmark with the job script

The tbb benchmark will run with 2 threads using two logical on one physical core.In this case the results are written to "/data1/abusch/TBB/1/" and "/data1/abusch/TBB/2/".

5.7. Running test40 & the tbb benchmark at the

same time on Nehalem

A scenario is conceived where in a single unit of time, one instance of test40would run on one logical core on one physical core, and one instance of tbb onthe other logical core of the same physical core. This should show if the gain ofSMT decreases if different benchmarks run on the same physical core and shouldbe a more realistic scenario, since not only one kind of application will run on asystem, but multiple.

75


While writing the framework for benchmarking with CPUset this scenario istaken into account, and only an adapted job script and two environment scriptsfor test40 and the tbb benchmark are needed to cover the requirements:

1 #!/bin/bash

2

3 #Usage

4 usage="$(basename $0) #REAL-CORES #SMT-CORES RESULT-DIRECTORY"

5

6 if [[ $# < 3 ]]

7 then

8 echo -e $usage

9 exit 0

10 fi

11 #End of Usage

12

13 REAL_CORES=$1

14 SMT_CORES=$2

15 CPUSET="cpuset1"

16 RESULT_DIRECTORY="/data1/abusch/results/$3/"

17 SOURCE_DIRECTORY="/data1/abusch/benchmarks/"

18 COMMAND="./test40_tbb.sh"

19

20 cpus=$(cr_specific_cpusets.sh -n $REAL_CORES $SMT_CORES)

21 cr_specific_cpusets.sh -c $CPUSET $cpus

22 cpus=$(echo $cpus | sed -e ’s/,/\n/g’ | sed -e ’s/^$[^-]*$$

/\1\n\1/’ | sed -e ’s/-/\n/’ | xargs -L2 seq)

23 cpus_seta=""

24 cpus_setb=""

25 full=$(echo $cpus | wc -w)

26 half=$((echo $(echo $cpus | wc -w) / 2) | bc)

27 dir="/sys/devices/system/cpu"

28 k=0

29 for i in $cpus; do

30 for j in $cpus; do

31 if [[ $i == $j ]]

32 then

33 continue

34 fi

35 if [ "$(echo -e $(cat $dir/cpu$i/topology/

physical_package_id && cat $dir/cpu$i/

topology/core_id))" == "$(echo -e $(cat $dir/

76

5.7. RUNNING TEST40 & THE TBB BENCHMARK AT THE SAME TIME ONNEHALEM

cpu$j/topology/physical_package_id && cat

$dir/cpu$j/topology/core_id))" ]

36 then

37 if [[ $k -ge $half ]]

38 then

39 continue

40 else

41 cpus_seta="$cpus_seta$i,"

42 cpus_setb="$cpus_setb$j,"

43 k=$(($k + 1))

44 fi

45 fi

46

47 done

48 done

49

50 cr_specific_cpusets.sh -c $CPUSET/test40 $cpus_seta

51 cr_specific_cpusets.sh -c $CPUSET/tbb $cpus_setb

52

53 mkdir -p ${RESULT_DIRECTORY}/1

54

55 cd ${SOURCE_DIRECTORY}

56

57 ./job_test40_environment.sh $CPUSET/test40 $3 &

58 ./job_tbb_environment.sh $CPUSET/tbb $3 &

Listing 5.39: Job script for the tbb benchmark and test40 at the same time

All the available cores have to be split into two sets for scheduling the differentinstances of test40 and the thread of the tbb benchmark in own CPUset. Thus, theavailable cores are split and attached to the two newly created CPUsets. Finallythe benchmark’s specific environment scripts are called and behave like the stan-dard job scripts. But here, to attach the PID, no new CPUset is created, but thegiven CPUset is taken:

1 start.sh -c ${CPUSET} ${RESULT_DIRECTORY}/tbb/ ${COMMAND}

Listing 5.40: Modifications on the tbb benchmark environment

1 start.sh -c ${CPUSET} ${RESULT_DIRECTORY}/test40/ ${COMMAND}

Listing 5.41: Modifications on the test40 environment

The start script is called with the -c argument and makes sure that no new CPUsetwill be created, but the given set is taken for attaching the PIDs.

77


Again the same tbb and test40 benchmarks are used to observe the impact ofscheduling a mixture of both benchmarks running on each logical core. The mea-surements are always carried out twice. Since test40 and tbb do not have thesame runtime, it has to be ensured that on the other logical core of the physicalcore (without the focus for the measurement), the benchmark runs long enoughfor the benchmark with the focus on it. After that the same procedure is carriedout for the benchmark that did not have the focus before.

5.7.1. Results

While running test40 and the tbb benchmark at the same time, test40 provides aspeedup of about 2.61 %. The tbb benchmark shows a speeddown of 1.28 % inaverage. But the result using 16 cores is maybe less representative, due to thefact that one of the jobs is scheduled on core0. There is no significant influence intotal. Thus, it is possible to run different benchmarks at the same time observingthat they do not interfere with each other significantly.

#threads #physical #logical Runtime CPU-time Throughputtotal cores cores AVG (s) AVG (s)

test40 (only) 2 1 2 41.770 41.770 100.00 %

tbb+test40: 2 1 1 40.650 20.235 102.68 %test40 4 2 2 40.685 40.685 102.60 %

8 4 4 40.710 81.420 102.54 %16 8 8 45.928 183.712 (90.05 %)

tbb (only) 2 1 2 68.373 68.373 100.00 %

tbb+test40: 2 1 1 132.660 66.330 102.99 %tbb 4 2 2 69.270 69.270 98.69 %

8 4 4 36.070 72.140 94.49 %16 8 8 21.310 85.240 (75.33 %)

tbb is a multrithreaded benchmark, thus as reference the CPU time is taken. Forthe singlethreaded benchmark test40 the average runtime is taken for compar-isons. CPU-time = #logical_cores

#physical_cores·2 · runtime

78

6. Conclusion and outlook

Evaluating the Nehalem CPUs allowed an insight and analysis in both hardwareand software and how they work together. Changing hardware or operatingsystem parameters, directly showed effect on the software performance of SPECCPU2006 and also real applications as the ALICE framework. It also discoveredthe importance of parallel programming, because the single CPU performancedoes not increase as much as it increased in the previous years anymore.

For evaluating the CPUs it was necessarily to execute the same commands foreach benchmark run, thus using a linux shell was an advantage. It would havebeen also possible to type each time the commands by hand, but this is an error-prone process, can’t be automatized and has to be noticed at an dedicated place toremember what was done. From starting the benchmarks to calculate the results,small shell scripts were written to automatize the process. Finally a frameworkwas developed to offer a possibility to add new benchmarks and execute themevery time in the same way and the measurements are reproducible.

A part of the evaluating process was the energy consumption and the energypenalty of new technologies as SMT. In times of energy saving policies, for com-puter center, the energy efficiency of servers comes to the fore. Earlier the rawperformance was more important, due to the less total energy consumption ofthe system and of course also due to the less energy costs. Today, performanceper Watt is one of the main aspects for purchase decisions. Now there are newtechnologies coming up, like SMT and the turbo mode. These two parametersinfluence performance, but also energy consumption. Thus, performance andenergy consumption are not fixed anymore, but depend on the chosen settingsand a decision has to be taken between a maximum of performance per time unit(using only physical core) or a maximum of throughput (using SMT).

For openlab a technical report about the results of the Nehalem evaluation waswritten and the requirement of such a paper were shown and also a bridge wascreated between theoretical knowledge and practical use of it. Many special ap-plications from CERN were used and showed a deep insight into the proceduresat CERN. Meetings and presentations were done for the openlab partner Intel.

The Nehalem microarchitecture showed an impressive gain in terms of perfor-

79

CHAPTER 6. CONCLUSION AND OUTLOOK

mance and energy efficiency. The trend to multi and many core architectures iscoming. Soon, the Intel Nehalem EX will be introduced. Each single chip willprovide 16 logical or 8 physical cores. In several years more than 100 cores willbe combined on one single chip. The first step is Intel’s Larrabee processor, whichis a hybrid between a CPU and a GPU and will provide far more than 16 cores.Thus, supporting multithreading becomes increasingly important for software.

Due to many parallel applications which access to the other subsystems, likethe disk, there bottle necks can easily appear to the system. A new possibilityto gain the disk performance could be Solid State Drives. The performance andcapacity of these drives are linked to Moore’s law, because it also benefits of in-creasing the transistor’s density. Thus, every 18 months the capacity should bedoubled and the performance increased for the same price. But if the controllerlogic of the SSDs can provide all these promised advantages is not clear, but couldbe interesting to observe in future.

80

7. References

[1] http://openlab-mu-internal.web.cern.ch/openlab-mu-internal/Documents/2_Technical_Documents/Technical_Reports/2008/AH-SJ_The%20approach%20to%20energy%20efficient%20computing%20at%20CERN%20final.pdf[2] http://openlab-mu-internal.web.cern.ch/openlab-mu-internal/[3] http://de.wikipedia.org/wiki/Mooresches_Gesetz[4] http://www.gi-ev.de/no_cache/service/informatiklexikon/informatiklexikon-detailansicht/meldung/multicore-architekturen-145.html[5] http://www.tecchannel.de/pc_mobile/prozessoren/1752990/intels_nehalem_enthuellt_stark_verfeinerte_core_mikroarchitektur/index.htmlhttp://www.tecchannel.de/pc_mobile/prozessoren/1776212/test_turbo_technologie_intel_core_i7_overclocking/[6] http://www.intel.com/products/server/motherboards/s5520ur/s5520ur-overview.htm[7] http://www.intel.com/products/desktop/motherboards/DX58SO/DX58SO-overview.htm[8] Gyorgy Balazs, Sverre Jarp, Andrzej Nowak - "Is the Atom processor ready forHigh Energy Physics?", CERN 2008[9] http://www.brultech.com/files/Difference%20Between%20Power%20and%20VA.pdf[10] http://users.bigpond.net.au/CPUburn/[11] http://de.wikipedia.org/wiki/LAPACK[12] http://www.debuntu.org/2006/04/08/22-ssh-and-port-forwarding-or-how-to-get-through-a-firewall[13] http://toic.org/2009/01/18/reverse-ssh-port-forwarding/[14] http://www.intel.com/design/servers/ipmi/[15] http://de.wikipedia.org/wiki/Intelligent_Platform_Management_Interface[16] http://www.threadingbuildingblocks.org/[17] http://de.wikipedia.org/wiki/Geant4[18] http://geant4.web.cern.ch/geant4/collaboration/[19] http://www.ddj.com/architect/215802084/[20] http://www.bullopensource.org/cpuset/[21] http://en.wikipedia.org/wiki/SpeedStep[22] http://www.mjmwired.net/kernel/Documentation/cpu-freq/governors.txt

81

CHAPTER 7. REFERENCES

[23] http://en.wikipedia.org/wiki/Solid_State_Drive[24] http://de.wikipedia.org/wiki/NAND-Flash[25] http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=2[26] http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=3[27] Boi Feddern - "Entdeckungsreise - Solid State Disks mit bis zu 256 GByte" (c’tmagazin 11/2009)[28] http://www.tecchannel.de/storage/komponenten/1767940/test_ssd_intel_flash_laufwerk_vergleich_solid_state_disk_benchmark_transferrate/index2.html[29] http://download.intel.com/design/flash/nand/extreme/319984.pdf

82

8. Acknowledgements

At first I thank Prof. Dr. Jörg Hettel who helped me to get in touch with CERNand showed me this great place to spend my practice period.

I thank Sverre Jarp (CERN openlab CTO) and Julien Leduc (CERN openlabFellow) for giving me the opportunity to write this bachelor thesis at CERN, andfor their professional advise and providing me information at any time.

I thank Julien Leduc also for supporting me in writing this thesis.

83

List of Figures

2.1. Nehalem system with Intel’s S5520UR Motherboard . . . . . . . . . 62.2. Power test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3. Reverse SSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4. IPMI scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5. CERN IT power measurements: 20 % idle and 80 % load mix . . . . 18

3.1. Subfolders of "SPEC2006/run" . . . . . . . . . . . . . . . . . . . . . 203.2. SPEC CPU2006 results of Nehalems and Core i7 965 for GCC 4.3.3,

32 Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3. SPEC CPU2006 results of Nehalems and Core i7 965 for GCC 4.3.3,

64 Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4. SPEC CPU2006 results of Nehalems and Core i7 965 for Intel ICC

11.0, 32 Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5. SPEC CPU2006 results of Nehalems and Core i7 965 for Intel ICC

11.0, 64 Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.6. Efficiency of Nehalem L5520, E5540, X5570, Core i7 965 and Harper-

town H5410 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1. Scheduling jobs on only one core . . . . . . . . . . . . . . . . . . . . 495.2. Scheduling jobs on different cores . . . . . . . . . . . . . . . . . . . . 495.3. Execution time of test40 on Nehalem X5570 using #processes . . . . 575.4. Energy per process of test40 on Nehalem X5570 using #processes . 585.5. Efficiency per thread of test40 on Nehalem X5570 using #processes 595.6. Execution time of the tbb benchmark on Nehalem X5570 using

#threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.7. Energy per thread of the tbb benchmark on Nehalem X5570 using

#threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.8. Efficiency per thread of the tbb benchmark on Nehalem X5570 us-

ing #threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.9. SPEC CPU2006 results of Irwindale 3.6 GHz for GCC 4.1.2 . . . . . 645.10. Execution time of the tbb benchmark on Irwindale 3.6 GHz using

#threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

84

List of Figures

5.11. SLC (left) vs. MLC (right) [25] . . . . . . . . . . . . . . . . . . . . . . 665.12. Mock-up of a NAND flash cell [25] . . . . . . . . . . . . . . . . . . . 665.13. ALICE framework disk I/Os, removed I/O waits and delays . . . . 695.14. Framework for executing jobs in CPUsets . . . . . . . . . . . . . . . 71

A.1. ALICE framework disk I/Os, cumulated IO waits, removed I/Owaits and delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.2. ALICE framework (12 instances) CPU time and I/Os using theSSD, ext2 initialized with zeros . . . . . . . . . . . . . . . . . . . . . 97

A.3. ALICE framework (12 instances) CPU time and I/Os using theSSD, ext2 initialized with randoms . . . . . . . . . . . . . . . . . . . 98

A.4. ALICE framework (12 instances) CPU time and I/Os using theSSD, ext3 initialized with zeros . . . . . . . . . . . . . . . . . . . . . 99

A.5. ALICE framework (12 instances) CPU time and I/Os using theSSD, ext3 initialized with randoms . . . . . . . . . . . . . . . . . . . 100

85

A. Appendix

A.1. Results of the energy measurements in detail

A.1.1. Results of the Nehalem CPUs

State Idle

Memory Active Power [W] Apparent Power [W] Power Factor

L5520 SMT-off turbo-off

24 119.029 133.191 0.894

L5520 SMT-off turbo-on

24 120.732 134.408 0.898

L5520 SMT-on turbo-off

24 129.721 143.291 0.905

L5520 SMT-on turbo-on

24 130.916 144.482 0.906

E5540 SMT-off turbo-off

24 117.232 131.366 0.892

E5540 SMT-off turbo-on

24 117.221 131.29 0.893

E5540 SMT-on turbo-off

24 129.266 142.823 0.905

E5540 SMT-on turbo-on

24 129.422 142.96 0.905

E5540 SMT-off turbo-off SpeedStep=off

24 130.809 144.296 0.907

E5540 SMT-on turbo-off SpeedStep=off

24 148.379 160.966 0.922

X5570 SMT-off turbo-off

24 122.413 136.423 0.897

X5570 SMT-off turbo-on

86

A.1. RESULTS OF THE ENERGY MEASUREMENTS IN DETAIL

24 118.701 132.742 0.894

X5570 SMT-on turbo-off

24 133.288 146.609 0.909

X5570 SMT-on turbo-on

24 133.136 146.288 0.910

87

APPENDIX A. APPENDIX

State Load


L5520 SMT-off turbo-off

24 252.591 262.944 0.961

L5520 SMT-off turbo-on

24 262.687 272.719 0.963

L5520 SMT-on turbo-off

24 275.113 285.066 0.965

L5520 SMT-on turbo-on

24 273.54 283.524 0.965

E5540 SMT-off turbo-off

24 292.483 302.031 0.968

E5540 SMT-off turbo-on

24 309.231 318.48 0.971

E5540 SMT-on turbo-off

24 316.216 325.459 0.972

E5540 SMT-on turbo-on

24 322.952 332.142 0.972

E5540 SMT-off turbo-off SpeedStep=off

24 317.111 326.371 0.972

E5540 SMT-on turbo-off SpeedStep=off

24 319.316 328.556 0.972

X5570 SMT-off turbo-off

24 325.632 334.837 0.972

X5570 SMT-off turbo-on

24 359.091 367.979 0.976

X5570 SMT-on turbo-off

24 356.214 365.092 0.976

X5570 SMT-on turbo-on

24 372.953 381.667 0.977

88

A.2. RESULTS OF THE PERFORMANCE MEASUREMENTS IN DETAIL

A.1.2. Results of the Core i7 965

State Idle


Core i7 965 SMT-off turbo-off

6 109.039 110.822 0.984

Core i7 965 SMT-off turbo-on

6 109.366 111.231 0.983

Core i7 965 SMT-on turbo-off

6 110.797 112.590 0.984

Core i7 965 SMT-on turbo-on

6 111.295 113.089 0.984

State Load


Core i7 965 SMT-off turbo-off

6 220.078 221.981 0.991

Core i7 965 SMT-off turbo-on

6 238.422 240.228 0.992

Core i7 965 SMT-on turbo-off

6 235.737 237.567 0.992

Core i7 965 SMT-on turbo-on

6 259.089 260.795 0.993

A.2. Results of the performance measurements in

detail

A.2.1. Results of the Nehalem CPUs [SPEC marks]

E4 12x2G SMT-off Turbo-offMode Compiler L5520 E5540 X557032 bit Gcc 4.1.2 92.57 100.06 114.43

Gcc 4.3.3 93.90 96.56 116.08Intel 11.0 104.44 107.03 129.44

64 bit Gcc 4.1.2 106.86 108.90 136.90Gcc 4.3.3 110.28 110.57 134.43Intel 11.0 112.14 111.78 137.40

89


E4 12x2G SMT-off Turbo-onMode Compiler L5520 E5540 X557032 bit Gcc 4.1.2 94.25 103.46 118.21

Gcc 4.3.3 96.24 106.45 119.12Intel 11.0 107.36 116.38 133.18

64 bit Gcc 4.1.2 112.5 122.43 138.33Gcc 4.3.3 111.67 122.50 141.43Intel 11.0 114.57 124.23 142.50

E4 12x2G SMT-on Turbo-offMode Compiler L5520 E5540 X557032 bit Gcc 4.1.2 116.19 119.26 146.92

Gcc 4.3.3 117.44 120.23 144.42Intel 11.0 133.44 135.99 165.09

64 bit Gcc 4.1.2 139.04 138.70 173.86Gcc 4.3.3 137.19 139.39 166.10Intel 11.0 140.09 141.71 174.64

E4 12x2G SMT-on Turbo-onMode Compiler L5520 E5540 X557032 bit Gcc 4.1.2 117.97 128.70 151.44

Gcc 4.3.3 117.93 132.64 150.12Intel 11.0 134.17 139.13 167.06

64 bit Gcc 4.1.2 143.57 152.10 172.18Gcc 4.3.3 139.35 150.86 173.32Intel 11.0 141.49 144.58 175.47

A.2.2. Results of the Core i7 965 [SPEC marks]

3x2G SMT-off Turbo-off 3x2G SMT-off Turbo-onMode Compiler Core i7 965 Core i7 96532 bit Gcc 4.1.2 61.52 63.37

Gcc 4.3.3 62.17 64.44Intel 11.0 68.67 71.19

64 bit Gcc 4.1.2 71.38 73.05Gcc 4.3.3 71.41 73.75Intel 11.0 73.05 75.65

90

A.2. RESULTS OF THE PERFORMANCE MEASUREMENTS IN DETAIL

3x2G SMT-on Turbo-off 3x2G SMT-on Turbo-onMode Compiler Core i7 965 Core i7 96532 bit Gcc 4.1.2 77.33 80.51

Gcc 4.3.3 77.56 80.24Intel 11.0 86.14 91.05

64 bit Gcc 4.1.2 90.90 92.59Gcc 4.3.3 92.37 91.64Intel 11.0 90.56 93.26

91


A.3

.R

esul

tsco

mpa

riso

nbe

twee

nN

ehal

eman

dH

arpe

rtow

nus

ing

test

40an

dtb

b

test

40M

ode

Setu

pR

unti

me

%of

1pr

ocPo

wer

(W)

Wor

kloa

dTh

roug

hput

Thro

ughp

ut#p

roc

AV

G(s

)(H

arpe

rtow

n)+/

-5W

(<=>

Har

pert

own)

per

Wat

t

Neh

alem

E554

0SM

T=of

f1

33.7

810

6%

162

100

%95

%12

3%

@2.

53G

Hz

Turb

o=of

f2

33.7

810

6%

177

200

%18

9%

225

%SL

C5.

3,G

CC

SpSt

=on

433

.78

106

%20

1.5

400

%37

9%

395

%4.

3.0,

24G

BR

AM

833

.95

106

%24

580

0%

754

%64

6%

Neh

alem

E554

0SM

T=on

134

.07

106

%16

510

0%

94%

120

%@

2.53

GH

zTu

rbo=

off

234

.210

7%

179

200

%18

7%

220

%SL

C5.

3,G

CC

SpSt

=on

434

.19

107

%20

340

0%

374

%38

7%

4.3.

0,24

GB

RA

M8

34.5

710

8%

245

800

%74

1%

635

%16

50.2

715

7%

273

1600

%10

19%

783

%

Neh

alem

E554

0SM

T=of

f1

31.9

100

%15

010

0%

100

%14

0%

@2.

53G

Hz

Turb

o=on

231

.44

98%

189.

520

0%

204

%22

6%

SLC

5.3,

GC

CSp

St=o

n4

31.2

798

%11

8.5

400

%40

9%

725

%4.

3.0,

24G

BR

AM

832

.27

101

%26

4.5

800

%79

3%

630

%

Neh

alem

E554

0SM

T=on

132

.05

100

%16

810

0%

100

%12

5%

@2.

53G

Hz

Turb

o=on

231

.910

0%

193

200

%20

1%

218

%SL

C5.

3,G

CC

SpSt

=on

431

.64

99%

221.

540

0%

405

%38

4%

4.3.

0,24

GB

RA

M8

32.8

310

3%

254.

580

0%

780

%64

3%

1648

.52

152

%28

416

00%

1055

%78

0%

92

A.3. RESULTS COMPARISON BETWEEN NEHALEM AND HARPERTOWNUSING TEST40 AND TBB

tbb

Mod

eSe

tup

Run

tim

e%

of1

proc

Pow

er(W

)W

orkl

oad

Thro

ughp

utTh

roug

hput

#pro

cA

VG

(us)

(Har

pert

own)

(<=>

Har

pert

own)

per

Wat

t

Neh

alem

E554

0SM

T=o

ff1

33.6

710

5%

151.

510

0%

95%

132

%@

2.53

GH

zTu

rbo=

off

243

.33

135

%15

5.5

200

%14

8%

199

%SL

C5.

3,G

CC

SpSt

=off

434

.110

7%

182.

540

0%

375

%43

2%

4.3.

0,24

GB

RA

M8

34.6

710

8%

244

800

%73

8%

636

%

Neh

alem

E554

0SM

T=o

n1

33.7

410

5%

153

100

%95

%13

0%

@2.

53G

Hz

Turb

o=of

f2

34.0

210

6%

157

200

%18

8%

252

%SL

C5.

3,G

CC

SpSt

=off

435

.15

110

%16

5.5

400

%36

4%

462

%4.

3.0,

24G

BR

AM

835

.13

110

%21

680

0%

729

%70

8%

1651

.16

160

%27

316

00%

1001

%77

0%

Har

pert

own

runt

ime:

32(s

)H

arpe

rtow

npo

wer

:21

0(W

)

93


tbb

Mod

eSe

tup

Run

tim

e%

of1

proc

Pow

er(W

)W

orkl

oad

Thro

ughp

utTh

roug

hput

#pro

cA

VG

(µs)

(Har

pert

own)

(<=>

Har

pert

own)

per

Wat

t

Neh

alem

E554

0SM

T=of

f1

0.51

109

%16

410

0%

92%

105

%@

2.53

GH

zTu

rbo=

off

20.

257

55%

180

100

%18

3%

189

%SL

C5.

3,G

CC

SpSt

=on

40.

131

28%

209

100

%35

9%

319

%4.

3.0,

24G

BR

AM

80.

066

14%

260

100

%71

2%

509

%

Neh

alem

E554

0SM

T=on

10.

509

108

%16

610

0%

92%

103

%@

2.53

GH

zTu

rbo=

off

20.

258

55%

183

100

%18

2%

185

%SL

C5.

3,G

CC

SpSt

=on

40.

146

31%

204

100

%32

2%

294

%4.

3.0,

24G

BR

AM

80.

0715

%26

010

0%

671

%48

0%

160.

057

12%

287

100

%82

5%

534

%

Neh

alem

E554

0SM

T=of

f1

0.47

310

1%

173

100

%99

%10

7%

@2.

53G

Hz

Turb

o=on

20.

239

51%

195

100

%19

7%

188

%SL

C5.

3,G

CC

SpSt

=on

40.

136

29%

216

100

%34

6%

298

%4.

3.0,

24G

BR

AM

80.

072

15%

275

100

%65

3%

442

%

Neh

alem

E554

0SM

T=on

10.

467

99%

170

100

%10

1%

110

%@

2.53

GH

zTu

rbo=

on2

0.27

258

%19

510

0%

173

%16

5%

SLC

5.3,

GC

CSp

St=o

n4

0.13

829

%22

510

0%

341

%28

2%

4.3.

0,24

GB

RA

M8

0.06

714

%27

310

0%

701

%47

8%

160.

056

12%

288

100

%83

9%

542

%

94

A.3. RESULTS COMPARISON BETWEEN NEHALEM AND HARPERTOWNUSING TEST40 AND TBB

tbb

Mod

eSe

tup

Run

tim

e%

of1

proc

Pow

er(W

)W

orkl

oad

Thro

ughp

utTh

roug

hput

#pro

cA

VG

(µs)

(Har

pert

own)

(<=>

Har

pert

own)

per

Wat

t

Neh

alem

E554

0SM

T=o

ff1

0.50

910

8%

153

100

%92

%11

2%

@2.

53G

Hz

Turb

o=of

f2

0.40

286

%16

310

0%

117

%13

3%

SLC

5.3,

GC

CSp

St=o

ff4

0.20

243

%17

910

0%

233

%24

2%

4.3.

0,24

GB

RA

M8

0.06

814

%26

010

0%

691

%49

4%

Neh

alem

E554

0SM

T=o

n1

0.8

170

%15

410

0%

59%

71%

@2.

53G

Hz

Turb

o=of

f2

0.40

586

%16

010

0%

116

%13

5%

SLC

5.3,

GC

CSp

St=o

ff4

0.20

544

%18

310

0%

229

%23

3%

4.3.

0,24

GB

RA

M8

0.08

919

%25

510

0%

528

%38

5%

160.

057

12%

287

100

%82

5%

534

%

Har

pert

own

runt

ime:

0.47

(µs)

Har

pert

own

pow

er:

186

(W)

95


A.4

.S

ysst

atru

ntim

ean

dI/O

sfo

rth

eA

LIC

Efr

amew

ork

usin

ga

SS

D

1 10

100

100

0

100

00

100

000

1e+

06

Alice framework write

kb w

ritte

n S

SD

ext

2 ze

rokb

writ

ten

SS

D e

xt3

zero

kb w

ritte

n S

SD

ext

2 R

Dkb

writ

ten

SS

D e

xt3

RD

0.1 1 10

100

100

0

100

00

100

000

Alice framework read

kb r

ead

SS

D e

xt2

zero

kb r

ead

SS

D e

xt3

zero

kb r

ead

SS

D e

xt2

RD

kb r

ead

SS

D e

xt3

RD

0 50

100

150

200

250

Alice framework iowait

cum

ul io

wai

t SS

D e

xt2

zero

cum

ul io

wai

t SS

D e

xt3

zero

cum

ul io

wai

t SS

D e

xt2

RD

cum

ul io

wai

t SS

D e

xt3

RD

1 10

100

100

0

100

00

100

000

1e+

06

Alice framework write removing iowait and idle delays

kb w

ritte

n S

SD

ext

2 ze

rokb

writ

ten

SS

D e

xt3

zero

kb w

ritte

n S

SD

ext

2 R

Dkb

writ

ten

SS

D e

xt3

RD

0.1 1 10

100

100

0

100

00

100

000

0050

0010

000

1500

020

000

Alice framework read removing iowait and idle delays

kb r

ead

SS

D e

xt2

zero

kb r

ead

SS

D e

xt3

zero

kb r

ead

SS

D e

xt2

RD

kb r

ead

SS

D e

xt3

RD

Figu

reA

.1.:

ALI

CE

fram

ewor

kdi

skI/

Os,

cum

ulat

edIO

wai

ts,r

emov

edI/

Ow

aits

and

dela

ys

96

A.4. SYSSTAT RUNTIME AND I/OS FOR THE ALICE FRAMEWORK USINGA SSD

CP

U0

CP

U1

CP

U2

CP

U3

CP

U4

CP

U5

CP

U6

CP

U7

CP

U8

CP

U9

CP

U10

CP

U11

CP

U12

CP

U13

CP

U14

CP

U15

idle

iow

ait

syst

em user

0.1 1 10

100

100

0

100

00

100

000

0020

0040

0060

0080

0010

000

1200

014

000

1600

018

000

kb r

ead

kb w

ritte

n

Figu

reA

.2.:

ALI

CE

fram

ewor

k(1

2in

stan

ces)

CPU

tim

ean

dI/

Os

usin

gth

eSS

D,e

xt2

init

ializ

edw

ith

zero

s

97


CP

U0

CP

U1

CP

U2

CP

U3

CP

U4

CP

U5

CP

U6

CP

U7

CP

U8

CP

U9

CP

U10

CP

U11

CP

U12

CP

U13

CP

U14

CP

U15

idle

iow

ait

syst

em user

0.1 1 10

100

100

0

100

00

100

000

1e+

06

0020

0040

0060

0080

0010

000

1200

014

000

1600

018

000

kb r

ead

kb w

ritte

n

Figu

reA

.3.:

ALI

CE

fram

ewor

k(1

2in

stan

ces)

CPU

tim

ean

dI/

Os

usin

gth

eSS

D,e

xt2

init

ializ

edw

ith

rand

oms

98

A.4. SYSSTAT RUNTIME AND I/OS FOR THE ALICE FRAMEWORK USINGA SSD

CP

U0

CP

U1

CP

U2

CP

U3

CP

U4

CP

U5

CP

U6

CP

U7

CP

U8

CP

U9

CP

U10

CP

U11

CP

U12

CP

U13

CP

U14

CP

U15

idle

iow

ait

syst

em user

0.1 1 10

100

100

0

100

00

100

000

0020

0040

0060

0080

0010

000

1200

014

000

1600

018

000

kb r

ead

kb w

ritte

n

Figu

reA

.4.:

ALI

CE

fram

ewor

k(1

2in

stan

ces)

CPU

tim

ean

dI/

Os

usin

gth

eSS

D,e

xt3

init

ializ

edw

ith

zero

s

99


CP

U0

CP

U1

CP

U2

CP

U3

CP

U4

CP

U5

CP

U6

CP

U7

CP

U8

CP

U9

CP

U10

CP

U11

CP

U12

CP

U13

CP

U14

CP

U15

idle

iow

ait

syst

em user

0.1 1 10

100

100

0

100

00

100

000

0020

0040

0060

0080

0010

000

1200

014

000

1600

018

000

kb r

ead

kb w

ritte

n

Figu

reA

.5.:

ALI

CE

fram

ewor

k(1

2in

stan

ces)

CPU

tim

ean

dI/

Os

usin

gth

eSS

D,e

xt3

init

ializ

edw

ith

rand

oms

100

A.5. SHELL SCRIPT FOR COMPILING AND EXECUTING OF CPUBURNAND LAPACK

A.5. Shell script for compiling and executing of

CPUBurn and LAPACK

1 #!/bin/bash

2

3 function fail() {

4 echo $@

5 pkill cpuburn

6 pkill lapack

7 exit 1

8 }

9

10 CPU=‘grep vendor_id -m 1 /proc/cpuinfo | awk ’{print $3}’‘

11 CORES=‘grep -c "^processor" /proc/cpuinfo‘

12 MEM=‘free | grep Mem | awk ’{print $2}’‘

13

14 if [[ $CPU = "GenuineIntel" ]]; then

15 echo "Detected Intel CPU."

16 burn="burnP6.S"

17 elif [[ $CPU = "AuthenticAMD" ]]; then

18 echo "Detected AMD CPU."

19 burn="burnK7.S"

20 else

21 echo "Unknown CPU, assuming Intel."

22 burn="burnP6.S"

23 fi

24

25 echo "$CORES cores detected."

26

27 echo "Compiling benchmarks..."

28 gcc -s -nostdlib -m32 -o cpuburn $burn || fail "Error: Unable to

29 compile cpuburn test"

30 g77 -llapack -o lapack 1000d_1000mb.f || fail "Error: Unable to

31 compile lapack test"

32

33 echo "Running benchmarks..."

34 for i in ‘seq $CORES‘;

35 do

36 # Run a cpuburn and a lapack on alternating cores

37 let even=$i%2

101


38 if [[ $even -eq 0 ]]; then

39 echo "Launching lapack"

40 ./lapack >/dev/null || fail "Lapack died" &

41 else

42 echo "Launching cpuburn"

43 ./cpuburn >/dev/null || fail "Cpuburn died" &

44 fi

45 done

46

47 echo "Please wait 5 minutes for power consumption to stabilize

..."

48 sleep 5m

49 echo "Start reading the power consumption now"

50 sleep 30m

51 echo "Stop averaging now."

52

53 echo "Cleaning up..."

54 sleep 1m # Wait a bit so power consumption doesn’t drop suddenly

55 pkill cpuburn

56 pkill lapack

57 echo "All done!"

Listing A.1: Shell script for compiling and executing of CPUBurn and LAPACK

A.6. CPUset scripts

A.6.1. cr_specific_cpusets.sh

63 #Deleting all old CPUsets


65 then

66 declare -a SETS


68

69 for ((i=0; i<${#SETS[*]}; i++)); do

70 rmdir ${SETS[$i]}

71 done

72 fi

73


75

102

A.6. CPUSET SCRIPTS

76 if [ "$MOUNT" != "" ]

77 then

78 umount /dev/cpuset

79 fi

80

81 rmdir /dev/cpuset

82 #End of deleting old CPUsets

83

84 #Creating new CPUsets




88 SMT_CORES=$2

89 PID_COUNTER=0

90


92 i=0


94 if [[ $i != 0 && $i != 1 ]]

95 then

96 PID[$PID_COUNTER]="${arg}"

97 ((PID_COUNTER++))

98 fi

99

100 i=$(($i + 1))

101 done

102 #End of parsing arguments

103

104 cpus=""

105 ht_count=0

106 mkdir /dev/cpuset/cpuset1/

107 echo 1 > /dev/cpuset/cpuset1/cpu_exclusive

108 echo 0 > /dev/cpuset/cpuset1/mems

109

110 #Creates temporarily cpusets

111 for dir in /sys/devices/system/cpu/cpu[0-9]*; do




/null | wc -l) + 1 ))

103



null | wc -l) + 0 ))



117



) ]]

119 then

120 continue 1

121 fi

122

123 if [[ "$num" > "1" ]]

124 then


126 fi

127



,"

130 done

131 #End of creating temporarily cpusets

132 #End of creating cpusets

133

134 #Allocating cpus and PIDs to the cpuset

135 echo $cpus > /dev/cpuset/cpuset1/cpus

136 for ((i=0;i<${#PID[*]};i++)); do

137 echo ${PID[${i}]} > /dev/cpuset/cpuset1/tasks

138 done

139 #End of Allocating CPUs and PIDs to the cpuset

140 echo done.

Listing A.2: cr_specific_cpusets.sh

A.6.2. worker.sh

23 #!/bin/bash

24 #Signalhandler

25 on_signal()

26 {

27 exec /bin/bash -c "${ARGS}"

104

A.6. CPUSET SCRIPTS

28 exit 0

29 }

30 #End of Signalhandler

31

32 #Registration of the Signal "SIGUSR1"


34

35 #Parsing of Arguments

36 ARGS=""



39 done


41


43 until [ ]

44 do sleep 1

45 done

46 #End of busyloop

Listing A.3: worker.sh

A.6.3. start.sh

18 #!/bin/bash

19

20 #Usage

21 if [[ "$2" == "" || "$1" == "" || "$3" == "" ]]

22 then

23 echo "Usage: ./$(basename $0) #NATIVE_CORES #SMT_CORES #

INSTANCES ./YOUR_APPLICATION.sh"

24 exit 0

25 fi

26 #End of Usage

27

28 #Declaration of Variables

29 ARGS1=""

30 ARGS2=""

31 COUNT=0

32 i=0

33 KILL=""

105


34 #End Declaration

35

36 #Parsing Arguments


38

39 if [[ $i != 0 && $i != 1 ]]

40 then

41 if [ "$i" == "2" ]

42 then

43 COUNT=${arg}

44 else

45 ARGS1="${ARGS1} ${arg}"

46 fi

47 else

48 ARGS2="${ARGS2} ${arg}"

49 fi

50

51 i=$(($i + 1))

52 done


54

55 #Starting of worker.sh for $COUNT times

56 for ((i=0;i<$COUNT;i++)); do

57 /bin/bash -c "./worker.sh $ARGS1" &

58 KILL="$KILL $!"

59 done


61

62 PARAM="$ARGS2 $KILL"

63

64 #Creating and allocating of cpusets and setting of PIDs to the

cpuset

65 ./cr_specific_cpusets.sh $PARAM

66 sleep 1

67 #End of creating and allocating of cpusets

68

69 #Sending the Signal "SIGUSR1" to get a synchronized start of

every instances

70 kill -SIGUSR1$KILL

71 #End of sending the signal

106

A.6. CPUSET SCRIPTS

Listing A.4: start.sh

A.6.4. bench.sh

1 #!/bin/bash

2

3 # This small script acts as a scheduler and tries to reschedule

4 # the submitted job every ${period}

5 # allowing only on job to run on the same server

6

7 lockfile="/tmp/bench"

8 period="30 minutes"

9

10 command_to_run="$@"

11

12 reschedule () {

13 echo "rescheduling job ${command_to_run}"

14 at now + ${period} <<EOF

15 $0 ${command_to_run}

16 EOF

17 exit 0

18 }

19

20 if [ ! -e ${lockfile} ]; then

21 echo $$ > ${lockfile}

22 sleep 2

23 if [ "-$$-" != "-$(cat ${lockfile})-" ]; then

24 reschedule

25 exit 0

26 else

27 echo "executing $@"

28 $@

29 rm ${lockfile}

30 fi

31 else

32 reschedule

33 fi

Listing A.5: bench.sh

107


A.7. Framework

A.7.1. cr_specific_cpusets.sh

1 #!/bin/bash

2

3 #

4 # MUST BE RUN AS ROOT

5 #

6

7 #Parameter 1: The count of the native cores

8 #Parameter 2: The count of the SMT-Cores

9 #Parameter 3: PID that must be attached to the newly created

cpuset

10

11 #Usage

12 usage="Usage:\n1) $(basename $0) #NATIVE_CORES #SMT_CORES PID\n2

) $(basename $0) -n #NATIVE_CORES #SMT_CORES (Prints out the

needed core-IDs)\n3) $(basename $0) -c CPUSET_NAME COREIDS (

Creates CPUset with a name and attaches coreIDs)\n4) $(

basename $0) -a CPUSET_NAME PID (Associates a PID to the

cpuset)\n5) $(basename $0) -d CPUSET_NAME (Deletes a given

CPUset)"

13

14 n=0

15 c=0

16 a=0

17 d=0

18 name="cpuset1"

19

20 if [[ $# < 2 ]]

21 then

22 echo -e $usage

23 exit 0

24 fi

25 #End of Usage

26


28 if [ $1 -eq $1 2> /dev/null ]

29 then


108

A.7. FRAMEWORK

31 SMT_CORES=$2

32 PID_TO_ATTACH=$3

33 else

34 case "$1" in

35 -n)


37 SMT_CORES=$3

38 n=1

39 ;;

40 -c)

41 name=$2

42 cpus=$3

43 c=1

44 ;;

45 -a)

46 name=$2

47 PID_TO_ATTACH=$3

48 a=1

49 ;;

50 -d)

51 d=1

52 ;;

53 esac

54 fi

55

56


58


60

61 if [ "$MOUNT" == "" ]

62 then

63 #mount CPUset if exists


65 then

66 mount -t cpuset -ocpuset cpuset /dev/

cpuset

67 else

68 #Creates new CPUset


109


70 mount -t cpuset -ocpuset cpuset /dev/

cpuset

71 # End of creating new CPUset

72 fi

73 fi

74

75 #Deleting the inner sub-CPUsets

76 if [[ $d == 1 ]]

77 then

78 rmdir /dev/cpuset/$2/

79 else


81 if [ $1 -eq $1 2> /dev/null ] && [ -d /dev/cpuset ]

82 then

83 declare -a SETS

84 SETS=($(ls -d /dev/cpuset/$name/*/ 2> /

dev/null))

85

86 for ((i=0; i<${#SETS[*]}; i++)); do

87 rmdir ${SETS[$i]} 2> /

dev/null

88 done

89 fi

90 fi

91

92 if [[ $d == 1 ]]

93 then

94 rmdir /dev/cpuset/$2/

95 else


97 if [ $1 -eq $1 2> /dev/null ] && [ -d /dev/cpuset ]

98 then

99 declare -a SETS


101

102 for ((i=0; i<${#SETS[*]}; i++)); do

103 rmdir ${SETS[$i]} 2> /

dev/null

104 done

105 fi

106 fi

110

A.7. FRAMEWORK

107 #Deleting all temporary CPUsets

108 declare -a SETS

109 SETS=($(ls -d /dev/cpuset/*_*_*/ 2> /dev/null))

110

111 for ((i=0; i<${#SETS[*]}; i++)); do

112 rmdir ${SETS[$i]} 2>$1> /dev/null

113 done

114

115 if [ $1 -eq $1 2> /dev/null ] || [ $n -eq 1 ]

116 then

117 #Creats temporarily cpusets

118 cpus=""

119 ht_count=0

120 for dir2 in $(ls /sys/devices/system/cpu/ | grep -e ^cpu[0-9]*$

| sort -k 1.4gr); do

121 dir=$(echo "/sys/devices/system/cpu/")${dir2};




/null | wc -l) + 1 ))


null | wc -l) + 0 ))



127



) ]]

129 then

130 continue 1

131 fi

132

133 if [[ "$num" > "1" ]]

134 then


136 fi

137



,"

140 done

111


141 #End of creating temporarily cpusets

142 #End of creating cpusets

143 fi

144

145 #Creates the CPUset

146 if [ $1 -eq $1 2> /dev/null ] || [ $c == 1 ]

147 then

148 mkdir -p /dev/cpuset/$name/

149 echo 1 > /dev/cpuset/$name/cpu_exclusive

150 echo 0 > /dev/cpuset/$name/mems

151 echo $cpus > /dev/cpuset/$name/cpus

152 fi

153

154 #Allocates cpus and PID to CPUset

155 if ([ $1 -eq $1 2> /dev/null ] && [ $# -eq 3 ]) || [ $a == 1 ]

156 then

157 echo ${PID_TO_ATTACH} > /dev/cpuset/$name/tasks

158 fi

159

160 if [[ $n == 1 ]]

161 then

162 echo $cpus

163 fi

164 #End of create the CPUset and allocate cpus and PID to it

165 sleep 1

166

167 exit 0

168 \subsection{worker.sh}

169 \begin{lstlisting}[caption=worker.sh]{worker.sh framework}

170 #!/bin/bash

171 #Signalhandler

172 on_signal()

173 {

174 exec /bin/bash -c "export RESULT_DIRECTORY=${

result_directory}; /usr/bin/time -a -o ${

result_directory}/duration.log ${ARGS} > ${

result_directory}/output.log"

175 exit 0

176 }

177 #End of Signalhandler

178

112

A.7. FRAMEWORK

179 #Registration of the Signal "SIGUSR1"


181

182 #Parsing of Arguments

183 result_directory=$1

184 ARGS=""

185 i=0


187 if [ "$i" != "0" ]

188 then


190 fi

191 i=$(($i + 1))

192 done


194


196 until [ ]

197 do sleep 1

198 done

199 #End of busyloop

Listing A.6: cr_specific_cpusets.sh

A.7.2. start.sh

1 #!/bin/bash

2

3 #Usage

4 if [[ "$2" == "" || "$1" == "" ]]

5 then

6 echo "Usage: $(basename $0) #NATIVE_CORES #SMT_CORES

RESULT_DIRECTORY YOUR_APPLICATION.sh"

7 exit 0

8 fi

9 #End of Usage

10

11

12 #Declaration of Variables

13 COMMAND=""

14 RESULT_DIRECTORY=""

113


15 KILL=""

16 c=0

17 #End Declaration

18

19

20 #Parsing Arguments

21 if [ $1 -eq $1 2> /dev/null ]

22 then

23 REAL_CORES=$1

24 SMT_CORES=$2

25 RESULT_DIRECTORY=$3

26 else

27 CPUSET=$2

28 RESULT_DIRECTORY=$3

29 c=1

30 fi

31

32 i=0

33 j=2

34

35 #if [[ $c -eq 0 ]]

36 #then

37 # j=$(($j + 1))

38 #fi

39

40 #echo $j

41


43 if [[ $i > $j ]]

44 then

45 COMMAND="${COMMAND} ${arg}"

46 fi

47 i=$(($i + 1))

48 done

49


51

52

53 if [[ $c == 0 ]]

54 then

114

A.7. FRAMEWORK

55 #Creates and allocates the cpuset and associates this

script to the newly created cpuset (run as root)

56 sudo cr_specific_cpusets.sh $REAL_CORES $SMT_CORES $$

57 #End of creating and allocating of cpusets

58 else

59 #Takes given CPUset and attaches script’s PID to it

60 sudo cr_specific_cpusets.sh -a $CPUSET $$

61 fi

62

63

64 #Starting of worker.sh in the RESULT_DIRECTORY structure

65 for dir in ${RESULT_DIRECTORY}/[0-9]*; do

66 # start the benchmarks

67 /bin/bash -c "worker.sh ${dir} ${COMMAND}" &

68 KILL="$KILL $!"

69 done


71

72

73 sleep 1

74 #Sending the Signal "SIGUSR1" to get a synchronized start of

every instances

75 kill -SIGUSR1 ${KILL}

76 #End of sending the signal

77

78 wait

Listing A.7: start.sh

115