Protecting AES on TriCore against Power Analysis Attacks

EMSEC

Protecting AES on TriCoreagainst Power Analysis Attacks

Robert Spielmann

Master’s Thesis. August 27, 2013.Chair for Embedded Security – Prof. Dr.-Ing. Christof PaarAdvisor: Dr.-Ing. Timo Kasper

Abstract

Embedded Systems are increasingly influencing modern everyday life. Mobile phonesnow have more computing power than a single computer did some years ago. Vehicles areequipped with a perpetually growing number of electronic units that control everythingfrom electric windows down to engine management. It becomes evident that embeddedsecurity is required more than ever before. With our thesis we aim at laying the foundationfor a solution to one small part of the issues arising from embedded computing. Wecreate a number of AES implementations for the Infineon TriCore TC1797. The basicimplementation features no protection against power analysis attacks. We analyze currentmasking schemes and create AES implementations protected with two of those schemes.We analyze the resistance of the protected implementations against 1𝑠𝑡-order CorrelationPower Analysis (CPA) and give an outlook towards potential areas of future work.

Acknowledgements

In October 2010, I set out on a journey towards a Master’s degree in Applied IT Security.At that time, I had already been working full-time for three and a half years and I didnot consider it an option to give up on my day job. Therefore I decided to enrol in adistance learning programme, accepting the fact that I would have to cope with highwork load and little free time over the course of two to three years. Looking back, I cometo realize that being successful at work and at university in parallel would have beenimpossible without the tremendous amount of help and support I received.

My employer, codecentric AG, played an essential part in my success. The companywas generous enough to pay my tuition fees. In addition, both my managers and mycoworkers trusted me to the highest possible extent. Based on this, I was able to achievethe optimal balance between project-related work and exams. During my work on thisthesis, I was allowed to spend a number of weeks at the chair’s laboratory in Bochum,which was essential regarding the technical parts of this work.

The decision to write my thesis at the Chair for Embedded Security at Ruhr-UniversitätBochum (EMSEC) was probably the best I could have made. Timo Kasper entrustedme with the non-trivial subject of this thesis. He advised and supported me during allphases of my work. Falk Schellenberg and David Oswald gave me valuable input andfeedback, and helped me overcome a number of technical challenges. Christof Paar andAmir Moradi took the time to discuss some mathematical aspects. All EMSEC staff Imet were always ready to help, even in extremely busy times.

For some specific topics, I turned to external entities. I wish to thank Matthieu Rivainfor a very helpful e-mail exchange revolving around countermeasures. Louis Goubinand Vincent Rijmen kindly replied to mails as well. Elisabeth Oswald deserves thanksfor allowing me to use some of the example data from the supplementary materials of[MOP07]. Werner Schindler discussed characteristics of random number generators withme in a spontaneous phone call. I found it highly satisfying to realize that members ofthe scientific community took me and my questions seriously and that they helped mewithout knowing me personally.

Looking at the private aspects of life, I am almost at a loss for words of appreciation.I am deeply indebted to my girlfriend who invested incredible amounts of energy andpatience in order to keep me on track. Thank you for having my back no matter what!

Also in the private context, I thank my parents, my brothers, and all of my friendsfor being reliable partners in every condition of life. I had my head in the clouds for along time. You kept faith with me even though I was often hard to reach. Thank you forbelieving in me!

I am happy to see that my journey towards the Master’s degree is now coming to agood ending. From my point of view, it was worth while every single step. While I donot have a fitting quote at hand to conclude my writing, insiders know what I meanwhen I say this:

Praise the Sun!

DeclarationI hereby declare that this submission is my own work and that, to the best of myknowledge and belief, it contains no material previously published or written by anotherperson nor material which to a substantial extent has been accepted for the award of anyother degree or diploma of the university or other institute of higher learning, exceptwhere due acknowledgement has been made in the text.

ErklärungHiermit versichere ich, dass ich die vorliegende Arbeit selbstständig verfasst und keineanderen als die angegebenen Quellen und Hilfsmittel benutzt habe, dass alle Stellen derArbeit, die wörtlich oder sinngemäß aus anderen Quellen übernommen wurden, als solchekenntlich gemacht sind und dass die Arbeit in gleicher oder ähnlicher Form noch keinerPrüfungsbehörde vorgelegt wurde.

Robert Spielmann

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background: AES 52.1 The AES Contest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 The Rijndael Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.4 Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.5 Implementation Outlook . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Background: Power Analysis Attacks 133.1 Simple Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Differential Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Correlation Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Other Types of Power Analysis Attacks . . . . . . . . . . . . . . . . . . . 19

3.4.1 Inferential Power Analysis . . . . . . . . . . . . . . . . . . . . . . . 193.4.2 Mutual Information Analysis . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Order of Power Analysis Attacks . . . . . . . . . . . . . . . . . . . . . . . 213.6 Selecting one Type of Power Analysis . . . . . . . . . . . . . . . . . . . . 22

4 Target Platform: TriCore TC1797 234.1 The TriCore Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Selecting a TriCore Product . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 The TriCore Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Registers and Instructions . . . . . . . . . . . . . . . . . . . . . . . 244.3.2 Memory Layout and Addressing . . . . . . . . . . . . . . . . . . . . 254.3.3 Calling Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Relevant Features of the TC1797 . . . . . . . . . . . . . . . . . . . . . . . 254.4.1 CPU Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.2 Memory Sections and Caching . . . . . . . . . . . . . . . . . . . . . 264.4.3 Serial Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.4 Digital Input / Output (I/O) . . . . . . . . . . . . . . . . . . . . . . 27

ii Contents

5 Working Environment 295.1 The EMSEC TriCore Board . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Communicating with the Board . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2.1 JTAG Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2.2 USB Connection to the Host Computer . . . . . . . . . . . . . . . . 32

5.3 Measurement Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.4.1 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.4.2 Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4.3 Evaluation and Attack . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Software Countermeasures 376.1 Different Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.1.1 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.1.2 Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2 Recently Proposed Masking Schemes . . . . . . . . . . . . . . . . . . . . . 416.2.1 Rivain-Prouff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2.2 Rivain-Prouff without Mask Refreshing . . . . . . . . . . . . . . . . 456.2.3 Kim-Hong-Lim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2.4 Kim-Hong-Lim without Mask Refreshing . . . . . . . . . . . . . . . 516.2.5 Goubin-Martinelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2.6 Overall Masking of the AES Encryption . . . . . . . . . . . . . . . . 54

6.3 Complexity and Resource Comparison . . . . . . . . . . . . . . . . . . . . 576.3.1 Plain AES according to FIPS-197 . . . . . . . . . . . . . . . . . . . 576.3.2 32-bit Optimized AES . . . . . . . . . . . . . . . . . . . . . . . . . . 586.3.3 CPRR13 - Rivain-Prouff without Mask Refreshing . . . . . . . . . . 596.3.4 KHL11 - Kim-Hong-Lim . . . . . . . . . . . . . . . . . . . . . . . . 606.3.5 SKHL13 - Kim-Hong-Lim without Mask Refreshing . . . . . . . . . 626.3.6 Comparing the Estimates . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4 Selecting Candidates for Implementation . . . . . . . . . . . . . . . . . . 62

7 Implementation 657.1 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . 657.2 Implementing AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.2.1 Straightforward AES by the Book . . . . . . . . . . . . . . . . . . . 677.2.2 AES Optimized for the TriCore Architecture . . . . . . . . . . . . . 747.2.3 AES Protected with CPRR13 . . . . . . . . . . . . . . . . . . . . . 787.2.4 AES Protected with SKHL13 . . . . . . . . . . . . . . . . . . . . . . 817.2.5 Verification Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.3 Findings on Execution Timing . . . . . . . . . . . . . . . . . . . . . . . . 847.3.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.3.2 Compiler Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 857.3.3 Memory Management Tricks . . . . . . . . . . . . . . . . . . . . . . 87

7.4 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Contents iii

8 Side-Channel Analysis 898.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.1.1 Technical Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 898.1.2 Practical Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.2 Attacking the Implementations . . . . . . . . . . . . . . . . . . . . . . . . 908.2.1 Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 908.2.2 Common parameters for all attacks . . . . . . . . . . . . . . . . . . 918.2.3 Unprotected 8-bit AES . . . . . . . . . . . . . . . . . . . . . . . . . 928.2.4 Unprotected 32-bit AES . . . . . . . . . . . . . . . . . . . . . . . . 958.2.5 AES protected with SKHL13 . . . . . . . . . . . . . . . . . . . . . . 968.2.6 AES protected with CPRR13 . . . . . . . . . . . . . . . . . . . . . . 99

8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

9 Conclusion 1039.1 Breaking AES on TriCore using Power Analysis Attacks . . . . . . . . . . 1039.2 Protecting AES on TriCore against Power Analysis Attacks . . . . . . . . 103

10 Future Work 10510.1 TC1797 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10510.2 Optimizing the Implementations . . . . . . . . . . . . . . . . . . . . . . . 10510.3 Additional Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . 10610.4 Additional Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10610.5 Extended Software Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . 10610.6 Real-World Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10610.7 Newest Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

A Acronyms 109

B Appendix 113B.1 Lookup Tables for the Kim-Hong-Lim Scheme . . . . . . . . . . . . . . . 113

List of Figures 115

List of Tables 117

List of Algorithms 118

List of Listings 121

Bibliography 123

1 Introduction

In recent years the world has seen an increasing trend towards ubiquitous computing.Many areas of everyday life are now pervaded by embedded electronic devices. Forexample, many humans have become accustomed to using smart phones and tablets.Such devices are built from complex and powerful hardware components. Softwarerunning on those devices provides the user with a broad range of features like telephony,photography, or internet access. At the same time many embedded devices lack even themost basic protection against eavesdropping and manipulation. Generally speaking, thetrend towards pervasive computing carries along the need for ubiquitous cryptography.When applied correctly, cryptography can mitigate many of the risks modern technologybears.

Other industries aside from the mobile phone sector are affected by identical trends.The automotive industry poses as a great example. Some thirty years ago, a car usedto be a mechanical object that could be repaired using mechanical tools. Starting inthe early 1990s, an increasing amount of electronic units was built into all kinds ofmodels ranging from the simple budget car to top of the line luxury models. Somemodern upper class vehicles contain more than 100 Electronic Control Units (ECUs)[MNBSL10]. The ECUs control a broad range of infotainment, comfort, and safetysystems. Infotainment features include navigation systems, radio units, and full mediasystems for passengers. In the area of safety we are used to systems like Antilock BrakingSystem (ABS) and Electronic Stability Control (ESC). Comfort features like adaptivecruise control or Volvo’s Blind Spot Information System (BLIS)1 round off the picture.Today, almost all engines are fully controlled and driven by electronic units, just like thebrakes, the lights, the doors, and many other components of the vehicle. Most of theECUs exchange data over bus systems because they cannot work fully by themselves.Some recent car models even communicate with the outside world by means of mobileInternet connections. It seems clear that an increasing need for protection is arising.Two main goals are of interest with regard to the automotive setting. Firstly, data beingexchanged between components inside the car or with the environment outside the carshould be protected against eavesdropping and manipulation. Secondly, manufacturersmight be interested in tamper resistance to thwart the manipulation of firmware runningon ECUs. More individual needs may already exist or be identified in the near future.The need for automotive-related security research is clearly underlined by projects likeSCAAS (https://www.emsec.rub.de/research/projects/SCAAS/) and events like theannual escar (Embedded Security in Cars) conference.

1BLIS uses cameras inside the exterior rear-view mirrors to warn the driver of vehicles in the blind spot.A short explanatory video is available at http://www.volvocars.com/uk/top/my_volvo/videos/pages/volvo-blis.aspx.

https://www.emsec.rub.de/research/projects/SCAAS/

http://www.volvocars.com/uk/top/my_volvo/videos/pages/volvo-blis.aspx

http://www.volvocars.com/uk/top/my_volvo/videos/pages/volvo-blis.aspx

2 1 Introduction

1.1 Motivation

According to Infineon, TriCore products are used in every second car that is manufacturedtoday. For example the TC1797 is used in a broad range of engine control unitsmanufactured by Bosch which can then be found in models like Audi RS5, Land RoverDefender, and Mercedes-Benz G65 AMG. In fact a simple Internet search featuring termslike “TriCore” and “ECU” directly reveals detailed instructions on how to perform chiptuning by means of ECU manipulation. Some search results even include images thatshow pictures of the contact points required to boot the respective TriCore product.

We mentioned earlier that the automotive industry might be in need of means to protectECU firmware or inter-ECU communication. Employing cryptography could be oneapproach to providing the required security. The Advanced Encryption Standard (AES)is in widespread use for encryption and the TC1797 is one of the top TriCore productsavailable as of today. Therefore we felt motivated to analyze the chances of mountingpower analysis attacks against AES on the TC1797. We felt that the mere recovery ofAES keys by means of power analysis would prove nothing new. Thus we decided toaim at creating AES implementations that are protected against 1𝑠𝑡-order power analysisattacks.

1.2 Related Work

In this section we give a short overview of related work. We present details of thementioned publications later in this thesis.

We know of exactly one scientific work that is directly related to the topic of ourthesis. In June 2009 Andreas Hoheisel submitted his Master’s Thesis [Hoh09] in which hedevelops a side-channel resistant implementation of AES for a TriCore TC1796. While hisimplementation uses masking based on table recomputation, Hoheisel never performed anypower consumption analysis but focused on potential timing attacks instead. Nevertheless,we are able to relate our work to selected parts of his thesis.

AES itself is defined by [FIP01]. The standard was published on November 26, 2001by the National Institute of Standards and Technology (NIST). It specifies the Rijndaelalgorithm and names it “the AES algorithm”. Rijndael was developed by Joan Daemenand Vincent Rijmen. A book titled “The Design of Rijndael” [DR02] provides detailedinsight into design decisions and inner workings of AES.

In the area of power analysis attacks, [KJJ98] is the seminal whitepaper in which theauthors introduce the notion of both Differential Power Analysis (DPA) and SimplePower Analysis (SPA). A third form of power analysis, called CPA, was introducedlater [BCO04]. We additionally refer to [MOP07] as a source of profound theoreticalknowledge about attacks and countermeasures. We explain DPA, SPA, and CPA indetail in Chap. 3.

When it comes to practical countermeasures, four papers are of vital importance forour thesis. All of those papers deal with software countermeasures against power analysisattacks. Firstly, we refer to [RP10a] and [CPRR13]. The former introduces the masking

1.3 Our Contribution 3

scheme itself while the latter exhibits a security flaw in this proposal and provides a fixfor that flaw. Secondly, we found that [KHL11] poses as a viable alternative to [RP10a].Finally, we analyzed the ideas presented in [GM11] in order to have a greater numberof potential implementation alternatives. We give precise explanations of the distinctcountermeasures, along with additional references, in Chap. 6.

1.3 Our ContributionTo the best of our knowledge there has not been much public research involving thecombination of Side-Channel Analysis (SCA) and products from the TriCore family. Weonly know of [Hoh09]. We intend to improve this situation by conducting basic workon the software development for, and special features of, one product from the TriCorefamily: The TC1797. We put up the following two propositions in accordance with thesubject of our thesis:

Proposition 1.3.1 The key of an unprotected AES implementation running on theTC1797 can be recovered using power analysis.

as well as the complementary

Proposition 1.3.2 It is possible to protect AES on the TC1797 against first-order poweranalysis attacks.

We intend to verify the substance of both propositions in the course of our work. Weassume that the results we find will serve as suitable starting points for further researchor industrial projects.

1.4 Organization of this ThesisThis thesis is divided into distinct chapters where each chapter deals with one part ofour overall work. In Chapter 1 we describe our goal and our motivation. We give anoverview of related work and define general terms that span across the whole thesis. InChapter 2 we present the algorithms comprising AES and their mathematical foundation.In Chapter 3 we give an introduction to the concept of power analysis attacks. Wepresent different approaches and select one of them for our work. In Chapter 4 weintroduce the reader to the target platform we selected for this thesis. We describe itsfeatures and the design philosophy behind the platform’s architecture. In Chapter 5we give a detailed description of our working environment. We cover all hardware andsoftware components comprising our development and testing environment and concludethe chapter with a report about our measurement setup. In Chapter 6 we look at generalprinciples of software countermeasures against side-channel attacks. Subsequently wepresent recent publications proposing such countermeasures. We compare the proposalsand select candidates for implementation. In Chapter 7 we give an in-depth report ofhow we created unprotected and protected implementations of AES. We explain whathas to be taken into account when writing code for a microcontroller and summarize

4 1 Introduction

the key takeaways at the end of the chapter. In Chapter 8 we show how we attackedour AES implementations and which observations we made during this practical securityanalysis. In Chapter 9 we explain the conclusions we drew from the results we found.In Chapter 10 we present our opinion regarding potentially interesting areas of furtherresearch. We also discuss some general ideas we have around SCA, attacks and protection,and software used in scientific settings.

2 Background: AES

The term “Advanced Encryption Standard”, in short AES, does not provide muchinformation about itself to the reader. The original algorithm that became standardizedas the AES is called Rijndael. It was named after its developers, Joan Daemen andVincent Rijmen. In this chapter, we recall how the Rijndael algorithm became the currentde-facto standard for symmetric encryption. Subsequently, we describe the algorithmand its mathematical foundation.

2.1 The AES Contest

The term “Advanced” in the name of the standard hints towards a predecessor, which inour case is represented by the Data Encryption Standard (DES). The DES was specifiedwith the release of FIPS-46 by the National Bureau of Standards (NBS)1 in 1977. Thestandard was revalidated in 1983, 1988, 1993, and finally in 1999. This final revalidation,FIPS 46-3 [FIP99], declared 3DES the preferred implementation and restricted plainDES to legacy systems.

Between 1997 and 1999, RSA Security put up a total of four challenges to break DESencryption using brute force attacks. While the first challenge took 96 days to solve, thefourth challenge was solved within less than 24 hours. Additionally, Eli Biham and AdiShamir had published multiple papers on differential cryptanalysis. With the publicationof [BS92], a theoretic attack with lower complexity than brute force existed. It becameevident that the time had come to find a new algorithm to be used as both governmentand industry standard for data encryption.

In 1997, NIST started the AES contest. A total of 15 candidate algorithms weresubmitted. The contest consisted of multiple rounds and narrowed the candidates downto five finalists:

∙ Rijndael, designed by Vincent Rijmen and Joan Daemen.

∙ Twofish, designed by Bruce Schneier, Niels Ferguson, John Kelsey, Doug Whiting,David Wagner, and Chris Hall.

∙ Serpent, designed by Ross Anderson, Eli Biham, and Lars Knudsen.

∙ RC6, designed by Ron Rivest.

∙ MARS, designed by a team of IBM researchers, including Don Coppersmith.1known as NIST since 1988.

6 2 Background: AES

In October 2000, NIST announced Rijndael as the winner of the AES contest. Fol-lowing a drafting and request-for-comments phase, NIST published FIPS-197 [FIP01]on November 26, 2001. The standard became effective on May 26, 2002. Since thisday, Rijndael is the de-facto standard for symmetric encryption. Its predecessor, DES,co-existed within clearly set boundaries until NIST declared FIPS-46-3 invalid in 2005.

2.2 The Rijndael AlgorithmIn this section, we dive into the internals of Rijndael. The algorithm specified as theAES fixes certain parameters like the key length, which makes it a slight modification ofthe original (more generic) Rijndael algorithm. Henceforth, we use the term AES:

Definition 2.2.1 Throughout this thesis, we use the term AES to indicate the algorithmstandardized in FIPS-197, not the document itself.

For the scope of this thesis, we decided to work with a key length of 128 bits. In otherwords, this means that we focus on AES-128. Where applicable, we provide algorithmdescriptions tailored to this specific configuration.

2.2.1 Basic PropertiesAES is a block cipher based on a Substitution-Permutation Network (SPN). Input andoutput blocks have a fixed length of 128 bits. The internal state of the cipher consists of128 bits as well, arranged as 𝑁𝑏 = 4 columns, where each column contains four bytes (one32-bit word). The key length can be chosen from 128, 192, and 256 bits. As mentionedbefore, we fix the key length at 𝑛 = 128 bits or 𝑁𝑘 = 4 32-bit words. The combinationof 𝑁𝑏 = 4 and 𝑁𝑘 = 4 determines that the number of rounds amounts to 𝑁𝑟 = 10.

In terms of data, AES operates on bytes. One block of input, the state, and the outputare each represented by 16 bytes (four 32-bit words). The same holds for the key in ourcase. Next, we describe the cipher.

2.2.2 Mathematical FoundationEach byte manipulated in the AES during encryption or decryption represents an elementof GF(28). This finite field is constructed over GF(2) using the irreducible polynomial

𝑃 (𝑥) = 𝑥8 + 𝑥4 + 𝑥3 + 𝑥 + 1 . (2.1)

Each element of GF(28) is represented by a polynomial

𝑎(𝑥) =7∑︁

𝑖=0𝑎𝑖𝑥

𝑖 = 𝑎7𝑥7 + 𝑎6𝑥6 + · · ·+ 𝑎1𝑥 + 𝑎0

where all coefficients 𝑎𝑖 ∈ GF(2). The degree and the binary coefficients make iteasy to encode such a polynomial as a byte by mapping the coefficients to distinct bit

2.2 The Rijndael Algorithm 7

positions. For example, the polynomial 𝑎(𝑥) = 𝑥7 + 𝑥5 + 𝑥2 + 1 has 𝑎7, 𝑎5, 𝑎2, 𝑎0 = 1and 𝑎6, 𝑎4, 𝑎3, 𝑎1 = 0. This gives the binary notation 101001012 which is equal to 0xA5in hexadecimal notation. Whenever we refer to an element of GF(28) in hexadecimalnotation, we denote this by using curly braces, for example, {A5}.

Over GF(28), addition and multiplication are defined. We recall the rules for bothoperations in the following.

Addition is performed coefficient-wise. The coefficients are added modulo 2 becausethey come from GF(2). Bitwise addition modulo 2 corresponds to calculating theExclusive Or (XOR) of both values. This gives

𝑎(𝑥) + 𝑏(𝑥) =7∑︁

𝑖=0𝑎𝑖𝑥

𝑖 +7∑︁

𝑖=0𝑏𝑖𝑥

𝑖 =7∑︁

𝑖=0(𝑎𝑖 ⊕ 𝑏𝑖)𝑥𝑖

where ⊕ denotes the XOR operation. The XOR calculation is not performed for singlebits at a time but instead it is applied to whole bytes. Because the coefficients are fromGF(2), addition is technically equivalent to subtraction. The neutral element is given by{00} ∈ GF(28).

Multiplication is performed like regular polynomial multiplication followed by thereduction of the product modulo 𝑃 (𝑥) as given in (2.1). While the product withoutmodular reduction would still be equivalent to an element of GF(28), it is technicallyimportant to perform the modular reduction because only then can the result stillbe represented as a single byte. The neutral element for multiplication is given by{01} ∈ GF(28). A multiplicative inverse exists for all non-zero elements of GF(28)because 𝑃 (𝑥) is irreducible. The multiplicative inverse can be found in different ways,one of them being the Extended Euclidean Algorithm (EEA). We discuss alternatives tothis approach in Chap. 6.

2.2.3 Encryption

In order to describe the cipher, we first reproduce the pseudo code from [FIP01]. Wesubstitute concrete numbers for the symbolic parameters. Algorithm 2.2.1 shows theresulting steps for a single encryption. First, the state is initialized with the input.Next, an initial AddRoundKey operation is performed. Then, the same four operationsare repeated nine times in the same order: SubBytes, ShiftRows, MixColumns, andAddRoundKey. The 10𝑡ℎ and final round is almost the same except for the missingMixColumns operation. Finally, the state is copied into the output. In the following, wetake a look at the four elementary operations defined in AES.

SubBytes

SubBytes substitutes each individual byte of the state by a new value. StraightforwardAES implementations usually resort to a static lookup table for performance reasons. Incontrast, implementations resistant to side-channel attacks often compute the SubBytesoperation dynamically. Therefore we describe the computation in detail.

8 2 Background: AES

Algorithm 2.2.1: AES-128 Cipher – EncryptionData: bytes 𝑖𝑛[16], words 𝑤[44]Result: bytes 𝑜𝑢𝑡[16]

1 begin2 byte 𝑠𝑡𝑎𝑡𝑒[16]3 𝑠𝑡𝑎𝑡𝑒← 𝑖𝑛45 AddRoundKey(𝑠𝑡𝑎𝑡𝑒, 𝑤[0..3])6 for 𝑟𝑜𝑢𝑛𝑑← 1 to 9 do7 SubBytes(𝑠𝑡𝑎𝑡𝑒)8 ShiftRows(𝑠𝑡𝑎𝑡𝑒)9 MixColumns(𝑠𝑡𝑎𝑡𝑒)

10 AddRoundKey(𝑠𝑡𝑎𝑡𝑒, 𝑤[𝑟𝑜𝑢𝑛𝑑 * 4, (𝑟𝑜𝑢𝑛𝑑 + 1) * 4])1112 SubBytes(𝑠𝑡𝑎𝑡𝑒)13 ShiftRows(𝑠𝑡𝑎𝑡𝑒)14 AddRoundKey(𝑠𝑡𝑎𝑡𝑒, 𝑤[40..43])1516 𝑜𝑢𝑡← 𝑠𝑡𝑎𝑡𝑒

Computing the output of SubBytes consists of two steps. First, the multiplicativeinverse of the input is computed over GF(28), where {00} is mapped to itself. Second,the resulting byte is subjected to an affine transformation. In matrix notation, thistransformation is denoted as⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

𝑏′0

𝑏′1

𝑏′2

𝑏′3

𝑏′4

𝑏′5

𝑏′6

𝑏′7

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 0 0 1 1 1 11 1 0 0 0 1 1 11 1 1 0 0 0 1 11 1 1 1 0 0 0 11 1 1 1 1 0 0 00 1 1 1 1 1 0 00 0 1 1 1 1 1 00 0 0 1 1 1 1 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠·

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

𝑏0𝑏1𝑏2𝑏3𝑏4𝑏5𝑏6𝑏7

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠+

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

11000110

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠(2.2)

where 𝑏0 is the Least Significant Bit (LSB) and 𝑏7 is the Most Significant Bit (MSB).The result of the affine transformation serves as the substitute for the input value.

ShiftRows

ShiftRows performs a cyclical byte-wise left shift on the rows of the state. The first row isnot shifted. The second row is shifted by one byte, the third row is shifted by two bytes,


and the fourth row is shifted by three bytes. This gives⎡⎢⎢⎢⎣𝑠0,0 𝑠0,1 𝑠0,2 𝑠0,3𝑠1,0 𝑠1,1 𝑠1,2 𝑠1,3𝑠2,0 𝑠2,1 𝑠2,2 𝑠2,3𝑠3,0 𝑠3,1 𝑠3,2 𝑠3,3

⎤⎥⎥⎥⎦ 123

⎡⎢⎢⎢⎣𝑠0,0 𝑠0,1 𝑠0,2 𝑠0,3𝑠1,1 𝑠1,2 𝑠1,3 𝑠1,0𝑠2,2 𝑠2,3 𝑠2,0 𝑠2,1𝑠3,3 𝑠3,0 𝑠3,1 𝑠3,2

⎤⎥⎥⎥⎦where the matrix on the right represents the new state after ShiftRows has completed.

MixColumns

MixColumns operates on each column of the state. A column is interpreted as a four-termpolynomial with coefficients from GF(28). Each column is multiplied modulo 𝑥4 + 1 withthe fixed polynomial

𝑎(𝑥) = {03}𝑥3 + {01}𝑥2 + {01}𝑥 + {02} .

This multiplication can be written in matrix form as⎛⎜⎜⎜⎝𝑠′

0,𝑐

𝑠′1,𝑐

𝑠′2,𝑐

𝑠′3,𝑐

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝02 03 01 0101 02 03 0101 01 02 0303 01 01 02

⎞⎟⎟⎟⎠ ·⎛⎜⎜⎜⎝

𝑠0,𝑐

𝑠1,𝑐

𝑠2,𝑐

𝑠3,𝑐

⎞⎟⎟⎟⎠ (2.3)

where 0 ≤ 𝑐 < 4 indexes the column being processed.

AddRoundKey

Finally, AddRoundKey adds one word of the key schedule to each column of the state bymeans of an XOR operation. That is⎛⎜⎜⎜⎝

𝑠′0,𝑐

𝑠′1,𝑐

𝑠′2,𝑐

𝑠′3,𝑐

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝𝑠0,𝑐

𝑠1,𝑐

𝑠2,𝑐

𝑠3,𝑐

⎞⎟⎟⎟⎠⊕ 𝑤[4 * 𝑟 + 𝑐]

where 𝑤 is an array of four-byte words that is indexed based on the round 𝑟 and thecolumn 𝑐 being processed. We explain the meaning of the array 𝑤 in the following.

Key Expansion

Up to now, we have not clarified what the key schedule is and how it is created. We seefrom Alg. 2.2.1 that one full encryption requires a total of 11 AddRoundKey invocations.The term “round key” refers to the use of one individual key in each round. Thoseindividual keys comprise what is called the key schedule of the cipher. The round keysare derived from the cipher key based on a specific procedure called key expansion.

10 2 Background: AES

Algorithm 2.2.2: AES-128 Key ExpansionData: bytes 𝑘𝑒𝑦[16]Result: words 𝑤[44]

1 begin2 word 𝑡𝑒𝑚𝑝34 𝑤[0..3]← 𝑘𝑒𝑦[0..15]56 𝑖← 47 while 𝑖 < 44 do8 𝑡𝑒𝑚𝑝 = 𝑤[𝑖− 1]9 if 𝑖 mod 4 = 0 then

10 𝑡𝑒𝑚𝑝← SubWord(RotWord(𝑡𝑒𝑚𝑝))⊕ Rcon[𝑖/4]11 𝑤[𝑖]← 𝑤[𝑖− 4]⊕ 𝑡𝑒𝑚𝑝12 𝑖← 𝑖 + 1

In our case, the cipher key consists of 128 bits, and we need a total of 11 round keysof the same individual length. We provide the key expansion specification, tailored toAES-128, as Alg. 2.2.2. Executing this algorithm yields the array 𝑤 we mentioned earlier.

We see from the algorithm that the first four 32-bit words of the key schedule are filledwith the cipher key. Inside the loop, SubWord simply applies the S-box to each bytein the given word while RotWord cyclically rotates the word one byte to the left, i.e.,[𝑎0, 𝑎1, 𝑎2, 𝑎3] becomes [𝑎1, 𝑎2, 𝑎3, 𝑎0]. Rcon is an array that contains the value of 𝑥𝑖−1 inGF(28), starting at 𝑖 = 1. We see from the loop index 𝑖 that we require only the first tenvalues of Rcon. This fact strongly reduces the amount of memory required for the Rconlookup table.

2.2.4 DecryptionDecryption works by performing inverse operations in a modified order as depicted inAlg. 2.2.3. In the following, we take a look at the inverse operations InvShiftRows,InvSubBytes, and InvMixColumns.

InvShiftRows

InvShiftRows does exactly the opposite of ShiftRows by shifting rows of the state cyclicallyto the right. The first row is not shifted. The second row is shifted by one byte, the thirdrow is shifted by two bytes, and the fourth row is shifted by three bytes. This gives⎡⎢⎢⎢⎣

𝑠0,0 𝑠0,1 𝑠0,2 𝑠0,3𝑠1,0 𝑠1,1 𝑠1,2 𝑠1,3𝑠2,0 𝑠2,1 𝑠2,2 𝑠2,3𝑠3,0 𝑠3,1 𝑠3,2 𝑠3,3

⎤⎥⎥⎥⎦ �1�2�3

⎡⎢⎢⎢⎣𝑠0,0 𝑠0,1 𝑠0,2 𝑠0,3𝑠1,3 𝑠1,0 𝑠1,1 𝑠1,2𝑠2,2 𝑠2,3 𝑠2,0 𝑠2,1𝑠3,1 𝑠3,2 𝑠3,3 𝑠3,0

⎤⎥⎥⎥⎦


Algorithm 2.2.3: AES-128 Cipher – DecryptionData: bytes 𝑖𝑛[16], words 𝑤[44]Result: bytes 𝑜𝑢𝑡[16]

1 begin2 byte 𝑠𝑡𝑎𝑡𝑒[16]3 𝑠𝑡𝑎𝑡𝑒← 𝑖𝑛45 AddRoundKey(𝑠𝑡𝑎𝑡𝑒, 𝑤[40..43])6 for 𝑟𝑜𝑢𝑛𝑑← 9 to 1 do7 InvShiftRows(𝑠𝑡𝑎𝑡𝑒)8 InvSubBytes(𝑠𝑡𝑎𝑡𝑒)9 AddRoundKey(𝑠𝑡𝑎𝑡𝑒, 𝑤[𝑟𝑜𝑢𝑛𝑑 * 4, (𝑟𝑜𝑢𝑛𝑑 + 1) * 4])

10 InvMixColumns(𝑠𝑡𝑎𝑡𝑒)1112 InvShiftRows(𝑠𝑡𝑎𝑡𝑒)13 InvSubBytes(𝑠𝑡𝑎𝑡𝑒)14 AddRoundKey(𝑠𝑡𝑎𝑡𝑒, 𝑤[0..3])1516 𝑜𝑢𝑡← 𝑠𝑡𝑎𝑡𝑒

where the matrix on the right represents the new state after ShiftRows has completed.

InvSubBytes

InvSubBytes applies the inverse S-box individually to all bytes of the state by computingthe inverse of the affine function given in (2.2) followed by inversion in GF(28). Theinverse of the affine function is given in [Gla07] as

𝑏′𝑖 = 𝑏(𝑖+2) mod 8 ⊕ 𝑏(𝑖+5) mod 8 ⊕ 𝑏(𝑖+7) mod 8 ⊕ 𝑑𝑖

with 𝑑 = {05}. We can write this in matrix form as⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

𝑏′0

𝑏′1

𝑏′2

𝑏′3

𝑏′4

𝑏′5

𝑏′6

𝑏′7

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 0 1 0 0 1 0 11 0 0 1 0 0 1 00 1 0 0 1 0 0 11 0 1 0 0 1 0 00 1 0 1 0 0 1 00 0 1 0 1 0 0 11 0 0 1 0 1 0 00 1 0 0 1 0 1 0

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠·

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

𝑏0𝑏1𝑏2𝑏3𝑏4𝑏5𝑏6𝑏7

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠+

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

10100000

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠. (2.4)

The same remark as for SubBytes holds for InvSubBytes: The inverse S-box is usuallyimplemented as a lookup table, but is often computed in a secure way in the contextof side-channel countermeasures. Therefore, we provide the detailed description of theinverse affine transformation here.

12 2 Background: AES

InvMixColumns

InvMixColumns is the inverse of MixColumns. The columns of the state are multipliedmodulo 𝑥4 + 1 by the fixed polynomial

𝑎−1(𝑥) = {0𝑏}𝑥3 + {0𝑑}𝑥2 + {09}𝑥 + {0𝑒} .

This multiplication can again be written in matrix form as⎛⎜⎜⎜⎝𝑠′

0,𝑐

𝑠′1,𝑐

𝑠′2,𝑐

𝑠′3,𝑐

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝0𝑒 0𝑏 0𝑑 0909 0𝑒 0𝑏 0𝑑0𝑑 09 0𝑒 0𝑏0𝑏 0𝑑 09 0𝑒

⎞⎟⎟⎟⎠ ·⎛⎜⎜⎜⎝

𝑠0,𝑐

𝑠1,𝑐

𝑠2,𝑐

𝑠3,𝑐

⎞⎟⎟⎟⎠where 0 ≤ 𝑐 < 4 indexes the column being processed.

2.2.5 Implementation OutlookThe internal data structures of AES primarily facilitate software implementations of thecipher. The input, the state, the output, and the key schedule are easy to representas arrays that contain bytes. AES can be implemented in a straight-forward manneron 8-bit processors by simply converting the pseudo code from the specification intocompilable code in a language like C.

In our case, the target system features a 32-bit Central Processing Unit (CPU). Itis not efficient to process single bytes at a time on a 32-bit processor because doing soleaves 24 bits of the affected registers unused. It is highly beneficial to manipulate full32-bit words whenever this is possible. To our advantage, many of the AES operationscan be optimized in such a way that they use the full potential of the 32-bit architecture.Firstly, some operations like AddRoundKey can operate on full four-byte words at onetime. Secondly, there are cases in which it is possible to combine up to four individual8-bit operations in one 32-bit register by using special instructions. We describe thetarget platform in detail in Chap. 4, and discuss AES optimization opportunities andlimits in Chap. 7.

3 Background: Power Analysis Attacks

Cryptographic algorithms have always been under attack by adversaries who intend toreveal the secret message, the secret key, or both. Many attacks exist, ranging from plainbrute force to the exploitation of statistical properties in cipher texts or simply designflaws in the targeted algorithm. Those attacks represent classic forms of cryptanalysis.

Other forms of cryptanalysis include linear cryptanalysis [MY92] where the goal is tofind affine approximations of a cipher, and differential cryptanalysis [BS90] where theresults of subtle changes to the input of a cipher are studied in order to discover potentialweaknesses. Linear and differential cryptanalysis have evolved from plain attack formsinto scientific methods which are applied during the design phase of new cryptographicalgorithms.

One of the most prevalent cryptanalytic techniques today is called side channel analysis.Instead of mounting a direct attack on a cryptographic algorithm, the attacker attemptsto gain useful information by observing a physical implementation of the cipher. Thedesired information flows from the cryptographic device to the attacker through what iscalled a side channel. Common types of side channels include

∙ Sounds produced by mechanical devices like the Enigma, lock dials, or evencomputers [ST04],

∙ Time required to perform a cryptographic operation [Koc96],

∙ Electromagnetic Fields caused by cryptographic devices (see for example [Kas11]),

∙ Light emitted from devices or distinct parts (see for example [SNK+12]), and

∙ Power Consumption of a cryptographic device during the execution of a crypto-graphic algorithm [KJJ98].

Information gathered from side channels is usually stored and then analyzed in orderto validate or invalidate one or more hypotheses put up by the attacker. We are primarilyinterested in power analysis attacks, that is, attacks using the power consumption of acryptographic device as the side channel. In the following, we describe three types ofpower analysis attacks. We subsequently select one of the three approaches for the restof this thesis.

3.1 Simple Power AnalysisSimple Power Analysis (SPA) was introduced by Kocher et al. in [KJJ98]. In an SPAattack, recorded power traces are directly interpreted, for example by visual inspection.

14 3 Background: Power Analysis Attacks

Square and Multiply Exponentiation is a well-known example when it comes to SPA. Ifsquarings and multiplications can be clearly distinguished in a power trace, an attackercan derive the secret exponent directly from the plotted trace. But also block ciphers arepotentially vulnerable to SPA. When [KJJ98] was published, DES was the standard fordata encryption. The paper shows power traces of a DES encryption as examples of theinformation an attacker can uncover. In the case of AES some SPA attacks have beenlaunched against the key schedule [BS99, Man02].

SPA attacks become possible when the timing or the order of operations (i.e., conditionalbranches) depend on the data being processed. It is thus important to ensure, as faras possible, that an implementation of a cryptographic algorithm features a constantexecution path, effectively suppressing most SPA-enabling characteristics.

3.2 Differential Power AnalysisIn Chapter 4 of [KJJ98], the authors note that “In addition to large-scale power varia-tions due to the instruction sequence, there are effects correlated to data values beingmanipulated.” Even in the presence of noise and measurement errors, those smaller effectscan be exploited using Differential Power Analysis (DPA).

Because we want to recover a secret AES key, we adapt the attack on DES proposedby Kocher et al. to AES. We choose a naive approach and try to attack the intermediatevalue after the first S-box in the first round. In order to do so, we first record 1000 powertraces of an unprotected implementation, sending random plaintexts to the device. Westore the plaintexts in a separate file. We know that the targeted intermediate dependsonly on the first plaintext byte and on the first key byte. We select the least significantbit of the intermediate as our target. The DPA selection function is then

𝐷(𝑃, 𝑏, 𝐾)

where the plaintext byte 𝑃 is known, the bit index 𝑏 is 0 because we target the leastsignificant bit of the intermediate, and the value of the hypothetical key byte 𝐾 rangesfrom 0 to 255. We know that the intermediate value 𝐼 is given as

𝐼 = 𝑆(𝑃 ⊕𝐾)

where 𝑆(. . . ) denotes the S-box. The concrete selection function becomes

𝐷(𝑃, 𝑏, 𝐾) = LSB(𝐼) = LSB(𝑆(𝑃 ⊕𝐾)) .

We calculate a running average of the traces where 𝐷 = 1 and a running average of thetraces where 𝐷 = 0. After all traces have been processed, we subtract the two averagesand get a differential trace Δ𝐷. If the guessed key byte is correct, the differential tracewill show a clearly visible spike while it will be mostly flat if the guess was incorrect.This is due to the fact that parts of the trace that are unrelated to 𝐷 get smaller with

1√𝑚

, m being the number of measurements, while the parts of the trace that are directlyrelated (correlated) to 𝐷 stay in place and add up with increasing 𝑚.

3.3 Correlation Power Analysis 15

To give an impression of differential traces we have created three plots. We used theMatlab examples accompanying [MOP07] with kind permission from Elisabeth Oswald.The data consists of 200 power traces at 5000 samples per trace. The plots show thedifference traces resulting from an attack on the first S-box in the first round of AES. Asingle-bit power model targeting the LSB of the S-box output was used to predict powerconsumption. The differential traces were computed as the difference between the sum ofthose power traces where the LSB of the predicted power consumption is 1 and the sumof those power traces where the LSB of the predicted power consumption is 0. Althoughthis computation is simplified in comparison to the original formula given in [KJJ98], wekeep the notation and call the resulting differential traces Δ𝐷.

In the following, we describe the plots in order to clarify their meaning. Figure 3.1shows the differential trace that results from the wrong key guess 𝐾 = 6. It is easy to seethat there are no significant spikes in the plot. Figure 3.2 shows the differential trace forthe correct key guess 𝐾 = 43. In contrast to Fig. 3.1, the spikes are immediately visible.To show a downside of DPA, we present Fig. 3.3. This differential trace results fromthe wrong key guess 𝐾 = 1. The curve roughly spans the range from -3000 to +3000while the first two curves reached maximal values around ±400 and ±650 respectively.This is bad for at least two reasons. Firstly, the overall height of the three plots we havecreated is not identical. This hinders comparison of the curves by visual inspection. Onecould try to remedy this problem by creating all plots with identical y axis limits. Inthis case the problem would remain the same because the ±650 spikes would hardly bevisible relative to a ±3000 y axis. Secondly, an automated decision about the correct keyhypothesis seems impossible in the case of our example. If the software that is supposedto make this decision simply looked for the highest peak, it would by no means select𝐾 = 43 as the best key candidate.

The idea of DPA can be summarized as the analysis of the difference in power con-sumption caused by changes to the plaintext or ciphertext, depending on the cipherround being attacked.

3.3 Correlation Power Analysis

DPA can be used to quickly mount an attack on any physical implementation. It requiresrather low amounts of computational effort and the power model is simple because itusually takes exactly one bit of an intermediate value into account. DPA does in factexhibit the correlation between input, key, and power consumption, but the difference ofmeans is not a very strong metric. There is however a more flexible and more powerful toolfor the evaluation phase of a power analysis attack. In [BCO04] Brier, Clavier, and Olivierintroduced an approach called Correlation Power Analysis (CPA). They propose to usean established and well-researched method to measure the linear correlation between twovariables: The Pearson Correlation Coefficient1. Papers like [Man04] explicitly endorsethe usage of the correlation coefficient as the mathematical tool in power analysis attacks.

1In short “correlation coefficient”.


0 1,000 2,000 3,000 4,000 5,000

−500

0

500

Samples

Δ𝐷

Figure 3.1: Differential trace for the wrong guess 𝐾 = 6

0 1,000 2,000 3,000 4,000 5,000

−500

0

500

Samples

Δ𝐷

Figure 3.2: Differential trace for the correct guess 𝐾 = 43

0 1,000 2,000 3,000 4,000 5,000

−2,000

0

2,000

Samples

Δ𝐷

Figure 3.3: Differential trace for the wrong guess 𝐾 = 1

3.3 Correlation Power Analysis 17

The correlation 𝜌 of two variables 𝑋 and 𝑌 is defined as

𝜌(𝑋, 𝑌 ) = 𝐶𝑜𝑣(𝑋, 𝑌 )√︀𝑉 𝑎𝑟(𝑋) · 𝑉 𝑎𝑟(𝑌 )

where 𝐶𝑜𝑣(𝑋, 𝑌 ) is the covariance and 𝑉 𝑎𝑟(𝑋) resp. 𝑉 𝑎𝑟(𝑌 ) is the variance of 𝑋 and𝑌 . When we attack a physical implementation using power analysis, we do not look atrandom populations sharing a common distribution but instead we analyze samples. ThePearson correlation coefficient for samples, also called empirical correlation coefficient, isdefined as

𝑟𝑥,𝑦 =∑︀𝑁

𝑛=1 (𝑥𝑛 − �̄�)(𝑦𝑛 − 𝑦)√︁∑︀𝑁𝑛=1 (𝑥𝑛 − �̄�)2

√︁∑︀𝑁𝑛=1 (𝑦𝑛 − 𝑦)2

(3.1)

where 𝑥 and 𝑦 denote the two sets of samples, �̄� and 𝑦 denote the mean of each set, and𝑁 is the number of samples.

We must now clarify between which sets we want to calculate a linear correlation. ACPA attack generally consists of five important steps which are explained in [MOP07]and in university courses like “Implementation of Cryptographic Schemes 1” [Paa12]. Webriefly recall the steps in the following. To perform an attack the attacker has to:

Select an intermediate value to be attacked. We can use any intermediate variablethat is used during the execution of the algorithm. Some intermediates are suited betterthan others. The output of the first S-box in the first round of AES is often used as thetargeted intermediate.

Measure the power consumption. The attacker makes the target device encrypt ordecrypt 𝐷 different inputs. She measures the power consumption during those 𝐷 runs ofthe algorithm. The measurements are usually taken with a digital oscilloscope. The oscil-loscope samples the power consumption at a fixed frequency, resulting in 𝑇 samples perpower trace. Measuring the power consumption of 𝐷 runs of the algorithm at 𝑇 samplesper trace results in a matrix T of size 𝐷 × 𝑇 . The individual traces are usually not keptin memory but stored to the disk of a computer instead. The computer interfaces withthe target device (it sends the input and receives the output) and controls the oscilloscope.

Calculate hypothetical intermediate values. The goal of the attack is to recover thesecret key based on the power consumption measured during the 𝐷 runs of the algorithmfrom the previous step. The key is unknown to the attacker. It is thus necessary tocreate a set of hypotheses that can then be compared to the physical behavior of thedevice observed in terms of power consumption during the measurement phase. The firststep consists in defining the possible key values. Attacking a full 128-bit AES key at oncewould be too resource intensive. Instead a single byte of the key is usually attacked. Thislimits the choices for the key byte 𝑘 to the range (0, ..., 255). The next step is to calculatethe hypothetical value of the intermediate selected at the beginning of the attack. Theintermediate value depends on the key byte 𝑘 and on the corresponding byte 𝑑 of theinput used in each individual run of the algorithm. As an example, if we selected the


output of the S-box in the first round as the target for the attack, the intermediate valuewould be

𝑣 = 𝑓(𝑑, 𝑘) = 𝑆(𝑑⊕ 𝑘)

where 𝑑 is one byte of the input, 𝑘 is one hypothetical byte of the key, and 𝑆(. . . ) denotesthe S-box. If we further decided to attack the first byte of the key we would calculate 𝑣for each individual first byte of the 𝐷 inputs and each possible choice for 𝑘. This resultsin a matrix V of size 𝐷×𝐾 that contains the hypothetical intermediate values. Formallywe have

V𝑖,𝑗 = 𝑓(𝑑𝑖, 𝑘𝑗) 𝑖 = 1, . . . , 𝐷 𝑗 = 1, . . . , 𝐾 .

The next step is to map the hypothetical intermediate values to hypothetical powerconsumption values.

Map hypothetical intermediate values to hypothetical power consumption.The hypothetical intermediate values cannot be compared to the measured power con-sumption because they lie in different domains. To make the comparison possible it isnecessary to map each hypothetical intermediate value to the amount of power the devicemight consume while processing the respective value. Due to noise and other effects itis impossible to predict a precise power consumption value that would be sampled bythe oscilloscope during the collection of power traces. Instead the power consumption ofthe device is described by a power model. A power model is always an approximationbecause it is nearly impossible to create a perfect model without intimate knowledge ofthe internal design of the device under attack. It is however possible to create powermodels that fit the observed behavior of the device very well. One very popular model iscalled the Hamming Weight Model. It assumes that the power consumption of a devicedepends on the Hamming weight of the data being processed. The (n-bit) Hammingweight HW(𝑣) is defined as the number of bits equal to 1 in the binary representation of𝑣. For example, HW(4216) = HW(010000102) = 2. Other models range from a simple BitModel taking only one bit of a hypothetical intermediate into account to more complexmodels like the Hamming Distance Model which calculates the Hamming distance, definedas HD(𝑣1, 𝑣2) = HW(𝑣1 ⊕ 𝑣2), for example between two subsequent values stored in aregister of the device under attack. Detailed simulation and modeling techniques aredescribed in Chapter 3 of [MOP07].

The attacker applies the selected power model to each individual entry of V andreceives a matrix H of size 𝐷 ×𝐾 that contains hypothetical power consumption values.

Compare the hypothetical power consumption with the power traces. As thefinal step of the attack, the attacker has to analyze the relation between the calculatedhypothetical power consumption values and the measured power traces. This is where thecorrelation coefficient from (3.1) is applied as a measure. The linear correlation betweenhypothetical and observed power consumption is the criterion enabling the attacker todecide which key hypothesis was most likely the correct one at the end of the attack.

3.4 Other Types of Power Analysis Attacks 19

The result of this five-step process is a correlation trace that shows clearly distin-guishable peaks both for the correct hypothesis and for the correct point in time. Thistype of trace results from difference of means evaluation as well, but using the Pearsoncorrelation coefficient has two advantages. Firstly the underlying mathematical theoryis well established and well studied. Secondly, in contrast to the difference of means,all values of the Pearson correlation coefficient lie between −1 and 1. Different attackresults are therefore very easy to compare. To give an impression of correlation traceswe have prepared three plots again. We plotted the correlation traces for the same keyguesses that we used in the previous section. Figure 3.4 shows the correlation trace forthe wrong guess 𝐾 = 6. The trace ist mostly flat. Figure 3.5 shows the correlationplot for the correct key 𝐾 = 43, and Fig. 3.6 depicts the correlation for the wrong key𝐾 = 6. In the DPA case, the latter had the extremely unfortunate property that itmade the comparison of differential traces nearly impossible due to its overall amplitude(see Fig. 3.3). In the CPA case we see that it simply leads to a flat correlation trace.Moreover, the distinct plots are easy to compare because the y axis ranges exactly from-1 to 1 due to the nature of the Pearson correlation coefficient.

3.4 Other Types of Power Analysis Attacks

There are two more approaches to power analysis attacks that we wish to note. The firstis called Inferential Power Analysis (IPA) while the other goes by the name of MutualInformation Analysis (MIA).

3.4.1 Inferential Power Analysis

Inferential Power Analysis (IPA) was presented by Fahn and Pearson [FP99] as analternative to DPA. Both DPA and CPA require that the attacker has access to eitherplaintexts or ciphertexts. Attacks are based on the power consumption of computationsthat directly combine a small part of the secret key and some part of the input or outputof the cipher. However, there are scenarios in which the attacker does not have accessto inputs or outputs of the cipher, which makes it impossible to mount a DPA or CPAattack. In contrast, IPA requires no knowledge of the inputs or outputs of the cipher.Recorded power traces are first subjected to a profiling stage. The profiling stage consistsof preprocessing steps that attempt to identify moments in which the key is involved inthe cipher computation. The profiling stage is followed by the key extraction stage inwhich the secret key is recovered based on the results of the profiling stage. IPA allowsan attacker to target the whole cipher, for example one of the inner rounds, becauseno control over plaintexts or ciphertexts is required. From the attacker’s point of viewthere is another major advantage to this type of attack. As soon as a profile for onecryptographic device has been created, the key extraction can be applied to any otherdevice that follows the same design, for example identical smart card terminal models.


0 1,000 2,000 3,000 4,000 5,000−1

−0.5

0

0.5

1

Samples

Cor

rela

tion

Figure 3.4: Correlation trace for the wrong guess 𝐾 = 6

0 1,000 2,000 3,000 4,000 5,000−1

−0.5

0

0.5

1

Samples

Cor

rela

tion

Figure 3.5: Correlation trace for the correct guess 𝐾 = 43

0 1,000 2,000 3,000 4,000 5,000−1

−0.5

0

0.5

1

Samples

Cor

rela

tion

Figure 3.6: Correlation trace for the wrong guess 𝐾 = 1

3.5 Order of Power Analysis Attacks 21

3.4.2 Mutual Information Analysis

DPA and CPA employ statistical methods to accept or reject a given hypothesis. Gier-lichs, Batina, Tuyls, and Preneel introduced Mutual Information Analysis (MIA) as analternative approach [GBTP08]. In simple terms, mutual information describes the extentto which two random variables depend on each other. The mutual information 𝐼(𝑋; 𝑌 )is zero if 𝑋 and 𝑌 are independent and goes up to the Shannon entropy 𝐻(𝑋) if 𝑋 isfully dependent on 𝑌 . The authors of [GBTP08] build a side-channel distinguisher basedon two cascaded channels. The first channel exists between words W processed insidea cryptographic device and leakage L dependant on those words. The second channelleads from the leakage L to the physical observation O. In our case, O is representedby power consumption measurements. The attacker measures O(t) in order to recordpower traces and then tries to deduce information about W from the leakage informationL contained in O. To enable this, Gierlichs et al. describe methods of estimating themutual information between O and L under a given key hypothesis. The result is similarto that of CPA in that it yields a clearly distinguishable peak in value for the correctkey hypothesis. For the fully detailed distinction between MIA and other approaches, werefer to [GBTP08], especially to Section 5.

The approach by Gierlichs and his colleagues has gained much attention in the scientificcommunity. Amongst others, Rivain and Prouff [PR09] have picked up on the idea ofMIA. MIA is also becoming a prominent tool for theoretic security evaluation, see forexample [CPRR13].

3.5 Order of Power Analysis Attacks

In our description of DPA and CPA, we discussed how a single point in time within apower trace serves as the target of the attack. As an example, we selected the moment intime at which the result of the first S-box lookup in the first round of AES is processed bythe cryptographic device we are attacking. We described how a selection function is used(DPA) and how power consumption is predicted (CPA). Kocher and his colleagues statedin their original paper that “More sophisticated selection functions may also be used. Ofparticular importance are high-order DPA functions that combine multiple samples fromwithin a trace.” [KJJ98]. With regard to this statement, using a single sample from apower trace (and thereby a single intermediate value) means that a so-called 1𝑠𝑡-orderattack is being mounted. A 2𝑛𝑑-order attack would be based on a combination functionthat takes two samples from a power trace (two intermediate values) into account, and soforth. Higher-Order DPA (HODPA) and suitable countermeasures are subjects of activeresearch [SVCO+10]. In the scope of our work, we are almost exclusively interested in1𝑠𝑡-order attacks and suitable countermeasures. Nevertheless, it is important to knowthat higher-order attacks exist: If an implementation is secure against 𝑑𝑡ℎ-order attacks,it might still be breakable by attacks of (𝑑 + 1)𝑡ℎ or even higher order.


3.6 Selecting one Type of Power AnalysisThe initial proposal of DPA proved to be a powerful tool that could be used to recoversecret keys used in cryptographic devices. DPA reveals a correlation between thekey, the chosen intermediate, and the power consumption of the cryptographic device.Nevertheless, the scientific community has improved the original idea by using the Pearsoncorrelation coefficient as the mathematical tool of choice. Mangard argues that thePearson correlation coefficient is favorable “because there exists a well-established theoryon measuring correlations this way” [Man04]. Our experience is that this statement holdsand that CPA is both more precise and more efficient than DPA. MIA is an interestingapproach that exploits the full amount of information available from the observed powerconsumption, but apart from the papers we mentioned, we have not seen or heard ofany practical MIA-based attacks. Thus we decided to use 1𝑠𝑡-order CPA to conductside-channel analysis of our AES implementations.

4 Target Platform: TriCore TC1797

Power analysis attacks aim at exploiting information gathered from the power consumptionof some hardware executing one concrete implementation of a cryptographic algorithm.The scope of our thesis encompasses software implementations of AES which implies theneed for a target platform on which we can run those implementations. In this chapterwe present the target platform we selected for our work. We give an overview of theplatform’s architecture and of those features bearing the highest relevance with regard toour goal of mounting and hindering power analysis attacks against AES.

4.1 The TriCore Concept

The term TriCore was conceived by Infineon Technologies for a family of powerful 32-bitmicrocontrollers. The central concept behind TriCore products is that they combine aReduced Instruction Set Computer (RISC) architecture with Microcontroller Unit (MCU)and Digital Signal Processing (DSP) features on a single chip, a property that explainsthe origin of the name TriCore. The first generation of TriCore products was launched in1999 under the name “AUtomotive unifieD processOr”, in short AUDO. Automotive usecases like engine management or safety systems are the main area targeted by TriCoreproducts. Nevertheless, Infineon states that they are also suited for industrial scenarioslike solar panel or wind turbine related requirements. For details we refer to [ITA12]. Wewish to note that the term TriCore must not be confused with computer-related termslike DualCore or QuadCore. TriCore does not mean that multiple CPU cores operate inparallel.

4.2 Selecting a TriCore Product

For our work we selected the Infineon TriCore TC1797 as the target platform. TheTC1797 resides in the high end section of the third AUDO generation, the “AUDOFuture” product family. It is manufactured using a 130 nm process. The maximum clockfrequency amounts to 180 MHz and there are 4 MB of on-chip program memory. TheTC1797 offers 221 digital I/O lines and 48 Analog-to-Digital Converter (ADC) channels,along with 118 timed I/O channels for purposes like Pulse Width Modulation (PWM)and other applications that require timed execution of code. For external communicationit can natively deal with up to four Controller Area Network (CAN) nodes. Additionallyit features the following interfaces by twos:

∙ Asynchronous Serial Channel (ASC)

24 4 Target Platform: TriCore TC1797

∙ Synchronous Serial Channel (SSC)

∙ Micro Second Channel (MSC)

∙ Micro Link Interface (MLI)

∙ FlexRay

This list of features makes it clear that the TC1797 is a very powerful and at the sametime a highly complex device. It seems reasonable to ask why such a piece of hardwarewould be interesting with regard to side-channel resistance. As we mentioned in theintroduction, TriCore products are used in a lot of modern vehicles. Topics like securecommunication between automotive components and tamper resistance are attractingincreasing amounts of interest from both vendors and the research community. Thisjustifies the idea of evaluating side-channel attacks against software running on TriCoreproducts.

4.3 The TriCore Architecture

Common features and properties of multiple TriCore microcontroller models are definedby the TriCore architecture. As of today two versions of the TriCore architecture exist:TriCore version 1.3 (referred to as v1.3 or TC1.3), and TriCore version 2.0 (referred toas v2.0 or TC2) [ITA02]. Individual models can differ in terms of periphery, amount ofbuilt-in memory, maximum CPU frequency, and other characteristics. Nevertheless, theymust all fulfill the requirements of the architecture version they implement. The TC1797implements v1.3.1 which we investigate in the following.

4.3.1 Registers and Instructions

The v1.3 architecture is described in detail in [ITA08a]. According to the specification,a TriCore microcontroller is a 32-bit system. All registers and instruction opcodes are32 bits wide except for 16-bit instructions which can be used to reduce the overall codesize. There are two register files containing 16 32-bit registers each. One register filecontains data registers D0 through D15. D15 acts as an implicit data register. The otherregister file contains address registers A0 through A15. Not all address registers canbe used freely because three of them fulfill a special purpose: A10 contains the StackPointer (SP), A11 contains the Return Address (RA), and A15 is an implicit addressregister. The registers used implicitly (D15, A15) are of special importance to 16-bitinstructions. For those instructions the implicit register is hard coded into the opcodewhich reduces the overall code size.

Most operations are executed within one clock cycle. TriCore devices contain twofour-stage pipelines and one separate pipeline for loop execution. Up to three instructionscan be executed in a single clock cycle due to the fact that the pipelines run in parallel[ITA02, p. 40].

4.4 Relevant Features of the TC1797 25

4.3.2 Memory Layout and Addressing

The architecture further specifies that memory is addressed using 32-bit addresses. Thisimplies that up to 4 GBytes of RAM can be addressed. The address space is logicallydivided into 16 segments of 256 MBytes each, denoted by the hexadecimal numbers0𝐻 through 𝐹𝐻 . Depending on the segment used, the same physical memory may beaccessed in different technical ways. For example, the segments 8𝐻 and 𝐴𝐻 point to thesame physical memory locations but addressing via the segment 8𝐻 allows cached accesswhile addressing via 𝐴𝐻 allows non-cached access [ITA09b, ch. 8].

4.3.3 Calling Conventions

The TC1797 is a RISC microcontroller. RISC architectures normally make use of a smallinstruction set and a big number of registers. The TC1797 features 16 data and 16 addressregisters. A task being executed on the TC1797 runs inside a so-called context that “isdivided into the upper context and the lower context”. The upper context comprisesthe registers A[10] through A[15] as well as D[8] through D[15], the processor statusword, and information about the previous context. We recall that A[10] holds the SP,A[11] holds the RA, and A[15] is used for implicit addresses just as D[15] is used forimplicit data. The lower context consists of A[2] through A[7], D[0] through D[7], andthe program counter stored in A[11].

The conventions for function calls are described in the TriCore Embedded ApplicationsBinary Interface (EABI) documentation [ITA07]. The upper context is automaticallysaved when a function is called and restored when the function returns. There is noautomatic handling of the lower context but it can be saved and restored manually usingthe SVLCX and RSLCX instructions.

Due to the TriCore architecture function arguments are passed in registers and not viathe stack. Non-pointer arguments are passed to the called function in the registers D[4]through D[7] while pointer arguments are passed in the registers A[4] through A[7]. Ifmore than four arguments per type are required they are put onto the stack. Specialcases like 64-bit pointers or struct arguments are not relevant for our work and thusskipped here. 32-bit scalar return values are generally passed back to the caller insideD[2] while 32-bit pointers are returned in A[2]. When the function returns, the previouslysaved upper context is restored.

As a side note we wish to point out that the TriCore compiler is able to generatemachine code that will always use the stack to handle parameters and return values.This feature is only provided for backward compatibility and the instruction set containsno PUSH and POP mnemonics. This fact underlines the strict adherence to the load/storearchitecture.

4.4 Relevant Features of the TC1797

While many shared properties of TriCore products are specified by the TriCore architec-ture, we only worked with the specific model called TC1797. The TC1797 implements

26 4 Target Platform: TriCore TC1797

version 1.3.1 of the TriCore architecture. In Sect. 4.2, we gave a general overview ofthe TC1797’s features. The product brochure [ITA12], the data sheet [ITA09a], andthe user’s manual [ITA09b] give a good impression of how feature-rich and complex theTC1797 is. For our work we need only a small subset of the long list of features. Weshortly name the relevant features in the following and refer to [ITA09b] for the fulldetails.

4.4.1 CPU CoresThe TC1797 contains a 32-bit CPU that is supported by a 32-bit Peripheral ControlProcessor (PCP). The PCP is a fully self-contained processor with its own program anddata memory. It is intended for complex management tasks regarding peripheral units.For our work only the CPU is of interest because we perform no complex peripheraltasks.

4.4.2 Memory Sections and CachingRunning a binary on a microcontroller unavoidably requires some memory to store codeand data. The Program Memory Unit (PMU) of the TC1797 features 4 MB of programmemory and 64 kB of data memory. We mentioned earlier that the address space isdivided into 16 segments and that memory is addressed using 32-bit values. The overalladdress space therefore covers a range of 4 GB. Each of the 16 segments spans 256 MB.

Related to our work we are interested in accessing the program and data memories.Cached access to those memories is possible via segment 8 while non-cached accessis allowed via segment A. We might also encounter cases in which we wish to readfrom or write to the Local Data RAM (LDRAM) which is part of the Data MemoryInterface (DMI). The DMI handles data requests from the CPU and the Local MemoryBus (LMB). Access to the LDRAM is allowed via segment D.

The remaining segments are either reserved (0-7, 9, B) or irrelevant for our work.Knowledge about the segments is an essential prerequisite for the memory configurationeach binary requires. We found that the HighTec toolchain offers default configurations,but they are tailored to specific products like the Infineon TriBoard. Therefore we hadto customize the memory settings for our binaries. For the most part we chose segmentA in order to enforce non-cached memory access. We come back to caching in Chap. 7.One more problem with the default configurations lies in the fact that they assume theexistence of external memory. In our case no external memory is present. The TC1797immediately raises a debug trap if the memory configuration is wrong. Thankfully, FalkSchellenberg of EMSEC kindly provided us with a working memory map file.

4.4.3 Serial CommunicationWe need a way to exchange data like plaintexts and ciphertexts with the TC1797. Forthis communication we use one of two ASC interfaces provided by the TC1797. Thehardware implementation of suitable peripheral components was already finished whenwe started working on our thesis. Therefore we did not require to learn many details

4.4 Relevant Features of the TC1797 27

about the ASC interface. Nevertheless it is interesting to know that the ASC features abuilt-in baud rate generator whose internal clock frequency depends on the ASC clock,which in turn depends on the core frequency. This imposes some limits to possible baudrates. We mention one such case in Chap. 8.

4.4.4 Digital I/OAside from the CPU, working memory maps, and serial I/O, we require one moremechanism: Digital I/O. We need to supply a trigger signal to the outside worldfor power trace recording. The TC1797 offers a total of 215 digital General PurposeI/O (GPIO) lines grouped into 16 ports. A digital I/O line suits the need for a triggersignal perfectly because the signal is binary. No states other than “on” (high voltage)and “off” (low voltage) are required for the oscilloscope trigger. We chose to use one pinof port 2 for the trigger signal.

5 Working Environment

We introduced different types of power analysis attacks in Chap. 3. In order to conduct apractical attack we must meet some technical requirements. We discussed the most basicrequirement in Chap. 4 where we described which target platform we use four our work. Inthis chapter, we present the essential parts of our overall working environment. As a startwe explain the kind of TriCore evaluation board we work with and the communicationchannels between the board and surrounding components. Subsequently we describe themeasurement setup and the software we used to develop, test, and attack AES on theTC1797.

5.1 The EMSEC TriCore BoardInfineon offers a starter kit for the TC1797. The so-called TriBoard is a PCB equippedwith a TC1797, some external RAM, a number of connectors including JTAG, four 80-pinmale and four 80-pin female I/O connectors, and some additional components includingbus transceivers and LEDs. The author of [Hoh09] used a TriBoard for his work but thisevaluation board has two major downsides regarding power analysis attacks:

1. There is exactly one power supply for the entire board which in turn implies thatthere is exactly one electrical ground.

2. The TC1797 is soldered to the PCB. Thus the chip can not be exchanged.

The problem with the central power supply and ground connection is obvious. Whenperforming power analysis attacks, it is important to have the greatest possible amountof control over power supply and electrical ground connections. The TC1797 is a highlycomplex device featuring a total of 80 digital ground (VSS) pins complemented by anadditional eight distinct ground pins serving as oscillator ground, ADC reference voltageand analog part ground, and Fast Analog-to-Digital Converter (FADC) reference voltageand logic ground. Selective clustering of ground connections enhances the number ofpotential measurement points and in turn makes it easier to find the optimal approachto recording power traces.

The fact that the TC1797 is soldered to the board leads to more potential problemswhen it comes to experiments in a laboratory. Both constructive and destructive scenariosmust be considered. From the constructive point of view one might wish to downloadthe same binary to multiple TC1797 chips in order to use them in a team of researchers.It might also be necessary to try a different TriCore model for other experiments. Fromthe destructive point of view, a chip or its periphery can theoretically break at any time

30 5 Working Environment

Figure 5.1: The TriCore SCA board developed at EMSEC with the TC1797 socketed inthe middle, measurement probes on the left and at the bottom, and a JTAGcable on the right

during lab experiments. Both scenarios make it clear that physical exchangeability ofthe chip and full control over the Printed Circuit Board (PCB) seem desirable.

Based on those two criteria the TriBoard is obviously not the optimal device to conductpower analysis experiments with TriCore products in a research environment. Thankfully,David Oswald and Falk Schellenberg of EMSEC designed a custom board targeted at alltypes of side-channel attack against TriCore products. We use version 1.0 of this boardfor our attacks. The board is depicted in Fig. 5.1.

The two requirements we mentioned previously, namely fine-grained control over groundconnections and physical exchangeability of the chip, are fulfilled by the EMSEC board.The board offers distinct ground connectors for all major ground pin groups used by theTC1797. Direct electrical connections to the ground plane were replaced with sockets.The ground connection can be closed using jumper plugs for regular operation of theboard. For power consumption measurements it is easy to equip the socket with a resistorinstead of a jumper plug. We give details on ground connections and related issues inSect. 5.3 where we describe our measurement setup. Before we get to the measurementsetup we describe the development and debugging process.

5.2 Communicating with the Board

The SCA board itself would be quite useless if it was unable to communicate with otherdevices, in our case primarily with the host computer. We need to download binariesto the TC1797 in order to execute and debug our implementations. For this purpose

5.2 Communicating with the Board 31

Host PC

TriCore boardPicoScope UAD2

USB (Power, RS232)

USB USB

A

B JTAG

Figure 5.2: Schematic picture of the laboratory setup

we use the Joint Test Action Group (JTAG) capabilities of the TC1797. The boardfeatures a JTAG connector to interface with the respective hardware which we describelater in this section. In addition to flashing and debugging we also need to send keysand plaintexts to the TriCore and retrieve ciphertexts in exchange. For this purpose weuse the ASC0 interface of the TC1797. The board features an FTDI chip that performsRS-232 communication over Universal Serial Bus (USB). Figure 5.2 shows a schematicoverview of the laboratory setup. We describe the distinct components in the following.

5.2.1 JTAG Hardware

In order to interface with the JTAG connector on the board we require hardware thatcan do so because the host computer does not offer a JTAG port. For this purpose weuse the Universal Access Device 2 (UAD2) offered by PLS.1 The UAD2 is a small boxwith a serial and a JTAG connector on the one side and a USB connector on the other.The USB port allows for a connection to the host computer while the other ports areused to connect the UAD2 to the target device.

The UAD2 is powered externally and features a ground socket. The manual mandatesthat all devices are connected to a common ground domain using this socket in orderto prevent damage caused by Electrostatic Discharge (ESD). The ground connectionshould always be established first. The UAD2 is not specifically designed for the TC1797but supports a broader range of microcontrollers so that it can be reused across differentprojects. We use the UAD2 for flash programming and debugging.

Flash Programming

When a binary has been assembled it must be downloaded to the flash memory of theTC1797. During our work we are in a phase of “active development” in which frequentreprogramming of the target device is required. The UAD2 enables us to download codeto the TriCore via the JTAG interface. Whenever new code is downloaded to the target,the flash memory is first erased. It is then reprogrammed using the provided binary.Each programming cycle ends with a checksum computation by which the integrity ofthe downloaded binary data is asserted.

1PLS Programmierbare Logik & Systeme GmbH, http://www.pls-mc.com.

http://www.pls-mc.com


Debugging

Active development also brings along the need for debugging. Even though live debuggingshould be the last resort, it is sometimes unavoidable. Additionally, some effects likepointer arithmetic bugs only surface while testing code on the target platform. TheTC1797 offers On-Chip Debug Support (OCDS) which is described in [ITA09a, pp. 56-58]and in more detail in [ITA09b, ch. 15]. OCDS uses the JTAG port which in this caseserves as the communication link between the TC1797 and external debugging hardware.There are two levels of OCDS: Level 1 (L1) offers software debugging features likememory access, breakpoints, register inspection, and single-step execution. Level 3 (L3)offers more complex features like tracing but requires a special emulator. We only needthe features offered by OCDS L1. Combining the UAD2 and an accompanying softwarecalled Universal Debug Engine (UDE) enables us to set breakpoints and inspect registersand memory. In Sect. 5.4 we give more information on the software stack we use for ourwork.

5.2.2 USB Connection to the Host Computer

The board is connected to the host computer via USB. This connection powers the boardand serves as the serial communication link. The host computer detects the FTDI chip asa plug-in device and creates a virtual serial port accordingly. Data can then be exchangedwith the board and effectively with the TC1797 via this serial connection. This USBconnection completes the two required communication lines between the TC1797 (or theentire board) and the host computer.

5.3 Measurement Setup

Power traces are usually recorded using a digital oscilloscope. We use a PicoScope 5203as depicted in Fig. 5.3. The PicoScope 5203 is a USB oscilloscope with no built-in display.Sampled data is handled on the host computer. The oscilloscope can be controlled in twoways. Firstly, a Graphical User Interface (GUI) application is provided free of charge byPicoTech. This application allows a user to select settings like sampling mode, samplingresolution, and trigger thresholds. The GUI is a comfortable replacement for the internalscreen of classic oscilloscopes, especially because it offers more complex features likeserial data decoding and special measurement modes. We use v6.7.21.2 of the PicoScopesoftware.

Secondly, the oscilloscope can be controlled by custom software through an ApplicationProgramming Interface (API). This is the preferred approach for the actual recording ofsampled data because the GUI has no capabilities to do so. In the context of our attackswe need more functionality than visual inspection of a live signal. We must interact withthe TriCore board to supply the cipher with inputs, retrieve the outputs, and at the sametime record the power consumption. For this purpose we use a framework developed atEMSEC which we describe in Sect. 5.4.2.

5.4 Software Stack 33

Figure 5.3: The PicoScope 5203 with connectors on the front panel.Source: http://www.pc-oscilloscopes.com/images/ps5000-6low.jpg

From the technical point of view, the PicoScope 5203 features two channels namedA and B. It can sample each channel at up to 500 MHz. Samples are taken at an 8-bitresolution and represented as 16-bit values where the second byte includes special flagslike an overflow indicator. We use Channel A to measure the power consumption whileChannel B serves as the trigger that starts a sampling run. The measurement probe forChannel A is connected to a measurement resistor that we can freely insert into any ofthe ground sockets available on the TriCore board. As an additional tool we have a set ofwires that joins multiple jumper plugs together so that we can combine multiple groundsockets through the central measurement resistor. The probe for Channel B is connectedto an I/O pin of the TC1797. We configured the trigger in such a way that a rising edgeat this pin indicates the start of a measurement run.

5.4 Software StackIn order to develop software destined to be run on the TC1797 we need a number of tools.We also need some software for measurements and evaluation during our side-channelattacks. We describe the toolchain in the following.

5.4.1 Development

We develop software for the TC1797 using the Integrated Development Environment(IDE) provided by PLS. We use v3.4.7.3 which comes as Eclipse Helios SR1 augmentedwith some plugins. C and C++ development support is provided by the C/C++Development Tooling (CDT) plugin. The Eclipse Modeling Framework (EMF) acts asthe basis for features from the PLS development tools. Those are installed in version1.1.9.201101141645 and constitute the main plugin.

http://www.pc-oscilloscopes.com/images/ps5000-6low.jpg


The general integration of C and C++ compilers into Eclipse is managed by CDT.PLS provides a full GNU-based C and C++ compiler toolchain that has been customizedto fit the needs of TriCore development. We use v4.6.2.0 of the toolchain. Binariescreated are ELF compatible. Many well-known GNU tools like make are included inthe toolchain. A version of objdump is also provided which is of great value for binaryanalysis.

The PLS development tools contribute a HighTec and a UDE perspective to Eclipse.The HighTec perspective works like a regular C/C++ development perspective. Inaddition, it provides a tailored project explorer and some modeling integration for targetdevice memory configurations. The UDE perspective integrates the UDE workbenchinto Eclipse which can also be run as a standalone tool. We use v3.00.07 of UDE for allpurposes of interaction with the TC1797 via the UAD2. From UDE we can downloadbinaries to the TC1797 as well as debug them at runtime.

The last development tool we present comes directly from Infineon and is called DigitalApplication Virtual Engineer (DAVE). We use DAVE 2.2 r2. DAVE enables us to createa system configuration for the TC1797 and other TriCore models in a visual manner. Asimple GUI shows a graphical representation of TriCore components like system timer,serial ports, and memory configuration. The desired settings can quickly be selected.When everything is configured, DAVE generates low-level C code which takes care ofsystem initialization according to the settings selected in the DAVE GUI. The tool savesthe developer a lot of time by providing a complete code skeleton which only has to beedited in order to add the individual application code.

5.4.2 Measurement

As soon as we have flashed the TC1797 we need to measure power traces for our attacks.We described our measurement setup in Sect. 5.3. We split the power consumptionmeasurements into three phases.

In the first phase, we have to verify that the DPA board and the TC1797 are operationaland that serial communication works both ways. We use HTerm 0.8.1beta for this purpose.HTerm is a terminal program that allows the user to open a serial connection and exchangedata with the connected device. It features a highly useful automatic mode in which itrepeatedly sends the same data to the target device. Sending the same data in a loopallows us to observe the resulting waveforms in the PicoScope GUI. We can deducefrom what we see in the GUI whether or not the TC1797 is performing computations.Problems like hitting a debug trap would immediately become obvious due to missingserial communication responses or missing electrical activity of the TC1797. This veryfirst functional assessment also gives us the chance to debug electrical issues that maynot become obvious from looking at the DPA board.

In the second phase, we use the PicoScope GUI for preliminary analysis of the TC1797’spower consumption. During this phase, we first check that amplitudes look as expectedwhile the TC1797 performs AES computations on pseudo-random inputs. Next, wedetermine the overall timing of encryption operations in order to deduce the number ofsamples required to record one power trace.

5.4 Software Stack 35

In the third phase, we perform the actual recording of power traces. For this purpose,we use a framework developed by David Oswald of EMSEC. The framework is theresult of David’s diploma thesis [Osw09] and has continuously been extended since it wasoriginally implemented. It performs serial communication with the TriCore in order tosend plaintexts to the device and receive the resulting ciphertexts. At the same time itcontrols the PicoScope in headless mode, i.e., the GUI does not run during measurements.Plaintexts, ciphertexts, and the recorded power traces are stored on disk and can laterbe evaluated.

5.4.3 Evaluation and AttackAfter a measurement run we have a lot of data that needs to be evaluated. We use acombination of different tools for this purpose. The first tool is again the frameworkcreated by David Oswald. For the pure evaluation of the recorded power traces weconfigure the AES key in the framework settings and run the evaluation application. Theapplication calculates the mean, the variance, and correlation coefficients using a varietyof models including the Hamming weight model, the Hamming distance model, and a bitmodel. It outputs new data files containing the results of the evaluation, e.g., correlationtraces for the known key and the models used while processing the power traces.

The next step is to visualize the results of the evaluation. We use Matlab R2012a forthis purpose. A collection of Matlab scripts, mostly developed by David Oswald andFalk Schellenberg, allows us to create four plots that are viable for an assessment of therecorded power traces:

1. A plot that overlays some of the recorded traces in order to check for alignment.

2. A plot that shows the mean of all recorded traces.

3. A plot that shows the variance of all recorded traces.

4. A plot that shows, for each model used in the evaluation, a correlation trace.

The alignment plot is useful for a quick assessment of the overall alignment amongstthe power traces. Big shifts in the alignment are immediately visible in this plot.

The mean plot can be used to check that power consumption amplitudes are consistentacross the measurement and that no detrimental effects have affected the measurements.For example if the overall amplitudes decrease with increasing time, there is somethingwrong with the measurement setup. Big runtime differences between distinct runs of thealgorithm also become visible in the mean plot because the plot looks “smeared” in thosecases.

The variance plot gives an impression of the overall variance of the measured traces.It usually provides a good hint at sensitive points in time where an attack can be moresuccessful than at other points in time during a run of the algorithm. If the variance plotis flat, the measurement was bad or the implementation is very well protected.

Finally, the correlation plot gives an immediate impression about chances for a keyrecovery attack. Each individual plot relating to one model can be shown or hidden using


the plot browser. The plot contains lines that mark the ±4√𝐷

limit [MOP07] where 𝐷 isthe number of traces recorded. Visible peaks exceeding this limit indicate that the modelto which they belong is a good candidate for an attack.

For actual key recovery attacks we use a self-made Python script that makes use ofNumPy2 for matrix manipulation and correlation coefficient calculations. We exploit theinformation provided by the correlation plot in Matlab to decide which model(s) we useand in which window of time we want to launch the attack. Selecting a timeframe insteadof running an attack on the entire trace reduces the amount of memory and time requiredfor the attack. Our Python script creates hypotheses and compares them to the powertraces by means of the correlation coefficient. For each hypothesis we select the highestcorrelation coefficient occurring in the correlation trace. We subsequently repeat this inorder to find the overall absolute maximum of all correlation coefficients. We select thehypothesis that caused this highest correlation as the most probable candidate of the keybyte under attack. Our script prints a success message if the selected hypothesis matchesthe correct key byte and informs the user accordingly about failures. The relevant partof the output looks like:

Best guess based on max. correlation: 0xdeCorrelation for correct key: 0.365890471692Correlation for best guess: 0.365890471692

We know that the key byte under attack was not recovered correctly if the correlationvalues in the output differ, i.e., if the “best guess” correlation is higher than the correlationfor the correct key byte.

In addition to performing the attack our Python script saves the correlation tracesas text files so that they can be fed to gnuplot. This enables us to create either a plotshowing only one correlation trace or a plot showing all correlation traces with the correctone highlighted in a different color.

2NumPy is a scientific computing package for Python. See http://www.numpy.org/ for details.

http://www.numpy.org/

6 Software Countermeasures

In the previous chapters we described what side-channel attacks are and how they canbe mounted in practice. The overall goal of our thesis is to protect AES against poweranalysis attacks on the TriCore TC1797. We have no influence on the internal design ofthe TC1797 which rules out hardware countermeasures. Our only chance to protect thecipher’s computation on the target environment is thus to change its implementation insuitable ways. In this chapter we introduce the idea of software countermeasures. Wesubsequently present different approaches that have recently been presented. We apply asecurity fix to one of them and conclude the chapter by selecting a subset of the existingapproaches for implementation.

6.1 Different ApproachesA lot of research effort has been spent on side-channel countermeasures since Kocher,Jaffe, and Jun introduced DPA in 1998 [KJJ98]. The first papers on potential SCAprotection schemes were published while the AES contest was still running. At CHES1999 Goubin and Patarin presented the following

Fundamental hypothesis: There exists an intermediate variable, thatappears during the computation of the algorithm, such that knowing a fewkey bits (in practice less than 32 bits) allows us to decide whether two inputs(respectively two outputs) give or not the same value for this variable. [GP99]

This hypothesis usually holds true for unprotected implementations of block ciphers.SCA countermeasures aim at rendering the fundamental hypothesis false by a variety ofmeans.

We saw in Chap. 3 that CPA attacks target intermediate variables which appear duringthe computation of a cryptographic algorithm. An oracle first predicts an intermediatevalue and then its hypothetical power consumption. The hypothesis is then comparedto the physically observed power consumption in order to decide accept or reject thehypothesis. The targeted intermediates are also called sensitive variables. In order torender an attack unsuccessful, the correlation between the observable power consumptionand the hypothetical power consumption must be reduced as much as possible.

An attack can be counteracted optimally if not only the intermediate variables, butalso the points in time when they are processed, become less predictable. The valueof an intermediate variable is usually protected by taking an approach called Masking.Varying the point in time when an intermediate variable is processed becomes possibleby means of a technique called Hiding. Hiding is also possible in the frequency domain

38 6 Software Countermeasures

but we focus on the time domain for the sake of brevity. In the following, we investigateMasking and Hiding in detail.

6.1.1 MaskingWe begin with a look at Masking techniques. First we look into simple ways to conceala value by combining it with a single random mask. Additionally we present the ideaof splitting a sensitive variable into multiple parts which enables the construction ofmasking schemes that protect a cipher against 2𝑛𝑑- or higher-order DPA.

Boolean and Arithmetic Masking

The most intuitive approach to masking consists in adding a random mask 𝑚 to anintermediate variable 𝑣. This idea is called additive masking. In the case of AES, additivemasking is performed in terms of polynomial addition in GF(28). Technically, the additionoperation corresponds to computing the XOR between the sensitive variable and themask. Due to the XOR operation, additive masking is also called boolean masking1.Formally we have

𝑣′ = 𝑣 ⊕𝑚

where 𝑣′ denotes the new value of the intermediate variable and ⊕ denotes the XORoperation. Boolean masking is suitable for the protection of linear operations. Forfunctions that are linear with respect to XOR, we have

𝑓(𝑣 ⊕𝑚) = 𝑓(𝑣)⊕ 𝑓(𝑚)

which implies that we can execute the function on the masked intermediate, but canremove the mask later in order to reconstruct the unmasked result of the computation.

The requirement for linearity shows that additive masking is not completely sufficientto protect AES. The S-box is nonlinear which means that

𝑆(𝑣 ⊕𝑚) ̸= 𝑆(𝑣)⊕ 𝑆(𝑚)

and therefore the S-box cannot be protected with boolean masking. This problem can befixed in two different ways. The first way, initially proposed in [Mes00], is to recomputethe S-box lookup table. A new table 𝑆′ is computed in such a way that each table entryis masked by an input and an output mask:

𝑆′[𝑥] = 𝑆[𝑥⊕ 𝑟𝑖𝑛]⊕ 𝑟𝑜𝑢𝑡

where 𝑟𝑖𝑛 masks the input and 𝑟𝑜𝑢𝑡 masks the output of the S-box. Recomputing theS-box lookup table has one drawback, namely the requirement for an additional 256bytes of memory. This might lead to problems in heavily restricted environments likesmart cards. The second approach to fixing the masking problem is thus to go withouta lookup table and compute the S-box transformation instead. From the description

1This name goes back to [Mes00].

6.1 Different Approaches 39

of the S-box computation (Sect. 2.2.3) we can easily see that additive masking is notcompatible with the S-box. The multiplicative inversion in GF(28) is nonlinear:

(𝑣 ⊕𝑚)−1 ̸= 𝑣−1 ⊕𝑚−1

so that another approach is required. While GF(28) addition is not compatible withmultiplicative inversion, field multiplication is compatible because

(𝑣 ⊙𝑚)−1 = 𝑣−1 ⊙𝑚−1

where ⊙ denotes multiplication. The field inversion can therefore be masked with amultiplicative mask. This type of masking is called multiplicative masking or arithmeticmasking2.

We wish to note that the security gained by adding masking to the implementationdepends, amongst other factors, on the randomness of the masks. In addition to this,we have seen that AES cannot be fully protected by either one of boolean or arithmeticmasking. Both masking types have to be combined in order to secure the whole cipher.It is obvious that some kind of conversion is required if both boolean and arithmeticmasking are employed. The first conversion method was published in [AG01] whichwas subsequently followed by [Gou01], [GT02], [TSG02], [CT03], [NP04], [GPQ10], and[Deb12]. Many of those papers uncover flaws in previous publications. For example,[GT02] reveals the so-called zero value attack: Multiplicative masking only protectsnonzero values. At a certain probability, two bytes being XORed can have the same value,which results in zero. This zero value is in turn not protected by multiplicative masking.Research on efficient conversion between boolean and arithmetic masking is still beingconducted. For example, Blandine Debraize recently revealed a flaw in the conversionmethod proposed in [CT03]. She proposed a fix and a completely new conversion method[Deb12].

In contrast to boolean and arithmetic masking, another idea is to split the sensitivevariable into multiple parts, on which computations are then performed, followed by thereconstruction of the original value. We discuss this idea in the following.

Secret Sharing

In [Sha79] Adi Shamir presents a method that he calls a (k, n) threshold scheme. Werefer to the basic idea as secret sharing. The scheme is defined as the division of data 𝐷into 𝑛 pieces 𝐷1, . . . , 𝐷𝑛 in such a way that two requirements are fulfilled:

1. Knowledge of any 𝑘 or more 𝐷𝑖 pieces makes 𝐷 easily computable;

2. Knowledge of any 𝑘 − 1 or fewer 𝐷𝑖 pieces leaves 𝐷 completely undetermined (inthe sense that all its possible values are equally likely).

Shamir uses polynomial interpolation as the reconstruction method. In order to sharethe secret data 𝐷 he picks a random polynomial 𝑞(𝑥) = 𝑎0 + 𝑎1𝑥 + · · ·+ 𝑎𝑘−1𝑥𝑘−1 where

2This name also goes back to [Mes00].


𝑎0 = 𝐷 and then calculates 𝐷1 = 𝑞(1), . . . , 𝐷𝑖 = 𝑞(𝑖), . . . , 𝐷𝑛 = 𝑞(𝑛). Each individualpair (𝑖, 𝐷𝑖) then constitutes one share of the secret that can be distributed to somedistinct party. Reconstruction of the polynomial 𝑞(𝑥) and thereby of the secret 𝑎0 ispossible through polynomial interpolation based on 𝑘 distinct shares. Reconstructingthe polynomial is impossible given any 𝑘 − 1 or less shares and is still possible given any𝑘 + 1 or more distinct shares.

While there is a countermeasure that actually uses random polynomials for sharingand polynomial interpolation for reconstruction [GM11], we additionally wish to presenta more general approach to secret sharing that was introduced in [RP10b]. In order toprotect an implementation against (𝑑 + 1)𝑡ℎ order DPA, a sensitive variable 𝑥 can besplit into 𝑑 + 1 shares 𝑥0, . . . , 𝑥𝑑 along with a recombination operation ⊥. We reproducethe generic notation from [RP10b]:

𝑥 = 𝑥0 ⊥ 𝑥1 ⊥ . . .⊥ 𝑥𝑑 .

Looking at polynomial addition in GF(28), we define that ⊥ is the XOR operationwhich we commonly denote as ⊕. Therefore we have in our case:

𝑥 = 𝑥0 ⊕ 𝑥1 ⊕ · · · ⊕ 𝑥𝑑 . (6.1)

The shares 𝑥1 through 𝑥𝑑 are picked at random and 𝑥0 is computed in such a way that(6.1) is fulfilled. Provided that the random shares are equally and randomly distributed,the knowledge of any less than 𝑑 + 1 shares 𝑥𝑖 is insufficient to reconstruct 𝑥. Whenan intermediate value is shared in this way, we find cases for which we need alternativecomputational methods. Especially the multiplication of two shared elements from GF(28)requires a new approach in order to avoid leakage of the original secret. Later in thischapter we show how the individual operations can be carried out in a secure manner.

Thinking in the direction of DPA attacks, we see one potential drawback to this way ofmasking the secret: It is unavoidable to use the original secret in the process of splitting itinto shares. This is a paradox situation which is only rendered slightly less uncomfortableby the fact that the sharing happens before the intermediate becomes sensitive, i.e.,before the actual encryption. Nevertheless it seems possible to imagine an attack onthe splitting process itself, which would effectively render the secret sharing useless. Wecould not find a clear answer to this problem in recent papers. We therefore discussed theissue with Matthieu Rivain who confirmed that the problem exists but that the currentliterature provides no solution to it.

6.1.2 HidingThe idea of masking sensitive intermediates aims at randomizing their value and therebythe power consumption that occurs when the intermediates are processed. Neverthelessthe points in time where this processing happens remain unchanged under the influence ofmasking. Another technique called Hiding can be used to tackle this problem. Hiding inthe time domain requires randomized instruction sequences or randomized data ordering.A basic idea for the former is to add random dummy operations or parts of a round

6.2 Recently Proposed Masking Schemes 41

operating on a dummy state at a memory location different from that of the real state.Herbst, Oswald, and Mangard give a detailed description of such approaches in [HOM06].They even go so far as to make sure that the state and the dummy state have memoryaddresses with identical Hamming Weight so that access to one or the other cannot bedistinguished in an attack. The authors define two randomization zones that cover thefirst two and the last two AES rounds.

One idea for randomized data ordering is to shuffle the S-box lookups or the sequencein which words of the round key are added to the state. Shuffling of the S-box lookupscan be implemented using a Random Start Index (RSI) with 16 possible values or aRandom Permutation (RP) with 16! possible permutations. Countermeasures based onshuffling are extensively studied in [VCMKS12].

Lu, Pan, and den Hartog state in [GPQ10] that the first and last three rounds ofAES offer viable targets for first-order DPA attacks. The randomization zones from[HOM06] would thus have to be extended based on those findings. We discussed theneed for shuffling with Matthieu Rivain in the context of modern masking schemes.He stated his opinion that at least the first and last three rounds should be protectedand that shuffling is “an efficient way to add confusion to the leakage”. He added thatHiding countermeasures (in addition to Masking) are essential in scenarios where onlylow amounts of noise affect the measurements.

6.2 Recently Proposed Masking Schemes

Due to the fact that [Hoh09] deals with many countermeasures proposed up to 2009, wefocused on papers published in 2010 and later. We found that research on higher-ordermasking schemes has become increasingly popular since then. Most papers on this typeof masking scheme aim at being generic in terms of the order of an attack. We selectedthree papers for detailed analysis. We present the three schemes in the following. Incontrast to the generic algorithms given in the papers we provide concrete algorithmicdescriptions tailored to 1𝑠𝑡-order CPA because this kind of attack is in the focus of thisthesis.

Because all three schemes are meant to provide security against 𝑑𝑡ℎ-order attacks, theyall use the general idea of secret sharing. If a secret is split into 𝑑 + 1 shares it cannotbe reconstructed from any 𝑑 or less shares. This effectively enables protection against𝑑𝑡ℎ-order attacks. The first two schemes share and reconstruct the secret by means ofXOR operations. The third scheme uses Adi Shamir’s original idea [Sha79] for secretsharing and reconstruction.

More differences between the three schemes lie in the approach they take to calculatethe S-box. The first scheme relies on secure exponentiation in order to calculate inverseelements in GF(28). The second scheme uses a subfield approach for efficient inversion.The third scheme is again based on secure exponentiation but, as mentioned before, usesrandom polynomials for sharing and polynomial interpolation for reconstruction of thesecret. To that extent it is slightly easier to directly compare the first two schemes whilethe third one requires a different point of view. We present the three schemes in the


following sections which we named based on the respective papers’ authors.

6.2.1 Rivain-Prouff

The first scheme we present was proposed by Matthieu Rivain and Emmanuel Prouffat CHES 2010. While the original conference contribution is available as [RP10b] weprefer to cite the full version [RP10a] because it contains the relevant security proofs.We call the scheme RP10 in the following. The authors base the secure computationof the S-box on multiplicative inversion in GF(28) by means of exponentiation of fieldelements to the power of 254. This approach exploits Fermat’s Little Theorem:

Theorem 6.2.1 If 𝑝 is prime, then for any integer 𝑎, the number 𝑎𝑝 − 𝑎 is an integermultiple of 𝑝. Formally, 𝑎𝑝 ≡ 𝑎 mod 𝑝.

It can be deduced from this that 𝑎𝑝−2 ≡ 𝑎−1 mod 𝑝, in other words 𝑎𝑝−2 is themultiplicative inverse of 𝑎 mod 𝑝. For finite fields of size 𝑞 we consider the multiplicativegroup formed by the nonzero field elements. The multiplicative group has order 𝑞 − 1and by Lagrange’s Theorem we have 𝑎𝑞−1 = 1. We know that every element of themultiplicative group has a multiplicative inverse so that we can deduce 𝑎𝑞−2 = 𝑎−1. Inthe case of GF(28) we have 𝑞 − 2 = 254. Every nonzero element of GF(28) can thus beinverted mod 𝑃 (𝑥) by raising it to the 254𝑡ℎ power. Multiplicative inversion is undefinedfor zero which is why the AES specification defines that zero maps to itself under inversion[FIP01, p. 15].

There are different approaches to exponentiation. The most naive one consists inrepeated multiplication by 𝑎(𝑥) which would require 253 field multiplications. Unfor-tunately the multiplication of two field elements is a very expensive operation. On theother hand, squaring a field element is a linear operation. A better approach wouldthus be to use an algorithm like square-and-multiply which requires seven squaringsand six multiplications in the case of 𝑒 = 254 = (11111110)2. But there is an evenbetter approach. Rivain and Prouff found an optimal addition chain to compute theexponentiation. In [KHL11] this chain is written as

𝑥𝑆−→ 𝑥2 𝑀−→ 𝑥3 2𝑆−→ 𝑥12 𝑀−→ 𝑥15 4𝑆−→ 𝑥240 𝑀−→ 𝑥252 𝑀−→ 𝑥254

where S denotes one squaring, 2S denotes two squarings, 4S denotes four squarings, andM denotes multiplication by 𝑥, 𝑥3, 𝑥12, and 𝑥2, respectively. This chain still requiresseven squarings but the number of field multiplications is reduced to four. Three sharetuples are needed for intermediate values.

In order to perform the exponentiation according to the addition chain we need tocompute squarings and multiplications. Squaring is straightforward because it is linearin fields of characteristic 2, which makes it compatible with the shared representationof the intermediates. Moreover, each share can be squared separately without compro-mising the privacy3 of the computation because the individual share is masked. With

3The term “privacy” is inspired by the wording used in [ISW03].


regard to multiplication we face a different situation. The following equation shows themultiplication of two shared field elements:

𝑎 · 𝑏 = (𝑎0 + 𝑎1) · (𝑏0 + 𝑏1) = 𝑎0𝑏0 + 𝑎1𝑏0 + 𝑎0𝑏1 + 𝑎1𝑏1 .

We clearly see that all four intermediate results are unprotected, in other words notmasked in any way, and thus the computation is not private in the face of an intrusiveattacker. In [RP10a] Rivain and Prouff present an algorithm for secure multiplicationof shared intermediates over any given binary field. They derive their algorithm fromthe secure bitwise AND scheme proposed in [ISW03] which we henceforth refer to asthe ISW scheme. The ISW scheme was developed to keep the computation of a bitwiseAND operation private even if an adversary gains knowledge about the internal stateof a hardware circuit. The idea can be transferred to software. While Ishai et al. hadonly proven their scheme to be secure against 𝑑/2 order attacks, Rivain and Prouff showin their paper that the scheme is in fact 𝑑𝑡ℎ-order secure. We give the 1𝑠𝑡-order securemultiplication over GF(28) according to Rivain and Prouff as Alg. 6.2.1.

Algorithm 6.2.1: SecMult – 1𝑠𝑡-order Secure Multiplication over GF(28)Data: shares 𝑎0, 𝑎1 satisfying 𝑎 = 𝑎0 ⊕ 𝑎1, shares 𝑏0, 𝑏1 satisfying 𝑏 = 𝑏0 ⊕ 𝑏1Result: shares 𝑐0, 𝑐1 satisfying 𝑐0 ⊕ 𝑐1 = 𝑐 = 𝑎⊙ 𝑏

1 begin2 𝑟0 ← rand(8)3 𝑟1 ← (𝑟0 ⊕ 𝑎0𝑏1)⊕ 𝑎1𝑏0

4 𝑐0 ← 𝑎0𝑏0 ⊕ 𝑟05 𝑐1 ← 𝑎1𝑏1 ⊕ 𝑟1

6 return (𝑐0, 𝑐1)

We see from Alg. 6.2.1 that the multiplication intermediates are now masked. It isalso easy to see that the result is correct. Upon recombination we would get

𝑐 = 𝑐0 + 𝑐1 = (𝑎0𝑏0 ⊕ 𝑟0)⊕ (𝑎1𝑏1 ⊕ 𝑟1)= 𝑎0𝑏0 ⊕ 𝑎1𝑏1 ⊕ 𝑟0 ⊕ 𝑟1

= 𝑎0𝑏0 ⊕ 𝑎1𝑏1 ⊕ 𝑎0𝑏1 ⊕ 𝑎1𝑏0

= (𝑎0 + 𝑎1) · (𝑏0 + 𝑏1)= 𝑎 · 𝑏 .

Using the secure multiplication, we can now give the detailed secure exponentiationfrom [RP10a] as Alg. 6.2.2.

Rivain and Prouff argue in [RP10a] that the masks 𝑎1 and 𝑏1 must be refreshed duringthe algorithm in order to keep them mutually independent. The RefreshMasks procedureis described in Alg. 6.2.3.

Using the secure exponentiation to the 254 we can compute the S-box. The overallresult is correct because both squaring and multiplication yield correct results as shown


Algorithm 6.2.2: SecExp254 – 1𝑠𝑡-order Secure Exponentiation to the 254 overGF(28)Data: shares 𝑥0, 𝑥1 satisfying 𝑥0 ⊕ 𝑥1 = 𝑥Result: shares 𝑦0, 𝑦1 satisfying 𝑦0 ⊕ 𝑦1 = 𝑦 = 𝑥254

1 begin2 for 𝑖 = 0 to 1 do3 𝑧𝑖 ← 𝑥2

𝑖 [⨁︀𝑖 𝑧𝑖 = 𝑥2]4 RefreshMasks((𝑧0, 𝑧1))5 (𝑦0, 𝑦1)← SecMult

(︀(𝑧0, 𝑧1), (𝑥0, 𝑥1)

)︀[⨁︀𝑖 𝑦𝑖 = 𝑥3]

6 for 𝑖 = 0 to 1 do7 𝑤𝑖 ← 𝑦4

𝑖 [⨁︀𝑖 𝑤𝑖 = 𝑥12]8 RefreshMasks((𝑧0, 𝑧1))9 (𝑦0, 𝑦1)← SecMult

(︀(𝑦0, 𝑦1), (𝑤0, 𝑤1)

)︀[⨁︀𝑖 𝑦𝑖 = 𝑥15]

10 for 𝑖 = 0 to 1 do11 𝑦𝑖 ← 𝑦16

𝑖 [⨁︀𝑖 𝑦𝑖 = 𝑥240]12 (𝑦0, 𝑦1)← SecMult

(︀(𝑦0, 𝑦1), (𝑤0, 𝑤1)

)︀[⨁︀𝑖 𝑦𝑖 = 𝑥252]

13 (𝑦0, 𝑦1)← SecMult(︀(𝑦0, 𝑦1), (𝑧0, 𝑧1)

)︀[⨁︀𝑖 𝑦𝑖 = 𝑥254]

14 return (𝑦0, 𝑦1)

Algorithm 6.2.3: RefreshMasks – 1𝑠𝑡-order mask refreshingData: shares 𝑥0, 𝑥1 satisfying 𝑥 = 𝑥0 ⊕ 𝑥1Result: shares 𝑥0, 𝑥1 satisfying 𝑥 = 𝑥0 ⊕ 𝑥1

1 begin2 𝑟 ← rand(8)3 𝑥0 ← 𝑥0 ⊕ 𝑟4 𝑥1 ← 𝑥1 ⊕ 𝑟

5 return (𝑥0, 𝑥1)


earlier. During the S-box computation, each element of the AES state is inverted inGF(28) by raising it to the 254. We subsequently have to apply the affine transformation.For the encryption this transformation consists of a byte-wise matrix multiplicationfollowed by the byte-wise addition of a constant value 𝑐 = 0x63. Because we have twoshares of the state we apply the matrix multiplication to all elements of both shares butthe constant is only added to the elements of one share of the state. Adding the constantto all elements of both shares would effectively eliminate the constant with regard torecombination of the shares:

(𝑎0 ⊕ 𝑐⊕ 𝑎1 ⊕ 𝑐) = 𝑎 ̸= (𝑎⊕ 𝑐) .

The inverse affine transformation required for decryption can be split up in analogousways. The full computation of the S-box which is secure against 1𝑠𝑡-order attacks is givenin Alg. 6.2.4.

Algorithm 6.2.4: 1𝑠𝑡-order Secure S-box according to Rivain and ProuffData: shares 𝑥0, 𝑥1 satisfying 𝑥 = 𝑥0 ⊕ 𝑥1Result: shares 𝑦0, 𝑦1 satisfying 𝑦0 ⊕ 𝑦1 = 𝑦 = 𝑆(𝑥)

1 begin2 (𝑦0, 𝑦1)← SecExp254(𝑥0, 𝑥1)3 for 𝑖 = 0 to 1 do4 𝑦𝑖 ← 𝐴𝑓(𝑦𝑖)5 𝑦0 ← 𝑦0 ⊕ 0x63

6 return (𝑦0, 𝑦1)

6.2.2 Rivain-Prouff without Mask Refreshing

When they proposed the RP10 scheme, Rivain and Prouff argued that “for the 𝑑th-ordersecurity to hold, it is important that the masks (𝑎𝑖)𝑖≥1 and (𝑏𝑖)𝑖≥1 in input of theSecMult algorithm are mutually independent” [RP10a]. In order to achieve this mutualindependence the authors proposed to refresh the masks in the second and fourth stepsof Alg. 6.2.2 (lines 4 and 8). Their mask refreshing technique was subsequently used as abuilding block for new masking schemes [GM11, KHL11].

Rivain and Prouff also took part in the research leading up to [CPRR13] in whichthe authors point out that the mask refreshing technique proposed in [RP10a] causesa security flaw in the overall masking scheme. They state that “even if both the mask-refreshing procedure and the ISW multiplication are secure at order 𝑑, their compositionis insecure and it is defeated by an attack of order [𝑑/2 + 1].”

We refrain from reproducing the full description of the flaw at this point but it is madeclear in [CPRR13] that the mask refreshing procedure does in fact not fully guarantee therequired mutual independence of the masks and the SecMult input. In the consequence,RP10 and all other schemes using mask refreshing in this way are not fully secure against


𝑑𝑡ℎ-order attacks. The authors propose a different approach that requires no maskrefreshing. Their idea is to securely evaluate the product of a value 𝑥 and the result ofa GF(2𝑛)-linear function 𝑔(𝑥), for example 𝑔(𝑥) = 𝑥2. All possible values of 𝑔(𝑥) arestored in lookup tables. Formally we have the new Alg. 6.2.5 for the share processingwhich we call SecProc.

Algorithm 6.2.6 shows how the new method is plugged into the secure exponentiation.We denote with SecProc8 the secure processing of shares over GF(28). Taking a naiveapproach we need five lookup tables:

T2 so that 𝑇2[𝑥] = 𝑥2

T3 so that 𝑇3[𝑥] = 𝑥3 = 𝑥 · 𝑔(𝑥), 𝑔(𝑥) = 𝑥2


T5 so that 𝑇5[𝑥] = 𝑥5 = 𝑥 · 𝑔(𝑥), 𝑔(𝑥) = 𝑥4


where 𝑥 and all table entries are elements of GF(28). Each table requires 256 bytes ofROM. Note that different tradeoffs are possible: The tables 𝑇 4 and 𝑇 16 can be replacedby two or four subsequent 𝑇2 lookups respectively. For the rest of this thesis we refer tothe repaired variant of RP10 as CPRR13.

Algorithm 6.2.5: SecProc – Secure evaluation of ℎ : 𝑥 ↦→ 𝑥 · 𝑔(𝑥) over GF(2𝑛)Data: shares 𝑎0, 𝑎1 satisfying 𝑎0 ⊕ 𝑎1 = 𝑎, a lookup table ℎ : 𝑥 ↦→ 𝑥 · 𝑔(𝑥)Result: shares 𝑐0, 𝑐1 satisfying 𝑐0 ⊕ 𝑐1 = 𝑐 = 𝑎 · 𝑔(𝑎) for some F2-linear function 𝑔

1 begin2 𝑟0 ← rand(𝑛)3 𝑟′ ← rand(𝑛)4 𝑡← 𝑟05 𝑡← 𝑡⊕ ℎ[𝑎0 ⊕ 𝑟′]6 𝑡← 𝑡⊕ ℎ[𝑎1 ⊕ 𝑟′]7 𝑡← 𝑡⊕ ℎ[(𝑎0 ⊕ 𝑟′)⊕ 𝑎1]8 𝑡← 𝑡⊕ ℎ[𝑟′]9 𝑟1 ← 𝑡

10 𝑐0 ← ℎ[𝑎0]⊕ 𝑟011 𝑐1 ← ℎ[𝑎1]⊕ 𝑟1

6.2.3 Kim-Hong-LimAt CHES 2011 another higher-order masking scheme was proposed by HeeSeok Kim,Seokhie Hong, and Jongin Lim [KHL11]. The authors describe a software implementationof the hardware inverter architecture presented in [SMTM01] which in turn uses the


Algorithm 6.2.6: 1𝑠𝑡-order Secure Exponentiation to the 254 over GF(28) withoutMask RefreshingData: shares 𝑥0, 𝑥1 satisfying 𝑥0 ⊕ 𝑥1 = 𝑥Result: shares 𝑦0, 𝑦1 satisfying 𝑦0 ⊕ 𝑦1 = 𝑦 = 𝑥254

1 begin2 (𝑦0, 𝑦1)← SecProc8((𝑥0, 𝑥1), 𝑇3) [⨁︀𝑖 𝑦𝑖 = 𝑥3]3 for 𝑖 = 0 to 1 do4 𝑤𝑖 ← 𝑇4[𝑦𝑖] [⨁︀𝑖 𝑤𝑖 = 𝑥12]5 (𝑦0, 𝑦1)← SecProc8((𝑦0, 𝑦1), 𝑇5) [⨁︀𝑖 𝑦𝑖 = 𝑥15]6 for 𝑖 = 0 to 1 do7 𝑦𝑖 ← 𝑇16[𝑦𝑖] [⨁︀𝑖 𝑦𝑖 = 𝑥240]8 (𝑦0, 𝑦1)← SecMult

(︀(𝑦0, 𝑦1), (𝑤0, 𝑤1)

)︀[⨁︀𝑖 𝑦𝑖 = 𝑥252]

9 for 𝑖 = 0 to 1 do10 𝑤𝑖 ← 𝑇2[𝑥𝑖] [⨁︀𝑖 𝑤𝑖 = 𝑥2]11 (𝑦0, 𝑦1)← SecMult

(︀(𝑦0, 𝑦1), (𝑤0, 𝑤1)

)︀[⨁︀𝑖 𝑦𝑖 = 𝑥254]

composite field inversion method proposed by Guajardo and Paar [GP97]. We firstdescribe the inversion method from [GP97] followed by an overview of the hardwaredesign by Satoh et al. which then directly leads us to the analysis of the approach takenby Kim, Hong, and Lim. Henceforth we refer to this masking scheme as KHL11.

Composite Field Arithmetic

We have seen in the previous section that the complexity of computing a multiplicativeinversion in GF(28) can be reduced by using an optimal addition chain. Neverthelessit is still a computationally intensive task that requires four field multiplications andseven squarings. In fact the inversion can be computed even faster if subfield arithmeticis used.

While AES is designed to make direct use of GF(28) constructed as GF(2)/𝑃 (𝑥), 𝑃 (𝑥) =𝑥8 + 𝑥4 + 𝑥3 + 𝑥 + 1 it is obvious that there are other fields isomorphic to GF(28). Forexample 256 = 162 so that it is easy to construct a field GF(162) ∼= GF(28). Thesame idea can be applied repeatedly to construct a tower of fields. Applying threeextensions of degree two in a row to GF(2) leads to GF(((22)2)2) ∼= GF(28) because((22)2)2 = 28 = 256.

As part of their research on elliptic curve cryptography, Guajardo and Paar [GP97]presented a method for inversion over composite fields of the form GF((2𝑛)𝑚). Suchfields are isomorphic to GF(2𝑛)/𝑃 (𝑥) where 𝑃 (𝑥) is a monic irreducible polynomial ofdegree 𝑚 over GF(2𝑛). The idea is to push the inversion down to GF(2𝑛) which yields ahuge advantage in terms of efficiency. The following theorem describes how the inversionis computed.


Theorem 6.2.2 [Paa94] The multiplicative inverse of an element A of the compositefield GF((2𝑛)𝑚), 𝐴 ̸= 0, can be computed by

𝐴−1 = (𝐴𝑟)−1𝐴𝑟−1 mod 𝑃 (𝑥)

where 𝐴𝑟 ∈ GF(2𝑛) and 𝑟 = (2𝑛𝑚 − 1)/(2𝑛 − 1).

The central property here is that 𝐴𝑟 ∈ GF(2𝑛) which allows for calculation of theactual inversion in the subfield. In order to invert 𝐴 in GF((2𝑛)𝑚) we need to performfour steps. We first have to calculate 𝐴𝑟 in GF((2𝑛)𝑚) from which we get 𝐴𝑟 throughmultiplication with 𝐴 as the second step. The result 𝐴𝑟 is an element of GF(2𝑛) whichis then inverted over that very field as the third step. The fourth and final step is tocalculate (𝐴𝑟)−1𝐴𝑟−1.

Based on this method for subfield inversion we can now look into the inverter designpresented in [SMTM01].

A Composite Field Inverter Hardware Architecture

Optimizing the efficiency of the AES S-box has always been a popular resarch topic.Initially Rudra et al. proposed a way to improve the efficiency of AES encryptionthrough composite field arithmetic [RDJ+01]. Later Satoh et al. proposed an even betteroptimization [SMTM01]. In addition to the theoretic optimization they presented apractical hardware architecture for a subfield inverter. We describe their ideas in thefollowing.

Satoh and his colleagues first construct a composite field by means of repeated extensionsstarting with GF(2): ⎧⎪⎪⎨⎪⎪⎩

GF(22) : 𝑃0(𝑥) = 𝑥2 + 𝑥 + 1GF((22)2) : 𝑃1(𝑥) = 𝑥2 + 𝑥 + 𝜑

GF(((22)2)2) : 𝑃2(𝑥) = 𝑥2 + 𝑥 + 𝜆

where 𝜑 = {10}2 and 𝜆 = {1100}2. They subsequently apply the inversion methodproposed by Guajardo and Paar. Denoting the composite field as GF((24)2) gives 𝑛 = 4and 𝑚 = 2. Consequently, 𝑟 = (28 − 1)/(24 − 1) = 17 so that

𝐴−1 = (𝐴17)−1𝐴16 mod 𝑃 (𝑥) . (6.2)

At this point only one piece is missing before we can actually invert elements of GF(28):The composite field is isomorphic but not identical to GF(28) which means that wecannot directly apply the inversion to an element of GF(28). If we intend to invert anelement 𝑥 ∈ GF(28) we have to map it to 𝐺𝐹 ((24)2), invert the result, and map theinverted value back to GF(28). We thus need two isomorphism functions:

𝛿 : 𝐺𝐹 (28) ↦→ 𝐺𝐹 ((24)2)𝛿−1 : 𝐺𝐹 ((24)2) ↦→ 𝐺𝐹 (28)


𝑥 𝛿

𝑎ℎ

𝑎𝑙 +

𝑝2

×

×𝜆

+ 𝑝−1

×

×

𝑎′ℎ

𝑎′𝑙

𝛿−1 𝐴𝑓 𝑆(𝑥)/4

/4

/4

/4

Figure 6.1: GF((24)2) inverter according to Satoh et alii

Those functions are given in [SMTM01]. As in (2.2) and (2.4),4 we write them inmatrix notation:

𝛿(𝑥) =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

𝑎0𝑎1𝑎2𝑎3𝑎4𝑎5𝑎6𝑎7

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 0 0 0 0 1 00 1 0 0 1 0 1 00 1 1 1 1 0 0 10 1 1 0 0 0 1 10 1 1 1 0 1 0 10 0 1 1 0 1 0 10 1 1 1 1 0 1 10 0 0 0 0 1 0 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠·

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

𝑥0𝑥1𝑥2𝑥3𝑥4𝑥5𝑥6𝑥7

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠(6.3)

as well as the inverse

𝛿−1(𝑎) =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

𝑥0𝑥1𝑥2𝑥3𝑥4𝑥5𝑥6𝑥7

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 1 0 1 1 1 00 0 0 0 1 1 0 00 1 1 1 1 0 0 10 1 1 1 1 1 0 00 1 1 0 1 1 1 00 1 0 0 0 1 1 00 0 1 0 0 0 1 00 1 0 0 0 1 1 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠·

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

𝑎0𝑎1𝑎2𝑎3𝑎4𝑎5𝑎6𝑎7

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠(6.4)

where 𝑥 ∈ GF(28), 𝑎 ∈ GF((24)2), and (𝑥0, . . . , 𝑥7) resp. (𝑎0, . . . , 𝑎7) denote bits 0 (theLSB) through 7 (the MSB) of 𝑥 and 𝑎. We now have everything we need to invertelements of GF(28).

Figure 6.1 shows the inverter proposed in [SMTM01]. The boxes labeled with + and× represent addition and multiplication. The box labeled with 𝑝2 means that the inputvalue is squared and the box labeled with ×𝜆 means that the input is multiplied with 𝜆.The box labeled 𝑝−1 denotes the inversion.

Going through the figure from left to right, we start with the input 𝑥 ∈ GF(28)which is passed to the 𝛿 function in order to map it to GF((24)2). The result is𝑎(𝑥) = 𝑎ℎ𝑥 + 𝑎𝑙, 𝑎(ℎ,𝑙) ∈ GF((24)2). The coefficients 𝑎ℎ and 𝑎𝑙 are subsequently treatedas individual four-bit values each representing one polynomial in GF(24).

4See pages 8 and 11.


We now come back to (6.2) and the required steps for the full inversion. The first stepis the computation of 𝐴16. We have 𝐴(𝑥) = 𝑎ℎ𝑥 + 𝑎𝑙. We can calculate 𝐴16:

𝐴16 = 𝑎16ℎ 𝑥 + 𝑎16

ℎ (𝜆 + 𝜆4 + 𝜆8 + 𝜆16) + 𝑎16𝑙

= 𝑎16ℎ 𝑥 + {0001}2𝑎16

ℎ + 𝑎16𝑙

≡ 𝑎ℎ𝑥 + (𝑎ℎ + 𝑎𝑙) mod 𝑃2(𝑥)

where the last step holds because 𝐴16 ≡ 𝐴 mod 𝑃 over GF(24). The leftmost adder inFig. 6.1 provides us with (𝑎ℎ + 𝑎𝑙) while 𝑎ℎ itself remains unchanged.

The second step is to compute 𝐴17 = 𝐴 ·𝐴16. We write down the following equations:

𝐴 ·𝐴16 = (𝑎ℎ𝑥 + 𝑎𝑙)(𝑎ℎ𝑥 + (𝑎ℎ + 𝑎𝑙))= 𝑎2

ℎ𝑥2 + 𝑎ℎ(𝑎ℎ + 𝑎𝑙)𝑥 + 𝑎ℎ𝑎𝑙𝑥 + (𝑎ℎ + 𝑎𝑙)𝑎𝑙

= 𝑎2ℎ𝑥2 + 𝑎2

ℎ𝑥 + (𝑎ℎ + 𝑎𝑙)𝑎𝑙

≡ 𝑎2ℎ(𝑥 + 𝜆) + 𝑎2

ℎ𝑥 + (𝑎ℎ + 𝑎𝑙)𝑎𝑙 mod 𝑃2(𝑥)= 𝑎2

ℎ𝜆 + (𝑎ℎ + 𝑎𝑙)𝑎𝑙

from which we see that 𝐴17 is indeed an element of GF(24) because the upper coefficientis zero. The remaining inverter components depicted in the left half of the figure performthe computation of 𝐴17 which is then fed into the component that computes 𝐴−17. Atthis point it is important to note that inversion in GF(24) can be computed by raisingan element to the power of 14.

After the inversion we have 𝐴−17. As the final step of the overall GF((24)2) inversionwe need to multiply 𝐴−17 with 𝐴16. This final multiplication is depicted in Fig. 6.1in terms of the rightmost two multipliers. The resulting outputs 𝑎′

ℎ and 𝑎′𝑙 are then

recombined into an 8-bit value and passed to the 𝛿−1 function. The following affinetransformation concludes the computation and we have 𝑆(𝑥).

Secure Inversion in Software

The inverter design published in [SMTM01] has two properties that do not perfectlymatch our goal of creating an AES software implementation secure against 1𝑠𝑡-orderSCA. Firstly we have the fact that Satoh et al. proposed their design in an effortto enable efficient hardware implementation of the Rijndael S-box and inverse S-box.Secondly, side-channel security was not in the scope of this publication. This is where[KHL11] closes the gap related to the goal of our thesis. Kim, Hong, and Lim pick upon the existing inverter design and propose algorithms for secure implementation of thecomposite field arithmetic in software.

As far as the overall masking of AES is concerned, the ideas of Kim et al. differ only inthe approach that is taken towards the computation of the S-box. The subfield approachleads to reduced complexity. In order to speed up the subfield operations even more, theauthors resort to six lookup tables5, namely:

5We provide the tables as C code in Appendix B.1.


T1 so that 𝑇1[𝑋] = 𝑋2; 𝑋, 𝑋2 ∈ GF(24)

T2 so that 𝑇2[𝑋] = 𝑋4; 𝑋, 𝑋2 ∈ GF(24)

T3 so that 𝑇3[𝑋] = 𝜆𝑋2; 𝑋, 𝑋2 ∈ GF(24)

T4 so that 𝑇4[𝑋][𝑌 ] = 𝑋𝑌 ; 𝑋, 𝑌, 𝑋𝑌 ∈ GF(24)

T5 so that 𝑇5[𝑋] = 𝛿(𝑋); 𝑋 ∈ GF(28), 𝛿(𝑋) ∈ GF(((22)2)2)

T6 so that 𝑇6[𝑋] = 𝐴𝑓(𝛿−1(𝑋)); 𝑋 ∈ GF(((22)2)2), 𝐴𝑓(𝛿−1(𝑋)) ∈ GF(((28)

We begin by recalling that inversion in GF(24) can be performed by raising 𝑥 to the14. Kim et al. use the addition chain

𝑥𝑆−→ 𝑥2 𝑀−→ 𝑥3 2𝑆−→ 𝑥12 𝑀−→ 𝑥14

to compute this exponentiation. Given the lookup tables and the addition chain wefirst tailor Algorithm 2 from [KHL11] to 𝑑 = 1 and give the result as Alg. 6.2.7. Basedon this, we subsequently reproduce Algs. 2 and 3 from the same source and give themas Algs. 6.2.8 and 6.2.9. The latter maps perfectly to Fig. 6.1: Line 3 represents the𝛿 function and the output splitting, lines 4-6 and 8 represent the preparations for theinversion over GF(24), line 9 describes the actual inversion, and lines 10-11 as well as13-14 represent the final multiplication and the inverse isomorphism mapping includingthe constant addition we described on page 45. As an important hint for implementation,we wish to point out that the RefreshMasks procedure used here retrieves only 4-bitrandom values in contrast to the original RP10 version we gave as Alg. 6.2.3.

Algorithm 6.2.7: SecMult4 – 1𝑠𝑡-order Secure Multiplication over GF(24)Data: shares 𝑎0, 𝑎1 satisfying 𝑎 = 𝑎0 ⊕ 𝑎1, shares 𝑏0, 𝑏1 satisfying 𝑏 = 𝑏0 ⊕ 𝑏1Result: shares 𝑐0, 𝑐1 satisfying 𝑐0 ⊕ 𝑐1 = 𝑐 = 𝑎⊙ 𝑏 ∈ GF(24)

1 begin2 𝑟0 ← rand(4)3 𝑟1 ← (𝑟0 ⊕ 𝑇4[𝑎0][𝑏1])⊕ 𝑇4[𝑎1][𝑏0]4 𝑐0 ← 𝑇4[𝑎0][𝑏0]⊕ 𝑟05 𝑐1 ← 𝑇4[𝑎1][𝑏1]⊕ 𝑟1

6.2.4 Kim-Hong-Lim without Mask Refreshing

The KHL11 scheme uses the same mask refreshing technique as the RP10 scheme whichmakes it potentially vulnerable as well. The secure inversion over GF(24) we presentedearlier as Alg. 6.2.8 is affected by this problem. We take another look at the first threesteps of the algorithm: The squaring table is used, the masks are refreshed, and SecMult4is invoked. This way, 𝑧𝑖 = 𝑥3

𝑖 is computed. We learned from [CPRR13] that these exact


Algorithm 6.2.8: SecInv4 – 1𝑠𝑡-order Secure Inversion over GF(24)Data: shares 𝑥0, 𝑥1 satisfying 𝑥 = 𝑥0 ⊕ 𝑥1Result: shares 𝑦0, 𝑦1 satisfying 𝑦0 ⊕ 𝑦1 = 𝑦 = 𝑥−1 = 𝑥14 ∈ GF(24)

1 begin2 for 𝑖 = 0 to 1 do3 𝑤𝑖 ← 𝑇1[𝑥𝑖] [⨁︀𝑖 𝑤𝑖 = 𝑥2]4 RefreshMasks((𝑤0, 𝑤1))5 (𝑧0, 𝑧1)← SecMult4

(︀(𝑤0, 𝑤1), (𝑥0, 𝑥1)

)︀[⨁︀𝑖 𝑧𝑖 = 𝑥3]

6 for 𝑖 = 0 to 1 do7 𝑧𝑖 = 𝑇2[𝑧𝑖] [⨁︀𝑖 𝑧𝑖 = 𝑥12]8 (𝑦0, 𝑦1)← SecMult4

(︀(𝑧0, 𝑧1), (𝑤0, 𝑤1)

)︀[⨁︀𝑖 𝑦𝑖 = 𝑥14]

Algorithm 6.2.9: 1𝑠𝑡-order secure masking of the AES S-boxData: shares 𝑥0, 𝑥1 satisfying 𝑥0 ⊕ 𝑥1 = 𝑥 ∈ GF(28)Result: shares 𝑦0, 𝑦1 satisfying 𝑦0 ⊕ 𝑦1 = 𝑆(𝑥) ∈ GF(28)

1 begin2 for 𝑖 = 0 to 1 do3 (𝐻𝑖‖𝐿𝑖)← 𝑇5[𝑥𝑖] [𝐻𝑖, 𝐿𝑖 ∈ GF(24)]4 𝑤𝑖 ← 𝑇3[𝐻𝑖]5 𝑡𝑖 ← 𝐻𝑖 ⊕ 𝐿𝑖

6 (𝐿0, 𝐿1)← SecMult4(︀(𝑡0, 𝑡1), (𝐿0, 𝐿1)

)︀7 for 𝑖 = 0 to 1 do8 𝑤𝑖 ← 𝑤𝑖 ⊕ 𝐿𝑖

9 (𝑤0, 𝑤1)← SecInv 4(︀(𝑤0, 𝑤1)

)︀10 (𝐻0, 𝐻1)← SecMult4

(︀(𝑤0, 𝑤1), (𝐻0, 𝐻1)

)︀11 (𝐿0, 𝐿1)← SecMult4

(︀(𝑤0, 𝑤1), (𝑡0, 𝑡1)

)︀12 for 𝑖 = 0 to 1 do13 𝑦𝑖 ← 𝑇6[𝐻𝑖‖𝐿𝑖]14 𝑦0 ← 𝑦0 ⊕ 0x63


three steps can be replaced by a SecProc invocation. We thus decided to fix KHL11 bychanging the SecInv algorithm accordingly. We introduce SecProc4 to denote secure shareprocessing over GF(24) and give the fixed SecInv function as Alg. 6.2.10. The cost of thisfix comes as an additional 16-byte lookup table 𝑇 7 so that 𝑇 7[𝑥] = 𝑥3 where 𝑥 and 𝑥3 areelements of GF(24). For the rest of this thesis we refer to the fixed KHL11 as SKHL13.

Algorithm 6.2.10: 1𝑠𝑡-order Secure Inversion over GF(24)Data: shares 𝑥0, 𝑥1 satisfying 𝑥 = 𝑥0 ⊕ 𝑥1Result: shares 𝑦0, 𝑦1 satisfying 𝑦0 ⊕ 𝑦1 = 𝑦 = 𝑥14 = 𝑥−1 ∈ GF(24)

1 begin2 (𝑦0, 𝑦1)← SecProc4((𝑥0, 𝑥1), 𝑇7) [⨁︀𝑖 𝑦𝑖 = 𝑥3]3 for 𝑖 = 0 to 1 do4 𝑧𝑖 = 𝑇2[𝑧𝑖] [⨁︀𝑖 𝑧𝑖 = 𝑥12]5 for 𝑖 = 0 to 1 do6 𝑤𝑖 = 𝑇1[𝑥𝑖] [⨁︀𝑖 𝑤𝑖 = 𝑥2]7 (𝑦0, 𝑦1)← SecMult4

(︀(𝑧0, 𝑧1), (𝑤0, 𝑤1)

)︀[⨁︀𝑖 𝑦𝑖 = 𝑥14]

6.2.5 Goubin-Martinelli

The two masking schemes we just presented employ a combination of two techniquesin order to protect sensitive variables. The first technique can generally be calledsecret sharing because variables are split into multiple parts all of which are requiredto reconstruct the original value. In fact RP10/CPRR13 and KHL11/SKHL13 use a(𝑑+1, 𝑑+1) threshold scheme6 to share and recombine the sensitive variables. The secondtechnique comes into play when computations are performed on the shared variables.Secure manipulation of shared variables is performed by means of Secure MultipartyComputation.

At CHES 2011 Goubin and Martinelli presented a masking scheme [GM11] that resortsto the original idea of polynomial interpolation as proposed by Shamir [Sha79]. We discussit here in order to have a broader horizon when looking at software countermeasures, butwe keep the discussion shorter than in the previous sections.

The original idea of secret sharing is based on polynomial interpolation. For a set of 𝑘tuples (𝑥𝑖, 𝑦𝑖) where all 𝑥𝑖 are distinct, there is exactly one polynomial 𝑝(𝑥) of degree𝑘−1 for which 𝑝(𝑥𝑖) = 𝑦𝑖 for all 𝑖 (cf. [Sha79] and [Knu97]). A secret 𝑎0 can be protectedby randomly generating a 𝑘 − 1 degree polynomial 𝑃 (𝑥) = 𝑎𝑘−1𝑥𝑘−1 + · · · + 𝑎0 andthen evaluating 𝐷1 = 𝑝(1), . . . , 𝐷𝑖 = 𝑝(𝑖), . . . , 𝐷𝑛 = 𝑝(𝑛) where 𝑛 ≥ 𝑘. Any deliberatelychosen subset of 𝑘 distinct tuples (𝑖, 𝐷𝑖) can then be used to interpolate 𝑝(𝑥) andsubsequently evaluate 𝑝(0) = 𝑎0 to reconstruct the secret. Any subset of less than 𝑘distinct tuples (𝑖, 𝐷𝑖) is insufficient to reconstruct 𝑎0. We give the algorithmic notationof Shamir’s Secret Sharing as presented in [GM11] as Alg. 6.2.11. In this case we left the

6Here, 𝑑 is the DPA order.


parameter 𝑑 unchanged instead of replacing it with a concrete value. The reconstructionof 𝑎0 works by computing 𝑎0 = ∑︀𝑑

𝑖=0 𝑦𝑖 · 𝛽𝑖 where 𝛽𝑖 = ∏︀𝑑𝑗=0,𝑗 ̸=𝑖

−𝑥𝑗

𝑥𝑖−𝑥𝑗.

Algorithm 6.2.11: Shamir’s Secret Sharing SchemeData: A secret 𝑎0, random values (𝑥𝑖)𝑖=0..𝑑−1Result: Shares (𝑥𝑖, 𝑦𝑖)𝑖=0..𝑑−1

1 begin2 for 𝑖 = 1 to 𝑑− 1 do3 𝑎𝑖 = rand(𝑛)4 for 𝑖 = 0 to 𝑑 do5 𝑦𝑖 = 𝑎𝑑−1𝑥𝑑−1

𝑖 + 𝑎𝑑−2𝑥𝑑−2𝑖 + · · ·+ 𝑎1𝑥1 + 𝑎0

6 return (𝑥𝑖, 𝑦𝑖)𝑖=0..𝑑−1

Algorithm 6.2.12: Shared Multiplication according to Goubin and MartinelliData: Shared representation of 𝑏, (𝑥𝑖, 𝑦𝑖)𝑖=0..𝑑 and 𝑢, (𝑥𝑖, 𝑤𝑖)𝑖=0..𝑑

Result: Shares (𝑥𝑖, 𝑦′𝑖)𝑖=0..𝑑 representing the product of 𝑏 and 𝑢

1 begin2 for 𝑗 = 0 to 𝑑 do3 for 𝑘 = 0 to 𝑑 do4 𝑧𝑗,𝑘 ← 𝑦𝑗 · 𝑤𝑘

5 for 𝑖 = 0 to 𝑑 do6 (𝑥𝑖, 𝑦′

𝑖)←(︁𝑥𝑖, (∑︀𝑑

𝑗=1∑︀

0≤𝑘<𝑗(𝑧𝑗,𝑘 ⊕ 𝑧𝑘,𝑗) · 𝛽𝑗,𝑘(𝑥𝑖)) + ∑︀𝑑𝑗=0 𝑧𝑗,𝑗 · 𝛽𝑗,𝑖(𝑥𝑖)

)︁7 return (𝑥𝑖, 𝑦′

𝑖)𝑖=0..𝑑

In the case of AES the sensitive values in need of protection are elements of GF(28), inother words polynomials of the form 𝑝(𝑥) = ∑︀7

𝑖=0 𝑎𝑖𝑥𝑖 where 𝑎𝑖 ∈ GF(2). In the context

of GM11 the secret 𝑎0 and the random values (𝑥𝑖)𝑖=0..𝑑−1 are thus elements of GF(28).In order to invert a value in GF(28) Goubin and Martinelli chose the same addition

chain as in [RP10a] for the exponentiation to the 254. The chain requires squarings andmultiplications. Algorithm 6.2.12 shows the secure multiplication proposed for the GM11scheme. Squarings are computed simply as (𝑥′

𝑖, 𝑦′𝑖) = (𝑥2

𝑖 , 𝑦2𝑖 ) which introduces a problem

because 𝑥′𝑖 ̸= 𝑥𝑖. Goubin and Martinelli delegate the responsibility for correction of this

problem to their mask refreshing procedure. They leave the security proof of their schemeopen.

6.2.6 Overall Masking of the AES Encryption

In the previous sections, we presented three schemes for masked S-box computation.However, we have not provided any details on how to mask the rest of the cipher. We


give those details in the following. Additionally we refer to [RP10a, Section 3.2] wherethe authors explain the overall masking in great depth.

Key Expansion

The key expansion must be masked as well as the atomic encryption operations. Toachieve 1𝑠𝑡-order security the key schedule is first split into two shares just like the internalstate of the cipher. The SubWord invocation from the unprotected key expansion (seeAlg. 2.2.2) is exchanged with a loop that invokes SecSbox four times. This differs fromthe unprotected key expansion because SecSbox takes the shared representation of oneGF(28) element, computes the S-box transformation, and returns a shared representationof the result. RotWord is not affected by the sharing. Rcon values are only added to oneshare of each round key. For the sake of completeness we give the algorithmic notationof 1𝑠𝑡-order secure key expansion as Algorithm 6.2.13.

Algorithm 6.2.13: 1𝑠𝑡-order Secure Key ExpansionData: key shares 𝑘0, 𝑘1 satisfying 𝑘0 ⊕ 𝑘1 = 𝑘Result: shares 𝑤0, 𝑤1 satisfying 𝑤0 ⊕ 𝑤1 = 𝑤

1 begin2 for 𝑗 = 1 to 4 do3 (𝑤𝑖)*,𝑗 ← (𝑘𝑖)*,𝑗

4 for 𝑗 = 5 to 44 do5 for 𝑖 = 0 to 1 do6 𝑡𝑖 ← (𝑤𝑖)*,𝑗−1

7 if 𝑗 mod 4 = 0 then8 for 𝑙 = 1 to 4 do9 ((𝑡0)𝑙, (𝑡1)𝑙)← SecSbox ((𝑡0)𝑙, (𝑡1)𝑙)

10 for 𝑖 = 0 to 1 do11 𝑡𝑖 ← RotWord (𝑡𝑖)12 𝑡0 ← 𝑡0 ⊕𝑅𝑐𝑜𝑛𝑗/4

13 for 𝑖 = 0 to 1 do14 (𝑤𝑖)*,𝑗−1 ← (𝑤𝑖)*,𝑗−4 ⊕ 𝑡𝑖

15 return (𝑤0, 𝑤1)

Masking the State

The first step during encryption is to share the state. We require two shares of the state.We create those shares by executing Alg. 6.2.14.


Algorithm 6.2.14: Sharing the Initial State before EncryptionData: initial state 𝑠Result: state shares 𝑠0, 𝑠1 satisfying 𝑠0 ⊕ 𝑠1 = 𝑠

1 begin2 for 𝑖 = 0 to 15 do3 𝑠0[𝑖]← rand(8)4 𝑠1[𝑖]← 𝑠0[𝑖]⊕ 𝑠[𝑖]

AddRoundKey

The addition of round keys is automatically masked because the round keys have previouslybeen shared as well as the state. The shares of each round key are simply added to therespective share of the state. We reproduce the completeness proof from [RP10a, Section3.2]: Let 𝑘𝑟 denote one round key, 𝑘𝑟

0, 𝑘𝑟1 the shares of that round key, and 𝑠0, 𝑠1 the

shares of the state 𝑠. The round key addition is complete because

(𝑠0 ⊕ 𝑘𝑟0)⊕ (𝑠1 ⊕ 𝑘𝑟

1) = (𝑠0 ⊕ 𝑠1)⊕ (𝑘𝑟0 ⊕ 𝑘𝑟

1) = 𝑠⊕ 𝑘𝑟 .

SubBytes

We have learned about techniques to perform shared S-box computations. Each compu-tation results in two output shares, one for each share of the state. In order to substituteall entries in the state we apply the shared S-box computation to all 16 tuples of shares((𝑥0)𝑖,𝑗 , (𝑥1)𝑖,𝑗).

ShiftRows

The ShiftRows operation is irrelevant with regard to masking because it only reordersbytes in both shares of the state. However, because both shares of the state are reorderedin the same way, it holds that 𝑠 = 𝑠0 ⊕ 𝑠1 before and after ShiftRows.

MixColumns

The MixColumns operation is linear with respect to XOR so that we can independentlyapply it to all columns of both shares of the state. The overall result stays intact uponrecombination of the shares.

Recovering the Final Ciphertext

At the end of the encryption we still have two shares 𝑠0, 𝑠1 of the state but we need toreconstruct the actual ciphertext. To do so we compute 𝑠 = 𝑠0⊕ 𝑠1 which yields the finalnon-shared state and thereby the ciphertext.

6.3 Complexity and Resource Comparison 57

6.3 Complexity and Resource Comparison

We have given the algorithms for AES, tailored to the fixed parameters for AES-128,in Chap. 2. We have also presented algorithms regarding software countermeasures,tailored to 𝑑 = 1 where applicable, in Sect. 6.2. Now we make RP10/CPRR13 andKHL11/SKHL13 comparable by estimating the amount of clock cycles required for asingle encryption run on the TC1797. We also calculate how much memory, both ROMand RAM, is required for each variant and how much random data we need for themasked implementations. We exclude GM11 here for reasons we explain at the end ofthis chapter.

We give clock cycle counts per invocation of the respective function, for exampleSubBytes. For memory transfers we abbreviate a load operation as Ld and a storeoperation as St. We denote one clock cycle as 1 cc and assume that Ld and St eachrequire 1 cc. We present a special feature of the TC1797 memory management at theend of this chapter. We use bytes as the unit for memory consumption and denote valuesbelow 1000 bytes as in 64 B. We denote 1024 bytes as 1 kB.

Please note that the estimates we put up cannot be mapped exactly to the realexecution time of the final binary assembled for the TriCore. This is due to the factthat we do not estimate address arithmetic instructions and that we neglect the costof function calls. We concentrate primarily on the amount of clock cycles required toperform the actual computations.

Before we can estimate complexity in terms of clock cycles, we need to know how longTC1797 machine instructions take to complete. Infineon states in [ITA04] that

“TriCore has been designed with high performance in mind and therefore, withvery few exceptions, ALL instructions take only one clock cycle to complete(the clock referred to here is the pipeline clock). The main exceptions whichtake more than one cycle are the multiplication and branch instructions,although there are also a small number of other seldom-used instructionswhich take more than one cycle.”

Based on this statement, we can convert the number of operations to the number ofclock cycles at a rate of nearly 1:1. MUL instructions are the only exception because theyhave a result latency7 of 2 clock cycles.

6.3.1 Plain AES according to FIPS-197

We begin with a straightforward AES implementation according to FIPS-197. We assumethat lookup tables are used for the S-box and that round keys and the state are organizedin columns as described in the specification. Furthermore we assume that the keyexpansion is computed only once, in other words, full key unrolling is used.

7From the manual: “The number of clock cycles from the cycle when the instruction is issued to thecycle when the result value is available to be used as an operand to a subsequent instruction or writteninto a GPR.”


KeyExpansion

We decided to use full key unrolling so that the key expansion happens exactly onceafter the TC1797 has booted. This way, the complexity of one encryption round is notaffected by key computation. Therefore we neglect the complexity of the key expansionat this point.

SubBytes

The SubBytes operation simply iterates over the bytes in the state array and replacesthem with values looked up in the S-box table. We estimate 2 Ld, 1 XOR and 1 St perbyte which gives 64 cc for one round.

ShiftRows

For ShiftRows we can ignore the first row of the state because it is not modified. Weestimate 12 Ld and 12 St for the reordering of the remaining bytes. This gives 24 cc perround.

MixColumns

The original MixColumns operation from the specification internally requires multiplicationwith {02} and {03} where the latter can be expressed as multiplication by {02} followedby the addition of the input value. We show in Sect. 7.2.1 how we ensured constantexecution timing for the multiplication by {02}. We estimate that the multiplicationrequires 5 cc. To process one byte in a column we need 4 XOR and two multiplicationsby {02}, summing up to 14 cc. This gives 56 cc for one column. We need 1 Ld to loadthe column and 1 St to store it which gives 4 · (56 + 2) = 232 cc for the whole state.

AddRoundKey

We assume that the AddRoundKey operation requires 8 Ld, 4 XOR and 4 St operationswhich sums up to 16 cc per round.

Overall Complexity and Memory

Paying respect to the initial AddRoundKey and the missing MixColumns operation in thefinal round we get a total of 3144 cc for one encryption. In terms of ROM we need 256bytes each for the S-box and inverse S-box lookup tables and 10 bytes for Rcon values,summing up to 522 bytes of ROM. At runtime we need 176 bytes of RAM for the keyschedule and 16 bytes for the state, summing up to 192 bytes of RAM.

6.3.2 32-bit Optimized AES

Compared to the naive approach following the specification, the MixColumns computationcan be heavily optimized for the 32-bit TriCore architecture. We show in Sect. 7.2.2,


especially by giving Alg. 7.2.1, how the optimized approach works. We estimate 19 cc forthe optimized MixColumns. Effectively, the estimated overall amount of clock cycles forone full encryption run goes down to 1227 cc.

6.3.3 CPRR13 - Rivain-Prouff without Mask Refreshing

Before we begin the complexity analysis of the masking schemes, we have to realizethat the secret sharing approach requires the duplication of the round key and the statearrays. We keep in mind that ShiftRows, MixColumns, and AddRoundKey now have to beexecuted on each share of the state. In our case this means that those operations need torun twice because we chose to target 1𝑠𝑡-order security, in other words we split the roundkeys and the state into two shares. The SubBytes operation works on both shares of thestate so that it does not need to run twice per round.

We continue to neglect the complexity of the key schedule because it still does not affectthe performance of encryption. For all operations other than the S-box computation wereuse the numbers we estimated for the 32-bit optimized implementation and simplymultiply them by two. In the following we analyze the operations comprising the S-boxcomputation.

SecProc8

The secure processing with lookup tables first requires the generation of two randombytes. We assume that we are provided with externally generated random data so thatwe need only 1 cc in order to load each random byte. Subsequently we need 4 XOR, 4Table Lookup (TLU), and 4 more XOR operations. The assignment of 𝑡 to 𝑟1 can beneglected because the value of 𝑡 can directly be used at the end of the algorithm withoutmoving it to a new register or storing it in memory first. The algorithm is concluded by2 TLU and 2 XOR operations. For the input arguments we need 2 Ld and accordingly2 St are required for the output values. Taking the initialization of 𝑡 into account, weestimate a total of 23 cc for the SecProc8 computation.

SecMult

The secure multiplication first requires the generation (or retrieval in our case) of onerandom byte. Subsequently 4 XOR and 4 field multiplications must be computed. Weassume that field multiplications are computed using log/alog tables because this is themost efficient way of doing so, albeit at the expense of two more lookup tables.

The multiplication of two field elements using log/alog tables requires 2 Ld for thearguments, 2 Ld for the log table lookup, 1 ADD, 1 MOD, and 1 Ld for the alog tablelookup. We estimate 20 cc for the modular reduction because the compiler creates a seriesof 64-bit division instructions in this case. We need another 5 cc for the correct handlingof arguments equal to zero. All in all we estimate 31 cc for one field multiplication.Summing up, we get to an estimated 129 cc for the SecMult operation.


SecExp254

The secure exponentiation to the 254 consists of two SecProc8 invocations, two SecMultinvocations, and six TLU if we use the full set of lookup tables. We estimate a total of310 cc for SecExp254. We require 6 random bytes per exponentiation.

Affine Transformation

In order to complete the S-box computation we have to apply the affine transformation.We can compute the affine transformation for one byte within roughly 40 cc.8 On a32-bit CPU we can compute four affine transformations in parallel which virtually reducesthe cost for one byte to 10 cc or, in other words, we can process a full word within theestimated 40 cc. The addition of the constant 0x63 can also be performed word-wise. Theconstant then becomes 0x63636363. Including load and store operations this constantaddition then takes 4 cc.

SubBytes

In order to perform the full S-box substitution on both shares of the state, we need torun the secure exponentiation 16 times which takes 4960 cc and requires 96 random bytes.We subsequently have to apply the affine transformation to both shares individually. Itoperates on full words and must be applied to both shares of the state. This requires320 cc. The computation ends in adding the constant we previously described to oneshare of the state which takes 16 cc. All in all we get to an estimate of 5296 cc for theSubBytes operation. We require 96 random bytes per SubBytes invocation.


For one full encryption we have to invoke AddRoundKey eleven times (11 ·32 cc), SubBytesten times (10 · 5296 cc), ShiftRows ten times (10 · 48 cc), and MixColumns nine times(9 · 38 cc). We get to an effective overall estimate of 54134 cc for one encryption.

With regard to randomness, we need 16 bytes to share the initial state plus 960 randombytes for S-box computations. This sums up to 976 random bytes per encryption.

In terms of memory, we need 7 · 256 bytes for the tables (T2, T3, T4, T5, T16, log,alog) plus 10 bytes for Rcon values, summing up to 1802 bytes of ROM. If we tradetwo lookup tables for very few clock cycles the ROM consumption goes down to 1290bytes. At runtime we need two shares of the key schedule and the state which requires2 · (44 · 4 + 16) = 384 bytes of RAM.

6.3.4 KHL11 - Kim-Hong-Lim

For the KHL11 scheme we need to establish estimates for SubBytes while the other AESprimitives still have the same complexity as before. To perform the estimation we firstlook at SecMult4 and SecInv.

8See Sect. 7.2.3 and [Hoh09, p. 14] for implementation details.


SecMult4

The first step for secure multiplication of GF(24) elements is to generate (in our case,retrieve) one 4-bit random value. Subsequently 4 XOR and 4 two-dimensional tablelookups are performed. We estimate 2 cc per such-like table lookup. For the argumentsand results we have 4 Ld and 2 St. All in all we estimate 19 cc for the secure multiplication.We need 4 random bits for one invocation of SecMult4.

SecInv

The secure inversion over GF(24) requires 2 Ld, 4 table lookups, one invocation ofRefreshMasks, two invocations of SecMult4, and 2 St. RefreshMasks requires 2 Ld and 2St as well as the generation of one 4-bit random value followed by 2 XOR operations.We thus estimate 7 cc for the mask refreshing procedure. In total we estimate 53 cc forthe secure inversion. We need 12 random bits per invocation of SecInv.

S-box computation

We now come to the analysis of the full S-box computation according to Alg. 6.2.9. Webegin with 2 table lookups and 1 XOR for each input share. Additionally the result ofthe first table lookup must be split into two 4-bit values. We estimate that this splittingtakes 2 clock cycles. Thus we have 10 cc for the initial step (lines 2 to 5). Next we havean invocation of SecMult4 which increases our estimate to 29 cc. Two XORs follow, thenone SecInv invocation, and finally SecMult4 is invoked twice. We get a new intermediateestimate of 122 cc. The final operations consist of two 2-dimensional table lookups andone XOR. All in all we estimate 127 cc for one computation of the S-box. We need 24random bits for each invocation.

SubBytes

In order to execute SubBytes we have to run the shared S-box computation 16 times. Thisgives an estimate of 2032 cc for one invocation of SubBytes. We require 384 randombits each time we execute SubBytes.


For one full encryption we have to invoke AddRoundKey eleven times (11 ·32 cc), SubBytesten times (10 · 2032 cc), ShiftRows ten times (10 · 48 cc), and MixColumns nine times(9 · 38 cc). We get to an effective overall estimate of 21494 cc for one encryption.

With regard to random data we need 16 bytes to share the initial state and 3840 bits(480 bytes) for the secure S-box computations. This sums up to 496 bytes of randomnessper encryption.

In terms of memory, we need 826 bytes of ROM for the tables: 16 bytes each for 𝑇1through 𝑇 3, 256 bytes each for 𝑇 4 through 𝑇 6, and 10 bytes for Rcon values. At runtimewe need two shares of the key schedule and the state which requires 2 · (44 · 4 + 16) =384 bytes of RAM.


6.3.5 SKHL13 - Kim-Hong-Lim without Mask RefreshingFor the SKHL13 scheme we need to analyze the new SecProc4 and the fixed SecInvcomputation. The complexity of the other operations remains unchanged.

SecProc4

One invocation of SecProc4 is exactly as complex as one invocation of SecProc8 becauseboth operations differ only in the number of random bits they retrieve. In Sect. 6.3.3, weestimated 23 cc for secure share processing. We reuse this estimate here.

SecInv

The secure inversion over GF(24) now requires 2 Ld, one invocation of SecProc4, 4 tablelookups, one invocation of SecMult4, and 2 St. We thus estimate 50 cc for the secureinversion. We need 12 random bits per invocation of SecInv.

SubBytes

The S-box computation now requires 124 cc instead of the previous 127 cc. We haveto run the shared S-box computation 16 times in order to execute the full SubBytesoperation. Due to the slightly reduced complexity of SecInv this gives a new overallestimate of 1984 cc.


Compared to the original estimate of 21494 cc we come down to an effective overallestimate of 21014 cc for one encryption. With regard to memory, we need 16 morebytes of Read-Only Memory (ROM) for the additional lookup table. The number ofrandom bits stays the same compared to KHL11.

6.3.6 Comparing the EstimatesWe have estimated the complexity in clock cycles and the memory requirements for eachof our AES variants. Additionally we have calculated the amount of random numbersrequired for the sharing of the key and the state as well as for the S-box computation.As a summary and for direct comparison we give the results again in Table 6.1. Thetable does not contain the 16 random bytes required by the masking schemes to sharethe key because we neglected the key expansion throughout the estimation process.

6.4 Selecting Candidates for ImplementationBefore we finally decided which countermeasure to implement, we ruled out the GM11scheme. If we believe the inventors, this scheme offers the best security among thecandidates with regard to the amount of traces required for an attack [GM11, p. 91]. Atthe same time the scheme is exorbitantly slow in comparison to RP10/CPRR13 [GM11,

6.4 Selecting Candidates for Implementation 63

Table 6.1: Comparison of encryption complexity in clock cycles, number of requiredrandom bytes, and memory requirements for the discussed variants of AES

Clock Cycles Random Bytes ROM / bytes RAM / bytes

8-bit AES 3144 - 522 19232-bit AES 1227 - 522 192CPRR13 54 134 976 1802 384KHL11 21 494 496 826 384SKHL13 21 014 496 842 384

p. 86], not to mention KHL11/SKHL13. In addition the authors leave the formal securityproof for the multiplication algorithm open. We found no paper providing this securityproof as of this day. To the contrary, the authors of [CPR12] disprove its security. Theyalso propose a fix that makes the scheme faster and secure against 𝑑𝑡ℎ-order attacks,but their fix uses different computation methods than the original scheme. All in all wefelt that it would be best to remove GM11 from the list of implementation candidatesalthough we liked the approach Goubin and Martinelli took.

It remained to decide which of RP10/CPRR13 and KHL11/SKHL13 we should im-plement. At the time we made the initial implementation decision we were under theimpression that the fix proposed in [CPRR13] was not applicable to KHL11 due toa misunderstanding we uncovered only much later. Looking at runtime performanceand randomness requirements, KHL11 was and still is the clear favorite. Based on ourestimation it would be roughly 17.5 times slower than the unprotected 32-bit optimizedimplementation and it requires only about half as much random data as the CPRR13scheme. However we knew that [CPRR13] points out the mask refreshing flaw. This flawis the reason we dumped the RP10 scheme and replaced it with CPRR13 because thelatter dispenses with mask refreshing. The better performance of KHL11 and the bettertheoretical security of CPRR13 were clearly in conflict with each other, leading us tothe initial decision to implement both CPRR13 (for the better theoretical security) andKHL11 (for the better performance).

When we found out about the misconception we mentioned earlier, we decided tokeep both implementations but retrofit the mask refreshing fix to KHL11, therebytransforming it into an SKHL13 implementation. As a nice side effect we gain yet a littlemore performance because the fixed secure inversion is slightly faster than the originalone. We had previously estimated 53 cc which then went down to 50 cc per invocation ofSecInv. This way the SKHL13 implementation would theoretically be only roughly 17.1times slower, instead of 17.5, than the unprotected 32-bit implementation. For the SCAattacks we present in Chap. 8 we decided to focus on SKHL13.

7 ImplementationIn this chapter we present the steps we took to create AES implementations for theTC1797. First we describe a straight-forward AES implementation exactly following thespecification which we subsequently optimize so that it uses the 32-bit TriCore architectureas much as possible. Finally we present the decisions we made and the pitfalls we foundwhen we created AES implementations protected with the countermeasures we selectedin Chap. 6.

7.1 Random Number GenerationThe masking schemes we selected for implementation require random numbers in orderto share the cipher key (once) and the initial state (for each encryption) as well asfor secure S-box computations. Designing a cryptographically secure Random NumberGenerator (RNG) is a challenging task in itself. Due to the limited scope of our thesiswe decided to keep random number generation out of our efforts as much as possible.Andreas Hoheisel writes about random numbers in [Hoh09]. He mentions the RNGclasses defined by Bundesamt für Sicherheit in der Informationstechnik (BSI) in [Sch99].Hoheisel states that “the random number generator should at least fulfill the requirementsof a K2 random number generator for the masked values” and that “the IAIK suggeststhe Grain-80 stream cipher, which can also be used as random number generator”.

A more recent RNG-related BSI publication [KS11] defines the class DRG.1 whichsupersedes and extends the class K2. Due to the broader scope of requirements definedfor DRG.1 we stick to the main original K2 requirements: The generated random bits aredistributed uniformly and they pass statistical tests defined in [Sch99]. The next higherclass K3 mandates that “It is practically impossible for an adversary to work out or guessthe numbers which precede or follow a random number subsequence 𝑟𝑖, 𝑟𝑖+1, . . . , 𝑟𝑖+𝑗 orto work out or guess an internal state”. This is a complex requirement which is outsideof our scope, given that this thesis does not focus on random number generation.

Amongst other approaches, random numbers can be generated by ciphers. Grain[HJM06] is a stream cipher that can be used as a Pseudo Random Number Generator(PRNG). It was designed for implementation in hardware and is very inefficient insoftware. We experimentally integrated Grain into one of our protected implementationsand realized that generating the required number of random bits introduced a timingoverhead of roughly 4 s per single encryption. This overhead is unacceptable. We thusdecided to refrain from running Grain on the TC1797. We further decided to skip Grainon the host computer as well. Instead we used the rand() function provided by the hostcomputer’s operating system to generate pseudo-random bytes for the key schedule andfor each encryption. We extended the measurement framework presented in [Osw09] in

66 7 Implementation

such a way that it sends the random bytes to the TC1797 via the serial interface. Inthe setup phase we no longer send just the key, but instead we add 16 random bytesto share the initial key on the TriCore. For each challenge we send the exact amountof random data required for the individual encryption run. The code running on theTC1797 contains arrays that hold the random numbers. We can thus simply retrievebytes from said arrays whenever randomness is required.

We are aware that the quality of the random data we use is hardly comparable tothat of a real RNG. It might be interesting to test the protected implementations withdifferent sources of randomness in the future. We kept the relevant parts of both ourAES code and the framework as generic as possible. This makes it easy to feed in randombits from alternative sources. Generally speaking it seems desirable to get the requiredrandom bits from a hardware RNG which would have to be added to a TriCore-basedsystem as a supporting component. The TriCore architecture and specifically the TC1797offer neither hardware- nor software-based random number generation facilities. As aside note we wish to mention that the C standard library that comes with the HighTecTriCore toolchain does offer rand() and related functions. When we tested them onthe TC1797 we were impressed by the fact that the random numbers returned fromthose functions are constant1. We tried to reverse engineer the computations that areperformed inside the distinct functions, but failed at finding a reason for the describedbehavior.

7.2 Implementing AESAES has been implemented over and over in the past by students, researchers, andcompanies. There are many readily available implementations. Some of them are heavilyoptimized for certain platforms, others have been created to try various languages orto experiment with alternative but equivalent representations of Rijndael, so-calleddual ciphers [BB02]. We decided to build our AES implementations completely fromscratch because we initially had no experience with the behavior of both the customizedTriCore gcc and the TC1797 itself as the target platform. Furthermore we decided to useC as the programming language, relying on gcc to create efficient machine instructions.Using a high-level language and exploiting the power of gcc seemed favorable comparedto the alternative approach of writing assembler code using instructions with which wehave no working experience and which are specific to the TriCore environment. Aftersome initial experiments with the toolchain, the TC1797, and the measurement setup,we defined that compiler optimization should be turned off for all implementations. Wepresent the results of different optimization settings in Sect. 7.3.

We took some more decisions that relate directly to the implementation of AES. Firstand foremost we defined that the key length is fixed at 128 bits, i.e., we focus on AES-128.This implies that we need to perform 10 rounds and that we need 11 round keys. Weprovide reduced pseudo code according to the specification where applicable. Secondly,we decided to implement both encryption and decryption although we only attack the

1This situation is expressed perfectly at http://xkcd.com/221/.

http://xkcd.com/221/

7.2 Implementing AES 67

encryption. This decision is based on the fact that complete implementations appearpreferable especially in cases where the code might be used in subsequent projects.Thirdly, we decided to implement full key unrolling due to the fact that in our scenariothe key remains constant across encryptions. This way we need to compute the keyschedule only once after system startup in contrast to dynamic key computation in eachround which would be redundant as soon as more than one plaintext is encrypted. Finally,we defined the mode of operation to be ECB which means that each block of plaintextis encrypted independently of previous or following blocks. We consciously made thisdecision, being aware that there are official recommendations for block cipher modes ofoperation [NIS01], because we intend to secure the simplest possible implementation ofAES. This implies ECB as the mode of operation.

In order to achieve semantic meaning in our code we defined the types BYTE asunsigned char and WORD as unsigned int. This enabled us to express function param-eters and return values in a concise manner, for example WORD RotWord(WORD w) as wellas the key as an array of words and the state as an array of bytes.

Regarding the memory layout of the round keys we decided to use an array of 44WORD values. We chose to index this array by manual pointer arithmetic instead of usingmultidimensional array syntax. For the internal state of the cipher we use an array of16 BYTE values that is filled following the top-left to bottom-right matrix notation fromthe specification. This means that state[0] holds 𝑠0,0, state[1] holds 𝑠1,0, state[4]holds 𝑠0,1, and state[15] holds 𝑠3,3.

The basic decisions we just discussed influence all of the code we wrote. In the following,we describe the steps we took to implement AES and the lessons we learned during theimplementation.

7.2.1 Straightforward AES by the Book

With this thesis we intend to show that AES on TriCore can be protected against poweranalysis attacks. In order to show this and in accordance with Proposition 1.3.1 (page3) we must first demonstrate that some implementation of AES is actually vulnerableto such attacks when running on the TC1797. Based on literature like [MOP07] andpersonal discussions we felt that a naive2 implementation would likely serve as the mostvulnerable candidate. Thus we started by creating straightforward code, sticking closelyto the pseudo code given in [FIP01]. In the following we describe how we implemented thedistinct operations comprising the AES key schedule and the encryption. As mentionedearlier we also added the decryption function to all our implementations. However werefrain from detailing the respective implementation aspects here since they do not differsignificantly from those of the encryption.

Key Schedule

Based on our decision to use full key unrolling we need to implement the key expansionroutine before we can encrypt anything. We have Nk = 4 because we fixed the key length

2Here, “naive” means unprotected, non-optimized.

68 7 Implementation

at 128-bit. Algorithm 2.2.2 shows the key expansion tailored to this parameter. The keyexpansion uses two more functions:

SubWord takes a 32-bit word and applies the S-box to each byte on its own. Unfortu-nately we were not able to exploit the power of packed arithmetic on the TriCore,cf. [ITA02, pp. 33-34], because the instructions for packed data types do notsupport more complex operations like table lookups. We did nonetheless findthat the TriCore architecture supports byte addressing [ITA08a, p. 2-7] which wesubsequently used in order to avoid complex single-byte extraction and insertionoperations on full words. Speaking in C terms we simply reinterpreted the addressof the word to be substituted as a pointer to BYTE which we could then use withan index to retrieve and store individual bytes from and to the given word.

RotWord takes a 32-bit word and rotates it so that (𝑎0𝑎1𝑎2𝑎3) becomes (𝑎1𝑎2𝑎3𝑎0) inFIPS-197 notation. Because we are working on a little-endian system we denotedthe rotation in C as (w << 24) | (w >> 8). The TriCore compiler automaticallyturns this into the DEXTR instruction [ITA08b, p. 3-108]. This instruction takes oneoutput register D[c] and two input registers D[a], D[b] as well as a bit position asparameters. The two input registers {D[a], D[b]} are treated as one virtual 64-bitregister where D[a] provides the most significant bits. The contents of this virtualregister are then shifted to the left by the given number of bits. Finally the mostsignificant 32 bits after the shifting operation are put into D[c] as the result.

For a practical example we assume that the input word resides in register d4and that we want the result to be written into d3. The instruction allows D[a]and D[b] to be the same register which is the perfect match for our use case.Written with the most significant bits at the left the virtual register would contain[𝑎3𝑎2𝑎1𝑎0𝑎3𝑎2𝑎1𝑎0]. We want to retrieve the segment [𝑎0𝑎3𝑎2𝑎1]. Thus we have toextract 32 consecutive bits starting at bit position 8. The DEXTR instruction startsthe extraction at (32-pos) so that the last parameter we need equals 24. We getDEXTR d3, d4, d4, 24 as the instruction for our example. This is exactly whatthe compiler generates for the word rotation.

As a special remark we wish to note that the HighTec toolchain provides a compilerbuiltin for rotate operations which also leads to DEXTR instructions in the assembly.This intrinsic is called __ROTATEL(a, b) where a is the value to be rotated and b isthe number of bits to rotate by. It would thus be equivalent and more readable towrite __ROTATEL(w, 24) in C where w is the word we want to rotate. We do wishto note however that using intrinsics of the TriCore compiler reduces portability ofthe code.

With the aforementioned functions in place the key expansion was simple to implementbecause the pseudo code in the specification is almost ready to be used as C code.


1 /* Multiply with {02} */2 BYTE o2(BYTE b) {3 BYTE r = b << 1;4 if(b & 0x80) {5 r ^= 0x1B;6 }78 return r;9 }

Listing 7.1: Naive multiplication with 𝑥 in GF(28)

Encryption

The encryption routine requires the four operations AddRoundKey, SubBytes, ShiftRows,and MixColumns. We described the mathematical aspects of those functions in Sect. 2.2.3.Implementing AddRoundKey3, SubBytes, and ShiftRows is trivial while MixColumns requiressome attention. We recall that MixColumns involves multiplication of field elements with{02} = 𝑥 and {03} = 𝑥 + 1. We describe the implementation of multiplication with 𝑥first and then derive the multiplication with 𝑥 + 1 from it.

Setting 𝑏(𝑥) = 𝑥 and recalling 𝑃 (𝑥) = 𝑥8 + 𝑥4 + 𝑥3 + 𝑥 + 1 we have

𝑎(𝑥) · 𝑏(𝑥) = (𝑎7𝑥7 + 𝑎6𝑥6 + 𝑎5𝑥5 + 𝑎4𝑥4 + 𝑎3𝑥3 + 𝑎2𝑥2 + 𝑎1𝑥 + 𝑎0) · 𝑥= 𝑎7𝑥8 + 𝑎6𝑥7 + 𝑎5𝑥6 + 𝑎4𝑥5 + 𝑎3𝑥4 + 𝑎2𝑥3 + 𝑎1𝑥2 + 𝑎0𝑥

from which we see that we need to reduce the result modulo 𝑃 (𝑥) if 𝑎7 = 1, which gives

𝑎(𝑥) · 𝑏(𝑥) ≡ 𝑎6𝑥7 + 𝑎5𝑥6 + 𝑎4𝑥5 + (𝑎7 + 𝑎3)𝑥4 + (𝑎7 + 𝑎2)𝑥3

+ 𝑎1𝑥2 + (𝑎7 + 𝑎0)𝑥 + 𝑎7 mod 𝑃 (𝑥)(7.1)

for all elements of GF(28). From (7.1), we see that multiplication by 𝑥 is equivalent to aleft shift by one bit in technical terms. We know that for field elements where 𝑎7 = 1, themultiplication must be followed by a modular reduction. We also see from (7.1) that weneed to add a constant bit pattern to the result of the multiplication in order to computethe modular reduction. Based on the binary representation of a field element we candeduce that this pattern equals 0x1B if 𝑎7 = 1 and 0x00 otherwise.4

We can now express the multiplication of a field element with 𝑥 in code. The simplestpossible implementation is given in Listing 7.1. The input is first shifted left by one bit,followed by the conditional modular reduction. Finally, the resulting value is returned tothe caller.

3In the case of AddRoundKey, we made an exception regarding the naïveté of our implementation. Dueto the fact that we are working with a 32-bit CPU we decided to literally use words from both theround key and the state instead of breaking the round key addition down to single bytes. However wedid stick to the specification to this extent because the round key addition is described there in termsof words.

4See also [FIP01, Sect. 4.2.1].

70 7 Implementation

1 /* Multiply with {03} */2 BYTE o3(BYTE b) {3 BYTE r = b ^ (b << 1);4 if(b & 0x80) {5 r ^= 0x1B;6 }78 return r;9 }

Listing 7.2: Naive multiplication with (𝑥 + 1) in GF(28)

1 BYTE MC0(BYTE *col) {2 BYTE result = 0;34 result = o2(col[0]);5 result ^= o3(col[1]);6 result ^= col[2];7 result ^= col[3];89 return result;

10 }

Listing 7.3: MixColumns code excerpt for the first column of the state

Multiplication of a field element with 𝑥 + 1 is very similar because the computationcan be split. Setting 𝑏(𝑥) = (𝑥 + 1) we see that

𝑎(𝑥) · 𝑏(𝑥) = 𝑎(𝑥) · (𝑥 + 1) = 𝑎(𝑥) · 𝑥 + 𝑎(𝑥) (7.2)

which enables us to either reuse the function from Listing 7.1 or write a new functionthat computes the multiplication with 𝑥 + 1. We initially chose the latter approach fortesting purposes and because we wanted to stay as close as possible to the specification,which resulted in the code shown in Listing 7.2.

Given the implementation of the multiplication steps, we were then able to completethe overall MixColumns implementation. Listing 7.3 gives an example of how the firstbyte of a state column is processed for MixColumns. We pass a pointer to BYTE to thefunction which the caller has prepared so that it addresses the column to be manipulated.The MC0 function is called once for each column in the state, just like its relatives MC1,MC2, and MC3. In mathematical terms, the function MC0 implements the computationof the product between the first row of the constant matrix and the column vector asdefined in (2.3) (see page 9).

Finally we used the four functions AddRoundKey, SubBytes, ShiftRows, and MixColumnsas building blocks to complete the encryption code according to Alg. 2.2.1 (see page 8).

Timing Issues

When we ran the finished implementation using different plaintexts, we realized thatthe overall timing was far from constant. We need to ensure constant execution time of


1 /* Multiply with {02} */2 BYTE o2(BYTE b) {3 BYTE r = b << 1;4 if(b & 0x80) {5 r ^= 0x1B;6 } else {7 r ^= 0x00;8 }9

10 return r;11 }

Listing 7.4: Naive attempt at fixing the multiplication with 𝑥 in GF(28)

the cipher for all possible keys and inputs in order to mitigate potential timing attacks.We debugged the code and measured the runtime of individual functions in order tofind the reason for the timing variations. The cause of the timing variations was quicklypinned down to MixColumns. The deviations were caused by the implementation of themultiplication steps we described previously. Looking at Listings 7.1 and 7.2 once more,it quickly became evident that the modular reduction was the source of our problem. Dueto the nature of conditional branches, the XOR operation is obviously only executed ifthe MSB of the input byte b is 1. Only one half of all possible bytes fulfills this condition.Assuming that the inputs to the function are uniformly and equally distributed, wehave a probability of 𝑝 = 1

2 that the relevant bit is set. This means that on averagethe XOR operation will only be executed half of the time. The other half of the timethe instruction is skipped which leads to a shorter execution time. We initially tried tofix the timing by adding an “else” branch as shown in Listing 7.4. Unfortunately thisidea turned out worthless because the TriCore compiler removes redundant code even ifoptimization is switched off. The XOR with zero is treated as redundant in this casebecause it does not change the value of r. Thus we were left with the same situation asbefore because the instructions in the compiled binary were exactly the same as with thefirst implementation.

We then decided to trade efficiency for constant runtime by using a multiplicationinstead of a conditional branch. We give the resulting code in Listing 7.5. The formerMSB test is replaced with retrieval of the MSB by means of a shift to the right by sevenbits which results in either 1 or 0. This value is subsequently multiplied with 0x1B andthe result is added bitwise to the shifted input. Instead of either two (the AND and theconditional jump) or three (the AND, the conditional jump which is not taken, and theXOR) instructions, we now have a constant number of three (the shift, the multiplication,and the XOR) instructions which exhibits constant execution time independent of theinput value.

While researching the topic of constant execution time we found that Andreas Hoheiselproposed an almost identical implementation of this computation in his Master’s Thesis[Hoh09]. He added a logical AND with 1 to the right-shifted input before the multiplicationwith 0x1B. This operation is not necessary because we use unsigned data types throughoutour code. The compiler then creates machine instructions like LD.BU (Load Byte Unsigned,

72 7 Implementation

1 /* Multiply with {02} */2 BYTE o2(BYTE b) {3 return (b << 1) ^ ((b >> 7) * 0x1B);4 }

Listing 7.5: Fixed multiplication with 𝑥 in GF(28)

1 o2:2 sh %d15, %d4, -73 mul %d15, %d15, 274 mov %d2, 05 insert %d2, %d2, %d4, 1, 76 xor %d2, %d157 ret

Listing 7.6: Assembly output for the multiplication with 𝑥 in GF(28)

see [ITA08b, p. 3-180]) that perform zero extension upon moving the data from memoryto a register. This eliminates the need to ensure zero bits by means of a logical ANDwith 1. Hoheisel also claims “that implementing xtime this way leaks timing information,due to the conditional multiplication with {1B}16” [Hoh09, p. 12]. We dispute thisbecause the multiplication is not conditional the way it is implemented. We cannottechnically disprove his statement because we do not have access to his source code andbecause he used a different toolchain to create binaries.5 We can however check theassembly generated by the HighTec toolchain. We see from Listing 7.6 that the inputresides in register d4. The value is shifted to the right by 7 bits and then stored in d15.Subsequently the value in d15 is multiplied by 0x1B and written back into the sameregister. The next instruction clears d2 before the insert instruction copies the mostsignificant 7 bits of the input there, which is effectively equivalent to performing a shiftto the left by one bit. Finally the contents of d2 and d15 are XORed. The compiler hasused a 16-bit instruction here because it is shorter in binary form. In this case the effectis that the first parameter, d2, is used as the target register. The assembler listing clearlyshows that no timing problem is hidden in the code from Listing 7.5. All instructions arealways executed. To that extent we have delivered proof that Hoheisel’s claim is false atleast with regard to the HighTec toolchain.

Another approach at fixing the timing problem could have been to exploit a compilerfeature called “if conversion”. This feature is enabled by all optimization levels startingwith -O1. When it is active the compiler attempts to replace conditional branches withequivalent code that works without jumps. For example a conditional assignment mightbe replaced by a SELECT instruction. However we strongly recommend to fix such codemanually where applicable instead of relying too much on the compiler.

It is not sufficient to simply assume that the code runs in constant time after the fixhas been applied. Therefore we measured the running time of a full encryption with aconstant key for 1000 pseudorandom plaintexts. In order to measure the running time,

5according to [Hoh09, p. 49] he used the Tasking VX toolchain, version 3.0r1.


0 200 400 600 800 1,000

3.900

3.905

3.910

3.915

3.920

3.925

·106

Encryptions

Tim

e/

ns

naivefixed

Figure 7.1: Encryption timings measured with naive and fixed MixColumns implementa-tion

we set the trigger before the encryption and reset it after the encryption. We recordedthe trigger channel and then evaluated the time between the rising and the falling edge ofthe trigger signal. We performed one measurement using the naive implementation andone measurement using the fixed implementation. Figure 7.1 shows the results. The lightgray curve represents the timing we measured when the original multiplication by {02}was used while the black line shows the timing measured for the fixed implementation.For direct comparison we give the numeric measurement results in Table 7.1. We caneasily compute that running the CPU at 10 MHz implies a period of 100 ns per clockcycle. The standard deviation of the fixed MixColumns multiplication is much lower thanthis period which means that we can call the timing constant. We attribute the minimalnon-zero deviation to the measurement process where we take 50 samples per clock cycle.

Table 7.1: Comparison of encryption timings with naive and fixed MixColumns multipli-cation

𝑡/ns 𝜎/ns 𝜎/ccNaive multiplication 3.912× 106 4.154× 103 41.54Fixed multiplication 3.901× 106 4.763 0

74 7 Implementation

Our naive AES implementation works fine and exhibits constant execution time butit is not particularly efficient on a 32-bit CPU. Especially MixColumns is slow and notvery elegant. In the following section we describe how we optimized our code so that itexploits the 32-bit architecture of the TC1797 as much as possible.

7.2.2 AES Optimized for the TriCore Architecture

Manipulating single bytes on a 32-bit CPU is technically possible. Nevertheless it hindersthe performance of any algorithm that allows for the processing of more than one byte ata time. While the AES contest was still running, Brian Gladman attested that “Rijndael[...] allows very good optimisation on 32 bit processors” [Gla99]. Much research has sincebeen conducted in an effort to create optimal 32-bit implementations. We briefly presenttwo approaches and explain our reasoning about their applicability to our implementation.

Transposed State

The first approach we wish to present was proposed in [BBF+02]. The authors transposethe state and adapt the key scheduling accordingly. Thereby they achieve an increase inperformance that is due to the parallel computation of multiple atomic operations like fieldmultiplication. Concluding their paper, they compare their revised implementation to theoptimized implementation presented by Gladman and come to the conclusion that theirproposal “works much better than Gladman’s in decryption” but regarding the encryption,“the performances are instead more or less comparable”. We discarded the transposedstate approach for two reasons: Firstly, we focus on encryption. Better performance ofthe decryption is therefore rather outside of our scope. Secondly, implementing AES witha transposed state requires a complete redesign of the key schedule and of MixColumns.This would have forced us to throw the naive implementation away and start from scratchonce again. We decided to refrain from doing so.

T-Tables

Another idea for enhanced performance on 32-bit CPUs was proposed by Daemen andRijmen in [DR02]. The authors combine SubBytes, ShiftRows, and MixColumns, whichleads them to the definition of the following four lookup tables:

𝑇0[𝑥] =

⎡⎢⎢⎢⎣02 · 𝑆(𝑥)01 · 𝑆(𝑥)01 · 𝑆(𝑥)03 · 𝑆(𝑥)

⎤⎥⎥⎥⎦ , 𝑇1[𝑥] =

⎡⎢⎢⎢⎣03 · 𝑆(𝑥)02 · 𝑆(𝑥)01 · 𝑆(𝑥)01 · 𝑆(𝑥)

⎤⎥⎥⎥⎦

𝑇2[𝑥] =

⎡⎢⎢⎢⎣01 · 𝑆(𝑥)03 · 𝑆(𝑥)02 · 𝑆(𝑥)01 · 𝑆(𝑥)

⎤⎥⎥⎥⎦ , 𝑇3[𝑥] =

⎡⎢⎢⎢⎣01 · 𝑆(𝑥)01 · 𝑆(𝑥)03 · 𝑆(𝑥)02 · 𝑆(𝑥)

⎤⎥⎥⎥⎦


where 𝑥 ∈ GF(28). Each table requires 1 kB of memory because it contains 256 32-bitwords. The name “T-Tables” is directly derived from the table naming. The lookuptables reduce the complexity of a round transformation to four TLU and four XORoperations per column of the state:⎡⎢⎢⎢⎣

𝑠′0,𝑗

𝑠′1,𝑗

𝑠′2,𝑗

𝑠′3,𝑗

⎤⎥⎥⎥⎦ = 𝑇0[𝑠0,𝑗 ]⊕ 𝑇1[𝑠1,(𝑗+1 mod 4)]⊕ 𝑇2[𝑠2,(𝑗+2 mod 4)]⊕ 𝑇3[𝑠3,(𝑗+3 mod 4)]⊕ 𝑟𝑘[4𝑟 + 𝑗]

for 0 ≤ 𝑗 < 4. Here 𝑠𝑖,𝑗 denotes an entry of the state matrix, 𝑟𝑘 denotes the round keyarray, and 𝑟 denotes the number of the round. The overall memory required for thetables can be reduced to 1 kB because the tables contain rotated entries. This enablesdevelopers to make a tradeoff between one additional rotation per column transformationand 3 kB of memory.

Given that MixColumns is not executed in the final round, there are two choices: Eitherthe regular S-box table has to be accessed for the final round, or unmodified S-box valuesmust be extracted from the T-Tables by means of logic masking.

We ran the T-Table implementation published by Philip Erdelsky [Erd02] on theTriCore and found that it is indeed (obviously) much faster than a regular implementationfollowing the specification. However, the T-Table approach is completely incompatiblewith the SCA countermeasures we selected in Chap. 6. Thus we discarded this approachas well and decided to simply optimize our naive implementation as much as possible forthe 32-bit TriCore architecture.

Optimizing AES Primitives for 32-bit

The AES implementation we presented in Sect. 7.2.1 already partially uses 32-bit ca-pabilities but it is by no means fully optimized for a 32-bit CPU. In the following wereiterate the AES primitives and check them for optimization possibilities.

The Key Expansion already works on full words in the basic implementation. We haveseen that SubWord substitutes single bytes in a word by definition and that we cannotuse packed arithmetic. Additionally, the naive implementation uses a lookup table forthe S-box which makes full-word operation impossible from another point of view. Wehave also found that our implementation of RotWord leads to a DEXTR instruction whichrotates one complete word at a time without single-byte operations. Summing up we cansafely say that the key expansion cannot be optimized any further. The same statementholds for AddRoundKey without further investigation.

SubBytes and ShiftRows operate on distinct bytes. For each S-box substitution asingle byte is required as the table index and the result of the lookup is again a byte.The ShiftRows transformation exchanges bytes in the state across column boundaries.Due to the memory layout we selected for the state array we cannot address a row as aword. Thus SubBytes and ShiftRows do not offer any potential for 32-bit optimization.

The MixColumns implementation we described in Sect. 7.2.1 is very verbose andcomplex in terms of source code. Throughout many software projects we have made

76 7 Implementation

1 void MixColumns(BYTE *state) {2 BYTE tmp[4][4];3 int i;4 BYTE *col;5 for(i = 0; i < 4; i++) {6 col = state + i*4;7 tmp[i][0] = MC0(col);8 tmp[i][1] = MC1(col);9 tmp[i][2] = MC2(col);

10 tmp[i][3] = MC3(col);11 }1213 memcpy(state, tmp, Nb*Nb);14 }

Listing 7.7: MixColumns function from the naive implementation

the experience that code verbosity and inefficiency are oftentimes directly correlated.This vague allegation is easy to substantiate by taking another look at the topmostMixColumns function that is called from the encryption routine. We give the function inListing 7.7.

Before we optimize MixColumns, we estimate the complexity of the naive MixColumnsin terms of clock cycles. This enables us to compare it to the optimized implementation.First of all, we can clearly see that a total of four function calls is made for each columnin the state, summing up to 16 calls for each invocation of MixColumns. Additionallywe recall that each of the MC{0..3} functions makes two more calls to other functions,namely one to o2 and one to o3. We know from the assembly in Listing 7.6 that the fixedversion of o2 executes five machine instructions excluding the ret. The five instructionsinvolved here all execute within a single clock cycle so that we can use this number asthe estimated execution time.

As a start we clean up the code in order to get rid of the o3 call. This is not yet a32-bit optimization. We replace each call to o3 with a call to o2 followed by an XORof the original argument. We recall (7.2) to see that this change keeps the result intact.Next we roughly estimate that each individual MC{0..3} function performs two calls too2 (10 cc) and four XOR operations (4 cc). For the whole column we need 1 Ld and 1St. We get a total of 58 clock cycles for the mixing operation on one column. This sumsup to 232 clock cycles per MixColumns invocation.

Our goal is to find a much more efficient implementation of MixColumns. In his revisedand extended AES specification [Gla07] Gladman explains in detail how this optimizationcan be performed. We reproduce (2.3) as the starting point while recalling that thememory layout we chose for the state allows us to load a full column as one 32-bit word:⎛⎜⎜⎜⎝

𝑠′0,𝑐

𝑠′1,𝑐

𝑠′2,𝑐

𝑠′3,𝑐

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝02 03 01 0101 02 03 0101 01 02 0303 01 01 02

⎞⎟⎟⎟⎠ ·⎛⎜⎜⎜⎝

𝑠0,𝑐

𝑠1,𝑐

𝑠2,𝑐

𝑠3,𝑐

⎞⎟⎟⎟⎠where 0 ≤ 𝑐 < 4 indexes the column being processed. Gladman first reorders this notation


and gives four separate equations, each defining the computation for one byte in a column:

𝑠′0 = {02} · 𝑠0 ⊕ {03} · 𝑠1 ⊕ 𝑠2 ⊕ 𝑠3

𝑠′1 = {02} · 𝑠1 ⊕ {03} · 𝑠2 ⊕ 𝑠3 ⊕ 𝑠0

𝑠′2 = {02} · 𝑠2 ⊕ {03} · 𝑠3 ⊕ 𝑠0 ⊕ 𝑠1

𝑠′3 = {02} · 𝑠3 ⊕ {03} · 𝑠0 ⊕ 𝑠1 ⊕ 𝑠2 .

We already saw that multiplication by {03} can be written as multiplication by {02}followed by addition of the original value. Thus the previous equations become:

𝑠′0 = {02} · 𝑠0 ⊕ ({02} · 𝑠1 ⊕ 𝑠1)⊕ 𝑠2 ⊕ 𝑠3

𝑠′1 = {02} · 𝑠1 ⊕ ({02} · 𝑠2 ⊕ 𝑠2)⊕ 𝑠3 ⊕ 𝑠0

𝑠′2 = {02} · 𝑠2 ⊕ ({02} · 𝑠3 ⊕ 𝑠3)⊕ 𝑠0 ⊕ 𝑠1

𝑠′3 = {02} · 𝑠3 ⊕ ({02} · 𝑠0 ⊕ 𝑠0)⊕ 𝑠1 ⊕ 𝑠2

which we can write in an even more readable form as

𝑠′0 = {02} · (𝑠0 ⊕ 𝑠1)⊕ 𝑠1 ⊕ 𝑠2 ⊕ 𝑠3

𝑠′1 = {02} · (𝑠1 ⊕ 𝑠2)⊕ 𝑠2 ⊕ 𝑠3 ⊕ 𝑠0

𝑠′2 = {02} · (𝑠2 ⊕ 𝑠3)⊕ 𝑠3 ⊕ 𝑠0 ⊕ 𝑠1

𝑠′3 = {02} · (𝑠3 ⊕ 𝑠0)⊕ 𝑠0 ⊕ 𝑠1 ⊕ 𝑠2 .

(7.3)

Let 𝑤 denote a word that consists of the bytes 𝑠0 through 𝑠3. We clearly see from(7.3) that the multiplication with {02} receives the bitwise sum of the word 𝑤 and theword 𝑤 rotated by one byte as the input. The following bitwise additions require theword 𝑤 rotated by two and by three bytes. Recalling that our RotWord function performsone-byte rotation of a word by means of the DEXTR instruction, we quickly deduce thatwe can write very similar functions to achieve rotation of a word by two or three bytes.We pick up on Gladman’s function names and get Alg. 7.2.1.

Algorithm 7.2.1: Word-Wise MixColumnsData: one column of the state as a word 𝑤Result: the mixed column as a word 𝑤′

1 begin2 𝑤′ = FFMulX(𝑤 ⊕ rot3(𝑤))3 𝑤′ = 𝑤′ ⊕ rot3(𝑤)4 𝑤′ = 𝑤′ ⊕ rot2(𝑤)5 𝑤′ = 𝑤′ ⊕ rot1(𝑤)

Now we need the FFMulX function to complete the new implementation of MixColumns.This is the function that computes the multiplication with {02}. We have already seenfrom (7.3) that FFMulX works on a full word at once, that is, it multiplies four elements ofGF(28) with {02} in parallel. We recall that we previously implemented the multiplication

78 7 Implementation

with {02}. We can keep the same approach here but we need to ensure that no overflowto the neighboring bytes occurs during the shift. We give one of Gladman’s proposals asAlg. 7.2.2 and explain it in the following.

Algorithm 7.2.2: Parallel Multiplication with {02} in GF(28)Data: a word 𝑤Result: a word 𝑤′ with each byte multiplied by {02}

1 begin2 word 𝑡 = 𝑤 & 0x808080803 𝑤′ = (𝑤 ⊕ 𝑡) << 14 𝑤′ = 𝑤′ ⊕ (︀

(𝑡 >> 3) | (𝑡 >> 4) | (𝑡 >> 6) | (𝑡 >> 7))︀

The first step extracts the MSB of each byte in the word and stores the result in 𝑡.We need this value as the condition for modular reduction after the shift. The secondstep makes sure that all the MSBs are zero before the shift to the left is performed.This is the overflow protection we mentioned earlier. We do not lose any informationby clearing those bits because they would vanish under the modular reduction in anycase. The final step creates the reduction term 0x1B for all four bytes by shifting andORing 𝑡. It is easy to check that this computation yields the correct result. If an MSBis zero, this shift-and-OR sequence leads to zero as well. To the contrary, if an MSBwas previously one, the one is inserted at the bit positions 0, 1, 3, and 4. This leadsto the value (00011011)2 = 0x1B. To finish the computation of FFMulX the resultingword carrying the required reduction terms is added bitwise to the shifted word. Thisconcludes the parallel multiplication of four field elements with {02}.

Gladman explicitly mentions that the approach we adopted for the reduction termsin Alg. 7.2.2 is not necessarily optimal for the target platform. He proposes a couple ofalternatives. We did however find that his first proposal is indeed very well suited forthe TC1797.

To finish our MixColumns optimization we estimate the complexity in terms of clockcycles of the new implementation and compare it to the previous implementation. Lookingat (7.3), we see that mixing one column now takes four XORs (4 cc), four rotations (4 cc),and one call to FFMulX. We estimate 11 clock cycles for the FFMulX computation. All inall we get an estimate of 19 clock cycles per column of the state and a total estimate of76 clock cycles for one full MixColumns invocation. We see an improvement of ≈ 67%compared to the estimate of 232 clock cycles for the naive implementation.

7.2.3 AES Protected with CPRR13

Following the 32-bit optimization of our naive AES implementation, we turned to CPRR13.Implementing the secure S-box was very straightforward because SecExp254 as depictedin Alg. 6.2.6 could almost directly be transformed into C code. The lookup tables wereeasy to generate using MatLab. After the secure exponentation, the affine transformationhas to be computed. In (2.2) (see page 8) we gave the affine transformation in matrix


1 BYTE aff_trans(BYTE input) {2 BYTE m, t, b;3 int j;45 m = 0xF8;6 b = 0;7 for(j = 0; j <= 7; j++) {8 t = input & m;9 b <<= 1;

10 b ^= parity(t);11 m = ((m & 1) << 7) | (m >> 1);12 }1314 return b;15 }

Listing 7.8: Affine transformation in C

notation. Looking at the matrix, we see clearly that each row is a rotated version ofthe previous row. We recall that we work with data in binary form. Therefore we caninterpret one row of the matrix as a binary number. Multiplying a row of the matrixwith the input value is then equivalent to a logical AND followed by bitwise additionof the binary digits in the result. This is in turn equivalent to computing the parity ofthe value resulting from the AND operation. Starting with the MSB of the input, wecan iteratively compute a single digit of the overall result by performing the AND-paritycombination, shifting the accumulator variable one bit to the left, and rotating the maskby one bit. Listing 7.8 shows the resulting C code. Here, the parity computation is aseparate function which iterates the bits of the input value. This is not a particularlyefficient way of computing the parity. However on the TriCore we have instructions thatrequire only a single clock cycle to compute bitwise rotation or parity. As the first step,we decided to use the compiler builtins __ROTATEL [Hig10, p. 118] and __PARITY [Hig10,p. 119] which enabled us to implement the affine transformation in an efficient manner.In addition, we realized that four affine transformations can be computed in parallel ona 32-bit CPU. For this optimization, we had to refrain from using the __PARITY builtinbecause it leads to machine code that computes the full 32-bit parity of the input word.In contrast to this, we need the individual 8-bit parity of each byte in the word. Thisneed is satisfied by the PARITY machine instruction [ITA08b, p. 3-368] so that we decidedto exchange the __PARITY intrinsic for inline assembly that performs a direct invocationof the PARITY instruction. The resulting code is shown in Listing 7.9, which includes theTriCore compiler builtin __ROTATEL and the inline assembly. After we had implementedthe secure exponentiation and the affine transformation, we finished the implementationof the full S-box computation.

Dynamic computation of the S-box is much more complex than a simple table lookup.We therefore decided to write test code that would iterate all values from 0 through 255,share them, compute the S-box, and compare the recombined result to the accordingentry in the S-box lookup table given in the AES specification. This way we were ableto check our implementation for correctness. The test code helped in finding and fixing

80 7 Implementation

1 void aff_trans(WORD in, WORD *out) {2 WORD m = 0xF8F8F8F8;3 WORD t, b;4 int i;56 WORD par;78 b = 0;9 for(i = 0; i <= 7; i++) {

10 t = in & m;11 b <<= 1;12 __asm__("PARITY %0, %1" : "=r" (par) : "r" (t));13 b ^= par;14 __ROTATEL(m, 31);15 }1617 *out = b;18 }

Listing 7.9: Affine transformation in C, optimized for the TC1797

some bugs we had unwittingly added to our implementation. In Sect. 7.2.5 we providesome more information on the test bench we created based on NIST test vectors.

In order to plug the newly implemented S-box computation into AES, we first hadto share both the key and the input. In order to store the shares, we had to make adecision about the memory layout we wanted to use. There are mainly two possibleapproaches. The first approach is to define structs in C that hold all shares of oneintermediate. We tried this approach and found it convenient for function calls butinconvenient from a security perspective: loading one such structure into memory meansthat all shares of one secret value are processed at the same time, which breaks all rulesof SCA countermeasure design. Moreover, the layout using structs is highly inconvenientwith regard to 32-bit operations on full columns because there are essentially no columns.The second approach, which we chose to pursue in the end, is to leave the memorylayout of the unprotected implementation unchanged and simply duplicate the respectivememory areas. As such, we chose to store the shared state as a two-dimensional arrayholding 2 · 16 bytes and proceeded accordingly with the round key array. The sharing isthen easy to implement by simply generating the required random data, storing it in oneof the shares, and storing the XOR of the random value and the secret in the other share.This way it is guaranteed that - for example - 𝑠𝑡𝑎𝑡𝑒[0][0]⊕ 𝑠𝑡𝑎𝑡𝑒[1][0] = 𝑠0,0. The sharedkey expansion differs from the original specification only in so far as it computes two fullround key arrays instead of just one. No changes are required for the computation.

After we had decided on the memory layout, we went ahead and integrated the S-boximplementation into our sharing-compatible AES implementation. Given that the othertransformations remained unchanged compared to the unprotected implementation, wesimply reused the existing code. We provide the full encryption function as Listing 7.10 inorder to illustrate how the new memory layout results in two calls per round to ShiftRows,MixColumns, and AddRoundKey, but only one call to SubBytes, and to show how thesharing and recombination of the input and the ciphertext work. One could argue that it


is not very elegant to denote two separate calls to each of the affected methods. However,we believe that this is acceptable for two reasons. The first reason is that the actualfunction is reusable because the memory addresses to be manipulated are passed inas parameters. The second reason is that writing the code in this way gives us moreflexibility during side-channel analysis because we can freely choose where to set andreset the trigger for the measurements. For even more fine-grained access to sensibletrigger points, it sometimes makes sense to fully unroll the loop that represents the rounditerations.

Our overall impression while implementing CPRR13 was that the algorithms were veryeasy to put into code. We faced no special problems and debugging was simple becauseall computations are performed directly in GF(28). We faced no major issues duringimplementation and test. The situation was slightly different when we wrote the code forSKHL13. We present our findings in the following section.

7.2.4 AES Protected with SKHL13

In order to implement the SKHL13 scheme, we also started with the secure S-boxcomputation because this is the central part of the scheme. As for CPRR13, the otherround transformations remained unchanged, but had to be called twice per round so thateach share of the state is processed. We kept the principle of writing a small test programthat iterates over all possible inputs to the S-box in order to verify the correctness of thecomputation.

As the first step, we needed the lookup tables described in [KHL11] and on page 51of our thesis. Unfortunately, the authors of [KHL11] do not provide the tables in theirpaper and we were unable to make contact with any of the three individuals. We wereeffectively left with the task of creating the tables on our own. In contrast to CPRR13,we could not use MatLab to generate the tables in this case because MatLab does notallow for the definition of repeated field extensions6. Thus we decided to begin with asmall and simple multiplication table for GF(22) which was quickly finished. We createdthis table because it would later accelerate coefficient multiplication for computationsinvolving values from GF((22)2). It is not required in the code.

Based on the GF(22) table we were able to quickly write down the table T1 (𝑥2) andthen derive T2 (𝑥4) from it because this required nothing more than a lookup into T1.Next we needed the tables T3 (𝜆 · 𝑥2) and T4 (𝑥 · 𝑦). While we tried to calculate thosetables manually as well, we quickly realized that this approach was highly error-prone.We miscalculated intermediates and overall results many times, often in such creativeways that we would only find the mistake much later in the process. Additionally wefound it very tedious to write expressions like ({01}𝑥+{11}) ·({10}𝑥+{10}) by hand, notto mention the resulting terms before the modular reduction. Nevertheless we calculatedall tables for GF((22)2) by hand and transferred the results into our implementation. Forthe tables T5 (forward isomorphism mapping) and T6 (inverse isomorphism mappingand affine transformation) we felt that it would be best to write code that generates

6We cannot fully exclude the possibility that we failed at finding a working solution.

82 7 Implementation

1 void Encrypt(BYTE *in, BYTE *out, WORD w[2][44], int Nr) {23 int i;45 /* share the input into the state */6 BYTE state[2][16];7 for(i = 0; i < 16; i++) {8 state[0][i] = rand() & 0xFF;9 state[1][i] = state[0][i] ^ in[i];

10 }1112 /* initial round key addition */13 AddRoundKey(state[0], w[0], 0);14 AddRoundKey(state[1], w[1], 0);1516 /* rounds */17 for(i = 1; i < Nr; i++) {18 SubBytes(state);1920 ShiftRows(state[0]);21 ShiftRows(state[1]);2223 MixColumns(state[0]);24 MixColumns(state[1]);2526 AddRoundKey(state[0], w[0], i*Nb);27 AddRoundKey(state[1], w[1], i*Nb);28 }2930 /* final round */31 SubBytes(state);3233 ShiftRows(state[0]);34 ShiftRows(state[1]);3536 AddRoundKey(state[0], w[0], i*Nb);37 AddRoundKey(state[1], w[1], i*Nb);3839 /* recombine the shares into the ciphertext */40 for(i = 0; i < 16; i++) {41 out[i] = state[0][i] ^ state[1][i];42 }43 }

Listing 7.10: Shared AES Encryption


those tables. In this case the only problem we faced was that the bit order used forthe isomorphism matrices (cf. (6.3) and (6.4), page 49) easily leads to erroneous results.Nevertheless the code to generate those two tables was quickly finished and we were ableto verify that the tables were correct.

Given the required set of lookup tables, the remaining implementation of the S-boxcomputation was completed as quickly as the one for CPRR13. Upon running our testcode we realized that only a small subset of input values produced correct S-box results.We then reiterated the pen and paper approach multiple times in an attempt to findmistakes in our lookup table calculations. We managed to increase the number of correctS-box outputs, but there were still wrong results in between. In the end we decidedto write additional table generation code which proved the bigger part of our manualcomputations correct, but exhibited one fatal mistake we made regarding exactly one ofthe tables. As soon as we used the tables generated by our helper program, we foundthat all possible S-box inputs finally yielded correct output values with regard to theS-box lookup table from the AES specification.

As the last step we plugged the SKHL13 S-box implementation into the same AEScodebase we had previously used for CPRR13. The overall verification tests as describedin Sect. 7.2.5 were not immediately successful because we had again introduced somebugs. We found that debugging the code was much less straightforward than for CPRR13because the SKHL13 algorithms offer more chances to introduce bugs, mostly throughflawed lookup tables. Due to the isomorphism mapping at the very beginning of eachS-box computation, it seems challenging to debug the inner part of the subfield inversionif it is unclear whether the results of the isomorphism transformation are correct. Theapproach we took to finally eliminate all bugs in the S-box computation was to workfrom the inside towards the outside of the inversion. As a start we supposed that theisomorphism mapping worked correctly and focused our debugging efforts on the inversionsteps, using the inputs we could see after the isomorphism mapping. As soon as wewere sure that the inner parts worked correctly we checked the forward and backwardisomorphism mappings. In the end we found that the tables were all okay and that theerroneous results were caused by mistakes with variable handling and a small bug relatedto pointer arithmetic.

While implementing SKHL13 we got the overall impression that the algorithms weremostly as easy to put into code as those required for CPRR13. However we found itharder to create the lookup tables. This was certainly partially due to our own mistakes,but also due to the fact that we had to write custom code and could not use existingsoftware like MatLab. We also found that debugging the code was more of a challengethan in the context of CPRR13.

7.2.5 Verification Tests

In order to verify the overall correctness of our various AES implementations we wrotea test bench that runs a selection of the official NIST AES test vectors through thecipher. Table 7.2 lists the test vectors we used in our verification tests. As soon as animplementation passed all tests we flashed the code onto the TC1797. For more validation

84 7 Implementation

we generally sent at least one key and one test input to the TC1797 via the serial interfacein order to check that the code we had previously tested on the host computer alsoworked as expected on the TriCore. We decided not to add serial communication to ourtest bench to keep it as simple as possible.

Table 7.2: NIST test vectors used for verification testsTest Data Set Number of Tests Description

ECBGFSbox128 7 Encryption with key set to zeroECBKeySbox128 21 Encryption with plaintext set to zeroECBMCT128 100 Monte Carlo testsECBVarKey128 128 Variable key, plaintext set to zeroECBVarTxt128 128 Variable plaintext, key set to zero

7.3 Findings on Execution Timing

When we developed the very first straightforward AES implementation for the TC1797 andconducted experimental measurements, we noticed that the runtime of each distinctencryption was not constant when we supplied varying plaintexts to the cipher. Timingvariations cause misalignment of the recorded power traces, effectively decreasing theefficiency of power analysis attacks. Moreover, variable execution timing can enabletiming attacks. It is thus important to ensure that the cipher terminates in constanttime regardless of the key and the plaintext. As we showed in Sect. 7.2.1, it is possible tofix some timing issues by avoiding branches in the code. However, we found two reasonsfor timing variations that cannot be circumvented simply by fixing the implementation.We describe those findings in the following.

7.3.1 Caching

The first cause for non-constant runtime was easy to find. The TriCore features a datacache and an instruction cache. Those caches are used to reduce load times for repeatedinstructions and data. However it is impossible to predict which instruction or whichdata will be in the cache at which point in time. Because getting a value from the cache isfaster than getting it from memory, the timing for load operations will differ if the cacheis used and if the required value is in the cache. The same holds for the instruction cache.We decided to disable both caches on the TC1797. The cache sizes can be configuredindividually. The size of the instruction cache ranges from 0 to 16 kB [ITA09b, Section2.9.4] while the size of the data cache ranges from 0 to 4 kB [ITA09b, Section 2.10.4]. Wedecided to set both values to 0, thereby effectively disabling the caches. In addition, weconfigured all relevant memory areas to use Segment A instead of Segment 8 because theformer guarantees non-cached data access. It goes without saying that this configuration

7.3 Findings on Execution Timing 85

is only viable in an experimental environment. In a productive system like an ECU, thecaches are essential for the overall system performance.

7.3.2 Compiler Optimization

The second cause for timing variations lies in optimizations applied by the compiler. TheHighTec TriCore compiler is a customized gcc that offers the usual optimization levels0 through 3. At level 0 the binary is not optimized at all while selecting optimizationlevel 3 enables all optimization flags. The full reference on compiler options is providedin Chap. 18 of the Toolchain User’s Manual [Hig10]. As an example for how thecompiler optimization influences the encryption execution timing we built binaries of theunprotected 32-bit implementation for all four optimization settings. We downloadedeach individual binary to the TC1797 and measured the time it took until the encryptionwas fully computed. We ran 1000 encryptions per measurement. Table 7.3 shows theresults. It is easy to see that higher optimization levels lead to shorter execution times.However the timing that was originally constant at optimization level 0 does not stayconstant with increasing optimization. In the following we give a short overview of thekey algorithms each optimization level activates. We would like to note that we didunfortunately not have the time to perform a full analysis regarding the reasons fornon-constant execution timing at optimization levels 2 and 3.

Table 7.3: Timing of the unprotected 32-bit optimized encryption depending on compileroptimization

Optimization Level 𝑡/ns 𝜎/ns 𝜎/cc0 2.109× 106 1.085 01 1.237× 106 0.887 02 1.058× 106 2.833× 103 28.333 0.612× 106 1.750× 103 17.5

Level 1

At optimization level 1 the TriCore compiler “tries to reduce code size and executiontime, without performing any optimizations that take a great deal of compilation time”[Hig10, p. 167]. One of the key features at level 1 and above is that functions bearingthe inline keyword are expanded inline which saves clock cycles because call and returninstructions including related register and context preparation and cleanup are avoided.Additionally the compiler attempts to optimize data transfers by looking for patternswhich enables it to reduce the number of load operations. Conditional jumps are replacedwith non-branching instruction sequences where possible. Loops are optimized if possibleby rearranging termination criteria checks and loops are unrolled where it seems sensible.At this point we omit the other optimization features activated at level 1.

86 7 Implementation

Level 2

At optimization level 2 the compiler uses a total of 19 additional optimizations includingbut not limited to avoiding some more memory access operations by enforcing that data isloaded into a register before any computations are performed with it, and the reorderingof code blocks for better placement in the binary and for more efficient invocation. Loopsare optimized even more than at level 1 and recursive calls are replaced by jumps whereapplicable.

Level 3

At optimization level 3 the compiler tries to optimize register allocation for scheduledcode which is not relevant for the code we are writing. Nevertheless it also automaticallyexpands small functions inline as long as they are smaller than the limit prescribed bythe option -finline-limit=<number>. This optimization can be beneficial because itmight remove functions that were only written for better readability or as pure wrappersto subordinate function calls.

Selecting an Optimization Level

Based on the experience we gained from experimenting with different optimizationsettings we clearly recommend to start with optimization level zero, that is, turn off allcompiler optimizations. Using no optimization allows for absolute certainty about theexecution timing of the code. Furthermore it is always better to ensure efficiency of theimplementation by looking for common bottlenecks like memory access or operationsthat do not exploit the full register width. The timing measurements we performed forthe non-optimized 8-bit implementation (Table 7.1, row 2) and the 32-bit optimizedimplementation (Table 7.3, row 1), where both binaries were built without compileroptimization, clearly show that the developer himself can significantly improve theperformance of the code. Nevertheless, Table 7.3 also shows that manually optimizedcode can become even more efficient with the help of the compiler. In our opinion,compiler optimization should be used with care because minimized execution time canhave side effects. The inline expansion of functions is one example for such side effects.Inline code runs faster because call and return sequences are avoided, but at the sametime it makes the binary bigger because every invocation of the inlined function nowrequires its own sequence of opcodes in the assembled binary. Time-Memory tradeoffsmust be taken into consideration because the memory available for binary code is limitedas well as all other resources on the target platform. Also, Table 7.3 provides evidencethat too much optimization can lead to non-constant execution timing. While variabletiming makes CPA attacks harder because it impairs the alignment of power traces, itmight on the other hand enable timing-related attacks. Our overall recommendationis thus to start without any compiler optimization and then evaluate carefully whichoptimization level offers the best tradeoff between timing performance, memory usage,and security requirements.

7.3 Findings on Execution Timing 87

1 #pragma section .zdata2 const BYTE S[256] = {3 /* values */4 };5 #pragma section

Listing 7.11: Using the section pragma to move data to the LDRAM

7.3.3 Memory Management Tricks

The TC1797 features a DMI that contains an LDRAM area. If data caching is disabled,the LDRAM reaches a size of 128 kB. Access to the LDRAM is guaranteed to be fast anddeterministic, which makes it attractive “for use by performance critical code sequences”[ITA09b, Section 2.10.3]. In case that the entire encryption was defined as performancecritical (remember the tradeoff discussion from the previous section) the LDRAM couldbe used to hold part of or all lookup tables and other constant data. While performingsome superficial experiments around this approach, we found that the overall timerequired to perform one full encryption went down by up to 20% compared to the casewhere the lookup tables resided in the regular ROM area. At the same time, the powerconsumption that became directly visible on the oscilloscope (for example during S-boxlookups) decreased as well. Apart from the performance gain, we feel that it could beinteresting to investigate the effect of the reduced power consumption on power analysis,for example on the correlation during an attack or on the number of traces required forsuccessful key recovery. We did not perform an in-depth analysis of this phenomenon.We do however wish to give a short description of how data can be moved to the LDRAMfrom the developer’s perspective.

In a default setting, the predefined section .zdata contains initialized data that canbe manipulated using absolute addressing. The memory configuration and subsequentlythe resulting map file make sure that the section .zdata resides in the address rangeof the LDRAM. Of course the data is not directly downloaded to the LDRAM becausethis would mean that it gets lost upon reboot of the TC1797. Instead, the startup codecontains instructions that copy data from ROM to RAM by means of a so-called copytable. This table contains entries for all sections from which data needs to be copied toRAM. The .zdata section represents one of the entries in this copy table. We refer to[Hig10, Section 13] for details on the startup process.

In order to put data like a lookup table into a certain section, the developer can usethe section pragma. This pragma can be used to specify a predefined section or evencustom sections, along with a number of options. We provide Listing 7.11 as a simpleexample of how the S-box lookup table is moved to the LDRAM. The pragma directivecan span across multiple constants, tables, or even function definitions. In our exampleit is opened by #pragma section .zdata and closed by #pragma section.

88 7 Implementation

7.4 Key TakeawaysWhile writing code for the TC1797 we learned a great deal about how it generallybehaves and what the TriCore compiler is able to do. We came in touch with compilerbuiltins like __ROTATEL that can be used to write semantic expressions in the source codewithout using explicit inline assembler. We collected some experience regarding compileroptimization and learned about many specific properties of the TriCore instruction set.Based on our previous experience with programming in C we were able to rapidly developworking software for the TC1797 and realized that the HighTec toolchain is of greatvalue. We are under the impression that HighTec made the right decision when theyselected gcc and related tools as the basis for their toolchain. In the following chapterwe present the side-channel attacks we mounted against our AES implementations.

8 Side-Channel Analysis

In the previous chapters we described the process of selecting two masking schemesand the creating of various AES implementations. Although it seems obvious from theliterature that the secret key of a straightforward AES implementation can be brokenusing power analysis, this is not necessarily guaranteed for any deliberate implementationrunning on any deliberate platform. Thus we mounted a first-order power analysis attackagainst our 8-bit AES implementation running on the TC1797. Next we mounted thesame attack on the allegedly protected implementations. In this chapter we describe thesteps we took to mount the attacks and give an overview of the results we found.

8.1 Requirements

Before we can mount an attack, we need to make sure that certain requirements are met.We described the practical prerequisites for an attack in Chap. 3 and gave details on ourworking environment in Chap. 5. We give a short recap here, divided into technical andpractical requirements.

8.1.1 Technical Requirements

The first thing we need technically is a target platform for the code we want to attack. Inour case, this is the EMSEC evaluation board on which a TC1797 is mounted. Secondly,we need an oscilloscope to record power traces. Thirdly, a host computer is requiredthat stores the recorded traces and takes care of the communication with the TC1797.Finally, we need two communication links between the host computer and the evaluationboard. The first link is realized by a USB connection over which serial data is exchanged,and which additionally powers the board. The second link is required for the downloadof binaries and for debugging. For this purpose we use the UAD2 manufactured bypls. The oscilloscope and the UAD2 communicate with the host computer over USB aswell. This setup, along with the software we mentioned in Sect. 5.4, fulfills the technicalrequirements for a side-channel attack.

8.1.2 Practical Requirements

Aside from the technical requirements, we must also adhere to well-defined practicalmethods. A power analysis attack generally consists of five steps, which we briefly recallhere:

1. Select an intermediate value to be attacked.

90 8 Side-Channel Analysis

2. Measure the power consumption.

3. Calculate hypothetical intermediate values.

4. Map hypothetical intermediate values to hypothetical power consumption.

5. Compare the hypothetical power consumption with the power traces.

For the first step we selected the S-box output in the first round as the most promisingtarget. The S-box is a nonlinear component. Its diffusion property makes wronghypotheses easier to distinguish from wrong ones compared to an attack on linearoperations such as the key addition or ShiftRows. We note other intermediate targetsexplicitly where applicable.

For the four remaining steps, we decided to use the framework developed in [Osw09].This framework takes care of the data acquisition by means of the PicoScope API. Theoscilloscope channels to be recorded and the number of samples per measurement areeasily configurable. The remaining steps we mentioned above are also performed by theframework in the evaluation phase. The only step that requires interaction is the mappingof hypothetical intermediates to hypothetical power consumption. Here, it is necessary todecide which power consumption models shall be considered during the evaluation. Theframework offers all common models like a simple bit model or an 8-bit Hamming-Weightmodel, but also more complex ones like Hamming-Distance and register distance models.Individual models can easily be defined in C++ but we found no need to do so for thisthesis. We used Hamming-Weight and bit models most of the time, albeit we tried someother models in experimental phases of our work.

8.2 Attacking the ImplementationsIn the previous section we summarized the requirements for a power analysis attack andbriefly presented basic decisions we made. We begin this section with the descriptionof a first-order power analysis attack on the unprotected implementation. Subsequentlywe describe how we attacked the protected implementations, with a focus on SKHL13because we declared this scheme our favorite at the end of Chap. 6.

8.2.1 Initial Experiments

In the early phase of our work we conducted some experimental measurements in orderto collect experience with the TC1797 and the PicoScope. We recorded and profiledpower traces of the unprotected 8-bit implementation for those experiments. Theearly test measurements showed that full key extraction from the unprotected AESimplementation was not possible with less than 50.000 traces. Normally one expectsto break an unprotected implementation using around 1000 traces which made us feelthat we had to change something about our setup. We subsequently evaluated variousdifferent measurement setups using resistances from 1 Ω up to 200 Ω and different choicesof ground connectors ranging from only the central CPU ground to a combination of

8.2 Attacking the Implementations 91

all accessible ground lines. We noticed that including the oscillator ground into themeasurements quickly chokes the clock, effectively stalling the CPU, as soon as themeasurement resistance exceeds 𝑅 ≈ 50 Ω. Our experiments showed that a resistance of𝑅 = 150 Ω delivered optimal results. We thus decided to leave the oscillator ground outof the measurements.

In spite of all the different measurement setups, we never succeeded in decreasing thenumber of power traces required for a successful attack. We then discussed the effect ofdecoupling capacitors on the measured power traces with Falk Schellenberg and cameto the supposition that the capacitors could have a negative influence on the number ofrequired traces. We subsequently decided to remove the capacitors prescribed by theTC1797 PCB design guide [ITA11] from the board. Before the removal, we built threebinaries from identical AES code, but configured them to use different CPU frequenciesof 10 MHz, 50 MHz, and 180 MHz. We then entered an iterative process in which werepeated two steps:

1. Select and remove at most two capacitors from the board.

2. Download and run each of the three binaries in order to verify the functionality ofthe board and the TC1797.

For each test run, we additionally checked the power consumption waveform usingthe oscilloscope. We found that we could safely remove almost all decoupling capacitorsfrom the board without breaking the overall system. We had to make only one exceptionfor the capacitors supporting the flash memory. As soon as we removed them, theTC1797 quickly lost its programming. We thus reinstalled the flash memory capacitorson the board. After the capacitor modification, we conducted new experiments with theunprotected AES implementation. We found that full key extraction was now possiblewith at most 1000 traces, even without any alignment or filtering preprocessing. Thepresumption that the decoupling capacitors might hinder the attack proved to be correct.

8.2.2 Common parameters for all attacksWhile we attacked many different implementations, some technical parameters wereidentical across all attacks. We describe those parameters and their values in thefollowing.

For the measurement framework and the PicoScope, we set the sample rate to 500 MHz.For Channel A, which we used to record the power consumption, we set the range to±100 mV, and the coupling to AC. For Channel B, which we used as the trigger channel,we set the range to ±5 V, and the coupling to DC. In addition, we configured the triggerthreshold for Channel B to 2 V and selected rising edge as the trigger mode.

With regard to the TC1797, we selected a core frequency of 10 MHz. In combinationwith the sample rate we used for measurements, this means that we get 50 powerconsumption samples per clock cycle. We consciously set the CPU frequency to thiscomparably low value in order to avoid problems with the serial communication andbecause we preferred a high number of samples per clock cycle over low execution times.The latter preference leads to longer power traces.


For all measurements, we used a resistance of 150 Ω, as mentioned in the previoussection. We determined this value as the optimum by experimentation. Once we haddecided on the value, we never changed it, so that different measurements would becomparable.

Other parameters like the number of samples to be recorded per power trace weredifferent across implementations. This is obvious because each implementation exhibitsdifferent execution timing. We selected individual values for such parameters, tailoredto the implementation under attack. In the following, we describe how we attacked thedistinct implementations and which results we found when we analyzed the power traces.

8.2.3 Unprotected 8-bit AES

With the modified board, we felt that mounting a serious attack on the unprotected8-bit implementation should pose no extraordinary challenge. We chose to attack theoutput of the S-box in the first encryption round. Because S-box lookups are performedin units of single bytes we initially selected an 8-bit Hamming-Weight model. We decidedto record 10 000 traces, bearing in mind the common expectation that the key shouldbe recoverable using around 1000 traces or less. Given that the correlation coefficientlimit for an almost certain distinction between a correct and a wrong hypothesis equals4/√

𝐷 = 0.04 at 𝐷 = 10000 traces we expected to get a clear correlation plot with easilyvisible peaks after the evaluation. We used the profiling features of David Oswald’sframework to analyze the recorded power traces. In profiling mode, the framework usesa known key and an oracle, and then computes the correlation between the measuredpower consumption and the hypothetical power consumption. Figure 8.1 shows the result:All 16 S-box lookups cause clearly visible correlation peaks in the case of the correct key.

This result shows that there is a correlation between the power consumption of theTC1797 and the intermediate values processed during and after the S-box lookups.However, profiling the power traces using the correct key does not necessarily imply thata CPA attack without prior knowledge of the key leads to a successful key recovery. Wethus subsequently mounted a full CPA attack on each of the 16 bytes in the state. In thiscontext, “full CPA” means that all potential key bytes are used in the attack. Thereforewe used the full range (0 . . . 255) for the respective key byte and once again employed thefirst-round S-box oracle in combination with an 8-bit Hamming Weight model. When weplotted the result of the full CPA on the second key byte, we realized that even thoughwe can see correlation peaks in Fig. 8.1, the correlation of the correct key hypothesis onlyminimally exceeds the correlation of wrong hypotheses at around 3100 traces. Figure 8.2shows the evolution of the correlation coefficients for all key hypotheses, plotted againstthe number of traces. It is clearly visible that the correlation for the correct hypothesis,plotted in black, is almost always lower than that for some of the wrong hypotheses. Thishinders a clear decision concerning the correct key hypothesis.

Based on this finding we turned to other power models. We decided to try simplebit models that target exactly one bit of an intermediate value at a time. In order toevaluate the broadest possible range of models, we used one bit model for each bit ofthe state. We then ran the profiling application and examined the results. It turned out


30 40 50 60 70 80 90 100

−0.1

−0.05

0

0.05

0.1

Time (µs)

Co

rre

latio

n

Figure 8.1: Correlation traces for each of the 16 S-box lookups in the first encryptionround, using a Hamming-Weight model

0 2 4 6 8 10 ·103

0

0.2

0.4

0.6

0.8

1

Traces

Cor

rela

tion

Figure 8.2: Correlation for all key hypotheses, targeting the first S-box with a Hamming-Weight model


0 2 4 6 8 10 ·103

0

0.2

0.4

0.6

0.8

1

Traces

Cor

rela

tion

Figure 8.3: Correlation for all key hypotheses, targeting the first S-box with a bit model(MSB)

0 2 4 6 8 10 ·103

0

0.2

0.4

0.6

0.8

1

Traces

Cor

rela

tion

Figure 8.4: Correlation for all key hypotheses, targeting the second S-box with a bitmodel (MSB)


0 2 4 6 8 10 ·103

0

0.2

0.4

0.6

0.8

1

Traces

Cor

rela

tion

Figure 8.5: Correlation for all key hypotheses, targeting the third S-box with a bit model(MSB)

that most bit models exposed no relevant correlation. However there was one exception:Those bit models targeting the MSB of each interesting intermediate exhibited correlationcoefficients ranging from ≈ 0.2 up to almost 1.0. We thus decided to run a full CPAagain, targeting MSBs using the bit model. The results for the first S-box were closeto perfect from the attacker’s point of view. Figure 8.3 shows the correlation for allkey hypotheses. The black curve indicates the correct key hypothesis while the wronghypotheses are plotted in light gray.

We did not get as clear results for the remaining 15 S-boxes, but they are attackablenevertheless. Figures 8.4 and 8.5 make it clear that the individual portion of the keycan be recovered using at most 1000 power traces. We omit the full set of plots for theremaining S-boxes at this point for the sake of brevity.

8.2.4 Unprotected 32-bit AES

In between the unprotected and the allegedly secured implementations, we also performeda side-channel assessment of the 32-bit optimized implementation. The S-box was stilleasy to break. In contrast to the 8-bit implementation however, where we had successfullyconducted an attack on MixColumns during our initial experiments, we did not succeedin attacking MixColumns this time. We attribute this failure to the combination of twoeffects. Firstly, the 32-bit version of MixColumns uses the full capacity of the TC1797’sregisters which leads to a different power consumption than the handling of single bytes.Secondly, the Hamming-Weight models we used earlier predict the power consumption of8-bit values. This prediction does not fit the power consumption of a fully used 32-bit


register. It may seem obvious to use a 32-bit Hamming-Weight model to correct thisissue but in doing so, one would also have to predict 32 bits of the key at a time. Itmight be possible to mount an attack using three fixed key bytes while only one bytevaries, but we decided to skip this approach. We felt that it would make more sense tofocus on the protected implementations while sticking to the models we used for theunprotected implementation.

8.2.5 AES protected with SKHL13

Next, we attacked the SKHL13 implementation using the same approach as for theunprotected implementation. We attacked the S-box computation in the first round usingHamming-Weight and bit models. We refrained from mounting all possible attacks onall possible intermediates in all possible rounds because we felt that doing so would beexcessively time-consuming. Nevertheless, we decided to increase the number of recordedpower traces by one order of magnitude, that is, to record 100 000 traces. We would havepreferred to record 1 000 000 traces, but a combination of technical constraints forced us tofix 100 000 as the upper limit: Firstly, we had decided to send random data to the TC1797over the serial connection. Technically this is no problem. Secondly however, we had todeal with a connection speed of only 19 200 baud which we could not increase becausewe had fixed the TC1797 core frequency at 10 MHz. Setting a higher baud rate with anacceptable amount of deviation would have forced us to increase the core frequency whichin turn would have changed all dependent parameters of our measurements. Therefore wedecided to leave the CPU frequency and the baud rate unchanged. Instead, we reducedthe amount of random data transmitted for each encryption because we targeted onlythe S-box in the first round. We then realized by experimentation that we could recordan estimated two traces per second. From this we can roughly deduce that recording100 000 traces takes almost 14 hours which is a reasonable time frame compared to 140hours for one million traces.

After we had acquired the power traces, we used the framework to profile them usingHamming-Weight and bit models just like we did for the unprotected implementation.Figure 8.6 shows the correlation traces for the 16 S-box computations, based on HWmodels and bit models targeting the MSB of the chosen intermediates. Some minor spikesare still visible. In addition we provide the full DPA plot for the first S-box and all 256candidate key bytes in Fig. 8.7. We see from the plot that the correlation for the correctkey hypothesis is mostly hidden amongst that of wrong hypotheses. Nevertheless, wecan spot three sensitive areas in the plot. The first area is located at roundabout 14 000traces where the correlation of the correct hypothesis exceeds all others. The secondcritical area is located between 50 000 and 60 000 traces where we see the same effect.The third area ranges roughly from 65 000 to 75 000 traces where it looks very much likewe could perfectly have recovered the first key byte using any number of traces in thisinterval. A much better picture in terms of security is given by Fig. 8.8. This plot showsthe result of a full CPA targeting the second S-box call in the first round. The correlationfor the correct hypothesis never exceeds that of wrong hypotheses. We decided to stopmounting more attacks as soon as we found that the first S-box was still attackable. The


Figure 8.6: SKHL13: Correlation traces for each of the 16 S-box computations in thefirst encryption round, using a Hamming-Weight model

insecurity regarding the first key byte can have a variety of reasons. Firstly, the quality ofthe random data we used might have been to low. Secondly, applying the mask refreshingfix from [CPRR13] to the scheme might have introduced a different problem. Finally,we might have made a mistake in our code even though we verified that encryptionresults were formally correct. Nevertheless, the number of traces required to recover thefirst key byte is almost two orders of magnitude bigger than it was for the unprotectedimplementation.


0 10 20 30 40 50 60 70 80 90 100 ·103

0

2 · 10−2

4 · 10−2

6 · 10−2

8 · 10−2

0.1

Traces

Cor

rela

tion

Figure 8.7: SKHL13: Correlation for all key hypotheses, targeting the first S-box in round1 with a bit model (MSB)

0 10 20 30 40 50 60 70 80 90 100 ·103

0

2 · 10−2

4 · 10−2

6 · 10−2

8 · 10−2

0.1

Traces

Cor

rela

tion

Figure 8.8: SKHL13: Correlation for all key hypotheses, targeting the second S-box inround 1 with a bit model (MSB)


Figure 8.9: CPRR13: Correlation traces for each of the 16 S-box computations in thefirst encryption round, using a Hamming-Weight model

8.2.6 AES protected with CPRR13Finally, we mounted the same attack again, this time targeting the CPRR13 implemen-tation. Once again we attacked the S-box in the first round. We recorded 100 000 powertraces and evaluated them with the same power models as before. Figure 8.9 shows thecorrelation traces for all 16 S-box computations. We observe another couple of minorspikes as in Fig. 8.6. Figure 8.10 shows the result of a full CPA on the first key byte.As with SKHL13 we see some areas where the correlation of the correct key hypothesislooks exploitable. We identify one sensitive spot at around 15 000 traces and anotherbetween 70 000 and 75 000. The result of a full CPA attack on the second key byte isdepicted in Fig. 8.11. This plots looks much better because the correlation of the correctkey byte is always lower than that of some wrong hypotheses.


0 10 20 30 40 50 60 70 80 90 100 ·103

0

2 · 10−2

4 · 10−2

6 · 10−2

8 · 10−2

0.1

Traces

Cor

rela

tion

Figure 8.10: CPRR13: Correlation for all key hypotheses, targeting the first S-box inround 1 with a bit model (MSB)

0 10 20 30 40 50 60 70 80 90 100 ·103

0

2 · 10−2

4 · 10−2

6 · 10−2

8 · 10−2

0.1

Traces

Cor

rela

tion

Figure 8.11: CPRR13: Correlation for all key hypotheses, targeting the second S-box inround 1 with a bit model (MSB)

8.3 Results 101

8.3 ResultsIn this chapter we showed how we mounted CPA attacks against our AES implementations.We summarize the results in the following.

As a start, we subjected the unprotected 8-bit implementation to an attack using10 000 traces. We found that the Hamming-Weight model was not sufficiently well suitedfor successful recovery of the secret key. Using bit models targeting the MSB of eachintermediate we were able to recover the full key by attacking the S-box in the first round.We required less than 1000 traces for the full key recovery.

Next, we attacked the unprotected 32-bit implementation using 10 000 traces again.We found that the bit models still worked as efficiently as in the previous attack. Experi-mental attacks on the output of MixColumns failed in contrast to the 8-bit implementationwhich we attributed to the combination of full 32-bit register usage and 8-bit Hamming-Weight models.

Subsequently, we turned to the SKHL13 implementation. We increased the numberof traces to 100 000. The S-box in the first round remained our attack target. We profiledthe correlation using both Hamming-Weight and bit models and found that the peakswere less clear but still partially visible. Full DPA attacks on the first and second keybyte showed that the first key byte could still be recovered using less than 100 000 traceswhile the second key byte could not be recovered using the full 100 000 traces.

Finally, we attacked our CPRR13 implementation. We saw some minimal correlationspikes in this case as well. Plotting the full DPA results we found that the first S-box inthe first round might still be attackable with less than 100 000 traces whereas the secondS-box in the same round cannot be attacked using the full set of 100 000 traces.

Both masked implementations seem to suffer from marginal CPA vulnerabilities. Weassume that the reasons for those findings lie somewhere between bad randomness andimplementation bugs. We leave the analysis of the exact reasons open at this point.

9 Conclusion

We began our work on this thesis with little working knowledge about side-channelattacks. During the course of reading, writing, coding, testing, debugging, and attackingour implementations we learned a great deal about power analysis attacks in general andCorrelation Power Analysis in particular. We gained detailed insight into mathematicaland technological concepts as well as practical aspects. We are now familiar with theHighTec TriCore toolchain, the PicoScope, the EMSEC SCA framework, and character-istic properties of the Infineon TriCore TC1797. It remains to check whether we haveachieved the goal of our thesis. In the following we look for an answer to this question.

9.1 Breaking AES on TriCore using Power Analysis AttacksIn the introduction to our thesis we put up two propositions. We reproduce the first onehere:

Proposition 9.1.1 The key of an unprotected AES implementation running on theTC1797 can be recovered using power analysis.

We showed in Sections 7.2.1 and 7.2.2 how we created unprotected 8-bit and 32-bitAES implementations. We ran both implementations on the TC1797 and subjected themto CPA attacks in the consequence. We found that the unprotected implementationswere easy to break using less than 1000 power traces. The first proposition holds.

9.2 Protecting AES on TriCore against Power Analysis AttacksWe used the first proposition primarily to substantiate the allegation that a naive softwareimplementation of AES is vulnerable to Correlation Power Analysis. The major part ofour thesis aims at confirming the second proposition:

Proposition 9.2.1 It is possible to protect AES on the TC1797 against first-order poweranalysis attacks.

We researched publications on masking schemes and identified two interesting candi-dates for implementation. CPRR13 and its security fix came into play when we alreadyhad a working implementation of RP10 which we subsequently modified to meet the newspecification. KHL11 originally seemed like the better candidate from a performance pointof view but it contained the flawed mask refreshing procedure exhibited in [CPRR13].When we learned about the flaw, we ported the according fix to the KHL11 scheme. Indoing so we effectively created a fork of this scheme. We named the new scheme SKHL13

104 9 Conclusion

based on our own surname and the current year. We implemented CPRR13 and SKHL13and tested them on the TC1797. To verify their alleged protection against 1𝑠𝑡-orderCPA, we mounted attacks against both implementations. We saw in Chap. 8 that bothschemes deliver protection against 1𝑠𝑡-order attacks. However, we can not state that ourimplementations are fully protected against power analysis attacks. Further researchis required in order to identify and remedy the issues we found when we attacked theprotected implementations. Nevertheless, we saw that conducting an attack has becomeharder with regard to the required amount of power traces. Thus, the second propositionholds to a limited extent: AES on the TC1797 can be protected against first-order poweranalysis attacks, albeit we have not found the perfect solution yet.

10 Future Work

I love deadlines. I love the whooshingsound they make as they fly by.

Douglas Adams

Regular human beings do not perfectly conform to the much-cited concept of anattacker with unlimited time and resources. During the course of working on our thesiswe identified some issues that look interesting for further investigation. In the followingwe give an overview of what might be researched in the future.

10.1 TC1797 ProfilingMany papers and books we read, for example [MOP07], go into great detail consideringthe cryptographic devices being attacked. Our approach was to measure the powerconsumption and profile the traces for correlation afterwards, using a variety of models.We feel that this is a valid approach to finding the best power model for the targetedsystem. Nevertheless the reverse approach seems interesting as well. Mangard et al.profile the microcontroller using repeated identical instructions with varying data inorder to find out what information is leaked through the power consumption. Fromthe findings they derive which kind of power model suits the cryptographic device. Wemade the experience that the TC1797 is a very feature-rich and complex device and thatits electrical characteristics are just as versatile. We believe that spending some timeon detailed power consumption profiling could greatly improve the efficiency and thereliability of power analysis attacks.

10.2 Optimizing the ImplementationsWe found that all implementations we created required many more clock cycles thanwhat we had estimated which was primarily due to the fact that we excluded addressarithmetic and function calls from our estimation process. We also found that compileroptimization led to a considerable speedup with the minor downside that the executiontiming became variable with increasing optimization level. Due to the overall capabilitiesof the TC1797 we felt that it would not make sense to write an AES implementation inassembler. Nevertheless we believe that there is room for optimization in the C code wecreated. Moreover, pipelining is an area we left completely untouched. In combinationwith compiler optimization it should be possible to achieve further reduction of theamount of clock cycles required for encryption computations.

106 10 Future Work

10.3 Additional CountermeasuresWe implemented two masking schemes to one of which we applied a security fix thatwe knew from [CPRR13]. We found that our implementation of SKHL13 was not fullysecure against 1𝑠𝑡-order CPA. This finding might be due to implementation errors orbad random data. On the other hand it might be necessary to apply additional hidingcountermeasures as recommended by Matthieu Rivain. Implementing shuffling to hardenthe masking schemes further might be an interesting area of work.

10.4 Additional AttacksWe mounted CPA attacks primarily at the S-box in the first round of AES. For theunprotected implementation we found that this was sufficient to recover the key usingless than 1000 power traces. For the masked implementations we would have liked tomount more attacks on other rounds (2-3, 8-10), experiment with additional sources ofrandomness, and try more sophisticated power models. Moreover we can only speculateabout the chances of success in mounting higher-order attacks against our implementations.Higher-order attacks were out of the scope of our thesis but we feel that it would beinteresting to see if and how easily our implementations can be broken using higher-orderCPA. As another field of future research it might be educational to try attacks based onMutual Information Analysis which is supposedly more powerful than CPA.

10.5 Extended Software ToolingWe found the EMSEC SCA framework extremely valuable in all phases of our work. Itrepresents a valuable tool of the trade and many special requirements have already beentaken care of. Nevertheless we frequently felt that spending many hours waiting for aresult to drop out of the evaluation process was a bit boring and sometimes frustrating.During those phases we thought about possible enhancements to the software ecosystem.For example we imagined the construction of an automated side-channel assessmentsystem where the basic idea is strongly related to continuous integration systems usedin software development. We also pictured a SCA framework that employs distributedcomputing for faster evaluation of the recorded power traces, independent of the questionwhether this would be possible from technical and mathematical perspectives. Maybesuch software could even be augmented with realtime visualization like correlation plotsevolving with the increasing number of evaluated traces. From a more general pointof view, we strongly feel that there is a lot of potential in the area of side-channelvisualization.

10.6 Real-World ScenariosFinally we imagine that mounting a CPA attack in a real-world scenario could posea valuable challenge. By real-world scenario we mean a setting in which the many

10.7 Newest Literature 107

advantages we had with regard to control over the code and the TC1797 are reduced oreven missing completely. We worked in a highly optimized surrounding where nothingelse than our own code ran on the TC1797. Moreover we had full control over the chip,the board, and the code running on the target system. In other words we took a perfectwhite box approach. We feel that changing perspectives towards a different setting,maybe in cooperation with other researchers or even product vendors, could be attractive.

10.7 Newest LiteratureDuring the course of working on this thesis we were personally supplied with [CPRR13]by Matthieu Rivain when the paper was not yet officially published. We subsequentlyswitched from RP10 to CPRR13 and applied the same mask refreshing fix to KHL11,thereby creating SKHL13. This makes it obvious that new research findings appearcontinuously and that ongoing work can be influenced to different extents at any time.Just one week before our deadline the yearly Workshop for Cryptographic Hardware andEmbedded Systems (CHES) conference took place. We wish to mention two potentiallyinteresting papers that we consider well-suited for further research. The first paper istitled “Masking vs. Multiparty Computation: How Large is the Gap for AES?” [GSF13].It provides a comparison of existing masking schemes including CPRR13 and KHL11 andintroduces a new technique called Packed Secret Sharing. In addition the authors discussrequirements for random number generators. The second paper is titled “Block CiphersThat Are Easier to Mask: How Far Can We Go?” [GGNPS13]. The authors pursue thegoal to find block ciphers where masking is easier to apply compared to existing methods.They also analyze to which extent AES can be tuned in order to facilitate masking. Webelieve that both papers could be used as starting points for further research aroundAES implementations that are protected against side-channel attacks.

A Acronyms

ABS Antilock Braking System

ADC Analog-to-Digital Converter

AES Advanced Encryption Standard

API Application Programming Interface

ASC Asynchronous Serial Channel

AUDO AUtomotive unifieD processOr

BLIS Blind Spot Information System

BSI Bundesamt für Sicherheit in der Informationstechnik

CAN Controller Area Network

CDT C/C++ Development Tooling

CHES Workshop for Cryptographic Hardware and Embedded Systems

CPA Correlation Power Analysis

CPU Central Processing Unit

DAVE Digital Application Virtual Engineer

DES Data Encryption Standard

DMI Data Memory Interface

DPA Differential Power Analysis

DSP Digital Signal Processing

EABI Embedded Applications Binary Interface

ECB Electronic Code Book

ECU Electronic Control Unit

EEA Extended Euclidean Algorithm

ELF Executable and Linkable Format

EMF Eclipse Modeling Framework

EMSEC Chair for Embedded Security at Ruhr-Universität Bochum

110 A Acronyms

ESC Electronic Stability Control

ESD Electrostatic Discharge

FADC Fast Analog-to-Digital Converter

GNU GNU’s Not Unix!

GPIO General Purpose I/O

GUI Graphical User Interface

HODPA Higher-Order DPA

IDE Integrated Development Environment

IPA Inferential Power Analysis

I/O Input / Output

JTAG Joint Test Action Group

LDRAM Local Data RAM

LED Light Emitting Diode

LMB Local Memory Bus

LSB Least Significant Bit

MCU Microcontroller Unit

MIA Mutual Information Analysis

MLI Micro Link Interface

MSB Most Significant Bit

MSC Micro Second Channel

NBS National Bureau of Standards

NIST National Institute of Standards and Technology

OCDS On-Chip Debug Support

PCB Printed Circuit Board

PCP Peripheral Control Processor

PMU Program Memory Unit

PRNG Pseudo Random Number Generator

PWM Pulse Width Modulation

RA Return Address

RAM Random Access Memory

111

RISC Reduced Instruction Set Computer

RNG Random Number Generator

ROM Read-Only Memory

RP Random Permutation

RSI Random Start Index

SCA Side-Channel Analysis

SP Stack Pointer

SPA Simple Power Analysis

SPN Substitution-Permutation Network

SSC Synchronous Serial Channel

TLU Table Lookup

UAD2 Universal Access Device 2

UDE Universal Debug Engine

USB Universal Serial Bus

XOR Exclusive Or

B Appendix

In this appendix we provide information for which we found no suitable place in themain body of the thesis.

B.1 Lookup Tables for the Kim-Hong-Lim Scheme

The authors of [KHL11] use six lookup tables in order to reduce the running time ofS-box computations. They do not provide the tables in their paper which increasesthe effort required to create an implementation. We computed the lookup tables whenwe implemented SKHL13. We added one more table that was required for the maskrefreshing fix. We provide the lookup tables in the following.

1 typedef unsigned char BYTE;2 /* squarings in GF((2^2)^2) */3 const BYTE T1[] = {4 0x0, 0x1, 0x3, 0x2,5 0x6, 0x7, 0x5, 0x4,6 0xD, 0xC, 0xE, 0xF,7 0xB, 0xA, 0x8, 0x98 };9

10 /* two squarings in GF((2^2)^2) */11 const BYTE T2[] = {12 0x0, 0x1, 0x2, 0x3,13 0x5, 0x4, 0x7, 0x6,14 0xA, 0xB, 0x8, 0x9,15 0xF, 0xE, 0xD, 0xC16 };1718 /* lambda*X^2 in GF((2^2)^2) */19 const BYTE T3[] = {20 0x0, 0xC, 0x8, 0x4,21 0x9, 0x5, 0x1, 0xD,22 0x7, 0xB, 0xF, 0x3,23 0xE, 0x2, 0x6, 0xA24 };2526 /* x^3 in GF((2^2)^2) */27 const BYTE T7[] = {28 0x0, 0x1, 0x1, 0x1,29 0xE, 0xD, 0x8, 0xA,30 0xE, 0xA, 0xD, 0x8,31 0xE, 0x8, 0xA, 0xD32 };3334 /* multiplication in GF((2^2)^2) */35 const BYTE T4[16][16] = {36 {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},37 {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF},

114 B Appendix

38 {0x0, 0x2, 0x3, 0x1, 0x8, 0xA, 0xB, 0x9, 0xC, 0xE, 0xF, 0xD, 0x4, 0x6, 0x7, 0x5},39 {0x0, 0x3, 0x1, 0x2, 0xC, 0xF, 0xD, 0xE, 0x4, 0x7, 0x5, 0x6, 0x8, 0xB, 0x9, 0xA},40 {0x0, 0x4, 0x8, 0xC, 0x6, 0x2, 0xE, 0xA, 0xB, 0xF, 0x3, 0x7, 0xD, 0x9, 0x5, 0x1},41 {0x0, 0x5, 0xA, 0xF, 0x2, 0x7, 0x8, 0xD, 0x3, 0x6, 0x9, 0xC, 0x1, 0x4, 0xB, 0xE},42 {0x0, 0x6, 0xB, 0xD, 0xE, 0x8, 0x5, 0x3, 0x7, 0x1, 0xC, 0xA, 0x9, 0xF, 0x2, 0x4},43 {0x0, 0x7, 0x9, 0xE, 0xA, 0xD, 0x3, 0x4, 0xF, 0x8, 0x6, 0x1, 0x5, 0x2, 0xC, 0xB},44 {0x0, 0x8, 0xC, 0x4, 0xB, 0x3, 0x7, 0xF, 0xD, 0x5, 0x1, 0x9, 0x6, 0xE, 0xA, 0x2},45 {0x0, 0x9, 0xE, 0x7, 0xF, 0x6, 0x1, 0x8, 0x5, 0xC, 0xB, 0x2, 0xA, 0x3, 0x4, 0xD},46 {0x0, 0xA, 0xF, 0x5, 0x3, 0x9, 0xC, 0x6, 0x1, 0xB, 0xE, 0x4, 0x2, 0x8, 0xD, 0x7},47 {0x0, 0xB, 0xD, 0x6, 0x7, 0xC, 0xA, 0x1, 0x9, 0x2, 0x4, 0xF, 0xE, 0x5, 0x3, 0x8},48 {0x0, 0xC, 0x4, 0x8, 0xD, 0x1, 0x9, 0x5, 0x6, 0xA, 0x2, 0xE, 0xB, 0x7, 0xF, 0x3},49 {0x0, 0xD, 0x6, 0xB, 0x9, 0x4, 0xF, 0x2, 0xE, 0x3, 0x8, 0x5, 0x7, 0xA, 0x1, 0xC},50 {0x0, 0xE, 0x7, 0x9, 0x5, 0xB, 0x2, 0xC, 0xA, 0x4, 0xD, 0x3, 0xF, 0x1, 0x8, 0x6},51 {0x0, 0xF, 0x5, 0xA, 0x1, 0xE, 0x4, 0xB, 0x2, 0xD, 0x7, 0x8, 0x3, 0xC, 0x6, 0x9}52 };5354 /* isomorphism GF(2^8) -> GF(((2^2)^2)^2) */55 const BYTE T5[] = {56 0x00, 0x01, 0x5F, 0x5E, 0x7C, 0x7D, 0x23, 0x22, 0x74, 0x75, 0x2B, 0x2A, 0x08, 0x09, 0x57, 0x56,57 0x46, 0x47, 0x19, 0x18, 0x3A, 0x3B, 0x65, 0x64, 0x32, 0x33, 0x6D, 0x6C, 0x4E, 0x4F, 0x11, 0x10,58 0xB0, 0xB1, 0xEF, 0xEE, 0xCC, 0xCD, 0x93, 0x92, 0xC4, 0xC5, 0x9B, 0x9A, 0xB8, 0xB9, 0xE7, 0xE6,59 0xF6, 0xF7, 0xA9, 0xA8, 0x8A, 0x8B, 0xD5, 0xD4, 0x82, 0x83, 0xDD, 0xDC, 0xFE, 0xFF, 0xA1, 0xA0,60 0x4B, 0x4A, 0x14, 0x15, 0x37, 0x36, 0x68, 0x69, 0x3F, 0x3E, 0x60, 0x61, 0x43, 0x42, 0x1C, 0x1D,61 0x0D, 0x0C, 0x52, 0x53, 0x71, 0x70, 0x2E, 0x2F, 0x79, 0x78, 0x26, 0x27, 0x05, 0x04, 0x5A, 0x5B,62 0xFB, 0xFA, 0xA4, 0xA5, 0x87, 0x86, 0xD8, 0xD9, 0x8F, 0x8E, 0xD0, 0xD1, 0xF3, 0xF2, 0xAC, 0xAD,63 0xBD, 0xBC, 0xE2, 0xE3, 0xC1, 0xC0, 0x9E, 0x9F, 0xC9, 0xC8, 0x96, 0x97, 0xB5, 0xB4, 0xEA, 0xEB,64 0xFC, 0xFD, 0xA3, 0xA2, 0x80, 0x81, 0xDF, 0xDE, 0x88, 0x89, 0xD7, 0xD6, 0xF4, 0xF5, 0xAB, 0xAA,65 0xBA, 0xBB, 0xE5, 0xE4, 0xC6, 0xC7, 0x99, 0x98, 0xCE, 0xCF, 0x91, 0x90, 0xB2, 0xB3, 0xED, 0xEC,66 0x4C, 0x4D, 0x13, 0x12, 0x30, 0x31, 0x6F, 0x6E, 0x38, 0x39, 0x67, 0x66, 0x44, 0x45, 0x1B, 0x1A,67 0x0A, 0x0B, 0x55, 0x54, 0x76, 0x77, 0x29, 0x28, 0x7E, 0x7F, 0x21, 0x20, 0x02, 0x03, 0x5D, 0x5C,68 0xB7, 0xB6, 0xE8, 0xE9, 0xCB, 0xCA, 0x94, 0x95, 0xC3, 0xC2, 0x9C, 0x9D, 0xBF, 0xBE, 0xE0, 0xE1,69 0xF1, 0xF0, 0xAE, 0xAF, 0x8D, 0x8C, 0xD2, 0xD3, 0x85, 0x84, 0xDA, 0xDB, 0xF9, 0xF8, 0xA6, 0xA7,70 0x07, 0x06, 0x58, 0x59, 0x7B, 0x7A, 0x24, 0x25, 0x73, 0x72, 0x2C, 0x2D, 0x0F, 0x0E, 0x50, 0x51,71 0x41, 0x40, 0x1E, 0x1F, 0x3D, 0x3C, 0x62, 0x63, 0x35, 0x34, 0x6A, 0x6B, 0x49, 0x48, 0x16, 0x1772 };7374 /* inv. isom. and aff. tr. */75 const BYTE T6[] = {76 0x00, 0x1F, 0x19, 0x06, 0xAD, 0xB2, 0xB4, 0xAB, 0x84, 0x9B, 0x9D, 0x82, 0x29, 0x36, 0x30, 0x2F,77 0x54, 0x4B, 0x4D, 0x52, 0xF9, 0xE6, 0xE0, 0xFF, 0xD0, 0xCF, 0xC9, 0xD6, 0x7D, 0x62, 0x64, 0x7B,78 0x44, 0x5B, 0x5D, 0x42, 0xE9, 0xF6, 0xF0, 0xEF, 0xC0, 0xDF, 0xD9, 0xC6, 0x6D, 0x72, 0x74, 0x6B,79 0x10, 0x0F, 0x09, 0x16, 0xBD, 0xA2, 0xA4, 0xBB, 0x94, 0x8B, 0x8D, 0x92, 0x39, 0x26, 0x20, 0x3F,80 0x45, 0x5A, 0x5C, 0x43, 0xE8, 0xF7, 0xF1, 0xEE, 0xC1, 0xDE, 0xD8, 0xC7, 0x6C, 0x73, 0x75, 0x6A,81 0x11, 0x0E, 0x08, 0x17, 0xBC, 0xA3, 0xA5, 0xBA, 0x95, 0x8A, 0x8C, 0x93, 0x38, 0x27, 0x21, 0x3E,82 0x01, 0x1E, 0x18, 0x07, 0xAC, 0xB3, 0xB5, 0xAA, 0x85, 0x9A, 0x9C, 0x83, 0x28, 0x37, 0x31, 0x2E,83 0x55, 0x4A, 0x4C, 0x53, 0xF8, 0xE7, 0xE1, 0xFE, 0xD1, 0xCE, 0xC8, 0xD7, 0x7C, 0x63, 0x65, 0x7A,84 0xF3, 0xEC, 0xEA, 0xF5, 0x5E, 0x41, 0x47, 0x58, 0x77, 0x68, 0x6E, 0x71, 0xDA, 0xC5, 0xC3, 0xDC,85 0xA7, 0xB8, 0xBE, 0xA1, 0x0A, 0x15, 0x13, 0x0C, 0x23, 0x3C, 0x3A, 0x25, 0x8E, 0x91, 0x97, 0x88,86 0xB7, 0xA8, 0xAE, 0xB1, 0x1A, 0x05, 0x03, 0x1C, 0x33, 0x2C, 0x2A, 0x35, 0x9E, 0x81, 0x87, 0x98,87 0xE3, 0xFC, 0xFA, 0xE5, 0x4E, 0x51, 0x57, 0x48, 0x67, 0x78, 0x7E, 0x61, 0xCA, 0xD5, 0xD3, 0xCC,88 0xB6, 0xA9, 0xAF, 0xB0, 0x1B, 0x04, 0x02, 0x1D, 0x32, 0x2D, 0x2B, 0x34, 0x9F, 0x80, 0x86, 0x99,89 0xE2, 0xFD, 0xFB, 0xE4, 0x4F, 0x50, 0x56, 0x49, 0x66, 0x79, 0x7F, 0x60, 0xCB, 0xD4, 0xD2, 0xCD,90 0xF2, 0xED, 0xEB, 0xF4, 0x5F, 0x40, 0x46, 0x59, 0x76, 0x69, 0x6F, 0x70, 0xDB, 0xC4, 0xC2, 0xDD,91 0xA6, 0xB9, 0xBF, 0xA0, 0x0B, 0x14, 0x12, 0x0D, 0x22, 0x3D, 0x3B, 0x24, 0x8F, 0x90, 0x96, 0x8992 };

Listing B.1: Lookup Tables in C for the Kim-Hong-Lim Masking Scheme

List of Figures

3.1 Differential trace for the wrong guess 𝐾 = 6 . . . . . . . . . . . . . . . . . 163.2 Differential trace for the correct guess 𝐾 = 43 . . . . . . . . . . . . . . . . 163.3 Differential trace for the wrong guess 𝐾 = 1 . . . . . . . . . . . . . . . . . 163.4 Correlation trace for the wrong guess 𝐾 = 6 . . . . . . . . . . . . . . . . . 203.5 Correlation trace for the correct guess 𝐾 = 43 . . . . . . . . . . . . . . . . 203.6 Correlation trace for the wrong guess 𝐾 = 1 . . . . . . . . . . . . . . . . . 20

5.1 The TriCore SCA board developed at EMSEC with the TC1797 socketedin the middle, measurement probes on the left and at the bottom, and aJTAG cable on the right . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Schematic picture of the laboratory setup . . . . . . . . . . . . . . . . . . 315.3 The PicoScope 5203 with connectors on the front panel . . . . . . . . . . 33

6.1 GF((24)2) inverter according to Satoh et alii . . . . . . . . . . . . . . . . . 49

7.1 Encryption timings measured with naive and fixed MixColumns imple-mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.1 Correlation traces for each of the 16 S-box lookups in the first encryptionround, using a Hamming-Weight model . . . . . . . . . . . . . . . . . . . 93

8.2 Correlation for all key hypotheses, targeting the first S-box with a Hamming-Weight model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.3 Correlation for all key hypotheses, targeting the first S-box with a bitmodel (MSB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8.4 Correlation for all key hypotheses, targeting the second S-box with a bitmodel (MSB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8.5 Correlation for all key hypotheses, targeting the third S-box with a bitmodel (MSB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.6 SKHL13: Correlation traces for each of the 16 S-box computations in thefirst encryption round, using a Hamming-Weight model . . . . . . . . . . 97

8.7 SKHL13: Correlation for all key hypotheses, targeting the first S-box inround 1 with a bit model (MSB) . . . . . . . . . . . . . . . . . . . . . . . 98

8.8 SKHL13: Correlation for all key hypotheses, targeting the second S-boxin round 1 with a bit model (MSB) . . . . . . . . . . . . . . . . . . . . . . 98

8.9 CPRR13: Correlation traces for each of the 16 S-box computations in thefirst encryption round, using a Hamming-Weight model . . . . . . . . . . 99

116 List of Figures

8.10 CPRR13: Correlation for all key hypotheses, targeting the first S-box inround 1 with a bit model (MSB) . . . . . . . . . . . . . . . . . . . . . . . 100

8.11 CPRR13: Correlation for all key hypotheses, targeting the second S-boxin round 1 with a bit model (MSB) . . . . . . . . . . . . . . . . . . . . . . 100

List of Tables

6.1 Comparison of encryption complexity in clock cycles, number of requiredrandom bytes, and memory requirements for the discussed variants of AES 63

7.1 Comparison of encryption timings with naive and fixed MixColumns multi-plication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.2 NIST test vectors used for verification tests . . . . . . . . . . . . . . . . . . 847.3 Timing of the unprotected 32-bit optimized encryption depending on com-

piler optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

List of Algorithms

2.2.1 AES-128 Cipher – Encryption . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 AES-128 Key Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 AES-128 Cipher – Decryption . . . . . . . . . . . . . . . . . . . . . . . . 11

6.2.1 SecMult – 1𝑠𝑡-order Secure Multiplication over GF(28) . . . . . . . . . . . 436.2.2 SecExp254 – 1𝑠𝑡-order Secure Exponentiation to the 254 over GF(28) . . . 446.2.3 RefreshMasks – 1𝑠𝑡-order mask refreshing . . . . . . . . . . . . . . . . . . 446.2.4 1𝑠𝑡-order Secure S-box according to Rivain and Prouff . . . . . . . . . . . 456.2.5 SecProc – Secure evaluation of ℎ : 𝑥 ↦→ 𝑥 · 𝑔(𝑥) over GF(2𝑛) . . . . . . . . 466.2.6 1𝑠𝑡-order Secure Exponentiation to the 254 over GF(28) without Mask

Refreshing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2.7 SecMult4 – 1𝑠𝑡-order Secure Multiplication over GF(24) . . . . . . . . . . 516.2.8 SecInv4 – 1𝑠𝑡-order Secure Inversion over GF(24) . . . . . . . . . . . . . . 526.2.9 1𝑠𝑡-order secure masking of the AES S-box . . . . . . . . . . . . . . . . . 526.2.10 1𝑠𝑡-order Secure Inversion over GF(24) . . . . . . . . . . . . . . . . . . . . 536.2.11 Shamir’s Secret Sharing Scheme . . . . . . . . . . . . . . . . . . . . . . . 546.2.12 Shared Multiplication according to Goubin and Martinelli . . . . . . . . . 546.2.13 1𝑠𝑡-order Secure Key Expansion . . . . . . . . . . . . . . . . . . . . . . . 556.2.14 Sharing the Initial State before Encryption . . . . . . . . . . . . . . . . . 56

7.2.1 Word-Wise MixColumns . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.2.2 Parallel Multiplication with {02} in GF(28) . . . . . . . . . . . . . . . . . 78

List of Listings

7.1 Naive multiplication with 𝑥 in GF(28) . . . . . . . . . . . . . . . . . . . . 697.2 Naive multiplication with (𝑥 + 1) in GF(28) . . . . . . . . . . . . . . . . . 707.3 MixColumns code excerpt for the first column of the state . . . . . . . . . 707.4 Naive attempt at fixing the multiplication with 𝑥 in GF(28) . . . . . . . . 717.5 Fixed multiplication with 𝑥 in GF(28) . . . . . . . . . . . . . . . . . . . . 727.6 Assembly output for the multiplication with 𝑥 in GF(28) . . . . . . . . . . 727.7 MixColumns function from the naive implementation . . . . . . . . . . . . 767.8 Affine transformation in C . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.9 Affine transformation in C, optimized for the TC1797 . . . . . . . . . . . 807.10 Shared AES Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.11 Using the section pragma to move data to the LDRAM . . . . . . . . . . 87

B.1 Lookup Tables in C for the Kim-Hong-Lim Masking Scheme . . . . . . . . 113

Bibliography[AG01] Mehdi-Laurent Akkar and Christophe Giraud. An Implementation of DES

and AES, Secure against Some Attacks. In Koç et al. [KNP01], pages309–318.

[BB02] Elad Barkan and Eli Biham. In How Many Ways Can You Write Rijndael?In Yuliang Zheng, editor, ASIACRYPT, volume 2501 of Lecture Notes inComputer Science, pages 160–175. Springer, 2002.

[BBF+02] Guido Bertoni, Luca Breveglieri, Pasqualina Fragneto, Marco Macchetti,and Stefano Marchesin. Efficient Software Implementation of AES on 32-BitPlatforms. In Kaliski Jr. et al. [KKP03], pages 159–171.

[BC13] Guido Bertoni and Jean-Sébastien Coron, editors. Cryptographic Hardwareand Embedded Systems - CHES 2013 - 15th International Workshop, SantaBarbara, CA, USA, August 20-23, 2013. Proceedings, volume 8086 of LectureNotes in Computer Science. Springer, 2013.

[BCO04] Eric Brier, Christophe Clavier, and Francis Olivier. Correlation PowerAnalysis with a Leakage Model. In Joye and Quisquater [JQ04], pages16–29.

[BS90] Eli Biham and Adi Shamir. Differential Cryptanalysis of DES-like Cryp-tosystems. In Alfred Menezes and Scott A. Vanstone, editors, CRYPTO,volume 537 of Lecture Notes in Computer Science, pages 2–21. Springer,1990.

[BS92] Eli Biham and Adi Shamir. Differential Cryptanalysis of the Full 16-RoundDES. In Ernest F. Brickell, editor, CRYPTO, volume 740 of Lecture Notesin Computer Science, pages 487–496. Springer, 1992.

[BS99] Eli Biham and Adi Shamir. Power Analysis of the Key Scheduling of theAES Candidates. 1999.

[CPR12] Jean-Sébastien Coron, Emmanuel Prouff, and Thomas Roche. On the Use ofShamir’s Secret Sharing against Side-Channel Analysis. In Stefan Mangard,editor, CARDIS, volume 7771 of Lecture Notes in Computer Science, pages77–90. Springer, 2012.

[CPRR13] Jean-Sébastien Coron, Emmanuel Prouff, Matthieu Rivain, and ThomasRoche. Higher-Order Side Channel Security and Mask Refreshing. Toappear in the proceedings of FSE, 2013.

124 Bibliography

[CT03] Jean-Sébastien Coron and Alexei Tchulkine. A New Algorithm for Switchingfrom Arithmetic to Boolean Masking. In Colin D. Walter, Çetin Kaya Koç,and Christof Paar, editors, CHES, volume 2779 of Lecture Notes in ComputerScience, pages 89–97. Springer, 2003.

[Deb12] Blandine Debraize. Efficient and Provably Secure Methods for Switchingfrom Arithmetic to Boolean Masking. In Prouff and Schaumont [PS12],pages 107–121.

[DR02] Joan Daemen and Vincent Rijmen. The Design of Rijndael: AES - TheAdvanced Encryption Standard. Springer, Berlin, 2002.

[Erd02] Philip J. Erdelsky. Rijndael Encryption Algorithm, 2002. Available athttp://www.efgh.com/software/rijndael.htm.

[FIP99] Announcing the Standard for DATA ENCRYPTION STANDARD (DES).Information Technology Laboratory, National Institute of Standards andTechnology, Gaithersburg and MD, 25 October 1999.

[FIP01] Announcing the Advanced Encryption Standard (AES). Computer Secu-rity Division, Information Technology Laboratory, National Institute ofStandards and Technology, Gaithersburg, MD, 26 November 2001.

[FP99] Paul N. Fahn and Peter K. Pearson. IPA: A New Class of Power Attacks.In Koç and Paar [KP99], pages 173–186.

[GBTP08] Benedikt Gierlichs, Lejla Batina, Pim Tuyls, and Bart Preneel. MutualInformation Analysis. In Elisabeth Oswald and Pankaj Rohatgi, editors,CHES, volume 5154 of Lecture Notes in Computer Science, pages 426–442.Springer, 2008.

[GGNPS13] Benoît Gérard, Vincent Grosso, María Naya-Plasencia, and François-XavierStandaert. Block Ciphers That Are Easier to Mask: How Far Can We Go?In Bertoni and Coron [BC13], pages 383–399.

[Gla99] Brian Gladman. Implementation Experience with AES Candidate Algo-rithms. Technical report, Second AES Conference, 28th February 1999.

[Gla07] Brian Gladman. A Specification for Rijndael, the AES Algorithm, 1st August2007. version 3.16.

[GM11] Louis Goubin and Ange Martinelli. Protecting AES with Shamir’s SecretSharing Scheme. In Preneel and Takagi [PT11], pages 79–94.

[Gou01] Louis Goubin. A Sound Method for Switching between Boolean and Arith-metic Masking. In Koç et al. [KNP01], pages 3–15.

http://www.efgh.com/software/rijndael.htm

Bibliography 125

[GP97] Jorge Guajardo and Christof Paar. Efficient algorithms for elliptic curvecryptosystems. In Burton S. Kaliski Jr., editor, CRYPTO, volume 1294 ofLecture Notes in Computer Science, pages 342–356. Springer, 1997.

[GP99] Louis Goubin and Jacques Patarin. DES and Differential Power Analysis(The "Duplication" Method). In Koç and Paar [KP99], pages 158–172.

[GPQ10] Laurie Genelle, Emmanuel Prouff, and Michaël Quisquater. Secure Multi-plicative Masking of Power Functions. In Jianying Zhou and Moti Yung,editors, ACNS, volume 6123 of Lecture Notes in Computer Science, pages200–217, 2010.

[GSF13] Vincent Grosso, François-Xavier Standaert, and Sebastian Faust. Maskingvs. Multiparty Computation: How Large Is the Gap for AES? In Bertoniand Coron [BC13], pages 400–416.

[GT02] Jovan Dj. Golic and Christophe Tymen. Multiplicative Masking and PowerAnalysis of AES. In Kaliski Jr. et al. [KKP03], pages 198–212.

[Hig10] HighTec EDV-Systeme GmbH. User’s Guide, HighTec GNU Toolchain forTriCore, 2010. Version 1.20.

[HJM06] Martin Hell, Thomas Johansson, and Willi Meier. Grain - A Stream Cipherfor Constrained Environments. 2006.

[Hoh09] Andreas Hoheisel. Side-Channel Analysis Resistant Implementation of AESon Automotive Processors. Master’s thesis, Ruhr-Universität Bochum, 2009.

[HOM06] Christoph Herbst, Elisabeth Oswald, and Stefan Mangard. An AES SmartCard Implementation Resistant to Power Analysis Attacks. In JianyingZhou, Moti Yung, and Feng Bao, editors, ACNS, volume 3989 of LectureNotes in Computer Science, pages 239–252, 2006.

[ISW03] Yuval Ishai, Amit Sahai, and David Wagner. Private Circuits: SecuringHardware against Probing Attacks. In Dan Boneh, editor, CRYPTO, volume2729 of Lecture Notes in Computer Science, pages 463–481. Springer, 2003.

[ITA02] Infineon Technologies AG. Architecture Overview Handbook, TriCoreTM

1.3 32-bit Unified Processor Core, May 2002. Version 1.3.3, available fordownload at http://www.infineon.com/tricore.

[ITA04] Infineon Technologies AG. TriCore R○ 1 Pipeline Behaviour & InstructionExecution Timing, Application Note, June 2004. Version 1.1, available fordownload at http://www.infineon.com/tricore.

[ITA07] Infineon Technologies AG. User’s Manual, TriCore 32-bit Unified ProcessorCore, Embedded Applications Binary Interface (EABI), February 2007.Version 2.3, available for download at http://www.infineon.com/tricore.

http://www.infineon.com/tricore



126 Bibliography

[ITA08a] Infineon Technologies AG. User’s Manual, TriCore R○ 1, 32-bit Uni-fied Processor Core, Volume 1, Core Architecture, V1.3 & V1.3.1 Ar-chitecture, January 2008. Version 1.3.8, available for download athttp://www.infineon.com/tricore.

[ITA08b] Infineon Technologies AG. User’s Manual, TriCore R○ 1, 32-bit Uni-fied Processor Core, Volume 2, Instruction Set, V1.3 & V1.3.1 Ar-chitecture, January 2008. Version 1.3.8, available for download athttp://www.infineon.com/tricore.

[ITA09a] Infineon Technologies AG. TC1797 32-Bit Single-Chip Microcontroller,Data Sheet, September 2009. Version 1.2, available for download athttp://www.infineon.com/tricore.

[ITA09b] Infineon Technologies AG. TC1797 32-Bit Single-Chip Microcontroller,User’s Manual, May 2009. Version 1.1, available for download athttp://www.infineon.com/tricore.

[ITA11] Infineon Technologies AG. Design Guideline for TC1797 MicrocontrollerBoard Layout, December 2011. Version 2.5, available for download athttp://www.infineon.com/tricore.

[ITA12] Infineon Technologies AG. Highly Integrated and Performance Opti-mized 32-bit Microcontrollers for Automotive and Industrial Applica-tions, August 2012. Product brochure, available for download athttp://www.infineon.com/tricore.

[JQ04] Marc Joye and Jean-Jacques Quisquater, editors. Cryptographic Hardwareand Embedded Systems - CHES 2004: 6th International Workshop Cam-bridge, MA, USA, August 11-13, 2004. Proceedings, volume 3156 of LectureNotes in Computer Science. Springer, 2004.

[Kas11] Timo Kasper. Security Analysis of Pervasive Wireless Devices - Physicaland Protocol Attacks in Practice. PhD thesis, Ruhr-Universität Bochum,Bochum, Germany, September 2011.

[KHL11] HeeSeok Kim, Seokhie Hong, and Jongin Lim. A Fast and Provably SecureHigher-Order Masking of AES S-Box. In Preneel and Takagi [PT11], pages95–107.

[KJJ98] Paul Kocher, Joshua Jaffe, and Benjamin Jun. Differential Power Analysis,1998.

[KKP03] Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors. Crypto-graphic Hardware and Embedded Systems - CHES 2002, 4th InternationalWorkshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers,volume 2523 of Lecture Notes in Computer Science. Springer, 2003.







Bibliography 127

[KNP01] Çetin Kaya Koç, David Naccache, and Christof Paar, editors. CryptographicHardware and Embedded Systems - CHES 2001, Third International Work-shop, Paris, France, May 14-16, 2001, Proceedings, volume 2162 of LectureNotes in Computer Science. Springer, 2001.

[Knu97] Donald E. Knuth. The art of computer programming, volume 2 (3rd ed.):seminumerical algorithms. Addison-Wesley Longman Publishing Co., Inc.,Boston, MA, USA, 1997.

[Koc96] Paul Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA,DSS, and Other Systems, 1996.

[KP99] Çetin Kaya Koç and Christof Paar, editors. Cryptographic Hardware andEmbedded Systems, First International Workshop, CHES’99, Worcester,MA, USA, August 12-13, 1999, Proceedings, volume 1717 of Lecture Notesin Computer Science. Springer, 1999.

[KS11] Wolfgang Killmann and Werner Schindler. A proposal for: Functionalityclasses for random number generators. Bundesamt für Sicherheit in derInformationstechnik, 18 September 2011. Version 2.0.

[Man02] Stefan Mangard. A Simple Power-Analysis (SPA) Attack on Implementa-tions of the AES Key Expansion. In Pil Joong Lee and Chae Hoon Lim,editors, ICISC, volume 2587 of Lecture Notes in Computer Science, pages343–358. Springer, 2002.

[Man04] Stefan Mangard. Hardware Countermeasures against DPA - A StatisticalAnalysis of Their Effectiveness. In Tatsuaki Okamoto, editor, CT-RSA,volume 2964 of Lecture Notes in Computer Science, pages 222–235. Springer,2004.

[Mes00] Thomas S. Messerges. Securing the AES Finalists Against Power AnalysisAttacks. In Bruce Schneier, editor, FSE, volume 1978 of Lecture Notes inComputer Science, pages 150–164. Springer, 2000.

[MNBSL10] A. Monot, N. Navet, B. Bavoux, and F. Simonot-Lion. Multi-core scheduling in automotive ECUs. May 2010. Available athttp://nicolas.navet.eu/publi/ertss_2010.pdf, slides available athttp://nicolas.navet.eu/publi/ERTSS2010_MulticoreScheduling.pdf.

[MOP07] Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power AnalysisAttacks - Revealing the Secrets of Smart Cards. Springer, 2007.

[MY92] Mitsuru Matsui and Atsuhiro Yamagishi. A New Method for Known Plain-text Attack of FEAL Cipher. In Rainer A. Rueppel, editor, EUROCRYPT,volume 658 of Lecture Notes in Computer Science, pages 81–91. Springer,1992.

http://nicolas.navet.eu/publi/ertss_2010.pdf

http://nicolas.navet.eu/publi/ERTSS2010_MulticoreScheduling.pdf

128 Bibliography

[NIS01] Recommendation for Block Cipher Modes of Operation. Technical report,Information Technology Laboratory (National Institute of Standards andTechnology), Gaithersburg, MD, December 2001.

[NP04] Olaf Neiße and Jürgen Pulkus. Switching Blindings with a View TowardsIDEA. In Joye and Quisquater [JQ04], pages 230–239.

[Osw09] David Oswald. Development of an Integrated Environment for Side ChannelAnalysis and Fault Injection. Diploma thesis, Ruhr-Universität Bochum,September 2009.

[Paa94] Christof Paar. Efficient VLSI Architectures for Bit-Parallel Computation inGalois Fields. PhD thesis, Universität Essen, Essen, Germany, June 1994.

[Paa12] Christof Paar. Implementation of Cryptographic Schemes 1. Version 1.8.1.Lecture Script, Ruhr-Universität Bochum, January 2012.

[PR09] Emmanuel Prouff and Matthieu Rivain. Theoretical and Practical Aspectsof Mutual Information Based Side Channel Analysis. In Michel Abdalla,David Pointcheval, Pierre-Alain Fouque, and Damien Vergnaud, editors,ACNS, volume 5536 of Lecture Notes in Computer Science, pages 499–518,2009.

[PS12] Emmanuel Prouff and Patrick Schaumont, editors. Cryptographic Hardwareand Embedded Systems - CHES 2012 - 14th International Workshop, Leuven,Belgium, September 9-12, 2012. Proceedings, volume 7428 of Lecture Notesin Computer Science. Springer, 2012.

[PT11] Bart Preneel and Tsuyoshi Takagi, editors. Cryptographic Hardware andEmbedded Systems - CHES 2011 - 13th International Workshop, Nara,Japan, September 28 - October 1, 2011. Proceedings, volume 6917 of LectureNotes in Computer Science. Springer, 2011.

[RDJ+01] Atri Rudra, Pradeep K. Dubey, Charanjit S. Jutla, Vijay Kumar, Josyula R.Rao, and Pankaj Rohatgi. Efficient Rijndael Encryption Implementationwith Composite Field Arithmetic. In Koç et al. [KNP01], pages 171–184.

[RP10a] Matthieu Rivain and Emmanuel Prouff. Provably Secure Higher-OrderMasking of AES. Cryptology ePrint Archive, Report 2010/441, 2010. Avail-able at http://eprint.iacr.org/2010/441.

[RP10b] Matthieu Rivain and Emmanuel Prouff. Provably Secure Higher-OrderMasking of AES. In Stefan Mangard and François-Xavier Standaert, editors,CHES, volume 6225 of Lecture Notes in Computer Science, pages 413–427.Springer, 2010.

[Sch99] Dr. Werner Schindler. Functionality Classes and Evaluation Methodologyfor Deterministic Random Number Generators. Bundesamt für Sicherheitin der Informationstechnik, 2 December 1999. Version 2.0.

http://eprint.iacr.org/2010/441

Bibliography 129

[Sha79] Adi Shamir. How to Share a Secret. Commun. ACM, 22(11):612–613, 1979.

[SMTM01] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A CompactRijndael Hardware Architecture with S-Box Optimization. In Colin Boyd,editor, ASIACRYPT, volume 2248 of Lecture Notes in Computer Science,pages 239–254. Springer, 2001.

[SNK+12] Alexander Schlösser, Dmitry Nedospasov, Juliane Krämer, Susanna Orlic,and Jean-Pierre Seifert. Simple Photonic Emission Analysis of AES -Photonic Side Channel Analysis for the Rest of Us. In Prouff and Schaumont[PS12], pages 41–57.

[ST04] Adi Shamir and Eran Tromer. Acoustic cryptanalysis – On nosy peopleand noisy machines, May 2004. Available online at http://www.cs.tau.ac.il/~tromer/acoustic/.

[SVCO+10] François-Xavier Standaert, Nicolas Veyrat-Charvillon, Elisabeth Oswald,Benedikt Gierlichs, Marcel Medwed, Markus Kasper, and Stefan Mangard.The World Is Not Enough: Another Look on Second-Order DPA. InMasayuki Abe, editor, ASIACRYPT, volume 6477 of Lecture Notes inComputer Science, pages 112–129. Springer, 2010.

[TSG02] Elena Trichina, Domenico De Seta, and Lucia Germani. Simplified AdaptiveMultiplicative Masking for AES. In Kaliski Jr. et al. [KKP03], pages187–197.

[VCMKS12] Nicolas Veyrat-Charvillon, Marcel Medwed, Stéphanie Kerckhof, andFrançois-Xavier Standaert. Shuffling against Side-Channel Attacks: AComprehensive Study with Cautionary Note. In Xiaoyun Wang and KazueSako, editors, ASIACRYPT, volume 7658 of Lecture Notes in ComputerScience, pages 740–757. Springer, 2012.

http://www.cs.tau.ac.il/~tromer/acoustic/

http://www.cs.tau.ac.il/~tromer/acoustic/

Documents

Protecting AES on TriCore against Power Analysis Attacks