Choosing the Future of Lightweight Encryption Algorithms · ARM Cortex-M3, Optimizac¸ao de ... That creates opportunities for a more direct integration of the physical world into

Choosing the Future of Lightweight Encryption Algorithms

João Carlos Santos Fernandes

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Prof. Ricardo Jorge Fernandes ChavesProf. Tiago Miguel Braga da Silva Dias

Examination Committee

Chairperson: Prof. José Luís Brinquete BorbinhaSupervisor: Prof. Ricardo Jorge Fernandes Chaves

Member of the Committee: Prof. Alberto Manuel Ramos da Cunha

October 2018

Para os meus pais

iii

iv

Acknowledgments

I would like to thank to my advisers Prof. Ricardo Chaves e Prof. Tiago Dias for all the guidance, their

help and patience that with me. Without them, it would not have been possible for this work to reach its

end. I also need to thank to the PhD student Ricardo Macas for the assistance that he provided in the

experimental part of the energy consumption calculation.

Thank You Very Much

Quero tambem agradecer aquelas que foram as pessoas mais importantes nao so ao longo destes 2

anos, mas ao longo de todos os 23 anos da minha vida, os meus pais. Sem eles nunca teria conseguido

chegar onde cheguei hoje e concluir este mestrado, numa universidade longe de casa. Quero agradecer

tambem a minha irma e ao resto da minha famılia por todo o apoio que me deram ao longo do tempo

em que estive a fazer a minha tese. Agradeco em especial a minha namorada que teve de me ouvir

sempre que estava mais desmotivado com esta fase final do meu percurso academico, e me apoiou

e motivou incansavelmente. Por ultimo, agradeco aos meus amigos pois tambem eles me apoiaram

sempre que precisei.

A Todos,

Muito Obrigado

v

vi

Resumo

A criptografia leve e um campo que tem vindo a crescer muito nos ultimos anos, devido a explosao

da Internet das Coisas (IoT). Esta tem como objetivo desenvolver algoritmos para operar com recursos

limitados (memoria, poder de processamento e energia). Esta tese tem como objetivo aprofundar o

estado da arte, analisando e selecionando um conjunto de cifras leves, e otimiza-las, visando uma

classe de processadores amplamente utilizado em IoT (ARM Cortex-M3). A analise realizada considera

diferentes metricas como o tamanho do codigo, tempo de execucao e consumo de energia. As cifras

selecionadas foram AES, CLEFIA, NOEKEON, PRESENTE, RETANGULAR, RoadRunneR, SPARX e

SPECK. Estas foram melhoradas utilizando tecnicas como a otimizacao baseada em tabelas, bit-slicing

e otimizacoes de codigo (por exemplo: reorganizacao de operacoes, inlining de funcoes, unrolling, etc.).

A otimizacao baseada em tabelas melhorou a performance do AES e do CLEFIA em mais de 10×.

Para o NOEKEON e proposta uma implementacao que melhora a performance em 3.2× reduzindo o

tamanho do codigo em 21%. A otimizacao proposta para o RECTANGLE e 1% mais rapida e 10% mais

pequena que a versao otimizada em C dos seus autores. O SPARX e 2.72× mais rapido e 1% mais

pequeno que a versao otimizada em C dos seus autores. A performance do SPECK foi melhorada em

1.4×, com um custo de 5% em tamanho de codigo. Finalmente, e tambem apresentada uma analise

do consumo de energia das cifras com valores obtidos experimentalmente, algo que nao se encontra

no estado da arte.

Palavras-chave: Criptografia Leve, Cifras de Bloco, Internet das Coisas (IoT), Processador

ARM Cortex-M3, Optimizacao de Performance, Consumo de Energia

vii

viii

Abstract

Lightweight Cryptography is a field that has been growing significantly in the recent years, mainly be-

cause of the explosion of the Internet of Things (IoT). It aims to develop algorithms that can operate

under low resources (memory, processing power and energy). This thesis aims to complement the state

of the art by analyzing and selecting a set of lightweight ciphers and optimize them, targeting a widely

used processors in IoT (ARM Cortex-M3). The analysis considered different metrics like code size, exe-

cution time and energy consumption. The selected ciphers were AES, CLEFIA, NOEKEON, PRESENT,

RECTANGLE, RoadRunneR, SPARX and SPECK. Their performance was improved using techniques

like table-based, bit-slicing and code optimizations (e.g.: rearrange of operations, function inlining, un-

rolling, etc.). The table-based optimizations were able to speedup AES and CLEFIA execution time

more than 10times. For NOEKEON, an optimization proposed speed up is performance by 3.2times.

and reduces his code size in 21%. The proposed optimization of RECTANGLE reduces the execution

time by 1% and the code size by 10% when compared to the optimized C version of the cipher’s authors.

SPARX, execution time was speed up by 2.72times and the code size is 1% lower, in comparison to the

optimized C version of the cipher’s authors. The SPECK execution time was improved 1.4times, with

only a 5% increase in code size. Finally, an analysis of the energy consumption of the block ciphers was

made, using experimentally obtained results, something that has not yet been done in the state of art.

Keywords: Lightweight cryptography, Block Ciphers, Internet of Things (IoT), ARM Cortex-M3

Processor, Performance Optimization, Energy Consumption

ix

x

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and State Of The Art 5

2.1 Constrained Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Metrics and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.1 Algorithm Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.2 Code Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Proposed Improvements to Lightweight Encryption 29

3.1 Target Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Considered Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Ciphers Configuration Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 Performance Evaluation for Different Key Sizes . . . . . . . . . . . . . . . . . . . . 33

3.3.2 Performance Evaluation for Different Block Sizes . . . . . . . . . . . . . . . . . . . 34

3.4 Optimized Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.1 Algorithm Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.2 Code Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xi

4 Evaluation and Results 49

4.1 Reference Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Algorithm Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Code Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Conclusions 79

5.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

References 81

A Pseudo-Code of the Ciphers Encryption 89

B Implemented Versions 93

C Small, Fast and Balanced Versions 97

C.1 Small Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

C.2 Fast Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

C.3 Balanced Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

D ARM Cortex-M3 99

D.1 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

D.2 Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

D.3 Memory Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

D.4 MPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

D.5 Bus Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

D.6 Bit Banding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

D.7 Interrupt Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

D.8 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

D.9 Data Path and Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

D.10 Debugging Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

D.11 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xii

List of Tables

3.1 Block Ciphers Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Summary of Best Proposed Implementations Results . . . . . . . . . . . . . . . . . . . . 72

B.1 Optimizations included in the Implemented Cipher Versions . . . . . . . . . . . . . . . . . 95

xiii

xiv

List of Figures

2.1 Caesar Cipher Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 One Time Pad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Asymmetric Encryption Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 SPN Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 FN Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 ARX Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.7 Illustration of CLEFIA round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8 Illustration of CLEFIA F Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.9 Illustration of NOEKEON Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.10 Illustration of PRESENT Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.11 Illustration of RoadRunneR Round, F Funtion and SLK Structure . . . . . . . . . . . . . . 15

2.12 Illustration of SPARX Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.13 Illustration of the Speckey S-Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.14 SPARX block sizes, key sizes and rounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.15 SPECK block sizes, key sizes and rounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.16 Illustration of RECTANGLE operations applied to the cipher state . . . . . . . . . . . . . . 17

2.17 Illustration of SPECK Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 RECTANGLE Key Comparison - Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 RECTANGLE Key Comparison - Execution Time . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 RoadRunneR Key Comparison - Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 RoadRunneR Key Comparison - Execution Time . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 SPARX Block Comparison - Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 SPARX Block Comparison - Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.7 SPECK Block Comparison - Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.8 SPECK Block Comparison - Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Code Size and Execution Time of the Reference Implementations . . . . . . . . . . . . . 50

4.2 T-Box Optimizations Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 AES T-Box Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 CLEFIA T-Box Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

xv

4.5 AES Code Optimizations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.6 AES Relative Gain/Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.7 CLEFIA Code Optimizations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8 CLEFIA Relative Gain/Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.9 NOEKEON Code Optimizations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.10 NOEKEON Relative Gain/Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.11 PRESENT Code Optimizations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.12 PRESENT Relative Gain/Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.13 RECTANGLE Code Optimizations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.14 RECTANGLE Relative Gain/Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.15 RoadRunneR Code Optimizations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.16 RoadRunneR Relative Gain/Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.17 SPARX Code Optimizations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.18 SPARX Relative Gain/Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.19 SPECK Code Optimizations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.20 SPECK Relative Gain/Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.21 Code Size Results of the different implementations . . . . . . . . . . . . . . . . . . . . . . 69

4.22 Execution Time Results of the different implementations . . . . . . . . . . . . . . . . . . . 69

4.23 Efficiency Results of the different implementations . . . . . . . . . . . . . . . . . . . . . . 70

4.24 Proposed Key Schedule Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.25 Proposed Implementations vs State Of Art Results . . . . . . . . . . . . . . . . . . . . . . 73

4.26 Assembled Circuit for the Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.27 Energy Consumption per Block/Key Schedule (µJ) . . . . . . . . . . . . . . . . . . . . . . 76

D.1 ARM Cortex-M3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

D.2 ARM Cortex-M3 Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

D.3 Thumb vs Thumb-2 vs ARM ISA’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

D.4 ARM Cortex-M3 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

D.5 ARM Cortex-M3 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xvi

List of Acronyms

AES Advanced Encryption Standard.

ARM Advanced RISC Machine.

ARX Addition Rotation XOR.

CBC Cipher Block Chaining.

CTR Counter.

DES Data Encryption Standard.

FN Feistel Network.

GE Gate Equivalent.

GFN Generalized Feistel Network.

IoT Internet of Things.

ISA Instruction Set Architecture.

NIST National Institute of Standards and Technology.

OTP One-Time Pad.

RISC Reduced Instruction Set Computer.

RRR RoadRunneR.

SPN Substitution Permutation Network.

WSN’s Wireless Sensor Networks.

XOR Exclusive-Or.

xvii

xviii

Chapter 1

Introduction

1.1 Motivation

The Internet of Things (IoT) [1] refers to the networked interconnection of everyday objects, which are

often equipped with ubiquitous intelligence. In the IoT World everything is interconnected and exchang-

ing data. That creates opportunities for a more direct integration of the physical world into computer-

based systems, resulting in efficiency improvements, economic benefits, and reduced human effort.

Home appliances, surveillance cameras, traffic lights, vehicles, medical devices, watches, fitness

bands and different types of sensors (humidity, temperature, light, noise, etc.), are some of the current

examples of devices in the IoT World. Environments like Smart Houses [2], Smart Cities [3][4], Industry

4.0 [5], Smart Cars and Roads [6] or Smart Agriculture [7] are some of the examples where the IoT is

being applied.

This is a phenomenon that is growing every year. The number of IoT devices is expected to reach

26 billions by 2020 [8] posing a global market value of $7.1 trillion [9]. Projects like Raspberry Pi [10],

Arduino [11] and NodeMCU [12] are helping the grow of popularity of the IoT, allowing people to easily

develop their own cyber-physical devices.

Most IoT devices operate on limited resources because of their tight cost constraints, inherent to their

mass deployments, and the fact that they are mainly battery supplied. Because they do not have an un-

limited power source, their energy consumption needs to be small. All of that factors lead to devices with

limited memory and computing power. So, the involved data processing algorithms, communication pro-

tocols and underlying technologies must be carefully chosen to meet the restrict operating requirements

of these devices.

Another critical challenge of the IoT World is Security [13][14][15][16]. IoT devices are in permanent

connection to the Internet, and they can deal with a lot of sensitive data and information. Such informa-

tion is quite often private or safety critical, and therefore must be protected from malicious attackers. So,

the application of secure cryptographic components becomes imperative.

However, the well known cryptographic primitives used on traditional systems, are not necessarily

well suited for such constrained devices. They require too much memory space and processing power

1

(to have a fast execution), which leads to a high energy consumption.

As a consequence, a new field of research has been created in the Cryptographic World: Lightweight

Cryptography [17], which comprehends the research and development of new cryptographic primitives

and algorithms that can be executed quite fast, show diminished memory footprints and consume very

little energy, aiming at achieving a good trade-off between security, cost (in terms of memory footprint

and energy consumption), and performance.

In the last years, several articles were published addressing Lightweight Cryptography issues (e.g.

[17] [18] [19] [20] [21] [22]), such as the proposal of new lightweight ciphers. Nevertheless, most of such

works are focused on hardware implementations [23] [24]. However, with the growing of IoT and its open

source projects, the investigation of software oriented lightweight ciphers suitable for implementations

on constrained devices has been gaining more and more importance in the recent years.

1.2 Objectives

The main goal of this research work is to study existing lightweight encryption algorithms, in order to

identify the most suitable ones for software implementations in constrained devices. For this, the most

recent and commonly used lightweight algorithms must be evaluated based on their characteristics, like

complexity, memory requirements, and offered performance levels. Such assessment encompasses

also the port of the selected most relevant algorithms to an Arduino Due Board [25], that is powered by

a well known constrained processor, and its evaluation by taking into account other popular algorithms.

In summary, the main objectives of this work are:

• Select a set of the most promising software oriented lightweight algorithms, proposed by the com-

munity, based on their characteristics, implementations and optimizations.

• Provide efficient implementations of such algorithms, in terms of performance, memory footprint

and energy consumption.

• Evaluate the developed implementations to find the best algorithm and compare it with the state of

the art.

1.3 Thesis Outline

The remainder of this document is organized as follows:

Chapter 2 introduces the dissertation topic and provides the information needed for any reader to

understand the background of this thesis namely: constrained processors, cryptography and lightweight

cryptography. It also provides an overview of the current state of art divided in four topics: ciphers,

metrics, tools, and optimizations.

Chapter 3 describes the proposed work done in order to complement the flaws referred in the state of

art analysis. First the target platform is presented. Then the chosen ciphers are enumerated. After that,

is analyzed the best configuration for the ciphers having in consideration the targeted platform. Finally a

2

description of the proposed optimizations, with how, why, and which state of the art optimizations were

applied to each cipher, is presented.

Chapter 4 presents the obtained results for the proposed algorithm and code optimizations. First it

has an overview to the performance and code size of the reference implementations. Then the results of

the proposed algorithm and code optimizations, for each cipher, are analyzed. Furthermore, it provides

a comparative analysis between the proposed optimized ciphers, thei reference implementations and

other state of art ciphers. Finally it proposes an experimental energy consumption analysis.

Chapter 5 concludes the thesis by summarizing the achievements of this work and disclosing some

future work possibilities.

Three Appendixes summarizing some of the information presented in this thesis are provided at the

end of the document.

3

4

Chapter 2

Background and State Of The Art

This chapter presents the most relevant background information for this research work. First, a small

introduction to constrained processors is presented. After that, the fundamental concepts of cryptog-

raphy are reviewed. Then, the metrics and tools most commonly used to characterize and evaluate

ciphers and its encryption and decryption procedures are introduced. Finally, the state of the art tech-

niques employed to optimize ciphers are overviewed.

2.1 Constrained Processors

The IoT is characterized by heterogeneous devices. Sensor nodes, home appliances, smartphones,

and vehicles are expected to be interconnected by both wireless or wired networks.

Some of the IoT most known examples are the Wireless Sensor Networks (WSN’s) [26] that are a

group of spatially dispersed and dedicated sensors for monitoring and recording the physical conditions

of the environment. WSN’s measure environmental conditions like temperature, sound, pollution levels,

humidity, wind, and so on. These WSN’s are the foundation to bring concepts like Smart Cities [3][4]

and Smart Cars and Roads [6] to the quotidian. Other example is the intelligent personal assistants like

Amazon Echo [27] and Google Home [28] that can control the devices in our house like smart lamps,

smart cameras, sensors, vacuum cleaning robots, etc. Also, projects like Raspberry Pi [10], Arduino [11]

and NodeMCU [12] are allowing people to easily develop their one cyber-physical devices, by making it

easier to develop.

IoT can have different types of devices, but all of them share a characteristic, they are powered

by processors that are considered constrained. Constrained Processors are processors that have re-

stricted memory (RAM and ROM), processing power (small architectures and low clock frequencies)

and energy capacity. They operate on limited resources because of the tight cost constraints, inherent

to their mass deployments, and the fact that they must have a low power consumption (since most of

them are battery supplied).

These processors have a wide spectrum of architectures they can be of 8-bits, 16-bits and more re-

cently 32-bits. Also, these processors typically use Reduced Instruction Set Computer (RISC) Instruction

5

Set Architecture (ISA).

As mentioned before, more recently 32-bits processors are also joining the IoT world. One of the

biggest booster for that to happen is the Advanced RISC Machine (ARM) company. The ARM processors

[29] are very popular on smartphones, TV boxes, tables and even some personal computers. But they

are also becoming very popular in IoT devices [30], when a little more of processing power is needed

but the low energy consume still important. An example of this is the ARM Cortex-M family, which are

processors optimized for deterministic real-time embedded processing and micro controller applications.

They are the smallest and lowest power Cortex processors [31].

2.2 Cryptography

”Historically, cryptography arose as a means to enable parties to maintain privacy of the informa-

tion they send to each other, even in the presence of an adversary with access to the communication

channel.” [32] More generally, cryptography is the design and analysis of algorithms, or protocols, that

prevent the public, or a third partier, to read private information.

In cryptography, a cipher is an algorithm providing encryption or decryption of data, which has a

series of well-defined steps that must be followed as a procedure to conceal or reveal the data using a

secret (the key).

One of the first well known ciphers was the ”Caesar Cipher” (depicted in Figure 2.1), a cipher that

Julius Caesar used to protect his private correspondence. This cipher consisted in left shifting 3 positions

the letters of the message. The encryption and decryption procedures can be described as 2 mathe-

matical functions (2.1 for encryption, 2.2 for decryption), where x represents the letter to be encrypted

and n the shift made to the letters, also known as the key of the cipher.

Figure 2.1: Caesar Cipher Illustration

En(x) = (x+ n) mod 26 (2.1)

Dn(x) = (x− n) mod 26 (2.2)

With the appearance of Modern Cryptography [32], ciphers have evolved and are now mostly per-

formed by computing devices. This modern approach to encryption lead to a lot of research in the

field of cryptography aiming at the development of stronger ciphers, more robust and with good security

properties (resistance to attacks). For example, formerly both the cipher and the key where secret to

ensure that anyone untrusted could not access the protected information. Nowadays, the ciphers are

well known and the security lays in the robustness of the algorithm and the secrecy of the key.

The most known encryption technique is the One-Time Pad (OTP). The invention of the OTP is

generally credited to Gilbert S. Vernam and Joseph O. Mauborgne, but in fact it was described 35 years

before them by Frank Miller [33][34]. OTP is a technique that theoretically cannot be attacked [35], but

6

it requires pre-shared keys with the same size as the text to be encrypted. It uses Exclusive-Or (XOR)

operations to encrypt and decrypt the data. The encryption is performed by an XOR operation between

the bits of the data and the bits of the key, as depicted in Figure 2.2. If the keystream bits are (a)

perfectly random, and (b) never reused, ”the messages are rendered entirely secret, and are impossible

to analyze without the key” [35]. To achieve such goal it is only necessary to use a key as long as the

text to be encrypted. However, the use of such long keys is almost impossible for practical applications.

Consequently, several different ciphers were created trying to achieve a tamper proof security like OTP.

They can be divided in two main groups: asymmetric ciphers and symmetric ciphers.

Asymmetric ciphers (depicted in Figure 2.3) are characterized by using two different keys (one public

and one private) to encrypt and decrypt the data, using mathematical properties (factorization of the

product of two large prime numbers, discrete logarithm, elliptic curves over finite fields). Examples of

such ciphers are RSA[36], Diffie-Helman[37], and Elliptic Curve[38] encryption.

Figure 2.2: One Time Pad Figure 2.3: Asymmetric Encryption Example

Symmetric ciphers are characterized by using the same key to encrypt and decrypt the data. They

can be divided in two types: Stream Ciphers and Block Ciphers.

Stream Ciphers are used to encrypt streams of data. The plaintext digits are combined with

a pseudo random cipher digit stream (keystream), which is generated from the initial key. Typically

keystream generators are easily broken [39]. Examples of these ciphers are RC4[40] and A5[41] used

on telecommunications like 2G technology.

The Block Ciphers are used to encrypt blocks of data, for which they take the data and split it

into fixed size blocks. The initial key is expanded into multiple keys (subkeys) that are used in the

encryption/decryption process. The blocks are encrypted with the application of different operations

plus the subkeys through multiple rounds. In each round, several operations are performed to the block

data and the subkey. These operations (typically substitutions, permutations or algebraic operations)

vary with the design of the cipher. There are several different designs for block ciphers that can be

classified according to their structure. The most well known approaches are:

• Substitution Permutation Network (SPN)

• Feistel Network (FN) and Generalized Feistel Network (GFN)

• Addition Rotation XOR (ARX)

7

A Substitution Permutation Network (SPN) (depicted in Figure 2.4) takes a block of the plaintext

and the key as inputs, and applies multiple rounds consisting on a XOR operation with the subkey, then

a substitution stage followed by a permutation stage to produce the output block of the ciphertext. They

are composed of S-Boxes and P-Boxes, which implement the substitution and permutation logic. An

example of this approach is the Rijandel cipher, better known as Advanced Encryption Standard (AES).

AES [42] is the current standard for symmetrical cryptography, supporting 128 bits block sizes, and

128/192/256 bits keys, processed over 10, 12 or 14 rounds depending on key size.

In a Feistel Network (FN) approach (depicted in Figure 2.5), the input block to be encrypted is split

into two equal-sized halves. The round function (F) is applied to one half using a subkey, and then

the output is XORed with the other half. The two halves are then swapped. The Generalized Feistel

Network (GFN) structure is similar to FN but the block is split in n equal-sized slices that are then paired

2 to 2 in order to reproduce the same structure of FN. An example of the FN approach is the Data

Encryption Standard (DES) cipher. DES [43] has a block size of 64 bits, using keys of 56 bits over 16

rounds. DES is known deprecated given its vulnerabilities [44].

In the Addition Rotation XOR (ARX) (depicted in Figure 2.6) design, the round function involves only

three operations: modular addition, rotation with fixed rotation amounts, and XOR (with the subkeys).

An example of this approach is the RC6 cipher. RC6 [45] was one of the finalists of the AES contest,

supporting 128/192/256 bits and a block size of 128 bits with 20 rounds.

Each design has its strengths and weaknesses. SPN are more secure but also more difficult to

implement and more costly in terms of computation. FN are less secure than SPN but easier to imple-

ment and have a good property, such as the encryption and the decryption structure are the same, only

requiring the reordering of the subkeys. The ARX approaches are the weakest because they only use

basic operations (additions, rotations and XOR’s) to encrypt the data, but are very easy to implement

and less costly.

Figure 2.4: SPN Design Figure 2.5: FN Design Figure 2.6: ARX Design

8

Lightweight Cryptography

Cryptographic algorithms can potentialy be extremely heavy, and require a lot of computational power,

memory and energy. With the growth of IoT and the use of constrained devices, which that demand for

security, the design of new cryptographic algorithms that require less resources was necessary. This

motivated a new paradigm on the cryptography community, the Lightweight Cryptography.

The focus of Lightweight Cryptography is to design algorithms that require less memory, less pro-

cessing power and consume less energy [20].

In the last years, several works addressing this goal have been presented. In [17] the authors propose

a generalized approach to lightweight algorithms design. They also highlight some constraints and

recommendations for the implementation of lightweight algorithms. The research already done in this

field created some guidelines that lightweight encryption algorithms should follow:

• Lightweight algorithms need small parameters values: block size, key length (within reasonable

limits), and the algorithm’s internal state;

• Lightweight algorithms may use basic elements (arithmetic and logic operations, linear or non-

linear transformations, etc.), which are faster and lighter, but that can force a decrease of its

cryptographic strength;

• Lightweight algorithms should use low-cost (in they implementation) but effective elements, such

as data-dependent bit permutations, shift registers, etc;

• Lightweight algorithms may use simplified layers of transformations, e.g. decreasing ROM require-

ments by using 4-bits to 4-bits S-boxes;

• Designing key schedules that can derive sub keys in-place allows lightweight algorithms to use

less memory;

• Lightweight algorithms should compute operations that allow good implementation trade-offs ac-

cording to the resources available on the target platform.

Although, most of the research work, that has been published in this field, addresses the design of

new ciphers/algorithms, there are also some papers presenting comparisons and evaluations of ciphers.

Some articles [46] [47] [48] [49] [50] [22] focus on comparing different lightweight ciphers to assess the

ones requiring less area/memory and performing faster. Also, some algorithms try to improve the per-

formance of the ciphers with optimizations (e.g. [51]). In [52] [53] the focus is on comparing the energy

consumed by lightweight ciphers. Experimental energy consumption values presented in the state of the

art, were obtained only for hardware implementations. The values for software implementations were

only assessed using simulations. Other authors, [18] [17] [20] [50] [21], try to find the best characteris-

tics for lightweight algorithms. The implementation of lightweight ciphers is also a topic of research on

high end computing systems. In [54] [24] the performance of lightweight ciphers is evaluated on servers,

since the constrained devices are usually exchanging data with servers.

9

The implementation of lightweight algorithms also brings new security issues on the topic of side-

channel attacks [55], which is also a hot research topic. For example, in [19] and [56] the authors

measures and design options to prevent this type of attacks. However, this is not the focus of this

research work.

2.3 Algorithms

In the past years, several new algorithms were presented by taking into account the requirements

of lightweight encryption. Also, well known algorithms like AES and DES have been optimized to fit in

the lightweight world. Lightweight implementations of the AES algorithms have shown good results as

presented in [46], [21] and [50]. DES optimizations for lightweight (DESL & DESXL) are also analyzed

in [46], [47]. Besides these well known and mature ciphers, several new lightweight ciphers have been

proposed, such as:

SPN Ciphers: PRESENT[23], BORON[57], GIFT[58], KLEIN[59], NOEKEON[60], RECTANGLE[61],

SIMON[62][63], LED[64], Fantomas/Robin[65], Mysterion [66], PRIDE [67], PRINCE [68], RobinStar [66]

FN/GFN Ciphers: HIGHT[69], LBlock[70], LiCi[71], XTEA[72], NUX[73], CLEFIA1[74], TWINE[75],

RoadRunneR[76], Piccolo[77].

ARX Ciphers: RC62[45], SPECK[62][63], SPARX[78]3, LEA[79], Chaskey[80]4.

Hybrid Ciphers:5 Hummingbird[81], Hummingbird 2[82].

NLFSR Ciphers:6 Halka[83], KATAN & KTANTAN[84].

Some of these ciphers already have known security issues, as discussed in [46]. For example,

KLEIN, HIGHT, LBlock, XTEA, TWINE, RC6, Hummingbird, Hummingbird-2 and KATAN & KTANTAN all

suffer from significant security vulnerabilities and should be avoided. BORON, GIFT, NOEKEON, RECT-

ANGLE, RoadRunneR, SIMON and SPECK need to be further analyzed for vulnerabilities to evaluate

their claimed level of security. AES, PRESENT, CLEFIA are the most studied ciphers and, therefore, the

most acceptable solutions.

Following, a more detailed description of some of these ciphers is presented.

AES

The Advanced Encryption Standard (AES) [42], also known as Rijndael Cipher, was the winner of the

AES contest from National Institute of Standards and Technology (NIST) in 2001, that was looking for a

new standard algorithm for encryption after DES had been broken. It is a cipher with a substitution per-

mutation structure (SPN), oriented for both hardware and software implementations and that supports

blocks of 128 bits, keys of 128/192/256 bits, and 10/12/14 rounds (depending on the key size).1CLEFIA was not proposed as a lightweight cipher, but was standardized as one because of its highly efficient hardware and

software implementations.2RC6 was a finalist on the AES contest, it is not a lightweight cipher but since it is a ARX cipher it can be efficiently implemented

on constrained devices.3SPARX is an ARX-based SPN cipher, because his S-Boxes use an ARX structure.4Chaskey is an algorithm designed for creation of Message Authetication Codes (MAC)5Hybrid Ciphers combine the three main types (SPN, FN and ARX).6NLFSR Ciphers are ciphers that utilize the building blocks of stream ciphers.

10

They key scheduler expands the initial key into multiple round keys (number of rounds plus one) of

128-bits. To calculate the round keys the initial key is submitted to rotations, substitutions and a XOR

operation with a predefined round constant (stored in a table).

AES operates on a 4 × 4 matrix of bytes, called state. For the encryption, the state is submitted to

multiple rounds were 4 operations are performed. These operations are Add Round Key, Substitution of

bytes (S-Layer), Shift Rows and Mix Columns (P-Layer). In the Add Round Key, the round key is XORed

with the state. The substitution layer is performed using an 8 × 8 bits S-Box, through all the state. The

permutation layer is done by rotating the rows (the last 3 rows) of the matrix and mixing the columns by

multiplying them with a matrix (M). Each column is treated as a polynomial over GF (28)7

Because of this structure, AES is a heavy cipher to perform on constrained devices. It requires a

lot of memory to store the S-Box plus the round constants. Also, it has a lot of heavy operations (e.g.

multiplications) that can lead to low performances. To improve it many optimized implementations have

been proposed being the most popular ones using T-Box [42] and the Bit-Slice approach [85].

CLEFIA

CLEFIA [74],designed by SONY, was presented in 2007 and standardized in ISO/IEC 29192. It was

not presented as a lightweight cipher, but was later standardized as one because of its high efficiency

and performance in hardware.

The CLEFIA structure is a Generalized Feistel Network (GFN) that although was hardware oriented

can also perform well in software. It has a 128 bits block and supports key lengths of 128/192/256

bits with 18, 22 and 26 rounds, respectively. Moreover, it uses a simpler key scheduler and small F

functions, with small S-Boxes and basic permutations. Also, the use of a Feistel type structure allows

the encryption/decryption code to be almost the same, which provides a good trade-off between size

and performance.

The key scheduler expands the initial key into multiple round keys, 4 whitening keys plus 2 times the

number of rounds. All keys are 32 bits wide. In the key scheduling phase, the initial key is submitted to

a XOR with round constants that are generated base on the key length and a DoubleSwap function, a

function that swaps bits position, in place.

In CLEFIA (Figure 2.7) the blocks are split into 4 parts of 32 bits. Each part is submitted to different

operations and then swaps place with other part, like the Feistel Network structure. So, the beginning

and in the end, parts 1 and 3 are XORed with the whitening keys. Then, in each round the part 0 will

feed the Feistel Function F0 and its result will be XORed with part 1. Part 2 will feed the Feistel Function

F1 and the result will be XORed with part 3. The Feistel Functions (Figure 2.8) have a SPN structure:

the part that feeds it is XORed with a round key and then goes through a substitution layer, that uses

2 different S-Boxes of 4 × 4 bits intercalated. At the end, a permutation layer is used that mixes the

columns, similar to AES, where each byte of the 32-bits word is a column.

7Galois Field (GF) is a field that contains a finite number of elements. As with any field, a finite field is a set on which theoperations of multiplication, addition, subtraction and division are defined and satisfy certain basic rules.

11

Figure 2.7: Illustration of CLEFIA round

Figure 2.8: Illustration of CLEFIA F Functions

NOEKEON

NOEKEON [60] is a lightweight cipher that was proposed by the same authors of AES in 2000. It was

declared broken [86] but later the authors explained that the attack proposed on it was not possible [87].

NOEKEON is a SPN with bit-slice design (32-bits oriented) that targets software/hardware imple-

mentations. It has 16 rounds, supports a block of 128 bits and a key size of 128 bits. The particularity of

NOEKEON is that it does not have a key scheduler, since it does not use round keys. NOEKEON can

work on two modes: Direct Mode and Indirect Mode. In the Direct Mode the initial key is the one used

in the encryption/decryption. In the Indirect Mode the key is encrypted using a function of NOEKEON

(Theta Function) with a null vector as a key. The advantage of the Indirect Mode over the Direct Mode is

the no exposure of the secret key.

The NOEKEON structure (Figure 2.9) is divided into 3 functions: Gamma function, that is a S-Box

implemented on a bit-slice style; Theta, that is a linear mapping that takes the state and the key XORs

them and makes permutations and rotations; and the Pi functions, that are rotations. For the encryption,

the block of data is arranged in 4 32-bit words, called state. In each round, the state is first submitted to

a XOR with a constant and then to the Theta function. After that it is XORed with another constant and

submitted to the Pi1 rotations. Finally, it is submitted to the Gamma function and at the end to the Pi2

rotations. After all the rounds, it is XORed with 2 constants and in the middle goes through the Theta

function one more time.

This cipher uses very simple operations in a well-structured manner to achieve good security. It has a

12

small code size and memory footprint, since it does not need to store a S-Box, because it is implemented

in a bit slice style, nor round keys, since the same key is always used. Furthermore, the encryption and

decryption are very similar, which makes the code size smaller. These features make this cipher very

lightweight and fast, being the Direct mode the more attractive one.

Figure 2.9: Illustration of NOEKEON Round

PRESENT

PRESENT [23] is the most well known lightweight cipher. It was presented in 2007 and standardized

in ISO/IEC 29192, like CLEFIA. Since its presentation, it has been widely studied and it is considered

by many as one of the best lightweight encryption algorithms created so far. The PRESENT cipher is

hardware oriented, has a SPN structure, and uses keys of 80/128 bits to encrypt a 64 bits block through

31 rounds.

The key scheduler of PRESENT expands the initial key into multiple round keys of 64-bits (number of

rounds plus one). In each round, the key is submitted to a rotation, some of its bits suffer a substitution

(using the PRESENT S-Box) and other bits are XORed with the round number. The key schedulers for

80 and 128 bits are similar, presenting little differences on the rotation and substitution operations.

PRESENT (Figure 2.10) operates in a bit-oriented way, but its state is the full block. For the encryp-

tion, the state is submitted to multiple rounds were 3 operations are performed add round key a XOR

between the block and the round key; substitution of bytes, using a 4 × 4 bits S-Box; and a bit-oriented

permutation, where bits are swapped from place.

Because of this structure, with permutations that in hardware can be implemented using only simple

wired connections, PRESENT achieves a small and fast performance. In software, on the other hand,

these permutations are very heavy and difficult to implement, which leads to a big code size and a not

13

so good performance. Halka [83] is another lightweight cipher, very similar to PRESENT, but uses an

8× 8 bits S-Box, suffering from the same implementation issues.

Figure 2.10: Illustration of PRESENT Round

RECTANGLE

RECTANGLE [61] was proposed in 2015 and is oriented for both hardware and software, since it uses

a bit-slice technique to increase its software efficiency. Its has a SPN structure, supports a block size of

64-bits and key sizes of 80/128-bits, and a round number of 25. It is a 16-bits oriented bit-slice cipher.

For key scheduler, the key is arranged in a matrix of 5× 16 bits (for 80 bits key) or 4× 32 bits (for 128

bits key). The key is iterated through several rounds and in each round a substitution is applied to some

of the columns. Then a Feistel transformation is applied, where some rows are permuted and the first

and the last rows are rotated and XORed with other rows. After that a round constant of 5 bits is XORed

with the last bits of the first row. The round keys are 64 bits wide and are extracted from the first 4 rows

(for a 80 bits key), or from the 16 rightmost columns of the matrix (for 128 bits key).

In the encryption (Figure 2.16), the 64 bits block is first arranged in a 4 × 16 array and then the

encryption is performed throughout multiple rounds repeating these 3 steps: add round key, where a

round key is XORed with the key; substitution step performed to the columns in a bit slice way with a

4× 4 bits S-Box; and finally the permutation performed by rotations on the rows.

This structure allows the software implementation to benefit from a bit-slice style that gives it a fast

performance, while the rotations of the permutation layer allows it to have a very competitive result

regarding other lightweight algorithms when implemented on software.

RoadRunneR

RoadRunneR (RRR) [76] is one of the few lightweight block ciphers that is specifically addressed for

software implementations on processors with 8-bit architecture. It was presented in 2015 and follows

a LS-Design8 [65] to optimize its performance. It has a FN structure with some changes to improve its

security. RRR is a 64-bit block cipher that supports keys of 80/128-bits. The number of rounds for an

80-bits key is 10, and 12 for a 128-bits key.

8LS-Design [65] is a design that was proposed for ciphers composed by S-Boxes and L-Boxes. The S-Boxes follow the bit slicetechnique and the L-Boxes are a type of linear P-Boxes that mix bits inside the registers (in place) and can be applied in parallel.

14

The key scheduler is really simple. It repeats the key cyclic and then each 32 bits of the array of

the round keys are used in the encryption. It uses 3 keys per round plus 2 whitening keys, one in the

beginning and another at the end to XOR with the block.

For encryption, the block of 64 bits is split into two parts of 32 bits. In the start and the end of

the encryption the left part is XORed with a whitening key. In the middle, the block is submitted to

multiple rounds. In each round the left part of the block goes through a Feistel Function. This function

is composed of 4 steps, where the 1st, 2nd and 4th steps are the SLK boxes that have a SPN structure

inside. The other step (3rd) is a XOR with a round constant. The SLK box has a 4×4 S-Box implemented

in bit-slice, plus a permutation layer and an add round key layer, where a round key is XORed with

the data. The output of the Feistel function is XORed with the right part and finally the two parts are

swapped, repeating all the process in the next round. RoadRunner structure elements are illustrated in

Figure 2.11.

This structure allows RoadRunneR to have a small number of rounds keeping a high security. Also,

since the SP-network of the Feistel Function is implemented in bit-slice it will have a fast software per-

formance. Its key scheduler is also very simple and very fast to perform. These features make RRR a

cipher very suitable for software implementations.

Figure 2.11: Illustration of RoadRunneR Round, F Funtion and SLK Structure

SPARX

SPARX [78] is a family of ciphers based on ARX (Addition-Rotation-XOR) structure. It is one of the few

ARX lightweight ciphers that has not yet been broken, beside SPECK. SPARX was presented in 2016

15

as a cipher that allies the ARX structure advantages for lightweight encryption, with a SPN structure to

increase its security. It targets software and hardware implementations. SPARX has support for 64/128-

bit block sizes, 128/256-bit key lengths, and 24/32 and 40 rounds. The relation between block size, key

size and number of rounds can be seen in Figure 2.14.

The key scheduler takes the initial key and expands it in round keys of 32 bits. These round keys are

derived from the initial key, with operations depending on the key size. For example, in the 128 bits key

is split into 4 parts of 32 bits. Then to the first part the S-Box ARX-Based, Speckey is applied and the

result is added to the second part. The fourth part of the key is added to a round constant. After that,

the parts are permuted.

The encryption (Figure 2.12) takes place by steps with each step having multiple rounds. The block

of data is split into words of 32 bits, then those words are worked thorugh several steps. In each round,

the state words are XORed with the round keys followed by applying the ARX Based S-Box, Speckey.

The Speckey ARX-Box (Figure 2.13) splits the word in two and applies rotations, addition (left part +

right part) and a XOR (the right part XOR with left part) between them. At the end of each step, there

is a linear layer that is retrieved from NOEKEON: the left part is XORed with rotations of itself and then

permuted with the right part.

The structure of SPARX is a little confuse, but very lightweight, since it uses simple operations, to

create a S-Box, without the need to use memory to store it. This new way of implementing an ARX based

cipher is also very interesting, since it keeps the advantages of an ARX design, that are traditionally

weaker in terms of security, and makes it stronger, resulting in a secure algorithm with a small code size

and memory usage.

Figure 2.12: Illustration of SPARX Round Figure 2.13: Illustration of the Speckey S-Box

SPECK

SPECK [62][63], presented in 2013, was one of the lightweight ciphers created by NSA and presented

in 2013. SPECK is an ARX cipher oriented for software implementations, although it also performs well

in hardware. SPECK supports 32/42/64/96/128-bits block sizes and keys of 64/72/96/128/144/192/256-

bits length. The number of rounds goes from 22 to 34, depending on the block and key sizes, as

illustrated in figure 2.15.

16

Block Size Key Size Rounds64 128 24

128 128 32256 40

Figure 2.14: SPARX block sizes, keysizes and rounds

Block Size Key Size Rounds32 64 22

48 72 2296 23

64 96 26128 27

96 96 28144 29

128128 32192 33256 34

Figure 2.15: SPECK block sizes, keysizes and rounds

SPECK uses the same round function for key scheduling and for the encryption of the data. This

round function splits the data of the block size in half and performs additions, rotations and XOR opera-

tions between the two parts. In the key scheduler, the key is split in words of half of the size of the block.

So, for a 64 bits block a key of 128 bits will be split into 4 words of 32 bits, where the number of round

keys generated is equal to the number of rounds.

The encryption is a simple iteration over the round function (Figure 2.17), in which the block is split

in half, the left slice suffers a rotation and then an addition with the right slice and finally is XOR with

the round key. To complete the procedure, the right slice is XORed with the left slice (resulted from the

previous operations). These operations are performed through multiple rounds.

This structure allows SPECK to use in-place operations, removing the overhead to move values be-

tween registers. Also, since no tables are used and the same round function is used for the key schedul-

ing and encryption, it has a very small code size. The simple design and the use of light operations, like

rotations, additions and XORs, makes SPECK one of the most lightweight ciphers.

Figure 2.16: Illustration of RECTANGLE oper-ations applied to the cipher state Figure 2.17: Illustration of SPECK Round

17

2.4 Metrics and Tools

Metrics

With the increase of importance of lightweight cryptography, several new ciphers have been pro-

posed. As a result, comparative analysis of such ciphers have been gaining more and more importance.

However, depending on the target platform, different metrics should be considered.

Software

In what concerns software implementations, several different comparative analysis have been pre-

sented in the literature. In [52] the goal was to evaluate the energy consumption of the block cipher

in memory constrained devices, for which the considered metrics were the number of clock cycles, the

RAM footprint, and the code size. The authors collected the amount of clock cycles for the encryption

and decryption procedures, as well as for the key expansion, which allowed them to assess the stages

that require more processing time. The energy consumption values were obtained with the simulation of

power models for the StrongARM SA-1100 processor.

Other evaluations based on software platforms were also reported in [46] [47] [21] [48] [49] [50] [22].

Although such studies addressed different target platforms, they considered the same set of metrics

mentioned before:

• RAM Footprint: It is expected from a lightweight block cipher to require a low memory usage.

RAM usually contains data like the information to cipher, master keys, round keys and initialization

vectors. It is measured in bytes.

• Code Size (ROM): The code size requirement is computed in bytes and corresponds to the code

footprint that is stored in the device persistent memory.

• Execution Time: One of the most important performance metrics of a cipher algorithm is the

execution time, which is related to various parameters of the cipher, such as its structure or the

number of rounds and the target device. It is usually measured in processor clock cycles.

• Energy Consumption: This is a very important metric when considering devices with limited

power sources, like batteries. It is measured in Watts (W) (consumption per second) or Joules (J)

(total consumption). Typically, this value is an estimation computed using several approaches such

as the one in equation 2.3 used in [22] or using power models for the processors. For example, in

[52], the power model of the StrongARM SA-1100 was considered.

Energy (J) =Consumed Power (W )× Clock Cycles

Frequency (Hz)(2.3)

18

Hardware

When considering evaluations based on hardware platforms [46] [47] [48] [18] [53], other metrics are

usually employed:

• Area: It is important that a hardware implementation of a cipher uses the lowest possible area,

since constrained devices have a small area available to implement the ciphers. This area is mea-

sured in Gate Equivalent (GE). GE stands for a unit of measure that allows to specify manufacturing-

technology-independent complexity of digital electronic circuits. Nowadays, this area is given by

the silicon area of a two-input NAND gates [88].

• Throughput: Throughput is the maximum rate at which the cipher can encrypt/decrypt the data.

It is usually measured in bytes per second. Some times this metric is also used for sofware

implementations.

More complex metrics (formulas) based on the ones mentioned above have been reported in some

works [20] [76]. However, such metrics are not widely used.

Tools

With such a vast number of different metrics, the need for a consistent and simplified evaluation

methodology started to grow within the research community. As a result, several tools have been de-

signed for this purpose, have been developed and have become widely used. Two good examples of

these tools are BLOC[89] and FELICS[90].

BLOC

The BLOC project [89] aims at studding the design of block ciphers in constrained environments.

The underlying target device is the 16-bit MSP430F1611 microcontroller, which is commonly used in

sensor nodes. Three metrics are considered: the execution time, the RAM requirements and the code

size. The metric extraction is done automatically through Bash scripts and the results are exported into

LaTeX tables. Unfortunately, as mentioned in [90], a bug was found on the source code causing the RAM

footprint to be wrongly computed. Overall, the project has the merit of being one of the first attempts to

perform automated evaluation a set of lightweight block ciphers on an embedded device.

FELICS

The FELICS (Fair Evaluation of Lightweight Cryptographic Systems) tool was first introduced in [91]

but it was formally presented in [90]. It is a free, open source and flexible framework that can be used

to assess the performance of C and assembly software implementations of lightweight primitives (block

and stream ciphers) on a set of embedded devices. In fact, this framework has been widely used in the

most recent papers presenting analyzes of ciphers, like in [50], [21] and [51].

19

Currently, it supports three widely used microcontrollers, i.e. the 8-bit AVR ATmega128 microcon-

troller, the 16-bit MSP430F1611 microcontroller and the 32-bit ARM Cortex-M3 microcontroller, and

allows to extract three metrics: code size, RAM consumption and execution time. It has 3 usage scenar-

ios9 (2 of them consists of simulations of the usage of lightweight ciphers):

• Scenario 0: Evaluation of basic operations of a block cipher. In this scenario, a block is en-

crypted/decrypted using the provided test vectors.

• Scenario 1: Evaluation of the performance on secure communications in sensor networks and

between IoT devices. In this scenario, several blocks of 128 bytes are encrypted/decrypted in

Cipher Block Chaining (CBC) mode.

• Scenario 2: Evaluation of the performance of challenge-handshake authentication in the IoT. It

assumes the round keys are stored in flash memory so there is no need to perform key schedule.

128 bits of data are encrypted in Counter (CTR) mode.

The FELICS tool also presents the advantage of having a modular design that allows to easily ac-

commodate new target devices, metrics, and usage scenarios. For example, it is possible to assess

the execution time of the cipher and the key generator in separate to see which part is taking more

time and, consequently, needs more optimization. The possibility to create new metrics, such as energy

consumption, can also be very useful for the evaluation of more recent ciphers.

The project page [92] is another advantage because it contains all the documentation needed to

operate the tool, as well as lots of information and metrics for other lightweight cipher implementations

[93]. This can be extremely useful for performing comparative analysis of ciphers

2.5 Optimizations

The optimization of the ciphers has also been a major concern of lightweight encryption (e.g. [51]).

Several different approaches can be used to optimize a cipher, each one targeting a specific goal. Still,

most optimizations aim at reducing the execution time (increase on performance), the code size (ROM

footprint), or the volatile memory requirements (RAM footprint). These optimizations can be grouped in

two main classes:

• Algorithm Optimizations

• Code Optimizations

2.5.1 Algorithm Optimizations

These optimizations involve modifications in the structure of the cipher, which consist in replacing

some of the operations by other less complex operations providing the same result. The most commonly

adopted techniques are:9The usage scenarios are written in C

20

• Table based implementations;

• Bit-slice implementations;

• Vperm implementations. Which consists in using vector permutation instructions, in order to im-

plement table lookups by taking advantage of the SIMD engine present inside modern CPUs. This

technique was not addressed in this thesis, since it is not suitable for the platforms that are the

focus of this work;

The goal of these optimizations is to increase the performance of the cipher.

Table Based Implementations

Tabulating operations for efficiency purposes is quite an old technique that is very well known by

programmers. When applied to block ciphers, the goal is to tabulate as much as possible the different

operations composing one round. So, with this optimization method, multiple operations will be ex-

changed for a table lookup, leading to a much higher performance. The result of this implementation will

be a round that is composed of:

• the key addition layer (can be performed before or after the table lookups, depending on the cipher

structure);

• selection of slices from the cipher state using shift and mask operations;

• perform the round transformation through several table lookups;

• combine the result of the table lookup to obtain the updated cipher state.

This approach is trivial to implement in ciphers that follow an SPN structure but more specifically in

ciphers like the AES, where rounds are based in substitution layers (S-Box) and a permutation layer with

a multiplier boxes (M-Box). This technique can also be applied to other SPN structures, but it will be

harder and might not achieve a much better performance that justifies the trade-off. In [42] the authors of

the AES cipher already proposed this table-based implementation to efficiently perform AES operations

in 32-bits processors. Also, in [24] a general approach for a table-based implementation is proposed

and table implementations for LED[64], PRESENT [23] and Piccolo[77] are presented.

Table Based Reduction

The main problem of this table-based implementations is that they involve a trade-off between memory

and performance. The tables are usually big, and they need to be stored in memory or calculated on the

fly. If they are calculated on the fly, this will lead to a not so good performance improvement, because

of the overhead. Conversely, storing them in memory will require a large amount of memory, which can

be critical in constrained devices. For example, the proposed AES table-based implementation requires

8 tables (4 for encryption and 4 for decryption) where each table has an input of 8-bits and an output of

32-bits. This leads to tables of 1 KB, which would result in an increase of 8 KB in the cipher size. Luckily,

when the tables are extracted from a SPN structure with a multiplier box (M-Box), a relation between

21

the T-Boxes is noticed, because each row of the M-Box corresponds to a table and usually the rows are

rotations of each other. So, this makes possible to obtain the T-Boxes from only one T-Box by using

shift operations. These shift operations will add a little overhead but will enable a big reduction in the

size of the cipher, which makes it more suitable to be used on constrained devices. With this approach

the increase on a cipher like AES is reduced from 8 KB to only 2 KB, since only 1 T-Box is required for

encryption and another for decryption. This property of the T-Box implementation is also presented in

[42].

Security Issues

A main security issue of the table-based implementations is that they are susceptible to cache timing

attacks10, which makes this table-based implementations not reliable on processors with cache.

Bit-slice Implementations

The bit-slicing technique was first introduced by Biham in 1997 [95]. It was used to speed up the

software performance of DES. The optimization was used for brute force key search of DES in the late-

1990s. More recently, it has been used to improve the performance other ciphers. The fastest known

software implementation of AES, uses the bit-slicing technique and was implemented by Kasper and

Schwabe [85] on an Intel Core 2 utilizing its enhanced SIMD architecture. For PRESENT, a bit-slice

implementation on an Intel Core 2 and other x86 processors have been also presented [54] [24].

The basic concept of bit-slicing is to simulate hardware performance in software. To achieve such

goal the entire algorithm is represented as a sequence of logical operations. Also, the state of the

cipher changes from n-bit words to one-bit words, which allows to compute the operations in parallel to

n-bits. So, in a bit-slice implementation one software logical instruction corresponds to the simultaneous

execution of n hardware logical gates, where n is the size of a register. In the bit-slice approach, S-

boxes are computed using bit-logical instructions rather than table lookups. Since the execution time

of these instructions is independent of the input and key values, the bit-slice implementations require

less code size and are generally resistant to timing attacks. Hence bit-slicing can be efficient when

the entire hardware complexity of a target cipher is small, and the target processor has many long

registers. Despite all this, a conversion of the cipher state is required for compatibility with the bit-slicing

implementation. This conversion can lead to an overhead in the cipher performance.

Bitslice in the Lightweight World

To enable the deployment of bit-slicing ciphers, in processors with small registers, several ciphers

have been proposed in the last years with designs considering bit-slice approaches. Such ciphers do

not involve a conversion of the cipher state and focuses on processors with registers of smaller sizes.

The application of this technique to lightweight ciphers aims to produce ciphers with small code size,

because it does not require any S-Boxes, and that have better performance. Some examples of ciphers

that have been designed using the bit-slice implementation are Fantomas/Robin [65], RECTANGLE [61],

NOEKEON [60], Mysterion [66] and RoadRunneR[76] (both inspired by Fantomas/Robin).

10Cache Timing Attack[94] is a side channel attack in which the attacker attempts to compromise a cryptosystem by analyzingthe time taken to access the memory.

22

2.5.2 Code Optimizations

This type of optimizations involve the implementation of small changes to the algorithm, in order to

improve its performance but without changing the structure of the cipher. Such optimizations can be

divided in different groups, but for this work the more relevant ones are:

• Code Cleanup;

• Changing architecture orientation (e.g.: 8-bits to 32-bits);

• Changing the size of the S-Box;

• Constants Calculation vs Constants Tables;

• Function Calls vs Function Inlining;

• Store the Cipher State in Registers;

• Partial/Full loop unrolling;

• Reordering of the of operations;

Althought the main objective of these optimizations is to improve the performance, some of them

also allow to achieve a smaller code size or a smaller usage of RAM.

Code Cleanup

This type of optimization is somewhat mandatory when dealing with reference code written in C,

because the reference code of the ciphers are kept generic so that they can be executed on any device

and with any options. So, in order to take the best performance from the reference code, it is necessary

to:

• Avoid Conditional branches: Conditional branches (if’s) are one of the main problems of algo-

rithms, because they make the algorithms slower due to affecting the pipelining of the operations.

So, they should be avoided whenever its possible.

• Remove Unnecessary Memory Duplication: Because the data is constantly changing in ciphers,

some implementations work with local copies of the data blocks, which are modified and subse-

quently copied back to their original memory positions. This generates an unnecessary overhead

in the performance of the cipher that could be avoided if the data was changed in the original

memory spaces.

• Use Comparison with Zero: In some processors it is better to compare the cycle variable with

zero because it is faster and has an instruction specific for it. This approach is not usually adopted

because the code becomes more difficult to understand, but changing the conditional statements

to compare with zero can lead to a slight improvement in performance.

23

• Replace multiplications by shifting: As it is well known, multiplication is a slow operation. Some

processors use a hardware multiplier to improve the performance of multiplication operations, the

ARM Cortex-M3 [96] is an example of that. To avoid multiplications, when the multiplier is a multiple

of 2, it is possible to use shifting by the power of two. For example n × 8 is equal to n 3, since

8 = 23. This leads to a better software performance of the multiplication, since shifts are easier

and faster to perform.

• Reutilization of results: It is very common in ciphers have something like this:

data[0] ^= roundKey[i*2]

data[1] ^= roundKey[i*2+1]



This type of code is very inefficient, since it is repeating the same operation multiple times when it

could be done only once. This can be improved in different ways, depending on the context were

this happens. For example, a very simple way to avoid it is by storing the result of the repeated

operation, and then use it as many times as needed. Something like:

temp = i*2

roundKeysTemp = roundKey + temp;

data[0] ^= roundKeyTemp[0]




These are all little code improvements that together can bring a good improvement to the perfor-

mance of a cipher and could also lead to a reduction of the code size.

Changing Architecture Orientation

Different processors and architectures are used in the IoT world (e.g. 8-bit, 16-bit and 32-bit). The

problem when working with ciphers is that they could vary a lot on their architecture orientation. Some al-

gorithms are byte oriented, so variables of 8-bits length are used. However, other algorithms are tailored

for different architectures, e.g. 16 or 32-bits architectures. This can lead to inefficient implementations

in some processors. For example, if we want to perform an XOR between a 64-bits block and a 64-bits

round key, with an implementation oriented to 8-bits we will need 8 instructions to do it (8× 8 = 64). But

if we have a 32-bits oriented implementation we will only need 2 operations (2× 32 = 64). This leads to

an increase in the performance, by reducing the number of performed operations.

24

Changing the size of the S-Box

Most lightweight ciphers use S-boxes of 4-bits, which is good because a S-Box of that size only

requires 8-bytes to be stored as a table. The problem is that table lookups of 4-bits are slow to implement

on software, since most of the processors have an architecture of 8-bits or more. For example, to

substitute bits on a 64-bits block with an S-box of 4-bits it is necessary to perform 16 memory accesses,

mask the bits and after the substitution to put them back in the block. This can lead to a slow cipher. One

easy way to improve the performance of these ciphers is to expand the S-Box size to 8-bits. The S-Box

then requires 256-bytes to be stored but could lead to a very good improvement in the performance

(in the previous example an 8-bits S-Box will only require 8 memory accesses). As an example, in the

reference implementation of CLEFIA [97] the 4-bit S-Boxes have been replaced by 8-bit S-Boxes for

better software performance.

Constants Calculation vs Constants Tables

Some ciphers usually require round constants for the key scheduling or for the encryption/decryption.

These round constants can be the round number or some number that changes every round. Such

constants can be computed on the fly or stored in tables. Depending on the cipher implementation one

of the two options can be chosen. If these constants are calculated on the fly, it will add an overhead to

the cipher execution time. If these constants are stored in a table it will increase the data size. Therefore,

to select the best scenario, must evaluated the trade-off between computing these constants on the fly

or have them stored in a table, by taking into account the target device characteristics.

Function Calls vs Function Inlining

In [52] ciphers finalist of the AES contest were optimized, with memory in focus, by reducing the

code size using functions to replace macros and other code repetitions. The problem with functions

is that using many function calls adds the overhead of the call to the ciphers execution time and will

make the ciphers performance slower. So, one should always look for a good trade-off between having

functions or inlining them in the code. If the cipher does not reuse any code, than using functions is just

a waste of performance. However if the algorithm reuses a lot of code, than using function calls could

help to reduce the code size. This kind of optimizations depend always on the target device and on its

restrictions. If the memory is small, an implementation with function calls should be used. Otherwise,

function inlining cold be used to achieve a better performance.

Store Cipher State in Registers

A cipher can be seen as a set of operations that are performed over a state (the data block). Since

this state is used in most of the operations, a good way to increase the performance of a cipher is by

keeping it local to the processor in registers, therefore requiring less memory accesses. Usually the

state of a lightweight cipher can be 64-bits or 128-bits length. So, in a 32-bits processor two or four

registers must be used to hold the state value, which is not that much for most of the IoT processors.

25

When implementing ciphers on C, the best way to achieve this is by storing the state of the cipher in a

set of variables of the length of the registers that should be declared using the “register”11 keyword. The

compiler can then choose to keep such variables on registers, or no, but if the code is carefully designed,

(i.e. does not contain a lot of variables, includes only simple operations and reuses the variables already

declared) it is almost certain that the state variables will be kept in the registers.

This technique can lead to very good performance increase for most of the ciphers, but it is in ciphers

with a bit-slice or ARX design that this optimization should show better results, since these ciphers are

mainly composed of simple operations between variables and do not perform table lookups.

Loop unrolling

Loop Unrolling [98] is a technique used by programmers to reduce data dependencies on cycles and

with that reduce pipelining issues leading to an increase in performance. This can be achieved by either

removing cycles or reducing the number of cycle loops by repeating the code manually. When applied

to ciphers, the unroll can be partial or full. It is partial if the number of cycle loops is reduced but not

removed or when only a layer (e.g. a permutation layer of a cipher like PRESENT [23]) is unrolled and

not the full cipher. It is full if the full cipher is unrolled, this means that all the layers and rounds have

been unrolled.

Loop unrolling leads to a increase on the cipher code size. The greater the unroll, the greater the

code size impact. This increase of code size can be compensated with an increase in performance,

since the overhead of the cycle is eliminated and the pipelining of instructions is improved. Therefore, to

successfully use unroll a good trade-off must be found.

Reordering of the Operations

The objective of this optimization is to reorder the operations in the algorithms, exploiting the char-

acteristics of the microarchitecture of the processor, in order to achieve a better performance with no

significant changes in the cipher structures. For example, the rearranging of the operations can be used

to improve the data dependency between the instructions.

A good example of this type of optimization is the use of the barrel shifter in the ARM Cortex-M3

architecture [96], which is placed before the ALU in the micro-architecture. So, instead of doing a shift

operation on a variable and saving the result to a register or memory position, the shift operation can be

performed only when that variable is used in another operation, like an ADD, SUB, etc. This approach

achieves the same result but reduces the number of operations that are performed, therefore reducing

the execution time.

11The “register” keyword tips the compiler that the variables referred are widely used so they could be placed on registers andbe kept in them throughout the execution

26

2.6 Summary

As presented, lightweight cryptography has become a quite hot research topic in the last few years.

This motivated a lot of research works and several different contributions from various authors. Still,

there are some aspects that must be further investigated.

One of the best reviews on lightweight encryption is presented in [46]. Covering a lot of algorithms,

evaluating them in both, hardware and software. However, it does not addresses the issues of algorithm

and code optimizations, focusing only the basic implementation of the ciphers.

Another important issue concerns software implementations, since in most cases the characteristics

of the processors are not exploited to achieve efficient implementations. For example, ARM processors

have wider registers than 8-bit and 16-bit processors. Moreover, the energy consumed by ARM Cortex-

M family of processors is quite reduced and similar to the energy consumed by such lower performance

processors. However, the research that has been conducted using such processors is still quite di-

minished. The work presented in [51] is one of the first attempts to evaluate and optimize algorithms

targeting ARM processors, but only considers 2 algorithms. Also, FELICS [90] is one of the first tools

supporting this architecture.

Another problem relates to the excessive focus on SPN ciphers, given it is the one of the strongest

in terms of security. Hence, most of the presented analyses and optimizations address these ciphers.

However, ciphers with a Feistel Network or ARX structure have very good characteristics for implemen-

tations in lightweight devices. Hence, studies considering these types of ciphers are lacking.

Finally, a big issue on the lightweight world is that the standard lightweight ciphers are mainly hard-

ware oriented. PRESENT [23] is completely hardware oriented and CLEFIA is hardware oriented, al-

though it can have a reasonable performance in software. Nonetheless, not many research works ad-

dress software implementations of CLEFIA. Thus, there is a clear need to improve the study of software

oriented ciphers.

27

28

Chapter 3

Proposed Improvements to

Lightweight Encryption

This chapter presents all the work performed in the scope of this thesis aiming at the development of

optimized versions of the most relevant lightweight encryption algorithms targeting software implemen-

tations in constrained devices based on the ARM Cortex-M3 architecture.

First the computing platform and the programming environment considered in this study are intro-

duced. Then, the set of ciphers addressed by this research is presented and the criteria adopted for its

selection are explained. An analysis of the best configurations, in terms of key size and block size, is

discussed, by taking into consideration the characteristics of the adopted constrained processor. Finally,

proposed optimizations to the selected ciphers are presented and discussed regarding its impact in the

memory footprint and execution time.

3.1 Target Platform

To fairly assess the advantages offered by the proposed cipher optimizations for practical IoT applica-

tions and products, it is mandatory to conduct a thorough experimental evaluation procedure involving

a hardware platform containing a constrained processor widely used in the IoT world. ARM is one of

the most popular manufacturers of constrained processors and the dominant player in the IoT world.

Furthermore, its 32-bits Cortex-M3 processors have been central to the development of the most recent

and cutting-edge IoT products across several different market segments.

The main features that make this a revolutionary processor for the IoT and for the development of

embedded applications are the following:

• High Performance, because of its Harvard architecture, based in RISC [99], which allows data and

instructions fetches to be performed in parallel and the Thumb-2 instruction set, which takes the

best of both the ARM and Thumb instruction sets, by combining them to eliminate the overhead of

the swapping and achieving a higher code density with reduce memory requirements and higher

29

performance.

• Supports many instructions like bit banding, or multiplications in hardware, which are performed in

a single clock cycle.

• It has a barrel shifter placed before the ALU D.9, which allows to shift register values before an

arithmetic or logical operation, with no overhead.

• Advanced Interrupt-Handling, because of it Nested Vectored Interrupt Controller (NVIC) and the

automatic load/store of registers in the stack that is performed by hardware, which enables interrupt

and exception handlers to be fully developed in C, avoiding the need and the overhead generated

by the assembly routines.

• Low Power Consumption, because of its low number of gate counts and of its sleep modes that

enable it to have a power consumption similar to 8-bits and 16-bits processors.

• Large Debug Support, because of it many debug features and different interfaces, that decisively

help the programmers in the development of new applications and allow to reducing the time-to-

market of new applications.

Given all these very important characteristics, and also to the fact that the Cortex-M3 processors

are supported by the FELICS Framework [90] (which facilitates the evaluation process), the presented

research work is focused on the Cortex-M3 processor. A more detailed overview of this processor is

provided in the Appendix D.

Regarding the hardware platform that was used to conduct the experimental procedures, it consists

of the Arduino Due board [25], which is one of the most popular boards of the Arduino Project [11]. This

board includes an Atmel SAM3X8E microcontroller that is powered by an ARM Cortex-M3 revision 2.0

processor running at up to 84 MHz, with MPU. It has 2 flash memories of 256 Kbytes (512 Kbytes total)

and an SRAM of 96 Kbytes (64 + 32).

Implementation Setup

The FELICS tool [90] was used to deploy the considered set of ciphers to this board, as well as to

perform its evaluation in terms of execution time (in clock cycles) and code size.

Both the standard and the proposed optimized versions of such ciphers were implemented in the C

programming language and compiled using the FELICS scripts for the ARM Cortex-M3 processor, which

use the GNU C/C++ Compiler (gcc) from the GNU ARM Embedded Toolchain [100].

Such compilation procedure included the libsam3x and libc libraries available in the FELICS frame-

work and the four optimization1 levels provided by the gcc [101]:

• -O1: The most basic optimization level that tries to produce faster and smaller code without taking

much compilation time.

1The optimizations flags do not invalid the optimizations ”handmade” to the code, like the ones referred in the section 2.5.

30

• -O2: A step up from level -O1, in which the compiler attempts to increase code performance

without compromising on size and without taking too much compilation time.

• -O3: The highest level of optimization possible that enables optimizations that are expensive both

in terms of compile time and memory usage.

• -Os: Optimizes code for size, which can be useful for platforms that have extremely limited storage

space.

3.2 Considered Ciphers

Despite the vast number of ciphers that have been proposed in cryptographic literature and also exist

in implementations and product worlwide, the study herein presented focus only eight distinct lightweight

ciphers. This set of algorithms was selected based on the following criteria. Since, a lot of work has

already been done and published in this field, our study mostly addresses ciphers that have not been

widely studied, not only to investigate their potential for optimizations but also to provide new contri-

butions to the state of the art. Due to the enormous diversity of applications and constrained devices

that exists in the IoT world, we chose to include in our study algorithms with different designs, small

block sizes (typically 64 bits, max 128 bits), key sizes between 80 and 128 bits, and a reduced number

of rounds, which are all very important features to reduce the power consumption and obtain efficient

implementations on software. Also, we decided not to consider ciphers with security vulnerabilities.

Based on these criteria, the chosen ciphers were the following:

• AES, PRESENT and CLEFIA: These 3 ciphers were mandatory. AES is the most widely used

cipher due to having been established as the standard for encryption in 2002, which is why it is

also used by many IoT devices, despite the fact that it is not a lightweight cipher. PRESENT and

CLEFIA are two ciphers that have also been standardized, but as lightweight ciphers. Therefore,

it was mandatory to include these three ciphers in this study so that they can be used as anchors

when evaluating the other ciphers.

• NOEKEON, RECTANGLE and RoadRunneR: These 3 ciphers share a characteristic: their design

meakes use of the bit-slice technique. Bit-slice is showing promising results also in the lightweight

world for the following reasons: it leads to a reduced code size, since it does not stores the tables

in memory; most operations are performed in place using simple logical operators and rotations

for permutations, which makes them very efficient when performing on software; they have a low

number of rounds, which can lead to better performances and lower energy consumption.

• SPARX: This is one of the few ciphers with an ARX design, that has not yet been broken. It

uses ARX-Boxes as S-Boxes, taking the advantages of the ARX design, i.e. light operations, fast

speed and small code size. Because of these characteristics, SPARX is a very promising cipher

that can achieve a good performance in software implementations. Moreover, it has been poorly

considered in the state of the art.

31

• SPECK: This cipher is discussed in almost all comparative analysis that have been published in

the literature due to its ultra-lightweight characteristics. It is a very simple and small ARX cipher

that can achieve one of the best performances on software, but that has not yet managed to find

a place in the lightweight world because of its origin (NASA). Thus, it was chosen because it may

be helpful to see how well this cipher performs when optimized.

Table 3.1 presents an overview of the cryptographic properties of these algorithms.

Target Structure Block Size Key Size Rounds

AES Hardware/Software SPN 128128 10192 12256 14

CLEFIA Hardware/Software GFN 128128 18192 22256 26

NOEKEON Hardware/Software SPN 128 128 16

PRESENT Hardware SPN 64 80 31128

RECTANGLE Hardware/Software SPN 64 80 25128

RoadRunneR Software FN 64 80 10128 12

SPARX Hardware/Software ARX64 128 24

128 32256 40

SPECK Software ARX

32 64 22

48 72 2396 26

64 128 27144 28

96 192 2932

128 256 3334

Table 3.1: Block Ciphers Characteristics

3.3 Ciphers Configuration Selection

In order to identify the best configuration for each cipher when implemented in the target platform,

several different setups using distinct key sizes and block sizes were evaluated using the FELICS frame-

work. Such analysis focused the scenario 0 of FELICS, since it assumes both the encryption and the

decryption of the test vectors for a given cipher. The compilation procedure considered the -O1 opti-

mization level, due to being the simplest (fastest) one. Only block sizes of 64 and 128 bits and key sizes

of 80 and 128 bits were tested since they are the most used on lightweight ciphers.

In the following, the results of such analysis are presented and discussed.

32

3.3.1 Performance Evaluation for Different Key Sizes

In what concerns the key size, the performed analysis focused only the RECTANGLE and RoadRunneR

ciphers. This results from the fact that, together with PRESENT, they are the only ciphers supporting the

80 and 128 bits sizes among the whole set of ciphers considered in this study.

Regarding PRESENT, only the 80-bits key size was considered following its authors recommen-

dation. According to [23], for the lightweight world the 80-bits key size provides enough security and

demands less processing power.

Similarly, for CLEFIA [74], SPECK [62][63] and SPARX [78], which also support different key sizes, a

single key size was chosen (128-bits) due to being the only key size that they support in the considered

80 to 128-bits range.

RECTANGLE 80 vs RECTANGLE 128

The best configuration for the RECTANGLE [61] cipher was assessed using optimized implementa-

tions, produced by the authors, available on FELICS for the two key sizes. This source code is organized

in three main functions that implement the key scheduler, and the encryption and decryption algorithms.

In these otimizations, all the code of these three functions is inlined.

Hence, the resulting compiled code for the two implementations is quite similar. In fact, the only

difference concerns the implementation of the key scheduler, as a result of the distinct key sizes. This

can be seen in Figure 3.1, which depicts the obtained implementation results.

The code size of the encryption and decryption functions is the same, since this cipher uses the

same block size (64 bits) and amount of rounds (25) for both configurations. For the same reasons, the

amount of RAM required to store the round keys is also the same.

Figure 3.1: RECTANGLE Key Comparison -Code Size

Figure 3.2: RECTANGLE Key Comparison -Execution Time

In what concerns the performance, it was observed that the 128 bits version provides a slightly

better execution time for the key scheduler. Conversely, the execution time of the encrypt and decrypt

procedures is equal, as it can be seen in Figure 3.2.

The observed performance improvement results from a more efficient use of the 32-bits registers

available in the ARM Cortex-M3 processor. For keys with 128 bits, the variables used to implement the

key scheduler can make use of the full length of the processor registers, since 128 is a multiple of 32.

However, the implementation of the key scheduler using keys with 80 bits must involve 16 bit variables,

33

which results in extra code size and higher execution times. Therefore, the 128 bits key version performs

better than the 80 bits version, since it uses the full length of the register.

In conclusion, the use of RECTANGLE with a key size of 128 bits provides better performance results

and reduced code size. In addition, it increases the security since, due to the use of a stronger (longer)

key. The only disadvantage is an increase of 48 bits in the RAM requirements to store the key, which is

pretty much insignificant.

RoadRunneR 80 vs RoadRunneR 128

The RoadRunneR [76] analysis considered an implementation, available on FELICS, with some

minor changes in the key scheduler, and a modification to the encryption and decryption code, to enable

the use of single key scheduler for both operations. Despite these modifications, the tested source code

still remained organized in several different functions, in order to reduce the size of the compiled code.

However, it become possible to use the same source code to implement the two cipher configurations.

As a result, the same code size was obtained for both implementations, as shown in Figures 3.3.

Although the code size is the same for both configurations, the RAM requirements are slightly dif-

ferent. In fact, the 128 bits configuration requires an extra 30 bytes of RAM. This is mostly owed to the

larger size of the key (6 bytes) and the two additional rounds that are required to perform the encryption

when a 128 bits key is used, which requires and additional 12 bytes per round to store the round keys.

Figure 3.3: RoadRunneR Key Comparison -Code Size

Figure 3.4: RoadRunneR Key Comparison -Execution Time

In terms of execution time, the key schedule of both versions is almost similar, as it can be seen in

figure 3.4. However the encryption/decryption execution time is 19% higher in the 128 bits version, due

to computation of the two extra rounds.

Given these results , it can be concluded that it is preferable to use the RoadRunneR cipher with a

128 bits key, due to providing a higher security and not performing much worst than the 80 bits version.

3.3.2 Performance Evaluation for Different Block Sizes

In the considered set of ciphers, only SPARX and SPECK support different block sizes. Consequently,

the following analysis is focused on the implementations of these two ciphers using block sizes with 64

and 128 bits.

SPARX 64 vs SPARX 128

34

The SPARX [78] tests considered the reference implementations provided in the FELICS framework.

Such algorithm implementations present the same structure and are based on the same set of functions

and constants. So, the only code changed, from one implementation to another, is to support the

different block sizes.

In SPARX, an increase of the block size results in and increased number of rounds, which augments

the RAM requirements. In addition, a different key permutation is performed by the key scheduler. The

results obtained for the SPARX cipher are not very surprising.

As it can be seen in Figure 3.5, the code size of the 128 bits version is far greater (almost the double)

than that of the 64 bits version. This is a direct result of the increased complexity caused by using a block

size with the double of the size. Because in the 128-bits version the block is split in parts of 64-bits, which

require the double of operations when compared with the 32-bits parts, of the 64-bits block. Curiously,

the key scheduler of the 128-bits version has smaller code size that the 64-bits version, because in this

version the key scheduler operations are probably easier to perform on a 32-bits processor. Also, the

RAM requirements are higher for the 128 bits block size, since larger blocks require round keys with

more bits and more state words.

Figure 3.5: SPARX Block Comparison - CodeSize

Figure 3.6: SPARX Block Comparison - Exe-cution Time

In what concerns the performance, the execution time of the 128 bits version is almost three times

higher than that of the 64 bits version, as shown in Figure 3.6. This is due to the fact that the encryp-

tion/decryption of a 128 bits block requires 32 rounds, and the double of operations, whilst the same

computations for a 64 bits block involve only 24 rounds. This ratio also stands for the execution time of

the two key schedulers, since in 64-bits block less rounds are performed so less round keys need to be

computed.

According to these results, the 64 bits version of the SPARX cipher was chosen, since it provides

better performance (even if it runs two times to encrypt 128 bits) and smaller code size.

SPECK 64 vs SPECK 128

For SPECK cipher, the performed comparative analysis was based on the proposed algorithms on

the paper [62]. In SPARX, an increase of the block size results on an increase on the size of the round

keys, 32-bits length for 64 bits block, 64-bits length for 128 bits blocks. And an increase in the number of

rounds. The number of operations performed are also changed, because 32-bits state words are used

35

for the 64 bits block version, while 64-bits state words are used for the 128 bits block version, which in a

32-bits processor will require the double of operations to modify this state words.

Figure 3.7: SPECK Block Comparison - CodeSize

Figure 3.8: SPECK Block Comparison - Exe-cution Time

The implementation results that were obtained for the two cipher configurations are shown in Figures

3.7 and 3.8. In terms of code size, the 128-bits version requires 320 bytes, more than the double

of the 64-bits version, 144 bytes. The reason for this is the 32-bits datapath of the ARM Cortex-M3

architecture. In such a datapath, the 32-bits version of SPECK can be efficiently computed because all

the data (including the state words and the round keys) are 32-bits wide and the involved operations work

over 32-bits. However, the 128-bits version of this cipher works with 64-bits data. Consequently, more

than the double of the instructions are required to perform all the necessary computations. Naturally, the

execution time is also greatly influenced by this huge difference in the amount of instructions required to

implement the cipher and its key schedule. The key schedule and encryption/decryption procedures of

the 128-bits version require more than twice as much time as the ones for the 64-bits version.

The RAM requirements are also influenced by the block size, since the 128-bits version involves not

only bigger state words and round keys but also the computation of five additional rounds.

Based on these results, the block size chosen for the SPECK cipher was 64-bits.

3.4 Optimized Implementations

This section discusses the optimizations that were applied to all the considered ciphers.

For each cipher, one or more optimization techniques have been implemented, in order to obtain

the best possible implementation for an ARM Cortex-M3 processor. Although the ultimate goal of such

optimization procedure would be to improve the ciphers implementation in terms of performance, mem-

ory usage and energy consumption, it is not possible to achieve an implementation that combines the

optimal solution for the three dimensions. Consequently, the following priority ranking was adopted in

the considered optimization procedure:

1. Execution Time

2. Code Size

3. RAM Footprint

36

4. Energy Consumption

Due to this priority ranking, several different implementations were developed for all the ciphers,

each one offering a distinct optimization level. Table B.1 in the Appendix B lists the optimization tech-

niques that were considered in each one of those implementations. All the developed optimized ci-

pher implementations are available in the GitHub repository of this thesis (https://github.com/jcsf/

LightweightBlockCiphers.Thesis.2018) and shall be made available the FELICS project.

3.4.1 Algorithm Optimizations

As mentioned in section 2.5.1 the two main algorithm optimizations are table based implementations

(T-Box) and bit-slice implementations.

T-Box Optimization Implementation

As explained before table-based optimizations are known for a long time. Many papers have pre-

sented this solution to increase the performance of ciphers. These T-Box implementations are known

for trading memory space (code size) for performance, increasing the memory occupied by the cipher,

because of the size of the tables, in exchange for a faster execution time. The ciphers in this work that

have been choose to be optimized by T-Boxes are AES [42] and CLEFIA [74]. The reasons that lead

to the decision of selecting these two ciphers for the T-Box implementations are they structure. AES is

an SPN cipher with a substitution layer based on a S-Box table, a permutation layer based on shift rows

and a mix column, that can be written as a matrix multiplication. The CLEFIA cipher is a GFN with 2

Feistel Functions that has an SPN structure similar to the AES, the Feistel Functions have a substitution

layer of 2 S-Boxes, and a permutation layer that also uses a matrix multiplication. The 2 S-Boxes are

used in both Feistel Functions but are applied in a different order, the multiplication matrix is different for

both Feistel Functions.

AES T-Box Implementation

The AES T-Box Implementation is well known it was presented in the AES article [42]. The main

problem found is the cache timing attacks, which are not a issue when implemented on cacheless

processors. The T-Box Implementation condenses the substitution and permutation layers in a single

set of tables lookups, as explained before. A single column of the round output is expressed as e in

terms of bytes of the round input a. a(i,j) represents the byte of a in the row i, column j. So, the AES

round is illustrated by a key addition, k, a substitution layer, S, and a permutation layer composed by shift

row, C and a mix column represented by a matrix multiplication. For AES round we have the following:e0,j

e1,j

e2,j

e3,j

=

0x2 0x3 0x1 0x1

0x1 0x2 0x3 0x1

0x1 0x1 0x2 0x3

0x3 0x1 0x1 0x2

S[a0,j ]

S[a1,j−C1]

S[a2,j−C2]

S[a3,j−C3]

⊕k0,j

k1,j

k2,j

k3,j

37

https://github.com/jcsf/LightweightBlockCiphers.Thesis.2018

https://github.com/jcsf/LightweightBlockCiphers.Thesis.2018

By combining the substitution operation with the mix columns we obtain:

T0[a] =

0x2× S[a]

S[a]

S[a]

0x3× S[a]

, T1[a] =

0x3× S[a]

0x2× S[a]

S[a]

S[a]

, T2[a] =

S[a]

0x3× S[a]

0x2× S[a]

S[a]

, T3[a] =

S[a]

S[a]

0x3× S[a]

0x2× S[a]

Resulting in 4 tables with 256 elements each with 4 bytes in length, occupying 1KB for each table,

with 4KB in total. The round transformation can then be expressed as:

ej = T0[a0,j ]⊕ T1[a1,j−C1]⊕ T2[a2,j−C2]⊕ T3[a3,j−C3]⊕ kj

This can be identically applied for decryption.

T-Box Reduction

As seen above the values in the second, third and fourth table are just rotations of the first. This

allow to reduce the number of tables stored by just one, with the other tables being obtained using

simple rotation operations. The formulation could be seen below, where || represents the concatenation

of the bytes:

Resulting in the following, where RotLeft(T, n) is the rotation of the value T , n bits to left:

T1[a] = RotLeft(T0[a], 8)



With:

ej = T0[a0,j ]⊕RotLeft(T0[a1,j−C1], 8)⊕RotLeft(T0[a2,j−C2], 16)⊕RotLeft(T0[a3,j−C3], 24)⊕ kj

CLEFIA T-Box Implementation

Similar to AES, CLEFIA can also be implemented with table look-ups, but with some small differ-

ences. First the CLEFIA is not a SPN, the T-Box is implemented in the Feistel Functions of the CLEFIA

that follow and SPN structure. Second CLEFIA uses two S-Boxes and not only one, that are organized in

a different way for each of the Feistel Functions. Also, each Feistel Function has a unique multiplication

matrix.

The difference between CLEFIA T-Box and AES T-Box is that while AES applies the T-Box to the full

state of the cipher, the CLEFIA only applies to one-fourth of the cipher state in each Feistel Function, 32

bits. The other main difference is that CLEFIA does not require inverse tables for the decryption, since

it uses a Feistel Network, where the encryption is equal to the decryption. On the other hand the table

reduction is not as big as in AES, since two different S-Boxes used in CLEFIA, and the Mix Columns

matrix rows are not all rotations of the first row.

For each of the CLEFIA operations we have:

38

f0

f1

f2

f3

=

0x1 0x2 0x4 0x6

0x2 0x1 0x6 0x4

0x4 0x6 0x1 0x2

0x6 0x4 0x2 0x1

S0[b0]

S1[b1]

S0[b2]

S1[b3]

The values of S0[bi] and S1[bi] can be obtained by performing a table lookup after XORing the key,

ki, with the input bytes, ai, in the S-box table S0box[256] and S1box[256], resulting in:

T0F0[a] =

S0[a]

0x2× S0[a]

0x4× S0[a]

0x6× S0[a]

, T1F0[a] =

0x2× S1[a]

S1[a]

0x6× S1[a]

0x4× S1[a]

, T2F0[a] =

0x4× S0[a]

0x6× S0[a]

S0[a]

0x2× S0[a]

, T3F0[a] =

0x6× S1[a]

0x4× S1[a]

0x2× S1[a]

S1[a]

These 4 tables, each with 256 elements with 4 bytes length. Occupy 1KB each, in a total of 4KB:

for both Feistel Functions, with a total size of 8KB. The Feistel Function 0 transformation can thus be

expressed as:

f = T0F0[b0]⊕ T1F0[b1]⊕ T2F0[b2]⊕ T3F0[b3], where bi = ai ⊕ ki

T-Box Reduction

As seen above, the values in the third and fourth table are just rotations of the first and second.

Whole not allow for a reduction as big as for AES, it will reduce the number of tables stored from 4 to

each Feistel Function to only 2, which is a reduction of 50%, using simple rotation operations, by:

T2F0[a] = RotLeft(T0F0[a], 16)

T3F0[a] = RotLeft(T1F0[a], 16)

Resulting in:

f = T0F0[b0]⊕ T1F0[b1]⊕RotLeft(T0F0[b2], 16)⊕RotLeft(T1F0[b3], 16), where bi = ai ⊕ ki

Bit-slice Optimization Implementation

As mentioned in [24] when normal ciphers are converted to bit-slice implementations there is a over-

head generated by the packing of the bits in the registers. Therefore, in order to select the best bit-slice

cipher implementations, only ciphers that have been designed targeting bit-slice implementations have

been implemented in the bit-slice style. Thus the bit-slice optimization was only ”applied” to: RoadRun-

ner [76], a bit-slice cipher oriented to 8-bits architecture. RECTANGLE [61], a bit-slice cipher oriented

to 16-bits architecture. And NOEKEON [60], a bit-slice cipher oriented for 32-bits architecture. All these

ciphers share the same characteristics from the bit-slice approach: they do not require memory access

and no tables are stored. All the substitutions are done recurring to logical operations between the

registers, replacing the need for S-Boxes and table look-ups.

39

3.4.2 Code Optimizations

In order to obtain better implementations of the considered ciphers for the ARM Cortex-M3 processor,

both in terms of execution time and memory requirements, several changes were made to the source

code of its reference implementations. To this end, the optimization techniques described in section

2.5.2 were exploited. This section presents the set of optimizations that were applied to each cipher and

discusses its motivation. Table B.1 in the Appendix B summarizes the developed cipher implementations.

AES

Since the AES version available on the FELICS framework was poorly designed, a new AES reference

implementation, was designed. Such implementation is based on the data presented in [42] and is

oriented towards a 32-bits platform. This implementation was named AES 128 128 v14. Based on this

implementation, a new version was developed by implementing some modifications to the mix columns

algorithm of the encryption procedure. That version was named AES 128 128 v15 and reuses the value

of the GF multiplication to reduce the number of performed multiplications he other implementations that

were developed exploit the use of T-Boxes.

First, a standard version was developed, which was named AES 128 128 v08. Then, an improved

version was devised using reduced T-Boxes. In this implementation, which was named AES 128 128 v09,

a cycle was used to pass the four rows of the AES state through the T-Box algorithm. Two improved

versions of this implementation that focus the minimization of the execution time were also devel-

oped, by performing the partial and the full unrolling of this loop. Such implementations were named

AES 128 128 v10 and AES 128 128 v11, respectively. Finally, the same modification was applied to

these two implementations, in order to reduce the involved memory accesses. This modification con-

sisted in maximizing the use of the processor registers to compute the state operations and resulted in

the implementations named AES 128 128 v12 and AES 128 128 v13. In conclusion, the whole set of

AES implementations that was devised is the following:

• AES 128 128 v14: Reference Implementation (32-bits Oriented).

• AES 128 128 v15: 32-bits Oriented + Rearrangement of the operations in the mix columns en-

cryption algorithm.

• AES 128 128 v082: T-Box Optimization.

• AES 128 128 v09: T-Box Reduced Optimization.

• AES 128 128 v10: T-Box Reduced Optimization + Partial Unroll.

• AES 128 128 v11: T-Box Reduced Optimization + Full Unroll.

• AES 128 128 v12: T-Box Reduced Optimization + State in Registers + Partial Unroll.

• AES 128 128 v13: T-Box Reduced Optimization + State in Registers + Full Unroll.

2The version number starts in 8 because FELICS already had 7 versions of AES implemented.

40

CLEFIA

The CLEFIA reference implementation was obtained from the CLEFIA website [97] and adapted to

the FELICS framewok. Such implementation is 8-bits oriented and performs several memory copies that

are totally unnecessary. Therefore, the first implemented optimizations consisted in removing this code

and, subsequently, in its adaptation towards a 32-bits architecture. Such implementations were named

CLEFIA 128 128 v02 and CLEFIA 128 128 v04, respectively. The CLEFIA reference algorithm com-

putes the constants to be used in the key scheduler, which leads to an unnecessarily higher execution

time. Therefore, an alternative version of the 32-bits implementation that has the pre-computed values

of all the constants stored in a table was devised. This version was named CLEFIA 128 128 v05. The

remaining seven implementations that were developed exploit the use of T-Boxes.

While the CLEFIA 128 128 v03 and CLEFIA 128 128 v06 implementations apply standard T-Boxes

to the 8-bits oriented reference algorithm and to its optimized 32-bits oriented version, respectively, all

the other implementations exploit the use of reduced T-Boxes. In four of these reduced T-Boxes im-

plementations, the F0 and F1 functions were also converted to inline-functions, in order to reduce the

execution time resulting from the overhead imposed by the calls to these functions that were performed

in each round of the algorithm. For such implementations, two different optimization techniques were

further applied: full loop unrolling, to remove all the dependencies and push the cipher to its best exe-

cution time; and the maximization of the use of the processors registers, in order to reduce the memory

accesses. The following list summarizes the whole set of CLEFIA implementations that were devised:

• CLEFIA 128 128 v01: Reference Implementation.

• CLEFIA 128 128 v02: Reference Implementation + Code Clean Optimization.

• CLEFIA 128 128 v03: T-Box Optimization (8-bits).

• CLEFIA 128 128 v04: 32-bits Oriented Optimization.

• CLEFIA 128 128 v05: 32-bits Oriented Optimization + Constants Stored in Table.

• CLEFIA 128 128 v06: T-Box Optimization (32-bits) + Constants Stored in Table.

• CLEFIA 128 128 v07: T-Box Reduced Optimization (32-bits) + Constants Stored in Table.

• CLEFIA 128 128 v08: T-Box Reduced Optimization (32-bits) + Constants Stored in Table + F0

and F1 Inlined.


and F1 Inlined + Full Unroll.


and F1 Inlined + State in Registers.


and F1 Inlined + Full Unroll + State in Registers

NOEKEON

The NOEKEON reference implementation was also obtained from the NOEKEON web site [102] and

was adapted to the FELICS framework. Such implementation was already 32-bits oriented but was

41

targetted for systems with a big-endian memory organization scheme. That is not the default case if the

ARM Cortex-M3 architecture. Therefore, a little-endian version was implemented in order to reduce the

overhead and improve the performance in the target platform. Since the only difference between the

Direct and the Indirect modes was the key scheduler, most of the implemented versions use the Direct

mode.

In the reference implementation, NOEKEON 2 different round constants are calculated on the fly. In

order to reduce the overhead due to these computations, alternative implementations using tables to

store the pre-computed values of these constants were devised.

Due to the way the reference code was organized, more than four function calls are performed in

each cycle implementing a round. So, to reduce the execution time resulting from the overhead imposed

by the calls to these functions, they were converted to inline-functions. To further improve the execution

time, another two optimization techniques were implemented: loop unrolling and keeping the cipher

state within the processor’s registers most of the time.

The following list details the techniques that were applied to each one of the implementations that

were devised for the NOEKEON cipher:

• NOEKEON 128 128 v01: Reference Implementation Direct-Key (Big-Endian).

• NOEKEON 128 128 v02: Reference Implementation Indirect-Key (Big-Endian).

• NOEKEON 128 128 v03: Direct-Key (Little-Endian).

• NOEKEON 128 128 v04: Indirect-Key (Little-Endian).

• NOEKEON 128 128 v05: Direct-Key (Little-Endian) + Round Constants Stored in a Table.

• NOEKEON 128 128 v06: Direct-Key (Little-Endian) + Funtions Inlined

• NOEKEON 128 128 v07: Direct-Key (Little-Endian) + Round Constants Stored in a Table + Func-

tions Inlined.

• NOEKEON 128 128 v08: Direct-Key (Little-Endian) + Functions Inlined + Full Unroll.

• NOEKEON 128 128 v09: Direct-Key (Little-Endian) + Round Constants Stored in a Table + Func-

tions Inlined + State in Registers.

• NOEKEON 128 128 v10: Direct-Key (Little-Endian) + Functions Inlined + State in Registers.

• NOEKEON 128 128 v11: Direct-Key (Little-Endian) + Single Function for Encryption and Decryp-

tion (All code inlined in that function and also state kept on registers).

• NOEKEON 128 128 v12: Direct-Key (Little-Endian) + Functions Inlined + State in Registers + Full

Unroll.

PRESENT

Although FELICS has a an implementation of PRESENT, it has been poorly designed. Therefore, a

32-bits oriented reference implementaion was designed base on the information provided in [23]. This

implementation was subsequently optimized using three different techniques.

First, the 4-bits S-Boxes were replaced by 8-bits S-Boxes to improve the software performance. Then,

all the permutations were unrolled, due to their excessive cost for software implementations. Moreover,

42

the loops were also fully unrolled to remove all the dependencies and push the cipher code to the fastest

performance level. Finally, memory accesses were minimized by keeping the cipher state within the

processor’s registers for most of the cipher’s execution time.

The techniques applied in each one of the devised PRESENT implementations are detailed in the

following list:

• PRESENT 64 80 v073: Reference Implementation (32-bits oriented).

• PRESENT 64 80 v08: 8-bits S-Box Implementation.

• PRESENT 64 80 v09: 4-bits S-Box Implementation + Unroll of the permutation layer.

• PRESENT 64 80 v10: 8-bits S-Box Implementation + Unroll of the permutation layer.

• PRESENT 64 80 v11: 8-bits S-Box Implementation + State in Registers.

• PRESENT 64 80 v12: 8-bits S-Box Implementation + State in Registers + Unroll of the permuta-

tion layer.

• PRESENT 64 80 v13: 8-bits S-Box Implementation + State in Registers + Unroll of the permuta-

tion layer + Full Unroll of the rounds.

RECTANGLE

The considered reference implementation for RECTANGLE was also available on FELICS. Since

RECTANGLE is a 16-bits oriented bit-slice cipher, the first implemented optimization consisted in the

adaptation of the code towards a 32-bits architecture. Nevertheless, not all the code could be optimized

for a 32-bits processor due to the intrinsic 16-bits nature of this cipher. So, in the devised 32-bits partially

oriented implementations, only the code sections involving 32-bits variables were optimized.

The RECTANGLE reference implementation has a lot of functions strictly for code organization pur-

poses, none of such functions are reused in the cipher computation. As a result, these functions were

converted to inline-functions, in order to reduce the execution time resulting from the overhead imposed

by the calls to such functions.

RECTANGLE is a bit-slice cipher, thus several operations are performed involving the cipher state.

Consequently, a couple of modifications to the code were also implemented, so that the cipher state can

be kept in the processor’s register for most of the computation time.

In addition the loops were also fully unrolled to remove all the dependencies and push the cipher

code to the fastest performance level.

The techniques applied for each one of the devised implementations are listed as follows:

• RECTANGLE 64 128 v01: Reference Implementation provided by FELICS (Not Implemented).

• RECTANGLE 64 128 v114: 32-bits partial implementation.

• RECTANGLE 64 128 v12: 32-bits partial implementation + Functions Inlined.

• RECTANGLE 64 128 v13: 32-bits partial implementation + Functions Inlined + Full Unroll.

• RECTANGLE 64 128 v14: 32-bits partial implementation + Functions Inlined + State in Registers.

3The version number starts in 7 because FELICS already had 6 versions of PRESENT implemented.4The version number starts in 11 because FELICS already had 10 versions of RECTANGLE implemented.

43

• RECTANGLE 64 128 v15: 32-bits partial implementation + Functions Inlined + State in Registers

+ Full Unroll.

RoadRunneR

The reference implementation of the RoadRunneR cipher was available on FELICS has two versions:

one with two distinct key schedulers, for the encryption and the decryption, procedures, and another

one without a key scheduler that presents only a small overhead due to the key being managed inside

the cipher encryption and decryption tasks. So, the first optimization that was implemented consisted in

the design of a single key scheduler for both the encryption and the decryption procedures. From that

version, an implementation without a key scheduler was also devised.

Then, the reference implementation was optimized for the target platform. However, since Road-

RunneR is an 8-bits oriented bit-slice cipher, it was only possible to develop a 32-bits partially oriented

implementation, likewise for the RECTANGLE implementations.

To reduce the execution time, three different optimizations were further implemented. Firstly, all the

functions that are called in the loop implementing the rounds were converted to inline-functions. Sec-

ondly, the code was modified to keep the cipher state in the processor”s registers, and thus reduce the

number of memory accesses. Lastly, the loops were fully unrolled not only to remove all the dependen-

cies and the branches but also to minimize the overhead resulting from the computation of the iteration

count.

The following list presents the set of implementations that were devised for the RoadRunneR cipher:

• RoadRunneR 64 128 v02: Reference Implementation provided by FELICS (Two Key Sched-

ulers).

• RoadRunneR 64 128 v075: One Key Scheduler Implementation.

• RoadRunneR 64 128 v08: No Key Scheduler Implementation.

• RoadRunneR 64 128 v09: One Key Scheduler + 32-bits partial implementation.

• RoadRunneR 64 128 v10: No Key Scheduler + 32-bits partial implementation.

• RoadRunneR 64 128 v11: One Key Scheduler + 32-bits partial implementation + Functions In-

lined.

• RoadRunneR 64 128 v12: No Key Scheduler + 32-bits partial implementation + Functions In-

lined.


lined + Full Unroll.

• RoadRunneR 64 128 v14: No Key Scheduler + 32-bits partial implementation + Functions Inlined

+ Full Unroll.


lined + State in Registers.

5The version number starts in 7 because FELICS already had 6 versions of RoadRunneR implemented.

44


+ State in Registers.


lined + State in Registers + Full Unroll of rounds.


+ State in Registers + Full Unroll of rounds.

SPARX

The SPARX cipher was proposed by the same author of FELICS, so its reference implementation

was already available in the platform. However, such implementation is not fully oriented to 32-bits.

Consequently, the first optimization made to the SPARX source code consisted in adapting it for the

32-bits, architecture of the target processor.

SPARX resorts on the Speckey-Box to perform the substitution layer of the cipher, which is imple-

mented as a function, in the reference code for code reuse purposes. Therefore, a version that receives

the arguments by value and outputs the return value, was created. Also, another version of this function

was implemented, in which pointers are used to pass the arguments to the function and return the result

of the Speckey transformation. The Speckey function is also called a lot of times. So, this function was

converted to an inline-function to reduce the overhead of its function calls. Likewise the other ciphers,

the full unrolling of the loops corresponding to the cipher rounds was another optimization that was

implemented to reduce the execution time.

The SPARX reference implementation already tries to maximize the use of the processor registers,

so there was not much work to do in this field. Still, the source code was modified to include the register

keyword wherever it was found to be necessary, in order to guarantee that the compiler kept the variables

in the registers, as much as possible. The last optimization technique that was implemented aimed at

reducing the code size and consisted in reorganizing the steps of the cipher into cycles, since in the

considered reference implementation of SPARX such steps have been unrolled.

The following list details the techniques that were considered in the development of the optimized

implementations of the SPARX cipher:

• SPARX 64 128 v01: Reference Implementation provided by FELICS.

• SPARX 64 128 v376: 32-bits oriented implementation + Speckey with Pointers.

• SPARX 64 128 v38: 32-bits oriented implementation + Speckey with Return Value.

• SPARX 64 128 v39: 32-bits oriented implementation + Speckey Inlined (Functions Inlined).

• SPARX 64 128 v40: 32-bits oriented implementation + Speckey Inlined (Functions Inlined) + Full

Unroll.

• SPARX 64 128 v41: 32-bits oriented implementation + Speckey Inlined (Functions Inlined) + State

in Registers + Steps in Cycles.

6The version number starts in 37 because FELICS already had 36 versions of SPARX implemented.

45


in Registers.


in Registers + Full Unroll.

SPECK

Although some implementations of the SPECK cipher are already available in the FELICS platform,

a reference implementation was designed based on the algorithm presented in [62][63]. SPECK is a

very simple cipher that can not be greatly optimized, because this cipher has no functions, nor tables or

constants, therefore, the only possible optimizations were unrolling the rounds and changing the code

to maximize the use of the registers reducing the memory accesses.

The other modification to the source code that was implemented focused a more efficient implemen-

tation of the decryption rounds. Such change consisted in the rearrangement of the instructions to make

a better use of the positioning of the barrel shifter in the processor’s datapath (i.e. before the ALU).

Algorithms 1 and 2 show the implemented modifications. It should be noted that this type of optimization

is not required for the SPECK encryption rounds, since the algorithm already makes the best use of the

barrel shifter (see Algorithm 10 in the Appendix A).

The following list details the techniques applied in each one of the devised SPECK implementations:

Algorithm 1: SPECK DecryptionData:LS = BLOCK[0];RS = BLOCK[1];RK → Round Keys Array;

1 for i = NUMBER OF ROUNDS − 1 to i >= 0 do2 RS = RS ⊕ LS;3 RS = RotationRight(RS, β);4 LS = LS ⊕ RK[i];5 LS = LS - RS;6 LS = RotationLeft(LS, α);7 end8 BLOCK[0] = RS;9 BLOCK[1] = LS;

• SPECK 64 128 v077: Reference Implementation

• SPECK 64 128 v08: Full Unroll.

• SPECK 64 128 v09: State in Registers.

• SPECK 64 128 v10: State in Registers + Rearrange of the decryption operations, to take advan-

tage of shift before ALU of the ARM Cortex-M3.

• SPECK 64 128 v11: State in Registers + Full Unroll.

7The version number starts in 7 because FELICS already had 6 versions of SPECK implemented.

46

Algorithm 2: SPECK Rearranged DecryptionData:LS = BLOCK[0];RS = BLOCK[1];RK → Round Keys Array;

1 RS = RS ⊕ LS;2 for i = NUMBER OF ROUNDS − 1 to i > 0 do3 LS = LS ⊕ RK[i];4 LS = LS - RotationRight(RS, β);5 LS = RotationLeft(LS, α);6 RS = LS ⊕ RotationRight(RS, β);7 end8 LS = LS ⊕ RK[0];9 LS = LS - RotationRight(RS, β);

10 BLOCK[0] = RotationRight(RS, β);11 BLOCK[1] = RotationLeft(LS, α);

3.5 Summary

This chapter is focused on the development of efficient implementations of lightweight ciphers for

implementations targeting ARM Cortex-M3 processors.

The considered set of ciphers is composed of AES[42], CLEFIA[74], NOEKEON[60], PRESENT[23],

RECTANGLE[61], RoadRunneR[76], SPARX[78] and SPECK[62][63] for three specific reasons: rele-

vance in the lightweight cryptography topic (AES, CLEFIA and PRESENT), optimized designs (NOEKEON,

RECTANGLE and RoadRunneR), and the simple design and promising performances (SPARX and

SPECK). Nevertheless, the conducted optimization procedure considered only the cipher configura-

tions that were found to be the most suitable for the ARM Cortex-M3 architecture. For AES, CLEFIA

and NOEKEON, the 128-bits key size was adopted. For RECTANGLE and RoadRunneR the 128-bits

key was the best option. Conversely , for SPARX the 64-bits block matches better the target platform

and provides a good performance and educed code size. The 64-bits block is also the best option for

SPECK considering a 32-bits architecture.

The optimizations proposed to the eight selected ciphers focused both the algorithms and the code

implementation. At the algorithm level, optimizations were presented for AES and CLEFIA, whilst at the

code level the proposed optimizations addressed all the ciphers. Table B.1 in the Appendix B summa-

rizes the optimization techniques that were applied in each cipher implementation.

47

48

Chapter 4

Evaluation and Results

This chapter presentes the evaluation of the implementations and respective optimizations of the

considered lightweight ciphers. All results were obtained using the FELICS framework on scenario 0,

a scenario that performs the encryption key scheduling, data encryption, and performs the decryption

key schedule (if needed), followed by the decryption of the test vectors. The considered results are

focused on the Execution Time (Clock Cycles) and Code Size (ROM). This chapter starts by presenting

an overview of the used reference implementations. Then it presents a discussion of the proposed

algorithm optimizations and the results obtained, where the T-Box and T-Box Reduction Optimizations

are the most relevant. Then an analyzes of the results of the code optimizations for each cipher is also

presented. After that, a comparative analyses among the proposed optimizations is presented, and

also, a comparison with the state of art. To conclude the energy consumption analysis experimentation

is explained and the results obtained are presented.

4.1 Reference Implementations

In order to evaluate the impact of the proposed optimizations in the performance and memory re-

quirements of the considered set of ciphers, reference values were established for each cipher.

Such anchors were obtained by using the FELICS framework to characterize the implementations

originally proposed by the authors of the eight ciphers. Some of these ciphers were already integrated

into FELICS, such as RECTANGLE, RoadRunneR1 and SPARX. For the other ciphers, i.e. AES, CLE-

FIA, NOEKEON, PRESENT and SPECK, new algorithm implementations were developed and inte-

grated into FELICS. While the CLEFIA and NOEKEON implementations were based on the code pro-

vided by its authors, the AES, PRESENT and SPECK implementations had to be developed from scratch

using the information provided in the papers presenting these ciphers [42][23][62]. None of these imple-

mentations were optimized. The flags used for the compilation of these implementations were -O1 and

-O2, depending who showed the best results.

1The RoadRunneR version used in this analysis is the reference implementation, so, despite using the chosen configuration ofsection 3.3, it does not have the proposed optimization of using only one key scheduler.

49

Figure 4.1: Code Size and Execution Time of the Reference Implementations

The implementation results that were obtained for all the ciphers are depicted in 4.1 As it can be

seen, is visible that AES requires much more execution time than most of the other ciphers, which

justifies the need for lightweight ciphers. CLEFIA shows higher code size and slower performance

than AES because, AES reference implementation is oriented to a 32-bits architecture, while CLEFIA

reference implementation is targeted for 8-bits processors. RoadRunneR is another cipher that has a

performance close to AES because it is a bit-slice cipher designed for 8-bits processors.

The presented results also show that most of the new lightweight ciphers provide better results (for

both code size and execution time) than PRESENT or CLEFIA, which are the standardized lightweight

ciphers. Regarding SPECK, the obtained results demonstrate its superiority both in terms of execution

time and code size, which is a consequence of it simplicity. It also shows that most of the lightweight

ciphers have code size textless 1KB.

For the key scheduler times it is possible to notice that RoadRunneR is the only cipher that needs

a different key scheduler for encryption and decryption, this will be removed further in this work. All

other ciphers have no decryption key schedule because they can use the same round keys, since the

decryption algorithms are designed to use the round keys in the opposite direction. NOEKEON is used

in the direct key version and because of that it has no key schedule time. The execution time of CLEFIA

key schedule can be a little surprising, but since it calculates not only the round keys but also constants

needed for the cipher, this value is normal, further in this work the overhead caused by it will be removed

by storing this constants on a table.

50

4.2 Algorithm Optimizations

In terms of algorithm optimizations only the table-based (T-Box) can be evaluated. This is due to the

bit-slice ciphers being only implemented in bit-slice, so, it is not possible to compare between the non bit-

slice implementation and the bit-slice implementation. In the following section the results obtained with

the T-Box implementation are presented and compared with the normal implementation of the ciphers.

T-Box Implementations Improvements

For the T-Box Implementations analyses AES and CLEFIA were implemented in the Arduino Due that

is powered by an ARM Cortex-M3. The implementations were all 32-bits oriented. The optimization flag

used was -O2 for AES and -O1 for CLEFIA, since it was the flags that showed better results for this

ciphers. The Execution Time is obtained as the average value between encryption and decryption time.

The evaluation was obtained using the AES and CLEFIA reference versions (AES 128 128 v14, CLE-

FIA 128 128 v04), the versions using standard T-Box lookups (AES 128 128 v08, CLEFIA 128 128 v06)

and the proposed T-Box reduced implementations (AES 128 128 v09, CLEFIA 128 128 v07).

Figure 4.2: T-Box Optimizations Evaluation

As depicted in figure 4.2 the proposed reduced T-Box implementation, allows to significantly improve

the execution time, but increases the code size, never the less, with the proposed T-Box reduction it is

possible to reduce significantly the size of the code when compared with the regular T-Box implemen-

tations. Although CLEFIA has a simpler structure it is a little slower that AES. That happens because

CLEFIA has more 8 rounds, which leads to a bigger number of operations performed for the encryp-

tion/decryption.

51

Figure 4.3: AES T-Box Comparison

Figure 4.3 depict the relative values between the AES regular versions and its T-Box implementations.

The results show that AES T-Box Implementation is almost 4x bigger in code size that the reference

implementation of AES but is 85% faster. The AES Reduced T-Box is only 39% bigger in code size,

than the reference implementation, but is also 85% faster. This means that AES Reduced T-Box is the

fastest implementation with only a 39% increase in the code size, which is a lot smaller that the regular

T-Box Implementation. The fact that the use of the shifts to reproduce the other T-Boxes is not adding

an overhead to the reduced version execution time because of the structure of the ARM Cortex-M3, that

allows to shift before making operations in the ALU.

Figure 4.4: CLEFIA T-Box Comparison

Figure 4.4 shows that the CLEFIA T-Box Implementation is almost 5x larger tan the CLEFIA (32-bits

oriented) but is 81% faster. The CLEFIA Reduced T-Box is only 2.79x bigger and 83% faster that the

32-bits implementation of CLEFIA. The T-Box based versions of CLEFIA have a significant code size,

even in the reduced version, this is due to the fact that the table reduction is smaller that for AES. In

52

AES each 4 tables can be reduced to 1, while in CLEFIA, only each 2 tables can be reduced in 1. In

the CLEFIA T-Box reduced version it is also possible to notice that there is no overhead in the execution

time, which is again because of the structure of the ARM Cortex-M3, that allows to shift before making

operations in the ALU.

In conclusion, T-Box implementations are very interesting and can achieve very fast performances,

the trade-off is that they require a lot of memory space and are susceptible to cache timing attacks.

But when referring to lightweight devices most of them do not have cache, so when are targeting fast

performances, over the memory spent, these implementations can be very useful. Between them, AES

reduced T-Box is the clear better choice, because it achieves a much faster performance (85% faster)

with only an increase of 39% which is a trade-off that is very worth it. The only drawback is that it needs

two different key schedulers, one for encryption and other for decryption. CLEFIA reduced T-Box also

achieved very good results in terms of performance (83% faster) but have a bigger cost on code size

(179% bigger), which makes it less reliable if memory is a big constrain. The only advantage between

CLEFIA T-Box and AES T-Box is that because of the GFN structure of CLEFIA, it does not need a

different key scheduler for encryption and decryption, the keys only need to be used in the inverse way.

4.3 Code Optimizations

The code optimizations were different for each cipher, depending on the code design of the cipher. In

this section all the implemented versions of each cipher are considered and analyzed. It also presentes

the relative gain that each optimization has when compared with the reference implementation. All

versions were evaluated using FELICS. The evaluation was performed on the ARM Cortex-M3 and the

scenario used was the scenario 0. The used optimization flag were -O1 or -O2, depending on which

showed a better code size/execution time performance.

AES

For the AES all of the implemented versions (8) were evaluated with the optimization flag -O2.

Figure 4.5 depicts the results obtained for the implemented AES versions. An interesting fact is the

discrepancy between the encryption and decryption time of the normal implementation (v14 and v15).

That discrepancy exists because of the multiplication matrix of AES, which for the decryption use the

values of 9, 11, 13 and 14, rather than 1, 1, 2 and 3 used in the encryption. Particularly these are not

regular multiplications, but modular multiplications over polynomials whose coefficients are element of

GF(28). This means that with more complex coefficients more operations are required to calculate the

result, leading to this difference between the encryption and the decryption.

Another fact that was mentioned before is the key schedule execution time. While the regular im-

plementations of AES require only one key scheduler, with an execution time of 944 clock cycles. The

T-Box implementations requires a encryption and a decryption key schedule. The encryption is equal

to the normal version so has the same 944 clock cycles of execution time. The decryption is slower

53

Figure 4.5: AES Code Optimizations Results

because it needs to perform the same operations of the encryption key scheduler plus lookups on a

table to calculate the inverse keys, having 2189 clock cycles of execution time.

Comparing the execution time and code size with the reference implementation (version 14), figure

4.6 shows that most of the times in order to increase the performance, the code size increases, which

was expected with table-based optimizations and unrolls that were the most applied optimizations on

AES. Therefore, is possible to see that the little tweak to the encryption code on the version 15 made it

1% bigger that the reference implementation but also made it 5% faster. Then the T-Box implementa-

tions of version 8 and 9 were already discussed in the previous section, but the version 10 that unrolls

the permutation (over version 9) speeds up the cipher 10× when compared with the reference imple-

mentation, that means that the permutation unroll allowed for a 5% increase of performance from the

T-Box reduction version with increase of 20% of the code size. The state in registers optimization of

version 12 improved the execution time more 2% from its previous version (version 10). But in total

it reached a 92% faster execution than the reference implementation. This was expected since AES

T-Box is a cipher that mainly performs table lookups, so, the state in registers optimization would not

bring a significant increase in performance, but since it only had a 2% increase of code size over version

10 it was not bad. The T-Box implementations do not have many operations dependency’s, therefore,

as is possible to notice, the full unroll optimizations (versions 11 and 13) did not bring any significant

increases in performance, only in code size, making them useless for use on constrained devices.

As mentioned before AES T-Box is a very interesting optimization that can speed up the cipher 12×,

but with a high cost, the ' 4KB (3852 bytes) of code size (160 % that of the reference implementa-

tion), this is the best case using the proposed approach, that uses a T-Box Reduced plus unroll of the

permutations and a cipher state in registers optimizations.

54

Figure 4.6: AES Relative Gain/Loss

CLEFIA

For CLEFIA 8 of the 11 versions implemented were evaluated. The two versions not presented in this

evaluation are version 1, the reference implementation (withdrawal from CLEFIA website [97]), because

it had to many useless code (memory copies, etc.), and version 3 that was a first test for the T-Box

Optimization. Considering this, version 2, where the code was cleaned, will be used as reference.

The used optimization flag was -O1, because it was the one showing better code size/execution time

results.

Figure 4.7 depicts the results obtained for the CLEFIA versions implemented. Its possible to notice

that both encryption and decryption time are similar in all versions, that is because CLEFIA is a GFN

cipher so encryption and decryption processes is very similar. Its clearly seen that the optimizations with

more impact in the cipher are the change of architecture orientation, from 8 to 32 bits (from version 2 to

version 4) and the T-Box optimization (from version 5 to version 6). Similar to AES, and as mentioned

before, the table base optimization leads to a big increase on code size given the size of the tables.

To better illustrate the gain of each optimization the relative performance to the reference implemen-

tation (version 2), with cleaned code is presented (Fig. 4.8). It is possible to notice that changing the

architecture from 8-bits to 32-bits speeds up the cipher 1.39× and reduces the code size by 4%, that is

because the full length of the target device registers is used to reduce the number of operations. With

the addition of the constants table it speeded up the cipher 1.5× but increased the code size 3%. With

the application of the T-Box Reduced optimizations (version 7), the execution time is reduced by about

13% but with a significant cost of code size, increased 169%. With the inlining of the F functions of

CLEFIA (version 8), the execution time was speedup 11×, which is 4% faster that the T-Box Reduction

implementation (version 7), but it had a cost of 17% in code size. The state in registers optimization

55

Figure 4.7: CLEFIA Code Optimizations Results

(version 10) speed up the cipher approximately 12×, which is only a increase of 1× when compared

with the previous optimization (version 8). That is a small increase because like in AES, CLEFIA T-Box

mainly performs table lookups, so, the state in registers optimization do not bring a significant impact.

The full unroll achieves a speed up of 16× over the reference implementation, but with a significant cost,

that is an increase of 4.61times on code size.

The CLEFIA Key Schedule also suffered some changes. From version 2 to 4, with the change of

architecture orientation, the execution time of the key schedule reduced from 18195 to 10909 clock

cycles, which is a reduction to almost half of the time. Then from version 4 to 5, where constants

changed from being computed to be stored on a table, the execution time reduced to 3801 clock cycles.

That big reduction was caused because the constants were being calculated on the key schedule phase.

With only a cost of 240 bytes (60 constants of 32-bits length) was possible to reduce the execution time

of the key schedule to 20% of the initial time.

Like AES T-Box, CLEFIA T-BOX is a very interesting optimization, that associated with a function

inlining and state in registers otimizations, reduced the cipher execution time to only 8% of the reference

implementation. The drawback, similar to AES T-Box, is the high cost in code size, ' 5.6KB (5684

bytes), which is 289% of the reference implementation, making it only suitable for devices were the

internal memory (ROM) is not the main constrain, and a fast execution is required. Compared to AES

the advantage of CLEFIA is that its T-Box version do not need a different key scheduler for encryption

and decryption, as mentioned before.

56

Figure 4.8: CLEFIA Relative Gain/Loss

NOEKEON

For NOEKEON 10 of the 12 implemented versions were evaluated. The results of the big-endian

versions (v01 and v02) were not presented since they are not useful for this comparison. The little-endian

versions (v03 and v04) were considered as the reference implementations for comparison purposes.

The used optimization flag was -O2, since it was the one showing better code size/ execution time

results.

Figure 4.9 shows the results obtained for the NOEKEON implementations. The big optimization was

obtained with the function inlining optimization, version 6. Another interesting thing is that when the

functions are inlined a discrepancy between the encryption and decryption time growth. This is due to

the dependencies and the memory access order. When the state in the registers optimization is applied

(from version 9 up) the encryption/decryption difference becomes almost null again because the memory

accesses are reduced. Finally is possible to notice that when the constants are stored in a table (version

7 and 9) and not computed on the fly (version 6 and 10) the execution times are smaller.

As mentioned in the description of NOEKEON, it supports two modes of operation, Direct Key mode

and Indirect Key mode. The main difference is that the Indirect mode transforms the initial key using

a key scheduler, while the Direct mode do not perform any operations on the initial key. The cost of

performing the key schedule, for the Indirect mode, is 3729 clock cycles. The only advantage of using

the Indirect mode is to not expose the key by transforming it before use it on the encryption/decryption.

When comparing the values of the optimizations with the reference implementations it is possible to

see that, version 11, is the smallest with only 89% of the reference implementation code size, but has

also a 65% faster performance. That is because, that version uses a single function for encryption and

decryption. This is possible because NOEKEON encryption and decryption are very similar. Allied to

57

Figure 4.9: NOEKEON Code Optimizations Results

that all the code is inlined inside that functions and the state on registers optimization is also applied.

Its also possible to analyze that having constants stored in table and not computed on the flight allows

for a execution 5% faster but increases the size in 5%, version 3 to 5. Function inlining (from version 3

to 6) allows to reduce the execution time to only 35%, but the code sizes grows 32%. Combining the

function inlining with the constants stored in a table has no impact in the performance (from version 6

to 7). Function inlining allied to state in registers (versions 09 and 10), reduces the code size needed

(when compared with version 6 and 7), because the number of memory accesses is reduced, and also

provides a 4.28× faster execution, with constants table (version 9), or 3.69× faster, with the constants

computed on the fly (version 10). Full unroll (versions 8 and 12) can even speed up the cipher 5.11×

(version 12) but with a cost to big to be useful, since it needs 9× more code size.

NOEKEON shows good results. In particular the proposal optimization (version 10), that uses a

single function with all the code inlined inside and the cipher state stored in the registers, can reduce

both code size and execution time which makes it a very interesting and promising cipher, with good

performances and small code sizes. With the proposed optimizations is also possible to achieve per-

formances 4.28× faster than the reference implementation, with only a cost of 17% in code size, which

in a small cipher like this means a implementation of 580 bytes, which stills small. NOEKEON also

helps to proof that bit-slice ciphers benefit from the cipher state in the registers because they have a lot

of operations among the cipher state data. This improvement allow for ab additional gain bigger than

10% (from versions 6 and 7 to version 9 and 10), while in AES and CLEFIA this optimization makes

almost no difference. It also allows to conclude that function inlining in ciphers that resort a lot on func-

tion calls, which is the case of NOEKEON reference implementation, helps to significantly improve the

performance (2.86× faster, versions 6 and 7).

58

Figure 4.10: NOEKEON Relative Gain/Loss

PRESENT

For PRESENT the 7 implemented versions were evaluated . The flag used to obtain this results was

-O2, because it was the one showing better code size/execution time results.

Figure 4.11 shows the obtained results. Its possible to observe that the versions with 8-bits S-Box

(version 8 and 10) and with the same optimizations as versions with 4-bits S-Box (version 7 and 9) are

always faster and have a smaller gap between decryption and encryption because of the memory ac-

cesses on the cipher state, that require less operations in the 8-bits S-Box that on the 4-bits S-Box, since

memory can be accessed in 8-bit variables. It is also possible to notice that the unroll of the PRESENT

permutations do not bring a very big increase in the code size but reduces bring a very good improve-

ment in performance since the data dependencies are reduced. Similar to AES and CLEFIA, PRESENT

does not have many operations between the cipher state, so the state in registers optimization do not

bring a significantly improvement to the execution time. The full unroll of PRESENT creates a cipher of

' 35KB which is completely unusable, and with almost no improvement on performance (version 13).

The key scheduler execution time of PRESENT is 3168 clock cycles, it is the same for every version, it

always requires a 4-bits S-Box even in the 8-Bits S-Box.

Considering relative values as shows on figure 4.12, Using S-Boxes of 8-Bits leads to a 5% improve

on the execution but to a 51% increase on code size (version 8), that is because an 8-bits S-Box has

a size of 256 bytes, while a 4-bits S-Box occupies only 8 bytes, so replacing the two 4-bits S-Box (one

for encryption, and its inverse for decryption) for two 8-bits S-Boxes costs 512 bytes, which is more than

half of the 928 bytes of PRESENT. The unroll of permutations speeds up the performance 2×, with an

increase of 60% in the code size (version 9). With the addition of an 8-bits S-Box (version 10) it speeds

59

Figure 4.11: PRESENT Code Optimizations Results

Figure 4.12: PRESENT Relative Gain/Loss

60

up the cipher 2.5× but with a increase of 120% in code size. The state in registers optimization do not

bring any improvements (versions 11 and 12), with a 8-bits S-Box it only reduces 3% to the execution

time, with the addition of the unroll of permutations, it only speeds up 2.5×. The full unroll only achieves

a speed up of 2.6× and has a cost of more than 38x in code size (version 13).

Therefore, the best version of PRESENT seems to be the unrolled of permutation with 4-bits S-Box,

if memory is a big concern, if not the 8-bits S-Box version is faster with a little cost on code size. Beside

that, PRESENT is a small cipher but when compared with other ciphers it is really slow, as shown in the

next section.

RECTANGLE

For RECTANGLE 7 versions were evaluated, 5 of them being proposed the other 2 were obtained from

the FELICS framework. Version 1 is the reference implementation, version 10 is a optimized version of

RECTANGLE, implemented in C, by the authors. The flag -O1 was used, since it was the one showing

better code size/execution time results.

Figure 4.13: RECTANGLE Code Optimizations Results

Figure 4.13 depicts the obtained results. It is possible to see that all versions, except the full unroll,

have smaller sizes and faster execution when compared to the reference implementation. It is also

possible to conclude that the function inlining is the optimization that brings a bigger improvement to the

execution time with versions 10, 12 and 14 using it.

The key scheduler of RECTANGLE was also optimized by changing the architecture to 32-bits, which

increased the key scheduler performance by reducing the number of operations performed, with the use

of the full register length of the target device. The reference implementation (version 1) execution time

61

was 6324 clock cycles. The execution time of the version optimized, that was available in the state of

art (version 10), was 2371 clock cycles. The proposed implementation (used on versions 11, 12, 13, 14

and 15) achieved a execution time of 2063 clock cycles which is the smallest value among the three.

Considering the relative values given on figure 4.14, that compares the execution time and code size

with the reference implementation (version 1). Its possible to conclude that changing the architecture

bring a improvement in both execution time and code size (21% smaller code size, 27% faster execution

time). But is the function inlining the one that achieves a better improvement, and because no code is

reused in RECTANGLE, it do not increase the code size. It reached a 65% faster execution and a 33%

smaller code size, which is very close to the 70% faster, and already smaller than the 26% smaller code

size of the optimized C version created by the authors. When applying the cipher state stored in registers

optimization even better results are achieved. Because it is a bit-slice cipher, that optimization had more

impact that in other ciphers like PRESENT. It reached a 71% faster execution time and 36% smaller

code size, because of the reduction of memory accesses. So, with that combination of optimizations,

the proposed implementation (version 14) achieves a better execution time (71% faster vs 70% of the

SotA) and a smaller code size (64% of the original code size vs 74% of the SotA) than the optimized C

version of the authors (version 10) that was available in the state of art. The full unroll versions achieve

a execution time 75% faster that the reference implementation but had a very big cost on code size not

being helpful.

Figure 4.14: RECTANGLE Relative Gain/Loss

With these results seems clear that the best RECTANGLE version is the one that combines the 3

optimizations (version 14): 32-bits orientation, function inlining, and cipher state in registers. It enabled

the cipher to speed up 3.41× its execution and with only a code size of 476 bytes, which is very small and

suitable for constrained devices with an architecture similar to the one used in this work. That proposed

62

implementation also performs well that the optimized version of the authors (version 10).

RoadRunneR

For RoadRunneR 13 implementations were evaluated, the reference implementation available on

FELICS and the 12 versions here proposed. The used optimization flag was -O1, because it was the

one showing better code size/execution time results.

Figure 4.15: RoadRunneR Code Optimizations Results

The results obtained are presented in figure 4.15. In is possible to observe that RoadRunneR is a

very small cipher, but has big values of execution time, since it is 8-bits oriented. The first implemented

version (version 7) with code cleaned and only one key scheduler already show a very good improvement

in the execution time. It is also possible to notice that almost every version with key scheduler (versions

7, 9, 11, 13, 15 and 17) is slightly faster that a version without key scheduler (versions 8, 10, 12, 14,

16 and 18). This happens because not having a key scheduler adds a little overhead to the cipher

for calculating the current key position to be used. Similar to RECTANGLE, RoadRunner benefits from

almost every optimization, only full unroll is not very helpful. Nevertheless, because RoadRunneR is a

Feistel Network and its Feistel function is used in both encryption and decryption, when its functions are

inlined (version 11 and 12) the size increases, because the code that was reused (more specifically the

feistel function) is duplicated for the encryption and decryption.

Has mentioned before RoadRunneR has 3 different options for key scheduling: two different key

schedulers (one for encryption, other for decryption); only one key scheduler (here proposed); and

without key scheduler. The execution time of the encryption key schedule, is equal for both versions that

use key schedule, and its 200 clock cycles. The decryption key schedule execution time is 1938 clock

cycles. Because the two key schedule option has a big overhead in the decryption, the best options are

63

the one key schedule and the no key schedule.

Figure 4.16: RoadRunneR Relative Gain/Loss

Comparing the implemented versions to the reference implementation, shown on figure 4.16, its

possible to conclude that changing architecture implementation for 32-bits, allowed for a reduction of the

code size of 30%, when using a key schedule (version 9), and 17% if no key scheduler is used (version

10), the execution time is reduced by 46% when using key schedule (version 9), and 42% if not (version

10). That happens because the number of instructions is reduced, since the full length of registers is

being used. Function inlining reduced the execution time by 48/49% but incremented the code size by

72/74% so it may not be a very good improvement (versions 11 and 12). Similar to when applied to

other bit-slice ciphers, the cipher state stored in registers optimization, shows very good results leading

the cipher up to an execution time 73/72% faster but with a cost of duplicating the code size. The full

unroll can push the size to a 75% faster execution but with a to high cost on code size to be usable.

In conclusion, if speed is the main objective RoadRunneR with 32-bits orientation, function inlining

and state in registers is the more suitable choice (versions 15 or 16). If size is the main restriction, then

the version with only 32-bits orientation is the best choice, since it speeds up the cipher 1.86times with

a code size of only ' 400 bytes. Because the RoadRunneR key schedule is so simple, the best version

seems to be the proposed one with one key scheduler (versions 9 or 15), because it has smaller code

size and a smaller execution time than the no key scheduler (versions 10 or 16), with an overhead of

only 200 clock cycles (the cost of the key scheduling).

SPARX

For SPARX 9 versions were evaluated, 2 of them already available on FELICS, the reference im-

plementation and an optimized C version created by the author of the cipher, and the 7 versions here

64

proposed. The used optimization flag was -O1, since it was the one showing better code size/execution

time results.

Figure 4.17: SPARX Code Optimizations Results

Figure 4.17 presents the obtained results for the implemented versions. It is possible to notice that

versions 38, 39 and 40 have very similar results to versions 41, 42 and 43 this is because as mentioned in

the implementation section, the SPARX already has an implementation that tries to keep the cipher state

into the registers, so when the optimization is applied there is not much work to do, so, the result is very

similar between the 4 versions. Beside this it is also possible to see that having functions returning the

value (version 38) instead of using pointers to access the memory and change the value in the speckey

function (version 37) makes the cipher work faster. Also, the proposed implementations (versions 37,

38, 39, 39, 40, 41, 42, 43) are all faster that the optimization designed (version 36) by SPARX authors.

The key scheduler of SPARX was also optimized with the code cleanup and orientation changed to

32-bits. The reference implementation of the key schedule had a execution time of 3636 clock cycles.

The optimized version created by the authors had 1589 clock cycles of execution time. The version

here proposed achieved 869 clock cycles, which is the fastest among all the versions. It is 4.18× faster

that the reference implementation and 1.83× faster that the optimized version of the authors. That is

because, when changing the architecture orientation, the full length of the registers of the target device

are used. so the number of operations performed are smaller.

Comparing the execution time and code size with the reference implementation (version 1), figure

4.18, its possible to conclude that the optimization that changes the architecture of the code for 32-bits

with pointers on the speckey2 function (version 37) speed up the reference implementation nearly 3×

and has a 68% smaller code size. The version with the speckey function returning values (version 38)

2As refered in the state of art, section 2.3, Speckey is the ARX-Box that performs the substitutions for SPARX cipher.

65

Figure 4.18: SPARX Relative Gain/Loss

and also the 32-bit architecture is nearly 4× faster and 75% smaller in code size. This means that having

a return value is faster and smaller that using pointers for this case. The inlining of speckey (version 39)

increased the performance of the cipher 6.6× but only reduced the code size in 64%. The cipher state in

registers allied to the rolling of the steps of SPARX achieved a speed up of 4.3× and a 74% smaller code

size, it is a little bigger in code size that version 38 (1%) but has a speed up 0.3× bigger, with is a good

improvement for a small overhead. The other versions with the cipher state in registers optimization (42

and 43) have almost the same values that the versions without that optimization (39, 40), that are similar

to them. The full unroll (versions 40 and 43) can only achieve an 1% faster execution that the inlining

version (version 39 and 42) but with a increase of twice the code size, so again, it is not viable to use.

In conclusion, if the focus is code size, the proposed version with speckey using returning values

(version 38) is the more suitable one, or the one with speckey inlined, the cipher state in registers

optimization and the steps rolled (version 41). If size is not the most important aspect, the proposed

version with speckey inlined (version 39) is the more interesting one.

SPECK

For SPECK 5 versions were evaluated. The -O1 optimization flag was used, because it was the one

showing better code size/execution time results.

Figure 4.19 depicts the results obtained in implemented versions. It is possible to see that SPECK is

a very compact cipher with very small execution times, as expected. Something curious about SPECK

is that it has a gap between the encryption and decryption time, since the shift is placed before the

arithmetic operations, enabling the use of the shift that the ARM Cortex-M has before the ALU. So its

possible to see between version 9 and 10 that this gap has become smaller because of the decryption

66

Figure 4.19: SPECK Code Optimizations Results

optimization, applied on version 10, as described in the implementation (section 3.4.2).

Considering relative values (figure 4.20, the proposed cipher state stored in registers optimization

reduced the execution time by 35% with 0 cost on the code size. This is because SPECK is an ARX

cipher that similarly to the bit-slice ciphers, are mainly composed by operations between the cipher state.

The version that had the decryption optimization, reduced the execution time by 44% (a difference of

only 9% for the other version because this is a comparison between the mean of the encryption and

decryption execution times, if comparing only the decryption time it reached a reduction of 22% when

compared to the original) and had a increase of 8% in code size. The full unrolls reduced the execution

time by 68% but with a cost of 5× the code size. But since SPECK is a very small cipher, the code

stills smaller than 1KB. The full unrolls, also show a small gap between the encryption and decryption

execution time, because the data dependencies are removed and the pipelining of operations becomes

easier.

In conclusion the best implementation is the proposed one with with the cipher state stored in regis-

ters optimization plus the decryption optimization, it allows the cipher to achieve an average fast execu-

tion with a very low cost on code size.

4.4 Comparative Analysis

In the last section were compared the results of the optimizations in each cipher. This section presents

comparative analysis between the best obtained results, for different trade-offs, namely the smallest, the

more balanced and the fastest optimized versions of each cipher. To better evaluate the full cost of

the encryption, the key schedule is also evaluated for the considered algorithms. To better compare

67

Figure 4.20: SPECK Relative Gain/Loss

the obtained results a analysis of the proposed implementations and state of art is also presented. This

analysis considers the proposed balanced versions and other ciphers from the state of art available from

the FELICS project. The results were obtained using the FELICS framework.

The cipher versions and flags used given the results, are described in the Appendix C. These ver-

sions were elected given the results presented in the previous section.

To conclude, the performed energy consumption evaluation is also presented.

Figure 4.21 depicts the code size results for all the ciphers. Considering the several implementation

options, including, the reference versions and two state of art versions [92]. It is clear from the results

that AES and CLEFIA are consistently the largest ciphers, for all the different implementations. The

proposed implementations are even bigger that the reference implementations, given the use of T-Box,

which significantly increases the code size. Nevertheless, the proposed reduced T-Box solutions allow to

reduce that code size cost. The fastest implementations are clearly the ones with bigger code size. This

was expected since they mostly use the full unroll or funtion inlining to achieve better execution times,

which are optimizations that increase the code size. As expected the smallest implementations have the

lowest code size among all the implementations. Also, the majority of the ciphers has a size lower than

500 bytes, which is very good. Another positive point is that some of the balanced implementations are

close to the code size of the small implementations, which is very promising since their execution time

shows very good results. SPECK is the smallest cipher in all implementations, as expected, followed by

RoadRunneR, SPARX and NOEKEON.

Figure 4.22 depicts the execution time results for all the proposed ciphers implementations, when

operating on a 128-bits of data. As expected the reference implementations show the slowest results,

except for PRESENT where the slowest execution time is for the small code size implementation. This

shows that most of the optimzations used allowed for improvements in terms of execution time, even

68

Figure 4.21: Code Size Results of the different implementations

Figure 4.22: Execution Time Results of the different implementations

69

when the focus is reducing the code size. While the fastest implementations show the best execution

time results, they are followed very closely by the balanced implementations, which have significant

smaller code sizes. An interesting thing that can be noticed is that for SPARX all the proposed versions

show not only better execution time, than the reference implementation, but are also better that the

optimized version available in the state of art. The ciphers with bigger improvements on the execution

time are clearly the AES and CLEFIA, given the T-Box Reduced optimization. Another good result

of the proposed implementations is that for the balanced ones, the execution time is close to 1000

clock cycles for almost every cipher, just RoadRunneR and PRESENT which are far from that value.

PRESENT has the bit-oriented permutations that is heavy to compute in software. RoadRunneR is an

8-bits oriented bit-slice cipher which makes it not so interesting for 32-bits devices. On the other hand,

SPECK is clearly the fastest cipher, which was expected for the simple and small cipher that it is. It

is followed by NOEKEON which had a small code size. That good results are achieved because these

ciphers are mainly performing arithmetic and logical operations, not performing other type of more costly

operations, like memory accesses, since SPECK and NOEKEON is a bit-slice cipher. AES and CLEFIA

have smaller code size that SPARX and RECTANGLE in the balanced implementations, because while

AES and CLEFIA support blocks of 128 bits, SPARX and RECTANGLE only support 64-bits so they

need to perform the encryption 2 times to achieve the same amount of data.

Figure 4.23: Efficiency Results of the different implementations

Figure 4.23 compares the implementations versions in terms of efficiency, from the worst to the

better. The efficiency (equation 4.1) was calculated by multiplying the inverse of the code size with the

inverse of the execution time. The closer the efficiency to 1, the better.

Efficiency =1

Code Size× 1

Execution T ime(4.1)

70

The results show that the balanced implementations are the one with better efficiency values, validating

the considered trade-off between code size and execution time. Most of these ciphers use the optimiza-

tions that better suited their design. Curiously the small implementations also had very good efficiency

results. This probably happens because lightweight ciphers are already designed to have small execu-

tion time, so even when we optimize for code size, the execution time is still very small, so no much

efficiency is lost. The fastest optimizations do not have a good efficiency because they trade a lot of

code size for a better execution, which not always pays off.

PRESENT is a cipher with very similar efficiency for almost all considered implementations, since

in order to improve the PRESENT execution time a lot of code size needs to be wasted. The same

happens with RoadRunneR that, in its fastest implementation, has a very bad efficiency, since it was

a lot of code size (increased by 25×). AES and CLEFIA have small efficiency when compared to

ciphers like NOEKEON, RETANGLE, SPARX and SPECK, since they needed quite some memory to

store the T-Box (even in the reduced T-Box version), in order to achieve a execution time closer to

those ciphers. NOEKEON, RECTANGLE, SPARX and SPECK balanced implementations are the most

efficient, because having an average execution time (encryption/decryption) bellow 1000 clock cycles

(RECTANGLE is bigger, but very close to that value), with a code size smaller than 700 bytes. From this

point onwards, the balanced implementations are used as the reference implementations in the following

comparisons, including those with the state of the art.

Figure 4.24: Proposed Key Schedule Implementations

Figure 4.24 depicts the setup time for the ciphers proposed implementations. NOEKEON proposed

implementation works in direct mode so it has no key schedule. The RoadRunneR is also an implemen-

tation without key scheduler. AES is the only cipher that requires a different key scheduler for encryption

and decryption and is the one with the largest key scheduler code size. CLEFIA and PRESENT, the

71

standards of lightweight, show the slowest key scheduling and the larger ones, when compared to other

lightweight ciphers. While RECTANGLE, SPARX and SPECK have a key schedule, code size, below

200 bytes. Finally SPARX and SPECK are the fastest among the ciphers that need key scheduling.

Table 4.1 summarizes the results obtained for the balanced implementations, plus the NOEKEON

small implementation. These implementations are the ones with the configuration that achieved the

best efficiency values, therefore these implementations have the best trade-off between code size and

execution time. In this table is also shown the optimizations used in each cipher implementation.

Ciphers Code Size Difference Code Size Speedup Optimizations

AES +68% 3852 bytes 12× Faster

• T-Box Reduced

• Permutations Unrolling

• State in Registers

CLEFIA +189% 5684 bytes 11× Faster

• T-Box Reduced

• Constants Stored

• Functions Inlined


NOEKEON +19% 580 bytes 4.3× Faster

• Constants Stored



NOEKEON (Small) -21% 385 bytes 3.2× Faster• Single Function


PRESENT +112% 1968 bytes 2.5× Faster• 8-bits S-Box

• Unrolling Permutations

RECTANGLE -36%3 476 bytes 3.2×4 Faster

• 32-bits orientation



RoadRunneR +129% 1264 bytes 3.7× Faster

• No Key Schedule




SPARX -64%5 644 bytes 6.6×6 Faster


• Speckey Function Inlined


SPECK +5% 156 bytes 1.4× Faster• State in Registers

• Decryption Optimization7

Table 4.1: Summary of Best Proposed Implementations Results

72

Figure 4.25: Proposed Implementations vs State Of Art Results

Figure 4.25 presents an overview of the the proposed balanced implementations, and several ciphers

from the state of art, namely like Fantomas and Robin[65], LED[64], SIMON[62], HIGHT[69], LBlock[70],

LEA[79], TWINE[75], Piccolo[77], Chaskey[80], PRIDE [67], PRINCE [68], RobinStar [66]. These re-

sults were obtained using the best implementations available from the FELICS Project [92], on the ARM

Cortex-M3. Some of them are the highest scored implementations in the Triathlon Competition [91]. The

results for this competition can be seen in the Triathlon Webpage [103]. The ciphers are ordered by their

throughput (MB/s). From this results it is easy to conclude, which are the fastest ciphers when encrypt-

ing/decrypting data. Among the proposed optimizations SPECK is clearly the fastest one, followed by

NOEKEON. AES and CLEFIA T-Box implementations are the ones that follow, followed by SPARX and

RECTANGLE. The slowest ones are RoadRunneR and PRESENT. RoadRunneR big disadvantage is

the fact that is an 8-bit oriented bit-slice cipher.

It this figure is possible to recognize that the throughput of the proposed optimized implementations

are among the best ones, only Chaskey, LEA and SIMON can match or challenge the performance of

the proposed implementations. That is because these are ARX based ciphers, thus, like SPECK, they

have very simple operations and are very small, which may compromise there security. In the other

hand, the code size of the proposed implementations is very similar with the other ciphers, depending

on the algorithm design.

3Code Size 10% Smaller than Authors Optimized RECTANGLE4Execution Time 1% Faster than Authors Optimized RECTANGLE5Code Size 1% Smaller than Authors Optimized SPARX6Execution Time 26% Faster than Authors Optimized SPARX7Optimized Taking advantage of the Shift + ALU

73

4.5 Energy Consumption

In the state of art no experimental measurements of energy consumption was found for software imple-

mentation of lightweight ciphers. In the existing state of the art energy consumption is only estimated

using formulas or the emulation of the processors to get empirical values for the cipher consume. For

example in [22] a equation (eq. 4.2) is used .In [52] a power model of the StrongARM SA-1100 is used

to calculate the energy consumed by the ciphers. To improve the optimization and evaluation of the

ciphers, its proposed the experimental measurement of the energy consume is proposed.

Energy (J) =Consumed Power (W )× Clock Cycles

Frequency (Hz)(4.2)

For this an oscilloscope (Picoscope [104]) connected to a computer was used to measure the tension

variations, caused by the processor, when running the implemented ciphers, the setup is illustrated in

Figure 4.26.

Figure 4.26: Assembled Circuit for the Measurements

In this Figure it is possible to observe that a resistor of 2Ω was used in the VCC input of the board, in

order to have 2 points to measure the voltage. The point before the resistor was connected to channel B

of the oscilloscope to measure the differential voltage decay at the resistor, channel D was connected to

the other resistor terminal, thus measuring the voltage decay ∆V (t). The channel C of the oscilloscope

was used as a trigger and was connected to the digital output pin 5 of the Arduino Due board.

The circuit was powered by 8.23 V DC voltage supply. The oscilloscope configuration for channels B

and D, was set to measure voltage as:

• Mode: DC

74

• Range: ±200 mV

• Offset: -8 V

The configuration for channel C, used as the trigger, was:

• Mode: DC

• Range: ±500 mV

• Pre-Trigger: 5%

• Probe: 1/10

The used oscilloscope configuration was:

• Sample Frequency: 312.5 MHz

• Capture Duration: 20 000 µs

• Number of Samples: 200

The capture was done by running the targeted cipher code, inside a while(true) cycle. In each loop

iteration, the trigger was turned on at the start of each operation (namely: key schedule, encryption or

decryption), and turned off after the operation finished. The code was compiled and exported to the

board, using Arduino IDE [105], because when using FELICS the board was not able to be powered

through the power input, only by USB.

Initialy the functions used by the arduino library to manage the digital output port state (used for the

trigger), were not working properly leading to uncertain periods of the execution of the cipher. To reduce

this problem and after some research, an alternative for these functions was found, using direct port

manipulation [106].

Recalling that the goal was to detect the small fluctuations of voltage in the resistor (equation 4.3)

and thus in the consumed power energy. Since the resistor is fixed the current can be obtained by

equation 4.4 and consequently allowing to compute the power consumption using equation 4.5. The

consumed energy by the processor, in a given time interval, can be obtained by integrating over time the

power value (equation 4.6). Finally, the average energy consumed by the system, can be obtained by

dividing the times the cipher takes to compute a given data set. To obtain the energy consumption per

data block equation 4.7 can be used.

∆V (t) = VB(t)− VD(t); (4.3)

I(t) =∆V (t)

R(4.4)

P (t) = VD(t)× I(t) (4.5)

Etotal =

∫P (t)dt (4.6)

Eper block =Etotal

Number of blocks(4.7)

In each execution 200 samples where taken (to obtained a more precise value), each sample had a

duration of 20 ms. The obtained experimental results were somewhat unexpected, showing very identical

75

energy values for all ciphers. Therefore, different approaches to reduce the error of the obtained values

were made, namely: disabling trigger, differential reads, test executions of loops with only arithmetic

operations, nops or memory accesses. All approaches reached the same result. This suggests that the

static power consumption of the board is more significant that any fluctuations caused by the type of

instructions, being executed in the processor. The existing noise also hardened the signal acquisition

process, particularly when correlating the small energy variation of the different executed instructions.

Figure 4.27: Energy Consumption per Block/Key Schedule (µJ)

Figure 4.27 shows the obtained results for the balanced implementations of the ciphers, presented in

the previous section. As mentioned before, for the same time interval, the energy spent by the processor

is very similar, so, the execution time will be the factor that mostly impacts the energy consumption. This

means that ciphers that process more in the same interval of time, will be the ones that use less energy

per computation (block encryption/decryption, and calculation of round keys). As expected the results

shows that the fastest ciphers are the ones with smaller energy consumption. It is possible that the

arduino libraries were enabling all the peripherals, making the static consumption of the board higher.

4.6 Summary

The obtained experimental results allow to conclude that the proposed optimizations achieve good per-

formance improvements. The purposed T-Box reduced optimizations improved the AES and CLEFIA by

85% and 83% in terms of execution time, with a cost of 39% and 179% in code size increase, respec-

tively, which is a very small increase when compared with the normal T-Box optimization that has the

same improvement in execution time, but with a code size cost of 298% and 398% respectively. Also,

with the proposed code optimizations the performance of AES improved the execution time by 12×,

with a cost of 68% in code size increase. CLEFIA, with the proposed code optimizations improved its

76

execution time by 11×, with a cost of 189%.

On the other hand, bit-slice ciphers results show very small code footprint and very good perfor-

mances when the implementation is oriented to the architecture of the processor. For NOEKEON the

proposed code optimizations improved its performance by 4.3×, with a cost of only 19% in code size.

Another version of NOEKEON was also proposed targetting a small code size, allowing to reduce it to

only 79% of the reference implementation, and with a speedup of 3.2× in execution time, which is a very

good result. For RECTANGLE the proposed optimizations allow for a speedup of 3.2× on execution

time, with a reduction of 36% in code size, which is 1% faster execution time, with a 10% smaller code

size, than the optimized version proposed by the authors. RoadRunneR is the bit slice cipher that has

the worst performance, but with the proposed code optimizations the code size is increased by 129%

with a execution improvement of 3.7×. The bit-sliced ciphers benefited a lot from function inlining and

by having the state cipher stored in the registers optimizations.

The proposed PRESENT optimization speedup the execution time by 2.5×, with a code size increase

of 112%, the execution time is still to big for it to be considered a good software performance for a

lightweight cipher. That improvement in PRESENT was achieved with 8-bits S-Boxes and the unroll of

the permutations.

Ciphers with an ARX structure showed good performances and small code sizes, however these

types of ciphers have a big concern in terms of security. For SPARX the proposed code optimizations

allowed to improve by 6.6× the execution time while reducing by 64% the code size, which are very

good results. The purposed optimization as also beat the optimized version in the state of art by 26% in

the execution time and 1% in the code size. For SPECK the proposed optimizations targeted specifically

the decryption code but also improved the encryption, reducing the execution time by 1.4× with a cost

of only 5% in code size. SPECK, is the smallest and fastest cipher analyzed. SPARX, that as an ARX

based S-Box, is the fourth smallest and fastest.

Almost every cipher had a good improvement in the execution time, some of them with cost in the

code size. The presented analysis also showed that all proposed implementations achieve good results

in performance, and some in code size, when compared with other lightweight ciphers.

To conclude this analyses an experimental measurement of the energy consumption were also ob-

tained, which can not be found in the state of art. The obtained results suggest that the execution time

is the main factor when addressing the energy consumption. Faster the cipher less energy consumed.

77

78

Chapter 5

Conclusions

5.1 Achievements

With the IoT paradigm growing every year and the usage of constrained devices increasing the need

for lightweight encryption and efficient implementation becomes clear being a hot topic of research with

a lot of work done in the past years. This work tales into account the fact that the current standard for

lightweight encryption clearly does not fit well on software, demanding for a new standard on lightweight

encryption, that must consider software implementations.

Given this, the work here presented focuses on analyzing, improving the implementations and in

evaluating 7 different lightweight block ciphers plus AES, to find good alternatives for lightweight en-

cryption. These ciphers were carefully selected given their characteristics, The proposed optimizations

were applied based on what the ciphers needed and in their structure. Each cipher had different levels

of optimizations, considering which optimizations fitted well in the cipher structure and code. With that,

different levels of optimization were introduced to achieve different implementations of each cipher, fo-

cusing on different requirements. Some implementations focus on reducing the code size, while others

focus on reducing the execution time, with the respective trade-offs. AES and CLEFIA are the two clear

examples of ciphers that spend a lot of resources on code size in order to achieve a faster performance.

But they also take advantage from the ARM Cortex-M3 characteristic, which allows to shift a register

before an arithmetic or logical operation, with no overhead, allowing for a reduction on the code size.

SPECK also takes advantage of that feature to increase performance of the encryption and decryption

when the rounds are properly rolled. Bit-slice ciphers became a clear good alternatives with NOEKEON

being one of the fastest lightweight ciphers yet presented with a small code size. RECTANGLE also

achieved good results on a 32-bits platform. RoadRunneR was a little below expectations, given its 8-

bits orientation. Despite all this, all bit-slice ciphers had very small code sizes. SPARX also showed very

good and interesting results for an ARX-based SPN cipher, that combines the security of SPN ciphers

with the light performance of ARX ciphers. PRESENT, even with several optimizations, is still a very

heavy cipher on software. The main conclusions from this analyses are:

• The current standard lightweight encryption algorithms do not show very good results on software

79

when compared with new lightweight algorithms.

• PRESENT is an unfriendly lightweight cipher when targeting software implementations.

• AES and CLEFIA with the proposed T-Box Reduced optimization showed very good performances,

but with a big trade-off on code size, being only viable for devices that have some memory to spend.

• The feature of shifting without generating overhead of the ARM architecture processors is very

useful for some ciphers, particularly, AES and CLEFIA T-Box, and SPECK.

• Bit-slice ciphers show very good results, in both performance and code size, and could be the best

way to go, on future standards, for the lightweight world.

• Among the ciphers studied, NOEKEON is the cipher with best memory/execution time performance

apart from SPECK (created by NASA, not being very well seen by the community).

The experimental energy consumption evaluation is other point novel of this work. This is something

that has not been made in the state of art, adding value to this work. The results were not ideal, given the

overall processor power consumption. These results suggests that the energy consumption is mostly

imposed by the time spent computing.

When comparing the results with the objectives outlined in the beginning of this thesis, it becomes

clear that most of them were achieved. Several different ciphers were studied and different optimized

versions proposed for each one. Most of them with better performances and code sizes than the op-

timizations proposed by the authors. The implementations showed very good results when compared

with other ciphers, on the state of art for the target processor. Good alternatives for lightweight encryp-

tion were also found. The only drawback of this work is the RAM consumption measurement that was

not possible to be performed, because the J-Link Debugger needed for that was to expensive to buy.

Overall, a though evaluation of existing lightweight cipher as been presented while several improved

implementations have also been proposed.

5.2 Future Work

Future work can focus on the improvement of the energy consumption evaluation of the ciphers, by

considering the conclusions obtained in this work and complement them, and understanding how the

different instructions affect the energy consume of the processor. This is particularly relevant given

energy is one of the main constrains in the IoT world. Also, an evaluation on other platforms and devices

would be very interesting, considering processors with cache, architectures of 8-bits and 16-bits, in order

to find a cipher that could perform well in different devices.

80

Bibliography

[1] F. Xia, L. T. Yang, L. Wang, and A. Vinel. Internet of things. International Journal of Communication

Systems, 25(9):1101–1102, 2012.

[2] S. Helal, W. Mann, H. El-Zabadani, J. King, Y. Kaddoura, and E. Jansen. The gator tech smart

house: A programmable pervasive space. Computer, 38(3):50–60, 2005.

[3] A. Caragliu, C. Del Bo, and P. Nijkamp. Smart cities in europe. Journal of urban technology, 18

(2):65–82, 2011.

[4] A. Zanella, N. Bui, A. Castellani, L. Vangelista, and M. Zorzi. Internet of things for smart cities.

IEEE Internet of Things journal, 1(1):22–32, 2014.

[5] J. Lee, B. Bagheri, and H.-A. Kao. A cyber-physical systems architecture for industry 4.0-based

manufacturing systems. Manufacturing Letters, 3:18–23, 2015.

[6] P. Varaiya. Smart cars on smart roads: problems of control. IEEE Transactions on automatic

control, 38(2):195–207, 1993.

[7] F. TongKe. Smart agriculture based on cloud computing and iot. Journal of Convergence Informa-

tion Technology, 8(2), 2013.

[8] Gartner says the internet of things installed base will grow to 26 billion units by 2020, 2018.

https://www.gartner.com/newsroom/id/2636073, Accessed: 19/07/2018.

[9] L. Spencer. Internet of things market to hit $7.1 trillion by 2020: Idc, 2018. https://www.

zdnet.com/article/internet-of-things-market-to-hit-7-1-trillion-by-2020-idc/, Ac-

cessed: 19/07/2018.

[10] Raspberry pi, 2018. https://www.raspberrypi.org/, Accessed: 19/07/2018.

[11] Arduino, 2018. https://www.arduino.cc/, Accessed: 19/07/2018.

[12] Nodemcu, 2018. http://nodemcu.com/index_en.html, Accessed: 19/07/2018.

[13] Z.-K. Zhang, M. C. Y. Cho, C.-W. Wang, C.-W. Hsu, C.-K. Chen, and S. Shieh. Iot security: ongoing

challenges and research opportunities. In Service-Oriented Computing and Applications (SOCA),

2014 IEEE 7th International Conference on, pages 230–234. IEEE, 2014.

81

https://www.gartner.com/newsroom/id/2636073

https://www.zdnet.com/article/internet-of-things-market-to-hit-7-1-trillion-by-2020-idc/

https://www.zdnet.com/article/internet-of-things-market-to-hit-7-1-trillion-by-2020-idc/

https://www.raspberrypi.org/

https://www.arduino.cc/

http://nodemcu.com/index_en.html

[14] M. U. Farooq, M. Waseem, A. Khairi, and S. Mazhar. A critical analysis on the security concerns

of internet of things (iot). International Journal of Computer Applications, 111(7), 2015.

[15] T. Xu, J. B. Wendt, and M. Potkonjak. Security of iot systems: Design challenges and opportuni-

ties. In Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided Design,

pages 417–423. IEEE Press, 2014.

[16] R. Mahmoud, T. Yousuf, F. Aloul, and I. Zualkernan. Internet of things (iot) security: Current

status, challenges and prospective measures. In Internet Technology and Secured Transactions

(ICITST), 2015 10th International Conference for, pages 336–341. IEEE, 2015.

[17] S. Panasenko and S. Smagin. Lightweight cryptography: Underlying principles and approaches.

International Journal of Computer Theory and Engineering, 3(4):516, 2011.

[18] M. Katagi and S. Moriai. Lightweight cryptography for the internet of things. Sony Corporation,

pages 7–10, 2008.

[19] A. Moradi and A. Poschmann. Lightweight cryptography and dpa countermeasures: A survey. In

Financial Cryptography Workshops, pages 68–79. Springer, 2010.

[20] C. Manifavas, G. Hatzivasilis, K. Fysarakis, and K. Rantos. Lightweight cryptography for embed-

ded systems–a comparative analysis. In Data Privacy Management and Autonomous Sponta-

neous Security, pages 333–349. Springer, 2014.

[21] W. J. Okello, Q. Liu, F. A. Siddiqui, and C. Zhang. A survey of the current state of lightweight

cryptography for the internet of things. In Computer, Information and Telecommunication Systems

(CITS), 2017 International Conference on, pages 292–296. IEEE, 2017.

[22] J. Hosseinzadeh and A. G. Bafghi. Software implementation and evaluation of lightweight sym-

metric block ciphers of the energy perspectives and memory. arXiv preprint arXiv:1706.03909,

2017.

[23] A. Bogdanov, L. R. Knudsen, G. Leander, C. Paar, A. Poschmann, M. J. Robshaw, Y. Seurin, and

C. Vikkelsoe. Present: An ultra-lightweight block cipher. In CHES, volume 4727, pages 450–466.

Springer, 2007.

[24] R. Benadjila, J. Guo, V. Lomne, and T. Peyrin. Implementing lightweight block ciphers on x86

architectures. In International Conference on Selected Areas in Cryptography, pages 324–351.

Springer, 2013.

[25] Arduino due board, 2018. https://store.arduino.cc/arduino-due, Accessed: 01/08/2018.

[26] F. L. Lewis. Wireless sensor networks. Smart environments: technologies, protocols, and appli-

cations, pages 11–46, 2004.

[27] Amazon echo, 2018. https://www.amazon.com/all-new-amazon-echo-speaker-with-wifi-alexa-dark-charcoal/

dp/B06XCM9LJ4, Accessed: 30/07/2018.

82

https://store.arduino.cc/arduino-due

https://www.amazon.com/all-new-amazon-echo-speaker-with-wifi-alexa-dark-charcoal/dp/B06XCM9LJ4

https://www.amazon.com/all-new-amazon-echo-speaker-with-wifi-alexa-dark-charcoal/dp/B06XCM9LJ4

[28] Google home, 2018. https://store.google.com/us/product/google_home?hl=en-US, Ac-

cessed: 30/07/2018.

[29] U. of Washington. The arm architecture, 2018. https://courses.cs.washington.edu/courses/

cse466/10au/pdfs/lectures/07-arm_overview.pdf, Accessed: 30/07/2018.

[30] R. W. Tech. Arm’s race to embedded world domination, 2018. https://www.realworldtech.

com/arms-race/, Accessed: 30/07/2018.

[31] A. Limited. Arm cortex-m family, 2018. https://www.arm.com/products/processors/cortex-m,

Accessed: 30/07/2018.

[32] M. Bellare and P. Rogaway. Introduction to modern cryptography. Ucsd Cse, 207:207, 2005.

[33] S. M. Bellovin. Frank miller: Inventor of the one-time pad. Cryptologia, 35(3):203–222, 2011.

[34] F. Miller. Telegraphic code to insure privacy and secrecy in the transmission of telegrams. CM

Cornwell, 1882.

[35] G. S. Vernam. Cipher printing telegraph systems for secret wire and radio telegraphic communi-

cations. Transactions of the American Institute of Electrical Engineers, 45:295–301, 1926.

[36] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public-key

cryptosystems. Communications of the ACM, 21(2):120–126, 1978.

[37] W. Diffie and M. Hellman. New directions in cryptography. IEEE transactions on Information

Theory, 22(6):644–654, 1976.

[38] J. W. Bos, J. A. Halderman, N. Heninger, J. Moore, M. Naehrig, and E. Wustrow. Elliptic curve

cryptography in practice. In International Conference on Financial Cryptography and Data Secu-

rity, pages 157–175. Springer, 2014.

[39] A. Canteaut. Open problems related to algebraic attacks on stream ciphers. In Coding and

cryptography, pages 120–134. Springer, 2006.

[40] A. Klein. Attacks on the rc4 stream cipher. Designs, codes and cryptography, 48(3):269–286,

2008.

[41] J. D. Golic. Cryptanalysis of alleged a5 stream cipher. In International Conference on the Theory

and Applications of Cryptographic Techniques, pages 239–255. Springer, 1997.

[42] J. Daemen and V. Rijmen. The design of Rijndael: AES-the advanced encryption standard.

Springer Science & Business Media, 2013.

[43] J. O. Grabbe. The des algorithm illustrated. Laissez Faire City Times, 2(28):12–15, 1992.

[44] Y. Kumar, R. Munjal, and H. Sharma. Comparison of symmetric and asymmetric cryptography

with existing vulnerabilities and countermeasures. International Journal of Computer Science and

Management Studies, 11(03), 2011.

83

https://store.google.com/us/product/google_home?hl=en-US

https://courses.cs.washington.edu/courses/cse466/10au/pdfs/lectures/07-arm_overview.pdf

https://courses.cs.washington.edu/courses/cse466/10au/pdfs/lectures/07-arm_overview.pdf

https://www.realworldtech.com/arms-race/

https://www.realworldtech.com/arms-race/

https://www.arm.com/products/processors/cortex-m

[45] R. L. Rivest, M. Robshaw, R. Sidney, and Y. L. Yin. The rc6tm block cipher. In First Advanced

Encryption Standard (AES) Conference, page 16, 1998.

[46] G. Hatzivasilis, K. Fysarakis, I. Papaefstathiou, and C. Manifavas. A review of lightweight block

ciphers. Journal of Cryptographic Engineering, pages 1–44, 2017.

[47] T. Eisenbarth and S. Kumar. A survey of lightweight-cryptography implementations. IEEE Design

& Test of Computers, 24(6), 2007.

[48] M. Appel, A. Bossert, S. Cooper, T. Kußmaul, J. Loffler, C. Pauer, and A. Wiesmaier. Block ciphers

for the iot–simon, speck, katan, led, tea, present, and sea compared.

[49] W. K. Koo, H. Lee, Y. H. Kim, and D. H. Lee. Implementation and analysis of new lightweight cryp-

tographic algorithm suitable for wireless sensor networks. In Information Security and Assurance,

2008. ISA 2008. International Conference on, pages 73–76. IEEE, 2008.

[50] S. KOTEL, F. SBIAA, M. ZEGHID, M. MACHHOUT, A. BAGANNE, and R. TOURKI. Performance

evaluation and design considerations of lightweight block cipher for low-cost embedded devices.

[51] R. J. Cruz, T. B. Reis, D. F. Aranha, J. Lopez, and H. K. Patil. Lightweight cryptography on arm.

[52] J. Großschadl, S. Tillich, C. Rechberger, M. Hofmann, and M. Medwed. Energy evaluation of

software implementations of block ciphers under memory constraints. In Proceedings of the con-

ference on Design, automation and test in Europe, pages 1110–1115. EDA Consortium, 2007.

[53] S. Kerckhof, F. Durvaux, C. Hocquet, D. Bol, and F.-X. Standaert. Towards green cryptography:

a comparison of lightweight ciphers from the energy viewpoint. Cryptographic Hardware and

Embedded Systems–CHES 2012, pages 390–407, 2012.

[54] S. Matsuda and S. Moriai. Lightweight cryptography for the cloud: exploit the power of bitslice

implementation. Cryptographic Hardware and Embedded Systems–CHES 2012, pages 408–425,

2012.

[55] T. Caddy. Side-channel attacks. In Encyclopedia of Cryptography and Security, pages 1204–1204.

Springer, 2011.

[56] M. K. Pehlivanoglu, S. Akleylek, M. T. Sakallı, and N. Duru. On the design strategies of diffu-

sion layers and key schedule in lightweight block ciphers. In Computer Science and Engineering

(UBMK), 2017 International Conference on, pages 456–461. IEEE, 2017.

[57] G. Bansod, N. Pisharoty, and A. Patil. Boron: an ultra-lightweight and low power encryption

design for pervasive computing. Frontiers of Information Technology & Electronic Engineering, 18

(3):317–331, 2017.

[58] S. Banik, S. K. Pandey, T. Peyrin, Y. Sasaki, S. M. Sim, and Y. Todo. Gift: a small present. In

International Conference on Cryptographic Hardware and Embedded Systems, pages 321–345.

Springer, 2017.

84

[59] Z. Gong, S. Nikova, and Y. W. Law. Klein: A new family of lightweight block ciphers. RFIDSec,

7055:1–18, 2011.

[60] J. Daemen, M. Peeters, G. Van Assche, and V. Rijmen. Nessie proposal: Noekeon. In First Open

NESSIE Workshop, pages 213–230, 2000.

[61] W. Zhang, Z. Bao, D. Lin, V. Rijmen, B. Yang, and I. Verbauwhede. Rectangle: a bit-slice

lightweight block cipher suitable for multiple platforms. Science China Information Sciences, 2015.

[62] R. Beaulieu, D. Shors, J. Smith, S. Treatman-Clark, B. Weeks, and L. Wingers. The simon and

speck families of lightweight block ciphers. Cryptology ePrint Archive, Report 2013/404, 2013.

https://eprint.iacr.org/2013/404.

[63] R. Beaulieu, S. Treatman-Clark, D. Shors, B. Weeks, J. Smith, and L. Wingers. The simon

and speck lightweight block ciphers. In Design Automation Conference (DAC), 2015 52nd

ACM/EDAC/IEEE, pages 1–6. IEEE, 2015.

[64] J. Guo, T. Peyrin, A. Poschmann, and M. Robshaw. The led block cipher. In B. Preneel and

T. Takagi, editors, Cryptographic Hardware and Embedded Systems – CHES 2011, pages 326–

341, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.

[65] V. Grosso, G. Leurent, F.-X. Standaert, and K. Varıcı. Ls-designs: Bitslice encryption for efficient

masked software implementations. In International Workshop on Fast Software Encryption, pages

18–37. Springer, 2014.

[66] A. Journault, F.-X. Standaert, and K. Varici. Improving the security and efficiency of block ciphers

based on ls-designs. Designs, Codes and Cryptography, 82(1-2):495–509, 2017.

[67] M. R. Albrecht, B. Driessen, E. B. Kavun, G. Leander, C. Paar, and T. Yalcın. Block ciphers–focus

on the linear layer (feat. pride). In International Cryptology Conference, pages 57–76. Springer,

2014.

[68] J. Borghoff, A. Canteaut, T. Guneysu, E. B. Kavun, M. Knezevic, L. R. Knudsen, G. Leander,

V. Nikov, C. Paar, C. Rechberger, et al. Prince–a low-latency block cipher for pervasive comput-

ing applications. In International Conference on the Theory and Application of Cryptology and

Information Security, pages 208–225. Springer, 2012.

[69] D. Hong, J. Sung, S. Hong, J. Lim, S. Lee, B. Koo, C. Lee, D. Chang, J. Lee, K. Jeong, et al.

Hight: A new block cipher suitable for low-resource device. In CHES, volume 4249, pages 46–59.

Springer, 2006.

[70] W. Wu and L. Zhang. Lblock: a lightweight block cipher. In Applied Cryptography and Network

Security, pages 327–344. Springer, 2011.

[71] J. Patil, G. Bansod, and K. S. Kant. Lici: A new ultra-lightweight block cipher. In Emerging Trends

& Innovation in ICT (ICEI), 2017 International Conference on, pages 40–45. IEEE, 2017.

85

https://eprint.iacr.org/2013/404

[72] S. Kotel, M. Zeghid, M. Machhout, and R. Tourki. Lightweight encryption algorithm based on mod-

ified xtea for low-resource embedded devices. In Proceedings of the 21st International Database

Engineering & Applications Symposium, pages 192–199. ACM, 2017.

[73] G. BANSOD, S. SUTAR, A. PATIL, and J. PATIL. Nux: A new lightweight block cipher for security

at wireless sensor node level.

[74] T. Shirai, K. Shibutani, T. Akishita, S. Moriai, and T. Iwata. The 128-bit blockcipher clefia. In FSE,

volume 4593, pages 181–195. Springer, 2007.

[75] T. Suzaki, K. Minematsu, S. Morioka, and E. Kobayashi. Twine: A lightweight block cipher for

multiple platforms. In Selected Areas in Cryptography, volume 7707, pages 339–354. Springer,

2012.

[76] A. Baysal and S. Sahin. Roadrunner: A small and fast bitslice block cipher for low cost 8-bit

processors. In International Workshop on Lightweight Cryptography for Security and Privacy,

pages 58–76. Springer, 2015.

[77] K. Shibutani, T. Isobe, H. Hiwatari, A. Mitsuda, T. Akishita, and T. Shirai. Piccolo: an ultra-

lightweight blockcipher. In International Workshop on Cryptographic Hardware and Embedded

Systems, pages 342–357. Springer, 2011.

[78] D. Dinu, L. Perrin, A. Udovenko, V. Velichkov, J. Großschadl, and A. Biryukov. Sparx: A family of

arx-based lightweight block ciphers provably secure against linear and differential attacks.

[79] D. Hong, J.-K. Lee, D.-C. Kim, D. Kwon, K. H. Ryu, and D.-G. Lee. Lea: A 128-bit block cipher

for fast encryption on common processors. In International Workshop on Information Security

Applications, pages 3–27. Springer, 2013.

[80] N. Mouha, B. Mennink, A. Van Herrewege, D. Watanabe, B. Preneel, and I. Verbauwhede.

Chaskey: an efficient mac algorithm for 32-bit microcontrollers. In International Workshop on

Selected Areas in Cryptography, pages 306–323. Springer, 2014.

[81] D. Engels, X. Fan, G. Gong, H. Hu, and E. M. Smith. Hummingbird: ultra-lightweight cryptography

for resource-constrained devices. In International Conference on Financial Cryptography and Data

Security, pages 3–18. Springer, 2010.

[82] D. W. Engels, M.-J. O. Saarinen, P. Schweitzer, and E. M. Smith. The hummingbird-2 lightweight

authenticated encryption algorithm. RFIDSec, 11:19–31, 2011.

[83] S. Das. Halka: A lightweight, software friendly block cipher using ultra-lightweight 8-bit s-box.

IACR Cryptology ePrint Archive, 2014:110, 2014.

[84] C. De Canniere, O. Dunkelman, and M. Knezevic. Katan and ktantan—a family of small and

efficient hardware-oriented block ciphers. In Cryptographic Hardware and Embedded Systems-

CHES 2009, pages 272–288. Springer, 2009.

86

[85] E. Kasper and P. Schwabe. Faster and timing-attack resistant aes-gcm. In Cryptographic Hard-

ware and Embedded Systems-CHES 2009, pages 1–17. Springer, 2009.

[86] L. R. Knudsen and H. Raddum. Nes/doc/uib/wp3/009. 2001.

[87] J. Daemen, M. Peeters, G. Assche, and V. Rijmen. On noekeon no!, 2001.

[88] Gate equivalent, 2018. https://www.jedec.org/standards-documents/dictionary/terms/

gate-equivalent-1-cmos, Accessed: 10/10/2018.

[89] M. Cazorla, S. Gourgeon, K. Marquet, and M. Minier. Implementations of lightweight block ciphers

on a wsn430 sensor, 2015. http://bloc.project.citi-lab.fr/library.html.

[90] D. Dinu, A. Biryukov, J. Großschadl, D. Khovratovich, Y. Corre, and L. Perrin. Felics–fair evaluation

of lightweight cryptographic systems. In NIST Workshop on Lightweight Cryptography, volume

128, 2015.

[91] D. Dinu, Y. L. Corre, D. Khovratovich, L. Perrin, J. Großschadl, and A. Biryukov. Triathlon of

lightweight block ciphers for the internet of things. Technical report, IACR ePrint archive, 2015.

[92] CryptoLUX. Cryptolux ¿ felics, 2017. https://www.cryptolux.org/index.php/FELICS.

[93] CRIPTOLUX. Felics - block ciphers brief evaluation results. https://www.cryptolux.org/index.

php/FELICS_Block_Ciphers_Brief_Results.

[94] S. Banescu. Cache timing attacks, 2011.

[95] E. Biham. A fast new des implementation in software. In International Workshop on Fast Software

Encryption, pages 260–272. Springer, 1997.

[96] J. Yiu. The Definitive Guide to the ARM R© Cortex-M3. Newnes, 2009.

[97] S. Corporation. Clefia website. https://www.sony.net/Products/cryptography/clefia/, Ac-

cessed: 25/07/2018.

[98] P. M. Knijnenburg, T. Kisuki, and M. F. O’Boyle. Combined selection of tile sizes and unroll factors

using iterative compilation. The Journal of Supercomputing, 24(1):43–67, 2003.

[99] Risc vs. cisc architectures: Which one is better?, 2018. https://www.microcontrollertips.

com/risc-vs-cisc-architectures-one-better, Accessed: 30/07/2018.

[100] A. Limited. Gnu arm embedded toolchain. https://developer.arm.com/open-source/

gnu-toolchain/gnu-rm, Accessed: 02/08/2018.

[101] Gcc gnu optimize options. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html,

Accessed: 02/08/2018.

[102] Noekeon website, 2018. https://http://gro.noekeon.org/, Accessed: 20/08/2018.

87

https://www.jedec.org/standards-documents/dictionary/terms/gate-equivalent-1-cmos

https://www.jedec.org/standards-documents/dictionary/terms/gate-equivalent-1-cmos

http://bloc.project.citi-lab.fr/library.html

https://www.cryptolux.org/index.php/FELICS

https://www.cryptolux.org/index.php/FELICS_Block_Ciphers_Brief_Results

https://www.cryptolux.org/index.php/FELICS_Block_Ciphers_Brief_Results

https://www.sony.net/Products/cryptography/clefia/

https://www.microcontrollertips.com/risc-vs-cisc-architectures-one-better

https://www.microcontrollertips.com/risc-vs-cisc-architectures-one-better

https://developer.arm.com/open-source/gnu-toolchain/gnu-rm

https://developer.arm.com/open-source/gnu-toolchain/gnu-rm

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

https://http://gro.noekeon.org/

[103] CryptoLUX. Cryptolux ¿ felics triathlon, 2017. https://www.cryptolux.org/index.php/FELICS_

Triathlon.

[104] Picoscope, 2018. https://www.picotech.com/oscilloscope/9300/

picoscope-9300-sampling-oscilloscopes, Accessed: 10/10/2018.

[105] Arduino ide, 2018. https://www.arduino.cc/en/Main/Software, Accessed: 10/10/2018.

[106] Arduino due pwm frequency, 2018. http://www.kerrywong.com/2014/09/21/

on-arduino-due-pwm-frequency/, Accessed: 10/10/2018.

88

https://www.cryptolux.org/index.php/FELICS_Triathlon

https://www.cryptolux.org/index.php/FELICS_Triathlon

https://www.picotech.com/oscilloscope/9300/picoscope-9300-sampling-oscilloscopes

https://www.picotech.com/oscilloscope/9300/picoscope-9300-sampling-oscilloscopes

https://www.arduino.cc/en/Main/Software

http://www.kerrywong.com/2014/09/21/on-arduino-due-pwm-frequency/

http://www.kerrywong.com/2014/09/21/on-arduino-due-pwm-frequency/

Appendix A

Pseudo-Code of the Ciphers

Encryption

Algorithm 3: AES Encryption Pseudo-CodeData:STATE[4]→ Cipher State;RK → Round Keys Array;

1 AddRoundKey(STATE, &RK[0]);2 for i = 1 to NUMBER OF ROUNDS do3 SubBytes(STATE);4 ShiftRows(STATE);5 MixColumns(STATE);6 AddRoundKey(STATE, &RK[i× 4]);7 end8 SubBytes(STATE);9 ShiftRows(STATE);

10 AddRoundKey(STATE, &RK[(NUMBER OF ROUNDS + 1)× 4]);

Algorithm 4: CLEFIA Encryption Pseudo-CodeData:STATE[4]→ Cipher State;RK → Round Keys Array;WK →Whitening Keys Array;

1 STATE[1] = STATE[1] ⊕WK[0];2 STATE[3] = STATE[3] ⊕WK[1];3 for i = 0 to i < NUMBER OF ROUNDS do4 STATE[1] = STATE[1] ⊕ F0(STATE[0], RK[i× 2]);5 STATE[3] = STATE[3] ⊕ F1(STATE[2], RK[(i+ 1)× 2]);6 TEMP = STATE[0];7 STATE[0] = STATE[1];8 STATE[1] = STATE[2];9 STATE[2] = STATE[3];

10 STATE[3] = TEMP ;11 end12 STATE[1] = STATE[1] ⊕WK[2];13 STATE[3] = STATE[3] ⊕WK[3];

89

Algorithm 5: NOEKEON Encryption Pseudo-CodeData:STATE[4]→ Cipher State;KEY → Key;CONSTANT → Constants Table;

1 for i = 0 to i < NUMBER OF ROUNDS do2 STATE[0] = STATE[0] ⊕ CONSTANT [i× 2]);3 Theta(STATE, KEY );4 STATE[0] = STATE[0] ⊕ CONSTANT [(i+ 1)× 2]);5 Pi1(STATE);6 Gamma(STATE);7 Pi2(STATE);8 end9 STATE[0] = STATE[0] ⊕ CONSTANT [NUMBER OF ROUNDS × 2]);

10 Theta(STATE, KEY );

Algorithm 6: PRESENT Encryption Pseudo-CodeData:STATE[64]→ Cipher State;RK → Round Keys Array;

1 for i = 0 to i < NUMBER OF ROUNDS do2 AddRoundKey(STATE, RK[i]);3 SBoxLayer(STATE);4 PLayer(STATE);5 end6 addRoundKey(STATE, RK[NUMBER OF ROUNDS]);

Algorithm 7: RECTANGLE Encryption Pseudo-CodeData:STATE[4][16]→ Cipher State;RK → Round Keys Array;

1 for i = 0 to i < NUMBER OF ROUNDS do2 AddRoundKey(STATE, RK[i]);3 SubColumn(STATE);4 ShiftRow(STATE);5 end6 AddRoundKey(STATE, RK[NUMBER OF ROUNDS]);

90

Algorithm 8: RoadRunneR Encryption Pseudo-CodeData:STATE[2]→ Cipher State;RK → Round Keys Array;WK →Whitening Keys Array;

1 STATE[0] = STATE[0] ⊕WK[0];2 for i = 0 to i < NUMBER OF ROUNDS − 1 do3 STATE[1] = STATE[1] ⊕ F(STATE[0], RK[i], i);4 TEMP = STATE[0];5 STATE[0] = STATE[1];6 STATE[1] = TEMP ;7 end8 STATE[1] = STATE[1] ⊕ F(STATE[0], RK[i], NUMBER OF ROUNDS − 1);9 STATE[0] = STATE[0] ⊕WK[1];

Algorithm 9: SPARX Encryption Pseudo-CodeData:STATE[2]→ Cipher State;RK → Round Keys Array;

1 for i = 0 to i < NUMBER OF STEPS do2 RKT = &RK[i× (NUMBER OF ROUNDS × 2)];3 for j = 0 to j < NUMBER OF ROUNDS do4 STATE[0] = STATE[0] ⊕ RKT [j];5 Speckey(LeftBits(STATE[0]), RightBits(STATE[0]));6 STATE[1] = STATE[1] ⊕ RKT [j +NUMBER OF ROUNDS];7 Speckey(LeftBits(STATE[1]), RightBits(STATE[1]));8 end9 TEMP = STATE[0];

10 STATE[0] = STATE[1] ⊕ (STATE[0] ⊕ RotationLeft(STATE[0], 8) ⊕RotationRight(STATE[0], 8));

11 STATE[1] = TEMP ;12 end13 STATE[0] = STATE[0] ⊕ RK[NUMBER OF STEPS ×NUMBER OF ROUNDS × 2];14 STATE[1] = STATE[1] ⊕ RK[(NUMBER OF STEPS ×NUMBER OF ROUNDS × 2) + 1];

Algorithm 10: SPECK Encryption Pseudo-CodeData:LS = STATE[0];RS = STATE[1];RK → Round Keys Array;

1 for i = 0 to i < NUMBER OF ROUNDS do2 LS = RS + RotationRight(LS, α);3 LS = LS ⊕ RK[i];4 RS = LS ⊕ RotationLeft(RS, β);5 end6 STATE[0] = RS;7 STATE[1] = LS;

91

92

Appendix B

Implemented Versions

CipherAlgorithm Optimizations Code Optimizations

A1 B2 C3 D4 E5 F6 G7 H8 I9 J10 K11 L12

AES 128 128 v08 × ×

AES 128 128 v09 × ×

AES 128 128 v10 × × ×

AES 128 128 v11 × × × ×

AES 128 128 v12 × × × ×

AES 128 128 v13 × × × × ×

AES 128 128 v14 Reference Implementation

AES 128 128 v15 × ×

CLEFIA 128 128 v01 Reference Implementation

CLEFIA 128 128 v02 ×

CLEFIA 128 128 v03 × ×

CLEFIA 128 128 v04 × ×

CLEFIA 128 128 v05 × × ×

CLEFIA 128 128 v06 × × × ×

CLEFIA 128 128 v07 × × × ×

CLEFIA 128 128 v08 × × × × ×

CLEFIA 128 128 v09 × × × × × ×

CLEFIA 128 128 v10 × × × × × ×

CLEFIA 128 128 v11 × × × × × × ×

NOEKEON 128 128 v01 Reference Implementation

NOEKEON 128 128 v02 Reference Implementation

NOEKEON 128 128 v03 × ×

NOEKEON 128 128 v04 × ×

NOEKEON 128 128 v05 × × ×

93


A1 B2 C3 D4 E5 F6 G7 H8 I9 J10 K11 L12

NOEKEON 128 128 v06 × × ×

NOEKEON 128 128 v07 × × × ×

NOEKEON 128 128 v08 × × × ×

NOEKEON 128 128 v09 × × × × ×

NOEKEON 128 128 v10 × × × ×

NOEKEON 128 128 v11 × × × ×

NOEKEON 128 128 v12 × × × × ×

PRESENT 64 80 v07 Reference Implementation

PRESENT 64 80 v08 × ×

PRESENT 64 80 v09 × ×

PRESENT 64 80 v10 × × ×

PRESENT 64 80 v11 × × ×

PRESENT 64 80 v12 × × × ×

PRESENT 64 80 v13 × × × × ×

RECTANGLE 64 128 v11 × × ×

RECTANGLE 64 128 v12 × × × ×

RECTANGLE 64 128 v13 × × × × ×

RECTANGLE 64 128 v14 × × × × ×

RECTANGLE 64 128 v15 × × × × × ×

RoadRunneR 64 128 v07 × ×

RoadRunneR 64 128 v08 × ×

RoadRunneR 64 128 v09 × × ×

RoadRunneR 64 128 v10 × × ×

RoadRunneR 64 128 v11 × × × ×

RoadRunneR 64 128 v12 × × × ×

RoadRunneR 64 128 v13 × × × × ×

RoadRunneR 64 128 v14 × × × × ×

RoadRunneR 64 128 v15 × × × × ×

RoadRunneR 64 128 v16 × × × × ×

RoadRunneR 64 128 v17 × × × × × ×

RoadRunneR 64 128 v18 × × × × × ×

SPARX 64 128 v37 × × ×

SPARX 64 128 v38 × × ×

SPARX 64 128 v39 × × ×

SPARX 64 128 v40 × × × ×

SPARX 64 128 v41 × × × ×

94


A1 B2 C3 D4 E5 F6 G7 H8 I9 J10 K11 L12

SPARX 64 128 v42 × × × ×

SPARX 64 128 v43 × × × × ×

SPECK 64 128 v07 Reference Implementation

SPECK 64 128 v08 × ×

SPECK 64 128 v09 × ×

SPECK 64 128 v10 × × ×

SPECK 64 128 v11 × × ×

Table B.1: Optimizations included in the Implemented Cipher Ver-

sions

1A. T-Box2B. T-Box Reduction3C. Bit-Slice4D. Code Cleanup5E. Changing Architecture6F. Changing the size of the S-Box7G. Constants Calculation vs Constants Tables8H. Function Calls vs Function Inlining9I. Store the Cipher State in Registers

10J. Partial loop unrolling11K. Full loop unrolling12L. Reordering of the of operations

95

96

Appendix C

Small, Fast and Balanced Versions

C.1 Small Implementations

For the small implementations were evaluated the results obtained for all the implemented versions, and

for each cipher were elected the implementations with smaller code size. The flag used is the one that

provided that smaller code size, the more frequent one is -Os, since it is the one that optimized the code

in terms of memory footprint.

The versions that had the smallest code size are:

• AES S: T-Box Reduced (Version 9), Flag -Os;

• CLEFIA S: T-Box Reduced + Constants Stored (Version 7), Flag -Os

• NOEKEON S: Single Function (Version 11), Flag -Os;

• PRESENT S: S-Box 4-bits (Version 7), Flag -Os;

• RECTANGLE S: Functions Inlined + State in Registers (Version 14), Flag -Os;

• RoadRunneR S: One Key Scheduler + 32-bits orientation (Version 9), Flag -Os;

• SPARX S: 32-bits orientation + Speckey Function with return value (Version 38), Flag -O1;

• SPECK S: State in Registers (Version 9), Flag -Os;

C.2 Fast Implementations

Similarly to the small implementations, the fast implementations were obtained by evaluating the re-

sultsm for all the implemented versions, and electing the versions with fastest execution times. The flag

used is the one that provided the smaller execution time. Most of the elected versions have the full unroll

optimization because it was the one that improved the ciphers performance to their best results, also

cipher state stored in registers, and the function inlined optimizations are very common in almost every

version since they all improved, in some way, the ciphers performance.

The versions that had the fastest execution time are:

• AES F T-Box Reduced + Full Unroll (Version 11), Flag -Os;

97

• CLEFIA F: T-Box Reduced + Function Inlining + State in Registers + Full Unroll (Version 11), Flag

-Os;

• NOEKEON F: Functions Inlined + State in Registers + Full Unroll (Version 12), Flag -O1;

• PRESENT F: 8-bits S-Box + Unroll of the Permutation Layer (Version 10), Flag -O3;

• RECTANGLE F: 32-bits orientation + Function Inlined + State in Regiters + Full Unroll (Version

15), Flag -O1;

• RoadRunneR F: One Key Scheduler + 32-bits orientation + Funtions INlined + State in Registers

+ Full Unroll (Version 17), Flag -O1;

• SPARX F: 32-bits orientation + Speckey Function Inlined + State in Registers + Full Unroll (Version

43), Flag -Os;

• SPECK F: State in Registers Optimization + Full Unroll (Version 11), Flag -O1;

C.3 Balanced Implementations

For the balanced implementations have been chosen the versions with good trade-off between the code

size and the execution time. The criteria to select these versions was find results that had good execution

times but not spent to much on code size. The flag used is the one that provided the better results, -O1

is the more predominant.

The versions that had a good compromise between code size and execution time are:

• AES B: T-Box Reduced + Unroll of T-Box application + State in Registers (Version 12), Flag -O2;

• CLEFIA B: T-Box Reduced + Constants Stored + Funtions Inlined + State in Registers (Version

10), Flag -O1;

• NOEKEON B: Constants Stored + Functions Inlined + State in Registers (Version 9), Flag -O2;

• PRESENT B: 8-bits S-Box + Unroll of Permutations (Version 10), Flag -O1;

• RECTANGLE B: 32-bits oorientation + Functions Inlined + State in Registers (Version 14), Flag

-O1;

• RoadRunneR B: No Key Scheduler + 32-bits orientation + Functions Inlined + State in Registers

(Version 16), Flag -O1;

• SPARX B: 32-bits orientation + Speckey Function Inlined + State in Registers (Version 42), Flag

-O1;

• SPECK B: State in Registers + Decryption Optimization (Version 10), Flag -O1;

98

Appendix D

ARM Cortex-M3

The ARM Cortex-M3 (Fig. D.1) is a 32-bit processor. It has a Harvard architecture, which means

that the data bus and the instruction bus are separated. This allows the data and instruction access

to be done in parallel, so the performance of the processor increase because data accesses are not

affected by instruction pipeline. However, the instruction and data buses share the same memory space

(a unified memory system). The data path, register bank, and memory interfaces all work with 32-bits.

This processor does not have a cache but supports an external one if needed and does not have any

co-processor. It haves a hardware divider and multiplier that can perform the operations in one cycle.

Like other ARM processors it has a barrel shift and not a normal shifter, this optimized shift operations

that can be performed in a single clock cycle.

Figure D.1: ARM Cortex-M3 Structure

99

D.1 Registers

The ARM Cortex-M3 has 21 registries being R0-R12 general purpose registries, R13 the is the stack

pointer that is banked, with only one copy of it visible at a time. The two stack pointers are the Main

Stack Pointer (MSP) that is the default stack pointer, used by the operating system kernel and exception

handlers. The other one is the Process Stack Pointer (PSP) used by user application code. The R14 is

the link register, used to store the return address when a subroutine is called. The R15 is the program

counter that store the current program address. The last 5 are special registers that cannot be used for

normal data processing, they are program status, interruption masks and control registers.

D.2 Operating Modes

As mentioned above the ARM Cortex-M3 supports an operating system (OS) so it also has two

operation modes and two privilege levels. The operation modes are thread mode and handler mode

these modes determine if the processor is running a normal program or an exception handler like an

interrupt or system exception. The privilege levels are privileged level and user level those ones provide

a mechanism for safeguarding memory accesses to critical regions as well as providing a basic security

model. The exception handlers only work on privilege state, while a normal program can run in both

user or privilege. When running on privileged state, a program has access to all memory ranges (except

when prohibited by MPU settings) and can use all supported instructions. Also, when running on a

privilege state the software can change to user state using the control register. If a program is running

in user mode it can’t elevate it privilege to privilege mode using the control register, it need to use an

exception handler that will work on handler mode with privilege state switch the control register and then

go back to thread mode.

Because of this separation between user and privilege levels the security of the processor is improved

enabling it to support an operating system. The operating system kernel will work on privilege level and

when it launches applications it will launch them with user level, protecting the system from unwanted

accesses to the registers by untrusted programs that could provoke a crash on the system.

D.3 Memory Mapping

The ARM Cortex-M3 has predefined memory mapping. It has 4GB memory space that can be di-

vided into ranges as shown by Fig. D.2. Because of it predefined memory mapping the access to the

peripherals to be made using simple memory instructions.

This processor also implements the unaligned data access for the SRAM. This type of memory

allocation, packs together the unused memory into a continuous space which could lead to a reduction

in the amount of SRAM required by the application.

100

D.4 MPU

The ARM Cortex-M3 has an optional MPU (Memory Protection Unity). This allows for rules to be set

for the privileges status (privilege and user). This feature is optional and depends on the implementation

of the micro controller. It can be used for many ways like provide security for data used by the OS kernel

or privileged applications, or to set some memory addresses to read-only to protect sensible data from

being erased or modified, or also to isolate memory regions when working as a multi task system. The

MPU supports up to 8 regions each region can be divided into 8 sub regions. The regions minimal

size its 32 bytes and increments by factors of 2 until the max memory size (4 GB). These regions also

support to be overlapped in each other’s. Accesses to memory locations that are not defined in the MPU

regions, or not permitted by the region setting, will cause an exception.

D.5 Bus Interfaces

There exist several buses interfaces in the ARM Cortex-M3 processor. Because of it Harvard Archi-

tecture it is allow to perform several tasks like access data and fetch instructions at the same time. The

3 main buses, of 32-bits, are:

• Code Memory Buses

• System Bus

• Private Peripheral Buses

The code memory buses are responsible to make instruction fetch optimizing it for better instruction

execution speed. They are 2 buses one called I-Code and the other one named D-Code.

The system bus is responsible for the access to memory and peripherals. Is throw this bus that the

accesses to SRAM, external RAM, peripherals, devices and part of the system level memory regions.

The private peripheral buses are responsible to provide the access to the system level memory

regions of the private peripherals, like the debugging components.

The buses interfaces also implement the memory unaligned access and the bit banding feature of

the ARM Cortex-M3.

D.6 Bit Banding

The bit banding is a special feature supported by the ARM Cortex-M3 that allows to copy and change

only 1 bit in the memory recurring to special instructions. The memory map includes two 1MB bit band

regions in the SRAM and peripheral space that map on to 32MB of alias regions. Load/Store operations

on an address in the alias region directly get translated to an operation on the bit aliased by that address.

These operations are atomic and cannot be interrupted by other bus activities. A read issue to the alias

address returns the value in the appropriate bit-band bit. A write to the address in the alias region with

101

the least-significant bit set writes a 1, to the bit-band bit. Writing with the least-significant bit cleared

writes a 0 to the bit. With this feature the ARM Cortex-M3 enables direct access to single bits of data in

a single cycle, making bit masking a much faster operation.

D.7 Interrupt Controller

The ARM Cortex-M3 has a Nested Vectored Interrupt Controller (NVIC) that is closely coupled to

the processor and provides different multiple features. These features are the Nested Interrupt support

that enables to be assigned priority levels to each interrupt. If an interrupt with higher priority shows

when a lower interrupt is running the lower interrupt is stopped and the higher interrupt takes his place.

The Vector Interrupt support, there exists a table in memory with the address of all the interrupt service

routines (ISR). This enables the system to take less time to process the interrupt request. The Dynamic

Priority Changes support that enables priorities of interrupts to be changed in run-time. The Interrupt

Masking, that allows interrupts to be masked using the interrupt masking registers, they can be used

to ensure that time-critical tasks can be finished on time without being interrupted. And the Reducing

of Interrupt Latency, which states for a group of advanced features that provide support to reduce the

interrupt latency. As an example, the ARM Cortex-M3 can automatically save and restore some register

contents.

This last feature enables ARM Cortex-M3 to replace the shadow register exception model of the

ARM7 processor to a stack-based exception model. When an exception takes place, the Program

Counter, Program Status Register, Link Register and the general-purpose registers are pushed on to

the stack. When interrupt service routine or fault handler finishes, the processor automatic restores

the registers to enable the interrupted program to resume normal execution. By handling the stack

operations in hardware, the ARM Cortex-M3 enables the interrupt service routines to be written in C

avoiding write assembler wrappers that were usually required to perform stack manipulation. This feature

makes the development of applications for this processor much easier.

The NVIC also integrates a System Tick (SysTick) timer, which is a 24-bit count-down timer that can

be used to generate interrupts at regular time intervals, proving an ideal heartbeat to drive a Real Time

OS or other scheduled tasks.

D.8 Instruction Set Architecture

The ARM Cortex-M3 has a RISC architecture with the Thumb-2 instruction set. This is one of the

most important features of this processor because it allows the 32-bits and 16-bits instructions to be use

together, leading to a higher code density and efficiency.

Other ARM ISA’s support ARM (32-bits) and Thumb (16-bits) instructions, they were operated like

two different states. In the ARM state the instructions had 32-bits length and were operated with high

efficiency due to a very good performance. In the Thumb state the instructions had 16-bits which leads

to a high code density, but because the instructions were smaller not all types of operations were avail-

102

able and for some operations more instructions would be needed when compared with the instructions

needed in the ARM instruction set.

So, to take advantage of this feature the applications developed for this old ARM processors mixed

the ARM and Thumb instructions leading to a high-density code but not losing all the good performance

of the ARM instructions. But this lead to an overhead in the execution time and memory space, because

it was needed to switch between states to read each type of instructions and when a Thumb instruction

was fetched half of the register size would be wasted. Also, that would be needed to compile to different

files for the different ISA’s. This makes the design of software for this processor more complex and

reduced the efficiency of the CPU Core.

To address these problems a new instruction set was developed and was called Thumb-2. This

mixes the ARM and Thumb instruction set taking the best parts of each one and combining them in one

single instruction set. With the Thumb-2 there is no need for 2 states the different instructions sizes

are handled in only one state, removing the overhead to switch between states to leading to a save in

terms of execution time and memory space. Also, the need to have different files for different instructions

set was removed so the software development and maintenance become easier. Because of this new

instruction set becomes easier to achieve a best efficiency and performance when writing software for

this ARM processor. In fact, the ARM Cortex-M3 does not support ARM instructions, only the Thumb-2

instructions.

The Thumb-2 instructions have backward compatibility with the Thumb instruction set. Some of the

ARM instructions were ported to this new instruction set, not all instructions because for example the

ARM Cortex-M3 does not need instructions for co-processor operations since it does not have one.

Tests to the instructions sets showed (Fig. D.3) that Thumb-2 performance was 25% better then

Thumb and almost the same of ARM. In terms of code size, the code size of Thumb-2 was 26% smaller

than ARM and almost equal to Thumb.

Figure D.2: ARM Cortex-M3Memory Map Figure D.3: Thumb vs Thumb-2 vs ARM ISA’s

103

The only step back of the Thumb-2 instructions set is that only the registers between R0-R7 are

available to be used in instructions with 16-bits length operations with registers after R7 need 32-bits

instructions. So, to produce a smaller code size it may be needed to optimize the code to use less

registers. Also, for a better use of the pipeline when two instructions of 16-bit were find together they will

be fetch both in the same register improving the pipelining of instructions.

The Thumb-2 instruction set includes instructions that make it easier to write compact code for many

different applications. For example, it has instructions for bit-field manipulation useful for applications

like network packet processing. Instructions to insert or extract several bits from a register, useful in

automotive applications. Instructions to reverse the bits in a word. Table branch instructions that enable

a balance of code compaction and high performance. It also has a new If-Then constructor that enables

conditional execution up to four subsequent instructions.

D.9 Data Path and Pipeline

The ARM Cortex-M3 data path is illustrated in the Figure D.4. The main features that it has are the

hardware multiplier and divider that enables multiplication to be made in one clock cycle. The barrel

shifter that allows a shift operation to be performed in one clock cycle and that is place before the ALU

enabling to a faster operation when a shift need to be done before an arithmetic or logic operation. How

this feature works in the code is explained in the Appendix D.9.

Figure D.4: ARM Cortex-M3 Datapath

The ARM Cortex-M3 has a 3-stage pipeline (Fig. D.5). The states are the instruction fetch, instruction

decode, and instruction execute similar to the ARMv7. It does more operations in each stage to increase

overall performance. When executing a program with multiple 16-bits instructions the pipeline fetches

two instructions each time to waste less instruction space and improve the performance. When an

instruction takes more than 1 cycle to execute the pipeline will be stalled.

The ARM Cortex-M3 processor also has a branch prediction. So, when a branch instruction is

encountered, the decode stage also includes a speculative instruction fetch that could lead to faster

104

execution. Because of that, the processor will fetch the branch destination instruction during the decode

stage itself. After the branch is resolved, and the destination instruction is known, if the branch is not

to be taken, the next sequential instruction is already available. If the branch is to be taken, the branch

instruction is made available at the same time as the decision. This feature helps the pipeline to waste

less clock cycles if a branch is encountered leading to a faster performance.

Figure D.5: ARM Cortex-M3 Pipeline

Shift Before ALU

The ARM Cortex-M3 has the barrel shifter placed before the ALU. Therefore, its ALU instructions

allow for a shifting without overhead before an ALU operation. For example in the ISA instructions:

• ADD Rd, Rn, Op2 (Add)

• SUB Rd, Rn, Op2 (Sub)

• AND Rd, Rn, Op2 (Logical And)

• EOR Rd, Rn, Op2 (Exclusive Or)

Op2 can be either:

• A constant

• A Register with an optional shift (ASR, LSL, LSR, ROR, RRX);

As explained in [96] this additional shift does not add any extra cycles to the execution of the instruction

because the Barrel Shifter is placed before the ALU in the ARM Cortex-M3 data path. This means that

when the shift is omitted an LSL #0 shift is performed to the register by default.

D.10 Debugging Support

The ARM Cortex-M3 supports several debug features like such as program execution controls, in-

cluding halting and stepping, instruction breakpoints, data watchpoints, registers and memory accesses,

profiling, and traces. The debug access is made through the Debug Access Port (DAP) that can be im-

plemented as either a Serial Wire Debug Port for a two-pin (clock and data) Interface or a Serial Wire

105

JTAG Debug Port that enables either JTAG or SW protocol to be used. The debug interface is decoupled

from the core and is throw the DAP bus interface that the debug can be performed, using for example

external debuggers.

In the ARM Cortex-M3 a several events can be used to trigger debug actions. Debug events can be

breakpoints, watchpoints, fault conditions, or external debugging request input signals. When a debug

event takes place, the processor can either enter halt mode or execute the debug monitor exception

handler.

The debug has different units that enable the different debug features. The Data Watchpoint and

Trace (DWT) unit can be used to generate data trace information and output the trace. Also provides

a Flash Patch and Breakpoint (FPB) unit that can provide a simple breakpoint function or remap an

instruction access from Flash to a different location in SRAM. It also has an Instrumentation Trace

Macrocell (ITM) that provides a new way for developers to output data to a debugger. All these debugging

components are controlled via the DAP interface bus or with a program running on the processor core.

All trace information is accessible from the TPIU (Trace Port Interface Unit).

All these features allow the ARM Cortex-M3 to be easily debugged enabling fast development of

software with a low-cost debugging. It also supports an optional Embedded Trace Macrocell (ETM) to

allow instruction trace, that the chip manufacturers can include to benefit designers through an excellent

instruction trace capabilities with minimal cost impact.

D.11 Power Consumption

The ARM Cortex-M3 has some features that allow it processing power to have a high efficiency

with a low power consumption. It has a low gate count and design techniques, like the use of smaller

instructions providing a small program size and enabling tasks to be completed in a short time, so that

the processor can return to sleep modes as fast as possible reducing the power consumption. It has a

sleep mode and deep sleep mode which can be used to reduce the power consumption during idle time

periods.

In the more recent versions of the ARM Cortex-M3 a new feature called Wakeup Interrupt Controller

(WIC) is supported. This allows the processor states to be kept while the processor core is powered

down, enabling the processor to return to an active state almost instantaneously when an interrupt

happens.

These features make ARM Cortex-M3 a very good choice for development of low power applications

that were usually implemented with 8-bit or 16-bit controller. Because of it the ARM Cortex-M3 is a

processor with an efficiency of a 32-bit processor but with a power consumption of 8-bits or 16-bits

processors, making it a good choice for micro controllers

106

Documents

Choosing the Future of Lightweight Encryption Algorithms · ARM Cortex-M3, Optimizac¸ao de ... That creates opportunities for a more direct integration of the physical world into