85
pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4 Matthias Kannwischer, Joost Rijneveld, Peter Schwabe, and Ko Stoffelen Radboud University, Nijmegen, The Netherlands [email protected] Aug 24, 2019, 2nd PQC Standardization Conference, Santa Barbara

pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

pqm4: Testing and Benchmarking NIST PQC

on ARM Cortex-M4

Matthias Kannwischer, Joost Rijneveld, Peter Schwabe, and Ko Stoffelen Radboud University, Nijmegen, The Netherlands [email protected]

Aug 24, 2019, 2nd PQC Standardization Conference, Santa Barbara

Page 2: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4
Page 3: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

‚ No/limited vector instructions

‚ Side-channels?

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

2

Page 4: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

‚ No/limited vector instructions

‚ Side-channels?

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

2

Page 5: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Limited RAM

‚ No/limited vector instructions

‚ Side-channels?

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

2

Page 6: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ No/limited vector instructions

‚ Side-channels?

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

2

Page 7: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Side-channels?

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

‚ No/limited vector instructions

2

Page 8: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

‚ No/limited vector instructions

‚ Side-channels?

2

Page 9: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

‚ No/limited vector instructions

‚ Side-channels?

§ Challenges

2

Page 10: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

‚ No/limited vector instructions

‚ Side-channels?

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

2

Page 11: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ What is the overhead of masking?

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

‚ No/limited vector instructions

‚ Side-channels?

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

2

Page 12: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Motivation

§ “Performance will play a larger role in the second round”

§ Round 1: Focus on Intel/AVX2 implementations

§ But: Majority of cryptographic devices is way smaller

‚ Limited RAM

‚ No/limited vector instructions

‚ Side-channels?

§ Challenges

‚ Do schemes even fit in limited RAM + flash?

‚ Are schemes efficient on small ARMs?

‚ What is the overhead of masking?

2

Page 13: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

– everyone, always

§ STM32F4DISCOVERY

‚ ARM Cortex-M4

‚ 32-bit, ARMv7E-M

‚ 192KiB RAM, 168MHz

§ PQM4: test and optimize

on the Cortex-M4

‚ github.com/mupq/pqm4

Post-quantum on small devices

“It’s big and it’s slow”

3

Page 14: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ STM32F4DISCOVERY

‚ ARM Cortex-M4

‚ 32-bit, ARMv7E-M

‚ 192KiB RAM, 168MHz

§ PQM4: test and optimize

on the Cortex-M4

‚ github.com/mupq/pqm4

Post-quantum on small devices

“It’s big and it’s slow”

– everyone, always

3

Page 15: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Post-quantum on small devices

“It’s big and it’s slow”

– everyone, always

§ STM32F4DISCOVERY

‚ ARM Cortex-M4

‚ 32-bit, ARMv7E-M

‚ 192 KiB RAM, 168 MHz

§ PQM4: test and optimize

on the Cortex-M4

‚ github.com/mupq/pqm4

3

Page 16: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ They are huge in terms of RAM and flash

‚ Great for PQC – many schemes fit

‚ Unfortunately, pqRSA did not terminate within round 1

§ ARMv7E-M more interesting for assembly optimizations

§ NIST recommended Cortex-M4 for PQC evaluation

§ We’re using it for teaching

‚ We have dozens of them lying around

‚ Our students know how to work with them

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

4

Page 17: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Great for PQC – many schemes fit

‚ Unfortunately, pqRSA did not terminate within round 1

§ ARMv7E-M more interesting for assembly optimizations

§ NIST recommended Cortex-M4 for PQC evaluation

§ We’re using it for teaching

‚ We have dozens of them lying around

‚ Our students know how to work with them

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

§ They are huge in terms of RAM and flash

4

Page 18: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Unfortunately, pqRSA did not terminate within round 1

§ ARMv7E-M more interesting for assembly optimizations

§ NIST recommended Cortex-M4 for PQC evaluation

§ We’re using it for teaching

‚ We have dozens of them lying around

‚ Our students know how to work with them

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

§ They are huge in terms of RAM and flash

‚ Great for PQC – many schemes fit

4

Page 19: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ ARMv7E-M more interesting for assembly optimizations

§ NIST recommended Cortex-M4 for PQC evaluation

§ We’re using it for teaching

‚ We have dozens of them lying around

‚ Our students know how to work with them

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

§ They are huge in terms of RAM and flash

‚ Great for PQC – many schemes fit

‚ Unfortunately, pqRSA did not terminate within round 1

4

Page 20: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ NIST recommended Cortex-M4 for PQC evaluation

§ We’re using it for teaching

‚ We have dozens of them lying around

‚ Our students know how to work with them

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

§ They are huge in terms of RAM and flash

‚ Great for PQC – many schemes fit

‚ Unfortunately, pqRSA did not terminate within round 1

§ ARMv7E-M more interesting for assembly optimizations

4

Page 21: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ We’re using it for teaching

‚ We have dozens of them lying around

‚ Our students know how to work with them

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

§ They are huge in terms of RAM and flash

‚ Great for PQC – many schemes fit

‚ Unfortunately, pqRSA did not terminate within round 1

§ ARMv7E-M more interesting for assembly optimizations

§ NIST recommended Cortex-M4 for PQC evaluation

4

Page 22: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ We have dozens of them lying around

‚ Our students know how to work with them

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

§ They are huge in terms of RAM and flash

‚ Great for PQC – many schemes fit

‚ Unfortunately, pqRSA did not terminate within round 1

§ ARMv7E-M more interesting for assembly optimizations

§ NIST recommended Cortex-M4 for PQC evaluation

§ We’re using it for teaching

4

Page 23: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Our students know how to work with them

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

§ They are huge in terms of RAM and flash

‚ Great for PQC – many schemes fit

‚ Unfortunately, pqRSA did not terminate within round 1

§ ARMv7E-M more interesting for assembly optimizations

§ NIST recommended Cortex-M4 for PQC evaluation

§ We’re using it for teaching

‚ We have dozens of them lying around

4

Page 24: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Rationale for using STM32F4DISCOVERY boards

§ They are cheap (ă $30)

§ They are huge in terms of RAM and flash

‚ Great for PQC – many schemes fit

‚ Unfortunately, pqRSA did not terminate within round 1

§ ARMv7E-M more interesting for assembly optimizations

§ NIST recommended Cortex-M4 for PQC evaluation

§ We’re using it for teaching

‚ We have dozens of them lying around

‚ Our students know how to work with them

4

Page 25: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

‚ ref: Reference C implementations from submission

packages

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

‚ opt: Optimized portable C implementations

‚ m4: Optimized using ARMv7E-M assembly

1see https://github.com/PQClean/PQClean

5

Page 26: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

‚ ref: Reference C implementations from submission

packages

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

‚ opt: Optimized portable C implementations

‚ m4: Optimized using ARMv7E-M assembly

1see https://github.com/PQClean/PQClean

5

Page 27: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

‚ ref: Reference C implementations from submission

packages

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

‚ opt: Optimized portable C implementations

‚ m4: Optimized using ARMv7E-M assembly

1see https://github.com/PQClean/PQClean

5

Page 28: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

‚ ref: Reference C implementations from submission

packages

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

‚ opt: Optimized portable C implementations

‚ m4: Optimized using ARMv7E-M assembly

1see https://github.com/PQClean/PQClean

5

Page 29: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ ref: Reference C implementations from submission

packages

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

‚ opt: Optimized portable C implementations

‚ m4: Optimized using ARMv7E-M assembly

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

1 see https://github.com/PQClean/PQClean

5

Page 30: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

‚ opt: Optimized portable C implementations

‚ m4: Optimized using ARMv7E-M assembly

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

‚ ref: Reference C implementations from submission

packages

1 see https://github.com/PQClean/PQClean

5

Page 31: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ opt: Optimized portable C implementations

‚ m4: Optimized using ARMv7E-M assembly

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

‚ ref: Reference C implementations from submission

packages

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

1 see https://github.com/PQClean/PQClean

5

Page 32: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ m4: Optimized using ARMv7E-M assembly

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

‚ ref: Reference C implementations from submission

packages

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

‚ opt: Optimized portable C implementations

1 see https://github.com/PQClean/PQClean

5

Page 33: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

pqm4

§ Goals

‚ Framework that eases optimization for this platform

‚ Automate testing and benchmarking

‚ Include as many schemes as possible

§ 4 types of implementations

‚ ref: Reference C implementations from submission

packages

‚ clean: Slightly modified reference implementations to

satisfy basic code quality requirements 1

‚ opt: Optimized portable C implementations

‚ m4: Optimized using ARMv7E-M assembly

1 see https://github.com/PQClean/PQClean

5

Page 34: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Schemes included in pqm4– KEMs

reference optimized BIKE 7Lib — Classic McEliece 7Key — CRYSTALS-Kyber Frodo-KEM

3 3

3 3

[BKS19] [BFM ` 18]

HQC 7Lib — LAC 3 — LEDAcrypt 7RAM WIP NewHope 3 3 [AJS16] NTRU 3 3 [KRS19] NTRU Prime 3 — NTS-KEM 7Key — ROLLO 7Lib — Round5 3 3 Round5 team RQC 7Lib — SABER 3 3 [KRS19] SIKE 3 — ThreeBears 3 3 ThreeBears team

7Key : keys too large 7RAM : implementation uses too much RAM

7Lib: available implementations depend on external libraries 6

Page 35: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Schemes included in pqm4– Signatures

reference optimized

CRYSTALS-Dilithium 3 3 [GKOS18, RSGCB19]

FALCON 7RAM 3 Falcon team

GeMSS 7Key —

LUOV 3 —

MQDSS 7RAM —

Picnic 7RAM —

qTESLA 3 —

Rainbow 7Key —

SPHINCS+ 3 —

7Key : keys too large 7RAM : implementation uses too much RAM

7Lib: available implementations depend on external libraries

7

Page 36: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Benchmarking: Cycle Counts and RNG

§ Cycle counts

‚ We don’t want to benchmark the memory controller

§ Downclock core to 24MHz Ñ no wait states

§ Allows to have comprehensible cycle count

§ randombytes

‚ We use the hardware RNG of our platform

‚ Most schemes only sample seed, so speed doesn’t matter

8

Page 37: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ Downclock core to 24MHz Ñ no wait states

§ Allows to have comprehensible cycle count

§ randombytes

‚ We use the hardware RNG of our platform

‚ Most schemes only sample seed, so speed doesn’t matter

Benchmarking: Cycle Counts and RNG

§ Cycle counts

‚ We don’t want to benchmark the memory controller

8

Page 38: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ Allows to have comprehensible cycle count

§ randombytes

‚ We use the hardware RNG of our platform

‚ Most schemes only sample seed, so speed doesn’t matter

Benchmarking: Cycle Counts and RNG

§ Cycle counts

‚ We don’t want to benchmark the memory controller

§ Downclock core to 24MHz Ñ no wait states

8

Page 39: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ randombytes

‚ We use the hardware RNG of our platform

‚ Most schemes only sample seed, so speed doesn’t matter

Benchmarking: Cycle Counts and RNG

§ Cycle counts

‚ We don’t want to benchmark the memory controller

§ Downclock core to 24MHz Ñ no wait states § Allows to have comprehensible cycle count

8

Page 40: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ We use the hardware RNG of our platform

‚ Most schemes only sample seed, so speed doesn’t matter

Benchmarking: Cycle Counts and RNG

§ Cycle counts

‚ We don’t want to benchmark the memory controller

§ Downclock core to 24MHz Ñ no wait states § Allows to have comprehensible cycle count

§ randombytes

8

Page 41: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Most schemes only sample seed, so speed doesn’t matter

Benchmarking: Cycle Counts and RNG

§ Cycle counts

‚ We don’t want to benchmark the memory controller

§ Downclock core to 24MHz Ñ no wait states § Allows to have comprehensible cycle count

§ randombytes

‚ We use the hardware RNG of our platform

8

Page 42: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Benchmarking: Cycle Counts and RNG

§ Cycle counts

‚ We don’t want to benchmark the memory controller

§ Downclock core to 24MHz Ñ no wait states § Allows to have comprehensible cycle count

§ randombytes

‚ We use the hardware RNG of our platform

‚ Most schemes only sample seed, so speed doesn’t matter

8

Page 43: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ We don’t want to benchmark those

§ Our approach: Replace those with a single fast

implementation to allow fair comparison

§ SHA-3: ARMv7-M assembly implementation from XKCP 1

§ SHA-2: Fast C implementation from SUPERCOP 2

§ AES: ARMv7-M assembly implementation from [SS16] 3

Benchmarking: Fast Hashing

§ Submission packages often come with different

implementations of SHA-2, SHA-3, or AES

1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016

9

Page 44: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ Our approach: Replace those with a single fast

implementation to allow fair comparison

§ SHA-3: ARMv7-M assembly implementation from XKCP 1

§ SHA-2: Fast C implementation from SUPERCOP 2

§ AES: ARMv7-M assembly implementation from [SS16] 3

Benchmarking: Fast Hashing

§ Submission packages often come with different

implementations of SHA-2, SHA-3, or AES

‚ We don’t want to benchmark those

1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016

9

Page 45: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ SHA-3: ARMv7-M assembly implementation from XKCP 1

§ SHA-2: Fast C implementation from SUPERCOP 2

§ AES: ARMv7-M assembly implementation from [SS16] 3

Benchmarking: Fast Hashing

§ Submission packages often come with different

implementations of SHA-2, SHA-3, or AES

‚ We don’t want to benchmark those

§ Our approach: Replace those with a single fast

implementation to allow fair comparison

1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016

9

Page 46: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ SHA-2: Fast C implementation from SUPERCOP 2

§ AES: ARMv7-M assembly implementation from [SS16] 3

Benchmarking: Fast Hashing

§ Submission packages often come with different

implementations of SHA-2, SHA-3, or AES

‚ We don’t want to benchmark those

§ Our approach: Replace those with a single fast

implementation to allow fair comparison

§ SHA-3: ARMv7-M assembly implementation from XKCP 1

1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016

9

Page 47: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ AES: ARMv7-M assembly implementation from [SS16] 3

Benchmarking: Fast Hashing

§ Submission packages often come with different

implementations of SHA-2, SHA-3, or AES

‚ We don’t want to benchmark those

§ Our approach: Replace those with a single fast

implementation to allow fair comparison

§ SHA-3: ARMv7-M assembly implementation from XKCP 1

§ SHA-2: Fast C implementation from SUPERCOP 2

1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016

9

Page 48: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Benchmarking: Fast Hashing

§ Submission packages often come with different

implementations of SHA-2, SHA-3, or AES

‚ We don’t want to benchmark those

§ Our approach: Replace those with a single fast

implementation to allow fair comparison

§ SHA-3: ARMv7-M assembly implementation from XKCP 1

§ SHA-2: Fast C implementation from SUPERCOP 2

§ AES: ARMv7-M assembly implementation from [SS16] 3

1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016

9

Page 49: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

10

Page 50: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

10

Page 51: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

10

Page 52: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

10

Page 53: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

10

Page 54: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

10

Page 55: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

10

Page 56: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

10

Page 57: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

10

Page 58: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

10

Page 59: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

10

Page 60: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ see https://github.com/mupq/pqm4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

10

Page 61: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Results

§ Not all schemes have been optimized for this platform yet

§ We tried to collect as many implementations as possible

§ If we missed something

‚ Send us an e-mail or talk to us

‚ Open a pull request

§ For the following results, we restrict to

‚ Only schemes that have been optimized

‚ NIST security level 1

‚ For KEMs: CCA variants

‚ Only SHA-3/SHAKE variants

§ For the full results

‚ see paper at https://ia.cr/2019/844

‚ see https://github.com/mupq/pqm4

10

Page 62: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

KEM Speed

baby

bear-

opt

frodo

kem64

0shake

-m4

kyber5

12-m

4

lightsa

ber-m

4

ntruh

ps204

8509

-m4

ntruh

rss70

1-m4

r5nd-1

kemcca

-0d-m

4

r5nd-1

kemcca

-5d-m

40

20000

40000

60000

80000

100000

120000

140000

160000k

cycle

sKeyGenEncapsDecaps

11

Page 63: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

KEM Speed (2)

baby

bear-

opt

frodo

kem64

0shake

-m4

kyber5

12-m

4

lightsa

ber-m

4

ntruh

ps204

8509

-m4

ntruh

rss70

1-m4

r5nd-1

kemcca

-0d-m

4

r5nd-1

kemcca

-5d-m

40

250

500

750

1000

1250

1500

1750

2000k

cycle

sKeyGenEncapsDecaps

12

Page 64: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Signature Speed

dilithi

um2-m

4

falcon

512-m

4-ct

0

25000

50000

75000

100000

125000

150000

175000

200000k

cycle

sKeyGenSignVerify

13

Page 65: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Signature Speed (2)

dilithi

um2-m

4

falcon

512-m

4-ct

0

5000

10000

15000

20000

25000

30000

35000

40000k

cycle

sKeyGenSignVerify

14

Page 66: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

KEM RAM consumption

baby

bear-

opt

frodo

kem64

0shake

-m4

kyber5

12-m

4

lightsa

ber-m

4

ntruh

ps204

8509

-m4

ntruh

rss70

1-m4

r5nd-1

kemcca

-0d-m

4

r5nd-1

kemcca

-5d-m

40

10000

20000

30000

40000

50000

60000

70000by

tes

KeyGenEncapsDecaps

15

Page 67: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

KEM RAM consumption (2)

baby

bear-

opt

frodo

kem64

0shake

-m4

kyber5

12-m

4

lightsa

ber-m

4

ntruh

ps204

8509

-m4

ntruh

rss70

1-m4

r5nd-1

kemcca

-0d-m

4

r5nd-1

kemcca

-5d-m

40

1000

2000

3000

4000

5000

6000

7000

8000

9000by

tes

KeyGenEncapsDecaps

16

Page 68: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Signature RAM consumption

dilithi

um2-m

4

falcon

512-m

4-ct

0

10000

20000

30000

40000

50000

60000by

tes

KeyGenSignVerify

17

Page 69: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Signature RAM consumption (2)

dilithi

um2-m

4

falcon

512-m

4-ct

0

1000

2000

3000

4000

5000by

tes

KeyGenSignVerify

18

Page 70: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ 10 KEMs (6 optimized)

‚ 5 signature schemes (2 optimized)

§ Current implementations of Classic McEliece, LEDAcrypt,

NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume

more than 128 KiB of RAM

Ñ don’t fit

§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP

Ñ needs to be replaced to make it work

Conclusion

§ pqm4 currently includes

19

Page 71: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ 5 signature schemes (2 optimized)

§ Current implementations of Classic McEliece, LEDAcrypt,

NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume

more than 128 KiB of RAM

Ñ don’t fit

§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP

Ñ needs to be replaced to make it work

Conclusion

§ pqm4 currently includes

‚ 10 KEMs (6 optimized)

19

Page 72: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ Current implementations of Classic McEliece, LEDAcrypt,

NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume

more than 128 KiB of RAM

Ñ don’t fit

§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP

Ñ needs to be replaced to make it work

Conclusion

§ pqm4 currently includes

‚ 10 KEMs (6 optimized)

‚ 5 signature schemes (2 optimized)

19

Page 73: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP

Ñ needs to be replaced to make it work

Conclusion

§ pqm4 currently includes

‚ 10 KEMs (6 optimized)

‚ 5 signature schemes (2 optimized)

§ Current implementations of Classic McEliece, LEDAcrypt,

NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume

more than 128 KiB of RAM

Ñ don’t fit

19

Page 74: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Conclusion

§ pqm4 currently includes

‚ 10 KEMs (6 optimized)

‚ 5 signature schemes (2 optimized)

§ Current implementations of Classic McEliece, LEDAcrypt,

NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume

more than 128 KiB of RAM

Ñ don’t fit

§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP

Ñ needs to be replaced to make it work

19

Page 75: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ Level of optimization greatly differs

‚ Most implementations don’t optimize RAM consumption

‚ No implementations optimize code size

‚ What does NIST care about?

§ Currently, Round5 seems to be the fastest on this platform

‚ But Kyber, NTRU, Saber, ThreeBears very close

Conclusion

§ Still many schemes left to optimize Ñ please PR

20

Page 76: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ Most implementations don’t optimize RAM consumption

‚ No implementations optimize code size

‚ What does NIST care about?

§ Currently, Round5 seems to be the fastest on this platform

‚ But Kyber, NTRU, Saber, ThreeBears very close

Conclusion

§ Still many schemes left to optimize Ñ please PR

§ Level of optimization greatly differs

20

Page 77: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ No implementations optimize code size

‚ What does NIST care about?

§ Currently, Round5 seems to be the fastest on this platform

‚ But Kyber, NTRU, Saber, ThreeBears very close

Conclusion

§ Still many schemes left to optimize Ñ please PR

§ Level of optimization greatly differs

‚ Most implementations don’t optimize RAM consumption

20

Page 78: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ What does NIST care about?

§ Currently, Round5 seems to be the fastest on this platform

‚ But Kyber, NTRU, Saber, ThreeBears very close

Conclusion

§ Still many schemes left to optimize Ñ please PR

§ Level of optimization greatly differs

‚ Most implementations don’t optimize RAM consumption

‚ No implementations optimize code size

20

Page 79: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

§ Currently, Round5 seems to be the fastest on this platform

‚ But Kyber, NTRU, Saber, ThreeBears very close

Conclusion

§ Still many schemes left to optimize Ñ please PR

§ Level of optimization greatly differs

‚ Most implementations don’t optimize RAM consumption

‚ No implementations optimize code size

‚ What does NIST care about?

20

Page 80: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

‚ But Kyber, NTRU, Saber, ThreeBears very close

Conclusion

§ Still many schemes left to optimize Ñ please PR

§ Level of optimization greatly differs

‚ Most implementations don’t optimize RAM consumption

‚ No implementations optimize code size

‚ What does NIST care about?

§ Currently, Round5 seems to be the fastest on this platform

20

Page 81: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

Conclusion

§ Still many schemes left to optimize Ñ please PR

§ Level of optimization greatly differs

‚ Most implementations don’t optimize RAM consumption

‚ No implementations optimize code size

‚ What does NIST care about?

§ Currently, Round5 seems to be the fastest on this platform

‚ But Kyber, NTRU, Saber, ThreeBears very close

20

Page 82: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

https://github.com/mupq/pqm4

slides and paper available at kannwischer.eu

Thank you!

20

Page 83: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

References i

Erdem Alkim, Philipp Jakubeit, and Peter Schwabe.

A new hope on ARM Cortex-M.

In Security, Privacy, and Advanced Cryptography Engineering,

volume 10076 of Lecture Notes in Computer Science, pages

332–349. Springer-Verlag Berlin Heidelberg, 2016.

Joppe W. Bos, Simon Friedberger, Marco Martinoli, Elisabeth

Oswald, and Martijn Stam.

Fly, you fool! faster frodo for the ARM Cortex-M4. Cryptology ePrint Archive, Report 2018/1116, 2018. https://eprint.iacr.org/2018/1116.

21

Page 84: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

References ii

Leon Botros, Matthias J. Kannwischer, and Peter Schwabe.

Memory-efficient high-speed implementation of Kyber on

Cortex-M4.

In Progress in Cryptology – Africacrypt 2019, Lecture Notes in

Computer Science, pages 209–228. Springer-Verlag Berlin

Heidelberg, 2019.

Tim Guneysu, Markus Krausz, Tobias Oder, and Julian Speith.

Evaluation of lattice-based signature schemes in

embedded systems.

In 25th IEEE International Conference on Electronics, Circuits

and Systems, ICECS 2018, pages 385–388, 2018.

22

Page 85: pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4

References iii

Matthias J. Kannwischer, Joost Rijneveld, and Peter Schwabe.

Faster multiplication in Z2m rxs on Cortex-M4 to speed up

NIST PQC candidates.

In Applied Cryptography and Network Security, Lecture Notes

in Computer Science, pages 281–301. Springer-Verlag Berlin

Heidelberg, 2019.

Prasanna Ravi, Sourav Sen Gupta, Anupam Chattopadhyay,

and Shivam Bhasin.

Improving speed of Dilithium’s signing procedure. Cryptology ePrint Archive, Report 2019/420, 2019. https://eprint.iacr.org/2019/420.

23