Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
pqm4: Testing and Benchmarking NIST PQC
on ARM Cortex-M4
Matthias Kannwischer, Joost Rijneveld, Peter Schwabe, and Ko Stoffelen Radboud University, Nijmegen, The Netherlands [email protected]
Aug 24, 2019, 2nd PQC Standardization Conference, Santa Barbara
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
‚ No/limited vector instructions
‚ Side-channels?
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
2
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
‚ No/limited vector instructions
‚ Side-channels?
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
2
‚ Limited RAM
‚ No/limited vector instructions
‚ Side-channels?
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
2
‚ No/limited vector instructions
‚ Side-channels?
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
2
‚ Side-channels?
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
‚ No/limited vector instructions
2
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
‚ No/limited vector instructions
‚ Side-channels?
2
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
‚ No/limited vector instructions
‚ Side-channels?
§ Challenges
2
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
‚ No/limited vector instructions
‚ Side-channels?
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
2
‚ What is the overhead of masking?
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
‚ No/limited vector instructions
‚ Side-channels?
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
2
Motivation
§ “Performance will play a larger role in the second round”
§ Round 1: Focus on Intel/AVX2 implementations
§ But: Majority of cryptographic devices is way smaller
‚ Limited RAM
‚ No/limited vector instructions
‚ Side-channels?
§ Challenges
‚ Do schemes even fit in limited RAM + flash?
‚ Are schemes efficient on small ARMs?
‚ What is the overhead of masking?
2
– everyone, always
§ STM32F4DISCOVERY
‚ ARM Cortex-M4
‚ 32-bit, ARMv7E-M
‚ 192KiB RAM, 168MHz
§ PQM4: test and optimize
on the Cortex-M4
‚ github.com/mupq/pqm4
Post-quantum on small devices
“It’s big and it’s slow”
3
§ STM32F4DISCOVERY
‚ ARM Cortex-M4
‚ 32-bit, ARMv7E-M
‚ 192KiB RAM, 168MHz
§ PQM4: test and optimize
on the Cortex-M4
‚ github.com/mupq/pqm4
Post-quantum on small devices
“It’s big and it’s slow”
– everyone, always
3
Post-quantum on small devices
“It’s big and it’s slow”
– everyone, always
§ STM32F4DISCOVERY
‚ ARM Cortex-M4
‚ 32-bit, ARMv7E-M
‚ 192 KiB RAM, 168 MHz
§ PQM4: test and optimize
on the Cortex-M4
‚ github.com/mupq/pqm4
3
§ They are huge in terms of RAM and flash
‚ Great for PQC – many schemes fit
‚ Unfortunately, pqRSA did not terminate within round 1
§ ARMv7E-M more interesting for assembly optimizations
§ NIST recommended Cortex-M4 for PQC evaluation
§ We’re using it for teaching
‚ We have dozens of them lying around
‚ Our students know how to work with them
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
4
‚ Great for PQC – many schemes fit
‚ Unfortunately, pqRSA did not terminate within round 1
§ ARMv7E-M more interesting for assembly optimizations
§ NIST recommended Cortex-M4 for PQC evaluation
§ We’re using it for teaching
‚ We have dozens of them lying around
‚ Our students know how to work with them
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
§ They are huge in terms of RAM and flash
4
‚ Unfortunately, pqRSA did not terminate within round 1
§ ARMv7E-M more interesting for assembly optimizations
§ NIST recommended Cortex-M4 for PQC evaluation
§ We’re using it for teaching
‚ We have dozens of them lying around
‚ Our students know how to work with them
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
§ They are huge in terms of RAM and flash
‚ Great for PQC – many schemes fit
4
§ ARMv7E-M more interesting for assembly optimizations
§ NIST recommended Cortex-M4 for PQC evaluation
§ We’re using it for teaching
‚ We have dozens of them lying around
‚ Our students know how to work with them
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
§ They are huge in terms of RAM and flash
‚ Great for PQC – many schemes fit
‚ Unfortunately, pqRSA did not terminate within round 1
4
§ NIST recommended Cortex-M4 for PQC evaluation
§ We’re using it for teaching
‚ We have dozens of them lying around
‚ Our students know how to work with them
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
§ They are huge in terms of RAM and flash
‚ Great for PQC – many schemes fit
‚ Unfortunately, pqRSA did not terminate within round 1
§ ARMv7E-M more interesting for assembly optimizations
4
§ We’re using it for teaching
‚ We have dozens of them lying around
‚ Our students know how to work with them
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
§ They are huge in terms of RAM and flash
‚ Great for PQC – many schemes fit
‚ Unfortunately, pqRSA did not terminate within round 1
§ ARMv7E-M more interesting for assembly optimizations
§ NIST recommended Cortex-M4 for PQC evaluation
4
‚ We have dozens of them lying around
‚ Our students know how to work with them
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
§ They are huge in terms of RAM and flash
‚ Great for PQC – many schemes fit
‚ Unfortunately, pqRSA did not terminate within round 1
§ ARMv7E-M more interesting for assembly optimizations
§ NIST recommended Cortex-M4 for PQC evaluation
§ We’re using it for teaching
4
‚ Our students know how to work with them
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
§ They are huge in terms of RAM and flash
‚ Great for PQC – many schemes fit
‚ Unfortunately, pqRSA did not terminate within round 1
§ ARMv7E-M more interesting for assembly optimizations
§ NIST recommended Cortex-M4 for PQC evaluation
§ We’re using it for teaching
‚ We have dozens of them lying around
4
Rationale for using STM32F4DISCOVERY boards
§ They are cheap (ă $30)
§ They are huge in terms of RAM and flash
‚ Great for PQC – many schemes fit
‚ Unfortunately, pqRSA did not terminate within round 1
§ ARMv7E-M more interesting for assembly optimizations
§ NIST recommended Cortex-M4 for PQC evaluation
§ We’re using it for teaching
‚ We have dozens of them lying around
‚ Our students know how to work with them
4
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
‚ ref: Reference C implementations from submission
packages
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
‚ opt: Optimized portable C implementations
‚ m4: Optimized using ARMv7E-M assembly
1see https://github.com/PQClean/PQClean
5
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
‚ ref: Reference C implementations from submission
packages
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
‚ opt: Optimized portable C implementations
‚ m4: Optimized using ARMv7E-M assembly
1see https://github.com/PQClean/PQClean
5
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
‚ ref: Reference C implementations from submission
packages
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
‚ opt: Optimized portable C implementations
‚ m4: Optimized using ARMv7E-M assembly
1see https://github.com/PQClean/PQClean
5
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
‚ ref: Reference C implementations from submission
packages
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
‚ opt: Optimized portable C implementations
‚ m4: Optimized using ARMv7E-M assembly
1see https://github.com/PQClean/PQClean
5
‚ ref: Reference C implementations from submission
packages
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
‚ opt: Optimized portable C implementations
‚ m4: Optimized using ARMv7E-M assembly
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
1 see https://github.com/PQClean/PQClean
5
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
‚ opt: Optimized portable C implementations
‚ m4: Optimized using ARMv7E-M assembly
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
‚ ref: Reference C implementations from submission
packages
1 see https://github.com/PQClean/PQClean
5
‚ opt: Optimized portable C implementations
‚ m4: Optimized using ARMv7E-M assembly
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
‚ ref: Reference C implementations from submission
packages
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
1 see https://github.com/PQClean/PQClean
5
‚ m4: Optimized using ARMv7E-M assembly
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
‚ ref: Reference C implementations from submission
packages
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
‚ opt: Optimized portable C implementations
1 see https://github.com/PQClean/PQClean
5
pqm4
§ Goals
‚ Framework that eases optimization for this platform
‚ Automate testing and benchmarking
‚ Include as many schemes as possible
§ 4 types of implementations
‚ ref: Reference C implementations from submission
packages
‚ clean: Slightly modified reference implementations to
satisfy basic code quality requirements 1
‚ opt: Optimized portable C implementations
‚ m4: Optimized using ARMv7E-M assembly
1 see https://github.com/PQClean/PQClean
5
Schemes included in pqm4– KEMs
reference optimized BIKE 7Lib — Classic McEliece 7Key — CRYSTALS-Kyber Frodo-KEM
3 3
3 3
[BKS19] [BFM ` 18]
HQC 7Lib — LAC 3 — LEDAcrypt 7RAM WIP NewHope 3 3 [AJS16] NTRU 3 3 [KRS19] NTRU Prime 3 — NTS-KEM 7Key — ROLLO 7Lib — Round5 3 3 Round5 team RQC 7Lib — SABER 3 3 [KRS19] SIKE 3 — ThreeBears 3 3 ThreeBears team
7Key : keys too large 7RAM : implementation uses too much RAM
7Lib: available implementations depend on external libraries 6
Schemes included in pqm4– Signatures
reference optimized
CRYSTALS-Dilithium 3 3 [GKOS18, RSGCB19]
FALCON 7RAM 3 Falcon team
GeMSS 7Key —
LUOV 3 —
MQDSS 7RAM —
Picnic 7RAM —
qTESLA 3 —
Rainbow 7Key —
SPHINCS+ 3 —
7Key : keys too large 7RAM : implementation uses too much RAM
7Lib: available implementations depend on external libraries
7
Benchmarking: Cycle Counts and RNG
§ Cycle counts
‚ We don’t want to benchmark the memory controller
§ Downclock core to 24MHz Ñ no wait states
§ Allows to have comprehensible cycle count
§ randombytes
‚ We use the hardware RNG of our platform
‚ Most schemes only sample seed, so speed doesn’t matter
8
§ Downclock core to 24MHz Ñ no wait states
§ Allows to have comprehensible cycle count
§ randombytes
‚ We use the hardware RNG of our platform
‚ Most schemes only sample seed, so speed doesn’t matter
Benchmarking: Cycle Counts and RNG
§ Cycle counts
‚ We don’t want to benchmark the memory controller
8
§ Allows to have comprehensible cycle count
§ randombytes
‚ We use the hardware RNG of our platform
‚ Most schemes only sample seed, so speed doesn’t matter
Benchmarking: Cycle Counts and RNG
§ Cycle counts
‚ We don’t want to benchmark the memory controller
§ Downclock core to 24MHz Ñ no wait states
8
§ randombytes
‚ We use the hardware RNG of our platform
‚ Most schemes only sample seed, so speed doesn’t matter
Benchmarking: Cycle Counts and RNG
§ Cycle counts
‚ We don’t want to benchmark the memory controller
§ Downclock core to 24MHz Ñ no wait states § Allows to have comprehensible cycle count
8
‚ We use the hardware RNG of our platform
‚ Most schemes only sample seed, so speed doesn’t matter
Benchmarking: Cycle Counts and RNG
§ Cycle counts
‚ We don’t want to benchmark the memory controller
§ Downclock core to 24MHz Ñ no wait states § Allows to have comprehensible cycle count
§ randombytes
8
‚ Most schemes only sample seed, so speed doesn’t matter
Benchmarking: Cycle Counts and RNG
§ Cycle counts
‚ We don’t want to benchmark the memory controller
§ Downclock core to 24MHz Ñ no wait states § Allows to have comprehensible cycle count
§ randombytes
‚ We use the hardware RNG of our platform
8
Benchmarking: Cycle Counts and RNG
§ Cycle counts
‚ We don’t want to benchmark the memory controller
§ Downclock core to 24MHz Ñ no wait states § Allows to have comprehensible cycle count
§ randombytes
‚ We use the hardware RNG of our platform
‚ Most schemes only sample seed, so speed doesn’t matter
8
‚ We don’t want to benchmark those
§ Our approach: Replace those with a single fast
implementation to allow fair comparison
§ SHA-3: ARMv7-M assembly implementation from XKCP 1
§ SHA-2: Fast C implementation from SUPERCOP 2
§ AES: ARMv7-M assembly implementation from [SS16] 3
Benchmarking: Fast Hashing
§ Submission packages often come with different
implementations of SHA-2, SHA-3, or AES
1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016
9
§ Our approach: Replace those with a single fast
implementation to allow fair comparison
§ SHA-3: ARMv7-M assembly implementation from XKCP 1
§ SHA-2: Fast C implementation from SUPERCOP 2
§ AES: ARMv7-M assembly implementation from [SS16] 3
Benchmarking: Fast Hashing
§ Submission packages often come with different
implementations of SHA-2, SHA-3, or AES
‚ We don’t want to benchmark those
1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016
9
§ SHA-3: ARMv7-M assembly implementation from XKCP 1
§ SHA-2: Fast C implementation from SUPERCOP 2
§ AES: ARMv7-M assembly implementation from [SS16] 3
Benchmarking: Fast Hashing
§ Submission packages often come with different
implementations of SHA-2, SHA-3, or AES
‚ We don’t want to benchmark those
§ Our approach: Replace those with a single fast
implementation to allow fair comparison
1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016
9
§ SHA-2: Fast C implementation from SUPERCOP 2
§ AES: ARMv7-M assembly implementation from [SS16] 3
Benchmarking: Fast Hashing
§ Submission packages often come with different
implementations of SHA-2, SHA-3, or AES
‚ We don’t want to benchmark those
§ Our approach: Replace those with a single fast
implementation to allow fair comparison
§ SHA-3: ARMv7-M assembly implementation from XKCP 1
1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016
9
§ AES: ARMv7-M assembly implementation from [SS16] 3
Benchmarking: Fast Hashing
§ Submission packages often come with different
implementations of SHA-2, SHA-3, or AES
‚ We don’t want to benchmark those
§ Our approach: Replace those with a single fast
implementation to allow fair comparison
§ SHA-3: ARMv7-M assembly implementation from XKCP 1
§ SHA-2: Fast C implementation from SUPERCOP 2
1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016
9
Benchmarking: Fast Hashing
§ Submission packages often come with different
implementations of SHA-2, SHA-3, or AES
‚ We don’t want to benchmark those
§ Our approach: Replace those with a single fast
implementation to allow fair comparison
§ SHA-3: ARMv7-M assembly implementation from XKCP 1
§ SHA-2: Fast C implementation from SUPERCOP 2
§ AES: ARMv7-M assembly implementation from [SS16] 3
1https://github.com/XKCP/XKCP 2https://bench.cr.yp.to/supercop.html 3Schwabe and Stoffelen, SAC 2016
9
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
10
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
10
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
10
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
10
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
10
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
10
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
10
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
10
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
10
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
10
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
10
‚ see https://github.com/mupq/pqm4
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
10
Results
§ Not all schemes have been optimized for this platform yet
§ We tried to collect as many implementations as possible
§ If we missed something
‚ Send us an e-mail or talk to us
‚ Open a pull request
§ For the following results, we restrict to
‚ Only schemes that have been optimized
‚ NIST security level 1
‚ For KEMs: CCA variants
‚ Only SHA-3/SHAKE variants
§ For the full results
‚ see paper at https://ia.cr/2019/844
‚ see https://github.com/mupq/pqm4
10
KEM Speed
baby
bear-
opt
frodo
kem64
0shake
-m4
kyber5
12-m
4
lightsa
ber-m
4
ntruh
ps204
8509
-m4
ntruh
rss70
1-m4
r5nd-1
kemcca
-0d-m
4
r5nd-1
kemcca
-5d-m
40
20000
40000
60000
80000
100000
120000
140000
160000k
cycle
sKeyGenEncapsDecaps
11
KEM Speed (2)
baby
bear-
opt
frodo
kem64
0shake
-m4
kyber5
12-m
4
lightsa
ber-m
4
ntruh
ps204
8509
-m4
ntruh
rss70
1-m4
r5nd-1
kemcca
-0d-m
4
r5nd-1
kemcca
-5d-m
40
250
500
750
1000
1250
1500
1750
2000k
cycle
sKeyGenEncapsDecaps
12
Signature Speed
dilithi
um2-m
4
falcon
512-m
4-ct
0
25000
50000
75000
100000
125000
150000
175000
200000k
cycle
sKeyGenSignVerify
13
Signature Speed (2)
dilithi
um2-m
4
falcon
512-m
4-ct
0
5000
10000
15000
20000
25000
30000
35000
40000k
cycle
sKeyGenSignVerify
14
KEM RAM consumption
baby
bear-
opt
frodo
kem64
0shake
-m4
kyber5
12-m
4
lightsa
ber-m
4
ntruh
ps204
8509
-m4
ntruh
rss70
1-m4
r5nd-1
kemcca
-0d-m
4
r5nd-1
kemcca
-5d-m
40
10000
20000
30000
40000
50000
60000
70000by
tes
KeyGenEncapsDecaps
15
KEM RAM consumption (2)
baby
bear-
opt
frodo
kem64
0shake
-m4
kyber5
12-m
4
lightsa
ber-m
4
ntruh
ps204
8509
-m4
ntruh
rss70
1-m4
r5nd-1
kemcca
-0d-m
4
r5nd-1
kemcca
-5d-m
40
1000
2000
3000
4000
5000
6000
7000
8000
9000by
tes
KeyGenEncapsDecaps
16
Signature RAM consumption
dilithi
um2-m
4
falcon
512-m
4-ct
0
10000
20000
30000
40000
50000
60000by
tes
KeyGenSignVerify
17
Signature RAM consumption (2)
dilithi
um2-m
4
falcon
512-m
4-ct
0
1000
2000
3000
4000
5000by
tes
KeyGenSignVerify
18
‚ 10 KEMs (6 optimized)
‚ 5 signature schemes (2 optimized)
§ Current implementations of Classic McEliece, LEDAcrypt,
NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume
more than 128 KiB of RAM
Ñ don’t fit
§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP
Ñ needs to be replaced to make it work
Conclusion
§ pqm4 currently includes
19
‚ 5 signature schemes (2 optimized)
§ Current implementations of Classic McEliece, LEDAcrypt,
NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume
more than 128 KiB of RAM
Ñ don’t fit
§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP
Ñ needs to be replaced to make it work
Conclusion
§ pqm4 currently includes
‚ 10 KEMs (6 optimized)
19
§ Current implementations of Classic McEliece, LEDAcrypt,
NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume
more than 128 KiB of RAM
Ñ don’t fit
§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP
Ñ needs to be replaced to make it work
Conclusion
§ pqm4 currently includes
‚ 10 KEMs (6 optimized)
‚ 5 signature schemes (2 optimized)
19
§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP
Ñ needs to be replaced to make it work
Conclusion
§ pqm4 currently includes
‚ 10 KEMs (6 optimized)
‚ 5 signature schemes (2 optimized)
§ Current implementations of Classic McEliece, LEDAcrypt,
NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume
more than 128 KiB of RAM
Ñ don’t fit
19
Conclusion
§ pqm4 currently includes
‚ 10 KEMs (6 optimized)
‚ 5 signature schemes (2 optimized)
§ Current implementations of Classic McEliece, LEDAcrypt,
NTS-KEM, GeMSS, MQDSS, Picnic, and Rainbow consume
more than 128 KiB of RAM
Ñ don’t fit
§ BIKE, HQC, ROLLO, RQC use OpenSSL/NTL/GMP
Ñ needs to be replaced to make it work
19
§ Level of optimization greatly differs
‚ Most implementations don’t optimize RAM consumption
‚ No implementations optimize code size
‚ What does NIST care about?
§ Currently, Round5 seems to be the fastest on this platform
‚ But Kyber, NTRU, Saber, ThreeBears very close
Conclusion
§ Still many schemes left to optimize Ñ please PR
20
‚ Most implementations don’t optimize RAM consumption
‚ No implementations optimize code size
‚ What does NIST care about?
§ Currently, Round5 seems to be the fastest on this platform
‚ But Kyber, NTRU, Saber, ThreeBears very close
Conclusion
§ Still many schemes left to optimize Ñ please PR
§ Level of optimization greatly differs
20
‚ No implementations optimize code size
‚ What does NIST care about?
§ Currently, Round5 seems to be the fastest on this platform
‚ But Kyber, NTRU, Saber, ThreeBears very close
Conclusion
§ Still many schemes left to optimize Ñ please PR
§ Level of optimization greatly differs
‚ Most implementations don’t optimize RAM consumption
20
‚ What does NIST care about?
§ Currently, Round5 seems to be the fastest on this platform
‚ But Kyber, NTRU, Saber, ThreeBears very close
Conclusion
§ Still many schemes left to optimize Ñ please PR
§ Level of optimization greatly differs
‚ Most implementations don’t optimize RAM consumption
‚ No implementations optimize code size
20
§ Currently, Round5 seems to be the fastest on this platform
‚ But Kyber, NTRU, Saber, ThreeBears very close
Conclusion
§ Still many schemes left to optimize Ñ please PR
§ Level of optimization greatly differs
‚ Most implementations don’t optimize RAM consumption
‚ No implementations optimize code size
‚ What does NIST care about?
20
‚ But Kyber, NTRU, Saber, ThreeBears very close
Conclusion
§ Still many schemes left to optimize Ñ please PR
§ Level of optimization greatly differs
‚ Most implementations don’t optimize RAM consumption
‚ No implementations optimize code size
‚ What does NIST care about?
§ Currently, Round5 seems to be the fastest on this platform
20
Conclusion
§ Still many schemes left to optimize Ñ please PR
§ Level of optimization greatly differs
‚ Most implementations don’t optimize RAM consumption
‚ No implementations optimize code size
‚ What does NIST care about?
§ Currently, Round5 seems to be the fastest on this platform
‚ But Kyber, NTRU, Saber, ThreeBears very close
20
https://github.com/mupq/pqm4
slides and paper available at kannwischer.eu
Thank you!
20
References i
Erdem Alkim, Philipp Jakubeit, and Peter Schwabe.
A new hope on ARM Cortex-M.
In Security, Privacy, and Advanced Cryptography Engineering,
volume 10076 of Lecture Notes in Computer Science, pages
332–349. Springer-Verlag Berlin Heidelberg, 2016.
Joppe W. Bos, Simon Friedberger, Marco Martinoli, Elisabeth
Oswald, and Martijn Stam.
Fly, you fool! faster frodo for the ARM Cortex-M4. Cryptology ePrint Archive, Report 2018/1116, 2018. https://eprint.iacr.org/2018/1116.
21
References ii
Leon Botros, Matthias J. Kannwischer, and Peter Schwabe.
Memory-efficient high-speed implementation of Kyber on
Cortex-M4.
In Progress in Cryptology – Africacrypt 2019, Lecture Notes in
Computer Science, pages 209–228. Springer-Verlag Berlin
Heidelberg, 2019.
Tim Guneysu, Markus Krausz, Tobias Oder, and Julian Speith.
Evaluation of lattice-based signature schemes in
embedded systems.
In 25th IEEE International Conference on Electronics, Circuits
and Systems, ICECS 2018, pages 385–388, 2018.
22
References iii
Matthias J. Kannwischer, Joost Rijneveld, and Peter Schwabe.
Faster multiplication in Z2m rxs on Cortex-M4 to speed up
NIST PQC candidates.
In Applied Cryptography and Network Security, Lecture Notes
in Computer Science, pages 281–301. Springer-Verlag Berlin
Heidelberg, 2019.
Prasanna Ravi, Sourav Sen Gupta, Anupam Chattopadhyay,
and Shivam Bhasin.
Improving speed of Dilithium’s signing procedure. Cryptology ePrint Archive, Report 2019/420, 2019. https://eprint.iacr.org/2019/420.
23