View
226
Download
0
Tags:
Embed Size (px)
Citation preview
Design of a Reconfigurable Hardware
For Efficient Implementation of Secret Key and Public Key
Cryptography
Presentation Outline
Introduction & Motivation Related Work Design Methodology Design Description Algorithm Implementations Comparison with other Work Programming Paradigm Conclusion/Work in Progress
Motivating Factors
Need for high speed cryptography Need for algorithm independence Need for more secure implementations Need for implementing both Symmetric
and Asymmetric key encryption
Need for High Speed Implementations Software implementations cannot provide
real time rates Hardware implementations essential for
IPSec end pointsSSL serversVPN at rates exceeding ATM
Algorithm implementation must be able to sustain the network bandwidth
Need for Algorithm Independence IPSec
Cipher Algorithm Specified in Security Association (SA) SSL Transactions
Algorithm Negotiable for both Key Exchange & Encryption Need for Both Secret Key and Public Key Encryption
Session establishment - Large Number of transactions Dedicated hardware not cheap!
Hardware Implementation Benefits
More secure implementations Implementing both algorithms in hardware
removes bottleneck associated with slow computations in key establishment
Single hardware implementation supporting both algorithms reduce costs of separate hardware
Advantages of Reconfigurable Hardware Implementations Algorithm Agility Algorithm Upload/Modification Architecture Efficiency/Throughput Cost Efficiency
Comparison of Different Approaches
FPGAs? Post Fabrication Customization Low Cost Design Cycle Fast turnaround time Potential for Parallelism
Instruction-level – Multiple operationsData-level – Multiple blocks of dataTask-level – Parallel tasks (e.g. secret key)
FPGA: The basics
General purpose logic elements (LUTs)
Very flexible interconnect
Basically fine grained to support both data paths and random logic
FPGA: Disadvantages
Too much flexible – inefficiencies Too fine grained – again inefficiencies Block ciphers primarily data flow oriented –
implemented using a large number of small elements
Ciphers have a well defined data flow – general purpose interconnect end up being slow and overkill in terms of area
FPGA vs. Specialized Reconfigurable Logic Coarse grained vs. Fine grained Specialized interconnect vs. generic
interconnect Reduced reconfiguration times End result
Faster performance with reduced area while maintaining enough flexibility to support the application domain
Issues in Reconfigurable Hardware Designs How much of what to support?
How many functional units?What kinds of functional units?How much support for random logic?How much interconnect flexibility to allow?
Programming/CAD toolsWhat kind of programming model to targetHow to design efficient automated tools
Custom Reconfigurable Hardware Design- What’s involved? Looking for commonalities/overlaps as well as disjoint
elements Identify crucial components Utilize potential overlap or partial reuse Generic enough but fast components Minimizing the differences in component types
Balancing the resources Upper bounds/Lower bounds Logic units vs. memory blocks Determining exact number of each type of unit
Make the common case fast- IMPORTANT ALWAYS!
Related Work
Cavium Networks’ SSL & IPSEC Protocol Aware Security Processor
USC Mark II ‘s Advanced Cryptographic Engine for IPsec
Worcester Polytechnic Institute’s COBRA Architecture
SSL/IPsec Security Processor Support for both
public key and secret key encryption
Not Reconfigurable Dedicated hardware
blocks for each operation
Advanced Cryptographic Engine (ACE) Designed to implement
flexible cipher needs of IPsec
Only supports block ciphers
Support for any algorithm through a library of general purpose FPGA implementations
COBRA Architecture
Custom Reconfigurable Hardware for block ciphers
Each RCE is a macro block supporting various component operations
Configured using VLIW instructions
Design Methodology
Literature Survey Block cipher implementations Public key cipher implementations Identifying essential components of efficient
implementations Iterative Development of Architecture Validation by mapping several representative
algorithms Identification of Programming Methodology
Categorizing Implementation Requirements Essential step to handle the design
complexityLogic Requirements Interconnection RequirementsMemory (RAM/ROM) Requirements
Area and Performance directly affected by these
Prioritizing Support
Ordered by importance and then by relative hardware complexity
AES (Rijndael) DES Modular Exponentiation (RSA) Serpent Twofish RC6, MARS, and others
Block Ciphers: Key Elements
Bitwise XOR, AND, OR. Addition or subtraction modulo 2n Shift or rotation by a constant number of bits. Data-dependent rotation by a variable number of bits. Multiplication modulo the table entry value. Multiplication in the Galois field specified by the table
entry value. Inversion modulo the table entry value. Look-up-table substitution
Block Cipher: Core Operations
Modular Multiplication and Exponentiation Modular Exponentiation implemented with
multiple and square algorithm Montgomery Multiplication algorithm the
most popular for modulo multiplication Various Approaches for Implementation
Systolic Array Word Based
ME & MM
ME primarily requires fast adders CSA based implementation most common The highest throughput implementation used
redundant representation with carry save adders for computation of partial results
The same implementation style thus selected for ME
Our Design: Key Insight
CSA made up of 2 half adders with 1 OR gate Each half adder itself 1 XOR & 1 AND Add some configurability to the basic CSA Result: A fast basic element with support for most of
primitive operations
Half Adder
Half Adder
X1 X2
Ci
AB
Co
SUM
So What Else is needed?
Shifts between rounds of addition (for modulo exponentiation)
support for fixed length shifts, rotates & arbitrary permutes of 32-bit operands (for symmetric key)
Solution: A Permutation Unit!
Structure of Proposed Design
Final Design arrived upon by iterative refinement
Hierarchical DesignCellBlock/ClusterGroupsTop of Hierarchy
The Cell
BA CD
Output Select Logic
O1 O2
Half Adder
Half Adder
MUX M
Half Adder
Half Adder
X1 X2
Ci
AB
Co
SUM
The Block/Cluster
Permute Unit
4-BitRandom
Logic
64 Carry SaveAdders
A B C D
O1 O2
64 Bit RegisteredOutputs
64
32 32
64
32 32
64
32 32
64
32 32
64
32 32
64
32 32
Group
Block 1
MemoryBlock 3
Block 2
Block 4
Block 5
Interconnects In a Group
Block 1 Block 2 Block 3 Block 4 Block 5 Memory
Extern
al input fro
m oth
erb
locks
Extern
al input fro
m oth
erb
locks
Overall Structure
Random Logic Support
16 configuration bits
8 input bits6 input bits
2 bits output4 bits output
1 bit output
4 input bits