Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Efficient Content Addressable Memory Design
Using RAM
Charith P. Wedage Millennium Information Technologies, Colombo, Sri Lanka
Email: [email protected]
Abstract—Content Addressable Memory (CAM) is a storage
element similar to Random Access Memory (RAM) used in
digital circuits, through which search operations can be
done in an extremely high speed. However, usage of CAMs
in electronic circuits has been limited because of the high
complexity and the high resource usage of CAMs. This
paper proposes an efficient design for CAM implementation
using traditional RAM. In the proposed method, a primitive
CAM block is defined and CAM for the required data width
and the address width is created by combining multiples of
these CAM primitives. In this research, it has been found
that resource usage can be minimized by reducing the data
width of this CAM primitive. Investigations were done on
how to implement variable sized CAM primitives using
fixed sized RAM. It was found that the CAMs implemented
using CAM sub-blocks with lower data widths consumes
significantly less memory while having a higher latency.
This paper shows that the user of the CAM can have a
trade-off between resource usage and latency by varying the
data width of the primitive CAM which would result in a
more optimized and efficient CAM structure.
Index Terms—CAM, RAM, memory, FPGA
I. INTRODUCTION
A. Content Addressable Memory
Content Addressable Memory (CAM) is a storage
element similar to Random Access Memory (RAM). In
write mode, both CAMs and RAMs store the given data
in the given address. In read mode, RAM gives the stored
data in the given address. In contrast, in read mode, CAM
gives the address where the given data is stored.
For example, if a user needs to check whether a
particular data word is stored in memory, or if a user
requires finding the location of a particular data in
memory, user can use a CAM which will give the
location of the stored data in 1 operation. However, if a
RAM was used to get this functionality, the user will
have to read through all the addresses of the RAM until
the matching data is found.
It can be understood that search operations can be done
in constant time using CAMs. However, time consumed
for search operations done in RAMs, rapidly increases
with the size of the memory. Fig. 1 describes the worst-
case search operation latencies of RAMs and CAMs.
Manuscript sent June 10, 2015; revised January 14, 2016.
Figure 1. Worst case clock latency of CAMs and RAMs
Since a CAM can do search operations much faster
than its software and hardware counterparts, CAMs are
widely used in applications like address filters of network
switches and routers.
In [1]-[3], methods to implement CAMs as Application
Specific Integrated Circuits (ASIC) are proposed. These
CAMs are commonly referred to as native CAMs. These
methods require match logic for each bit of the memory
which makes these designs very expensive. Also, the
power consumption of each cell is extremely high in
these designs because of the match logic. These result in
higher cost and higher power consumption per bit. It
should also be noted that, Field Programmable Gate
Arrays (FPGAs) which are widely used in low-latency
systems don’t have CAMs implemented using ASIC.
Because of these drawbacks ASIC implementations of
CAMs have not been widely used.
B. CAM Implementation Using RAM
A flexible and more power efficient method to get
CAM functionality is sought after. CAM implemented
using traditional RAM is considered a solution to this
problem. This approach has gained popularity because
RAM is a more mature technology and it is widely
implemented in many digital systems including FPGAs.
CAM implementation using RAM was proposed in [4].
In [5], a more efficient method which uses dual-port
RAM was proposed. Both the major FPGA vendors in the
world, namely Xilinx and Altera have provided methods
[6], [7] to implement CAMs using RAM. The method
given in [5] and [6] defines a primitive CAM sub-block.
This method uses the principle of representing each
possible data-address combination using a bit in the RAM.
Both the methods provide the output in a single clock
cycle. Size of the CAM primitive is fixed to a particular
International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016
©2016 Int. J. Electron. Electr. Eng. 459doi: 10.18178/ijeee.4.5.459-462
data width and an address width, so that the CAM
primitive would exactly fit into a RAM block available in
FPGAs. CAM with the required data width and address
width is formed by combining multiple numbers of these
CAM primitives.
However, the main drawback of this method is the high resource usage. When, the data width rises, the
memory requirement rises exponentially as a result of rise
in possible data-address combinations. Table I shows the
memory requirements of traditional RAM and a CAM
implemented using method described in [6], in order to
store various data widths in 32 addresses.
Data width
(bits)
Required memory(bits)
for RAM
Required memory(bits)
for CAM
10 320 32768
15 480 65536
20 640 65536
25 800 98304
30 960 98304
It can be seen from the Table I, that to store 320 bits in
CAM, this method consumes 32768 number of memory,
thus effectively wastes 32448 bits in memory. It should
be also noted that, when data width is not a multiple of 10,
CAM implementation would waste part of the memory as
unused memory. This is due to the lack of configurability
of the size of the primitive CAM sub-block defined in [6].
It is also important to note that this method provides
the output in 1 clock cycle. That means if a 250MHz
clock is used, the output will be given in just 4
nanoseconds. However, many electronic circuits don’t
need the search operations or Look-Ups to be done in
such a high speed.
So, it can be seen that CAMs implemented using this
method provide results of search operations with
extremely low latency while consuming large portion of
memory. However, it would be ideal, if the designer of
the CAM can provide configurability to the user of the
CAM to have a trade-off between the latency and the
resource usage.
A methodology to configure the resource usage and
latency in implementing a CAM using RAM, is proposed
in this paper.
The rest of the paper is organized as follows. First, the
CAM architecture which was used in this research is
described. This architecture is adopted from the method
given in [5] and [6]. Then, the method to configure the
resource usage and latency is proposed. The resource
consumption and latency difference between the method
proposed here and the method given in [5] and [6] are
described later.
II. ARCHITECTURE OF CAM IMPLEMENTED USING
RAM
This research uses the method of having a separate bit
to represent the presence of each data-address
combination in a CAM. This is the same method
described in [5] and [6]. For a particular CAM design a
CAM primitive block with a fixed size of address width
and data width is defined.
For each data-address combination in the primitive
CAM block, a bit in the RAM block is used to represent
the presence of that particular data on that particular
address. For example if a primitive CAM block with 4 bit
data width and 3 bit address width is to be implemented,
a RAM block of 16 (24) depth and 8 (2
3) bits width is
used. Each of the 16 RAM addresses corresponds to a
particular 4 bit CAM data combination. Each bit in 8 bit
RAM data field corresponds to the presence of the CAM
data specified by the RAM address, in the corresponding
location of the CAM. Table II shows how the presence of
CAM data is stored in the corresponding RAM data as in
the above example.
CAM address/RAM data
CAM data/RAM
address
0 1 2 3 4 5 6 7
0000 0 0 0 1 0 0 0 0
0001 0 0 0 0 0 0 0 0
0010 1 0 0 0 1 0 0 0
…. .. .. .. .. .. .. .. ..
1111 0 0 0 0 0 0 0 0
In the above example data “0000” is located in the
CAM address 3, while data “0001” and “1111” are not
stored in the CAM. Data “0010” is located in two
locations, namely 0 and 4.
In a CAM read operation, the data to be matched
should be given as the RAM address. The output RAM
data is the matching address of the CAM. However, when
writing data to the CAM, it has to be made sure that the
presence of two or more data is not recorded in the CAM.
For example, we have to make sure that no column in the
Table II, does not have two 1’s. To get that functionality,
a typical RAM which is called as an erase RAM will be
implemented. This can be used to get the data located in a
given address. In a write operation, first the presence of
the data which is located in the given write address will
be cleared. In order to do this, Erase RAM will be used to
find the data located in the given address. Then the
presence of the new data given in the new address is
asserted.
After defining the size of the CAM primitive, CAM
with the required data width and the address width is
created by combining a set of these primitive blocks. Fig.
2 shows how a CAM of 10 bit data width and 128
addresses are implemented using four CAM primitives of
10 bit data width and 32 addresses.
Figure 2. Architecture of a CAM with 128 addresses implemented using CAM primitives with 32 addresses
International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016
©2016 Int. J. Electron. Electr. Eng. 460
TABLE I. MEMORY REQUIREMENT FOR CAMS AND RAMS
TABLE II. RAM DATA BITS REPRESENTING THE PRESENCE OF A
PARTICULAR CAM DATA IN CAM LOCATIONS
Similarly, CAMs of higher data width can be
implemented by combining the outputs of the primitive
CAM blocks using AND gates. Fig. 3 shows how a CAM
with 20 bit data width is constructed using two 10 bit
CAM primitives.
Figure 3. Architecture of a CAM with 20 bit data width implemented using CAM primitives with 10 bit data width
III. ACHIEVING CONFIGURABILITY IN RESOURCE
USAGE AND LATENCY
In this research, it was found that number of bits (M)
required in order to implement a CAM with a data width
of W and address width of a, is given by the following
equation where d is the data width of the primitive CAM
sub-block.
M = W/d × 2a+d
(1)
By differentiating M with respect to d, it was found
that minimum memory requirement is achieved when d is
1 or 2. It was found that memory requirement constantly
drops when d is reduced. However, RAM blocks
available in most digital circuits and FPGAs are of fixed
size. So, if a CAM primitive with a lower data width is
implemented on a fixed sized RAM, there will be a huge
wastage in memory. In this research, investigations were
done into how the data width of the CAM primitive (d)
can be reduced while being implemented on fixed sized
RAM without a significant wastage in memory.
In order to implement variable sized CAM primitives
in fixed sized RAM primitives, RAM primitives are
broken into smaller virtual RAM primitives. The size of
the virtual RAM primitive is selected so that a multiple of
these virtual RAM blocks would exactly fit into a
physical RAM block. The virtual RAM blocks
implemented in the same physical RAM share the same
access ports. So, two or more virtual RAM blocks cannot
be accessed in parallel.
Fig. 4 shows how a basic RAM block of 1024×32 size
is divided into 4 virtual RAM blocks and how it can be
accessed sequentially through multiplexers, de-
multiplexers and some control logic. In the below
example a physical RAM with a 10 bit data width and an
address width of 32 bits is divided into four Virtual RAM
(VRAM) blocks with each having an 8 bit data width and
a 32 bit address width.
When implementing CAM primitives on the virtual
RAM blocks, the size of the CAM primitive is chosen
such that a primitive CAM sub-block would exactly fit
into a virtual RAM sub-block. In order to get an output
from the CAM, all the primitive CAM blocks have to be
accessed. However, as mentioned earlier, the virtual
RAM blocks cannot be accessed in parallel. Therefore,
the primitive CAM blocks also cannot be accessed in
parallel. So, some control logic should be implemented in
order to sequentially access the different CAM primitives
located in the same physical RAM.
Figure 4. Basic RAM block broken into 4 virtual RAM blocks
Fig. 5 shows how four CAM primitives of 8 bit data
width and 32 addresses can be implemented in a 1024 ×
32 physical RAM block.
Figure 5. Four CAM primitives implemented using one physical RAM block
As described earlier, a number of these CAM
primitives can be combined to implement a CAM with
the required data width.
It should be noted that higher number of clock cycles
would be consumed because of the fact that multiple
CAM blocks would be accessed sequentially. The latency
of the CAM would increase proportionately to the
number of CAM sub-blocks implemented in a single
RAM block.
It is clear from the above analysis that decrease in
data-width of the primitive CAM block would result in a
reduction of total size of memory required and an
increase in latency of the CAM. So, by varying the size of
the data-width of the CAM primitive, user can have a
trade-off between the resource usage and the latency,
International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016
©2016 Int. J. Electron. Electr. Eng. 461
allowing the user to have a CAM which is optimized to
the intended use.
IV. RESULTS
Table III shows how the memory requirement (bits) for
CAMs for different data widths differs when the data
width of the CAM sub-block is varied. The given data is
to implement CAMs with 32 addresses.
Total data
width of CAM (bits)
Data width of
CAM sub-block = 10
Data width of
CAM sub-block = 8
Data width of
CAM sub-block = 4
8 32768 8192 1024
16 65536 16384 2048
20 65536 24576 2560
25 98304 32768 3584
32 131072 32768 4096
40 131072 40960 10240
64 229376 65536 16384
It should be noted that data width of CAM described in
[6] is 10 bits. So by comparing column 1 with the data in
column 2 and 3, it is evident that large saving in memory
can be achieved using the method described in this
research compared to the method described in [6].
18K block RAMs are the smallest RAM blocks used in
Xilinx Virtex [8] FPGA series. Table IV shows the
consumption of these blocks for CAM implementation
using CAM sub-blocks of various data widths.
TABLE IV. CONSUMPTION OF 18K MEMORY BLOCKS FOR CAMS
IMPLEMENTED USING CAM SUB-BLOCKS OF DIFFERENT DATA WIDTHS
Total data
width of CAM (bits)
Data width of
CAM sub-block = 10
Data width of
CAM sub-block = 8
Data width of
CAM sub-block = 4
10 2 1 1
16 4 1 1
20 4 2 1
25 6 2 1
32 8 2 1
40 10 3 1
64 14 4 1
Table V shows the latencies of CAM implementations
using CAM sub-blocks of various data widths. It is
assumed 18K RAM blocks are used here.
TABLE V. CLOCK CYCLE LATENCY OF CAMS IMPLEMENTED USING
CAM SUB-BLOCKS OF DIFFERENT DATA WIDTHS
Total data width of
CAM (bits)
Data width of CAM sub-
block = 10
Data width of CAM sub-
block = 8
Data width of CAM sub-
block = 4
10 1 4 8
16 1 4 8
20 1 4 8
25 1 4 8
32 1 4 8
40 1 4 8
64 1 4 8
V. CONCLUSION
In this research, it was found that by reducing the data-
width of the primitive CAM sub-block, the memory
requirement of the CAM can be reduced. A methodology
to implement CAM sub-blocks with reduced data-widths
on fixed sized RAM blocks was introduced in this
research. Using the given methodology, the user of the
CAM can have a trade-off between the memory usage
and the latency of the CAM. This will allow the user to
optimize the resource usage of the CAM for the intended
purpose. In future investigations have to be done on
optimizing the CAMs with large address widths. Also,
investigations have to be done on using quad-port RAMs
to optimize the memory requirement of CAMs.
REFERENCES
[1] N. Correa, A. Garcia, M. C. Duarte, and F. Gonzalez, “An ASIC
CAM design for associative set processors,” in Proc. 4th Annual IEEE International ASIC Conference and Exhibit, Rochester, NY,
1991. [2] D. H. Le, K. Inoue, and C. K. Pham, “Design a fast CAM-based
information detection system on FPGA and 0.18µm ASIC
technology,” in Proc. IEEE International Conference of Electron Devices and Solid-State Circuits, Hong Kong, 2013, pp. 1-2.
[3] Y. C. Shin, R. Sridhar, V. Demjanenko, P. W. Palumbo, and S. N. Srihari, “A special-purpose content addressable memory chip for
real-time image processing,” IEEE Journal of Solid-State Circuits,
vol. 27, no. 5, pp. 737-744, 1992. [4] K. McLaughlin, N. O'Connor, and S. Sezer, “Exploring CAM
design for network processing using FPGA technology,” in Proc. International Conference on Internet and Web Applications and
Services/Advanced International Conference on
Telecommunications, 2006, p. 84. [5] M. M. Soni, and P. K. Dakhole, “FPGA implementation of content
addressable memory based information detection system,” in Proc. International Conference on Communications and Signal
Processing, Melmaruvathur, 2014, pp. 930-933.
[6] K. Locke. (2011). XAPP1151 - Parameterizable content-addressable memory. [Online]. Available:
http://www.xilinx.com/support/documentation/application_notes/xapp1151_Param_CAM.pdf
[7] Altera Corporation, San Jose, CA. (July 2011). Advanced
Synthesis Cookbook. [Online]. Available: https://www.altera.com/content/dam/altera-
www/global/en_US/pdfs/literature/manual/stx_cookbook.pdf [8] Xilinx Inc. (2014). Virtex6 FPGA memory resources user guide.
Available:
http://www.xilinx.com/support/documentation/user_guides/ug363.
Charith P. Wedage was born in Colombo,
Sri Lanka on September 28th, 1988. Mr. Wedage received Honours degree in Bachelor
in Science of Engineering degree in the field of electronics and telecommunication from
University of Moratuwa, Sri Lanka in 2013.
Since 2013, he has worked as Electronics Engineer at MillenniumIT which is a part of
London Stock Exchange Group. At MillenniumIT, he has done research in
developing FPGA based accelerated systems for trading systems. His
research interests include reconfigurable computing, computer architecture, and digital signal processing.
Mr. Wedage is a member of Institute of Engineers Sri Lanka.
International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016
©2016 Int. J. Electron. Electr. Eng. 462
TABLE III. MEMORY REQUIREMENT (BITS) OF CAMS IMPLEMENTED
USING CAM SUB-BLOCKS OF DIFFERENT DATA WIDTHS