4
Efficient Content Addressable Memory Design Using RAM Charith P. Wedage Millennium Information Technologies, Colombo, Sri Lanka Email: [email protected] AbstractContent Addressable Memory (CAM) is a storage element similar to Random Access Memory (RAM) used in digital circuits, through which search operations can be done in an extremely high speed. However, usage of CAMs in electronic circuits has been limited because of the high complexity and the high resource usage of CAMs. This paper proposes an efficient design for CAM implementation using traditional RAM. In the proposed method, a primitive CAM block is defined and CAM for the required data width and the address width is created by combining multiples of these CAM primitives. In this research, it has been found that resource usage can be minimized by reducing the data width of this CAM primitive. Investigations were done on how to implement variable sized CAM primitives using fixed sized RAM. It was found that the CAMs implemented using CAM sub-blocks with lower data widths consumes significantly less memory while having a higher latency. This paper shows that the user of the CAM can have a trade-off between resource usage and latency by varying the data width of the primitive CAM which would result in a more optimized and efficient CAM structure. Index TermsCAM, RAM, memory, FPGA I. INTRODUCTION A. Content Addressable Memory Content Addressable Memory (CAM) is a storage element similar to Random Access Memory (RAM). In write mode, both CAMs and RAMs store the given data in the given address. In read mode, RAM gives the stored data in the given address. In contrast, in read mode, CAM gives the address where the given data is stored. For example, if a user needs to check whether a particular data word is stored in memory, or if a user requires finding the location of a particular data in memory, user can use a CAM which will give the location of the stored data in 1 operation. However, if a RAM was used to get this functionality, the user will have to read through all the addresses of the RAM until the matching data is found. It can be understood that search operations can be done in constant time using CAMs. However, time consumed for search operations done in RAMs, rapidly increases with the size of the memory. Fig. 1 describes the worst- case search operation latencies of RAMs and CAMs. Manuscript sent June 10, 2015; revised January 14, 2016. Figure 1. Worst case clock latency of CAMs and RAMs Since a CAM can do search operations much faster than its software and hardware counterparts, CAMs are widely used in applications like address filters of network switches and routers. In [1]-[3], methods to implement CAMs as Application Specific Integrated Circuits (ASIC) are proposed. These CAMs are commonly referred to as native CAMs. These methods require match logic for each bit of the memory which makes these designs very expensive. Also, the power consumption of each cell is extremely high in these designs because of the match logic. These result in higher cost and higher power consumption per bit. It should also be noted that, Field Programmable Gate Arrays (FPGAs) which are widely used in low-latency systems don’t have CAMs implemented using ASIC. Because of these drawbacks ASIC implementations of CAMs have not been widely used. B. CAM Implementation Using RAM A flexible and more power efficient method to get CAM functionality is sought after. CAM implemented using traditional RAM is considered a solution to this problem. This approach has gained popularity because RAM is a more mature technology and it is widely implemented in many digital systems including FPGAs. CAM implementation using RAM was proposed in [4]. In [5], a more efficient method which uses dual-port RAM was proposed. Both the major FPGA vendors in the world, namely Xilinx and Altera have provided methods [6], [7] to implement CAMs using RAM. The method given in [5] and [6] defines a primitive CAM sub-block. This method uses the principle of representing each possible data-address combination using a bit in the RAM. Both the methods provide the output in a single clock cycle. Size of the CAM primitive is fixed to a particular International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016 ©2016 Int. J. Electron. Electr. Eng. 459 doi: 10.18178/ijeee.4.5.459-462

Efficient Content Addressable Memory Design Using RAM · Efficient Content Addressable Memory Design Using RAM . Charith P. Wedage . Millennium Information Technologies, Colombo,

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Content Addressable Memory Design Using RAM · Efficient Content Addressable Memory Design Using RAM . Charith P. Wedage . Millennium Information Technologies, Colombo,

Efficient Content Addressable Memory Design

Using RAM

Charith P. Wedage Millennium Information Technologies, Colombo, Sri Lanka

Email: [email protected]

Abstract—Content Addressable Memory (CAM) is a storage

element similar to Random Access Memory (RAM) used in

digital circuits, through which search operations can be

done in an extremely high speed. However, usage of CAMs

in electronic circuits has been limited because of the high

complexity and the high resource usage of CAMs. This

paper proposes an efficient design for CAM implementation

using traditional RAM. In the proposed method, a primitive

CAM block is defined and CAM for the required data width

and the address width is created by combining multiples of

these CAM primitives. In this research, it has been found

that resource usage can be minimized by reducing the data

width of this CAM primitive. Investigations were done on

how to implement variable sized CAM primitives using

fixed sized RAM. It was found that the CAMs implemented

using CAM sub-blocks with lower data widths consumes

significantly less memory while having a higher latency.

This paper shows that the user of the CAM can have a

trade-off between resource usage and latency by varying the

data width of the primitive CAM which would result in a

more optimized and efficient CAM structure.

Index Terms—CAM, RAM, memory, FPGA

I. INTRODUCTION

A. Content Addressable Memory

Content Addressable Memory (CAM) is a storage

element similar to Random Access Memory (RAM). In

write mode, both CAMs and RAMs store the given data

in the given address. In read mode, RAM gives the stored

data in the given address. In contrast, in read mode, CAM

gives the address where the given data is stored.

For example, if a user needs to check whether a

particular data word is stored in memory, or if a user

requires finding the location of a particular data in

memory, user can use a CAM which will give the

location of the stored data in 1 operation. However, if a

RAM was used to get this functionality, the user will

have to read through all the addresses of the RAM until

the matching data is found.

It can be understood that search operations can be done

in constant time using CAMs. However, time consumed

for search operations done in RAMs, rapidly increases

with the size of the memory. Fig. 1 describes the worst-

case search operation latencies of RAMs and CAMs.

Manuscript sent June 10, 2015; revised January 14, 2016.

Figure 1. Worst case clock latency of CAMs and RAMs

Since a CAM can do search operations much faster

than its software and hardware counterparts, CAMs are

widely used in applications like address filters of network

switches and routers.

In [1]-[3], methods to implement CAMs as Application

Specific Integrated Circuits (ASIC) are proposed. These

CAMs are commonly referred to as native CAMs. These

methods require match logic for each bit of the memory

which makes these designs very expensive. Also, the

power consumption of each cell is extremely high in

these designs because of the match logic. These result in

higher cost and higher power consumption per bit. It

should also be noted that, Field Programmable Gate

Arrays (FPGAs) which are widely used in low-latency

systems don’t have CAMs implemented using ASIC.

Because of these drawbacks ASIC implementations of

CAMs have not been widely used.

B. CAM Implementation Using RAM

A flexible and more power efficient method to get

CAM functionality is sought after. CAM implemented

using traditional RAM is considered a solution to this

problem. This approach has gained popularity because

RAM is a more mature technology and it is widely

implemented in many digital systems including FPGAs.

CAM implementation using RAM was proposed in [4].

In [5], a more efficient method which uses dual-port

RAM was proposed. Both the major FPGA vendors in the

world, namely Xilinx and Altera have provided methods

[6], [7] to implement CAMs using RAM. The method

given in [5] and [6] defines a primitive CAM sub-block.

This method uses the principle of representing each

possible data-address combination using a bit in the RAM.

Both the methods provide the output in a single clock

cycle. Size of the CAM primitive is fixed to a particular

International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016

©2016 Int. J. Electron. Electr. Eng. 459doi: 10.18178/ijeee.4.5.459-462

Page 2: Efficient Content Addressable Memory Design Using RAM · Efficient Content Addressable Memory Design Using RAM . Charith P. Wedage . Millennium Information Technologies, Colombo,

data width and an address width, so that the CAM

primitive would exactly fit into a RAM block available in

FPGAs. CAM with the required data width and address

width is formed by combining multiple numbers of these

CAM primitives.

However, the main drawback of this method is the high resource usage. When, the data width rises, the

memory requirement rises exponentially as a result of rise

in possible data-address combinations. Table I shows the

memory requirements of traditional RAM and a CAM

implemented using method described in [6], in order to

store various data widths in 32 addresses.

Data width

(bits)

Required memory(bits)

for RAM

Required memory(bits)

for CAM

10 320 32768

15 480 65536

20 640 65536

25 800 98304

30 960 98304

It can be seen from the Table I, that to store 320 bits in

CAM, this method consumes 32768 number of memory,

thus effectively wastes 32448 bits in memory. It should

be also noted that, when data width is not a multiple of 10,

CAM implementation would waste part of the memory as

unused memory. This is due to the lack of configurability

of the size of the primitive CAM sub-block defined in [6].

It is also important to note that this method provides

the output in 1 clock cycle. That means if a 250MHz

clock is used, the output will be given in just 4

nanoseconds. However, many electronic circuits don’t

need the search operations or Look-Ups to be done in

such a high speed.

So, it can be seen that CAMs implemented using this

method provide results of search operations with

extremely low latency while consuming large portion of

memory. However, it would be ideal, if the designer of

the CAM can provide configurability to the user of the

CAM to have a trade-off between the latency and the

resource usage.

A methodology to configure the resource usage and

latency in implementing a CAM using RAM, is proposed

in this paper.

The rest of the paper is organized as follows. First, the

CAM architecture which was used in this research is

described. This architecture is adopted from the method

given in [5] and [6]. Then, the method to configure the

resource usage and latency is proposed. The resource

consumption and latency difference between the method

proposed here and the method given in [5] and [6] are

described later.

II. ARCHITECTURE OF CAM IMPLEMENTED USING

RAM

This research uses the method of having a separate bit

to represent the presence of each data-address

combination in a CAM. This is the same method

described in [5] and [6]. For a particular CAM design a

CAM primitive block with a fixed size of address width

and data width is defined.

For each data-address combination in the primitive

CAM block, a bit in the RAM block is used to represent

the presence of that particular data on that particular

address. For example if a primitive CAM block with 4 bit

data width and 3 bit address width is to be implemented,

a RAM block of 16 (24) depth and 8 (2

3) bits width is

used. Each of the 16 RAM addresses corresponds to a

particular 4 bit CAM data combination. Each bit in 8 bit

RAM data field corresponds to the presence of the CAM

data specified by the RAM address, in the corresponding

location of the CAM. Table II shows how the presence of

CAM data is stored in the corresponding RAM data as in

the above example.

CAM address/RAM data

CAM data/RAM

address

0 1 2 3 4 5 6 7

0000 0 0 0 1 0 0 0 0

0001 0 0 0 0 0 0 0 0

0010 1 0 0 0 1 0 0 0

…. .. .. .. .. .. .. .. ..

1111 0 0 0 0 0 0 0 0

In the above example data “0000” is located in the

CAM address 3, while data “0001” and “1111” are not

stored in the CAM. Data “0010” is located in two

locations, namely 0 and 4.

In a CAM read operation, the data to be matched

should be given as the RAM address. The output RAM

data is the matching address of the CAM. However, when

writing data to the CAM, it has to be made sure that the

presence of two or more data is not recorded in the CAM.

For example, we have to make sure that no column in the

Table II, does not have two 1’s. To get that functionality,

a typical RAM which is called as an erase RAM will be

implemented. This can be used to get the data located in a

given address. In a write operation, first the presence of

the data which is located in the given write address will

be cleared. In order to do this, Erase RAM will be used to

find the data located in the given address. Then the

presence of the new data given in the new address is

asserted.

After defining the size of the CAM primitive, CAM

with the required data width and the address width is

created by combining a set of these primitive blocks. Fig.

2 shows how a CAM of 10 bit data width and 128

addresses are implemented using four CAM primitives of

10 bit data width and 32 addresses.

Figure 2. Architecture of a CAM with 128 addresses implemented using CAM primitives with 32 addresses

International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016

©2016 Int. J. Electron. Electr. Eng. 460

TABLE I. MEMORY REQUIREMENT FOR CAMS AND RAMS

TABLE II. RAM DATA BITS REPRESENTING THE PRESENCE OF A

PARTICULAR CAM DATA IN CAM LOCATIONS

Page 3: Efficient Content Addressable Memory Design Using RAM · Efficient Content Addressable Memory Design Using RAM . Charith P. Wedage . Millennium Information Technologies, Colombo,

Similarly, CAMs of higher data width can be

implemented by combining the outputs of the primitive

CAM blocks using AND gates. Fig. 3 shows how a CAM

with 20 bit data width is constructed using two 10 bit

CAM primitives.

Figure 3. Architecture of a CAM with 20 bit data width implemented using CAM primitives with 10 bit data width

III. ACHIEVING CONFIGURABILITY IN RESOURCE

USAGE AND LATENCY

In this research, it was found that number of bits (M)

required in order to implement a CAM with a data width

of W and address width of a, is given by the following

equation where d is the data width of the primitive CAM

sub-block.

M = W/d × 2a+d

(1)

By differentiating M with respect to d, it was found

that minimum memory requirement is achieved when d is

1 or 2. It was found that memory requirement constantly

drops when d is reduced. However, RAM blocks

available in most digital circuits and FPGAs are of fixed

size. So, if a CAM primitive with a lower data width is

implemented on a fixed sized RAM, there will be a huge

wastage in memory. In this research, investigations were

done into how the data width of the CAM primitive (d)

can be reduced while being implemented on fixed sized

RAM without a significant wastage in memory.

In order to implement variable sized CAM primitives

in fixed sized RAM primitives, RAM primitives are

broken into smaller virtual RAM primitives. The size of

the virtual RAM primitive is selected so that a multiple of

these virtual RAM blocks would exactly fit into a

physical RAM block. The virtual RAM blocks

implemented in the same physical RAM share the same

access ports. So, two or more virtual RAM blocks cannot

be accessed in parallel.

Fig. 4 shows how a basic RAM block of 1024×32 size

is divided into 4 virtual RAM blocks and how it can be

accessed sequentially through multiplexers, de-

multiplexers and some control logic. In the below

example a physical RAM with a 10 bit data width and an

address width of 32 bits is divided into four Virtual RAM

(VRAM) blocks with each having an 8 bit data width and

a 32 bit address width.

When implementing CAM primitives on the virtual

RAM blocks, the size of the CAM primitive is chosen

such that a primitive CAM sub-block would exactly fit

into a virtual RAM sub-block. In order to get an output

from the CAM, all the primitive CAM blocks have to be

accessed. However, as mentioned earlier, the virtual

RAM blocks cannot be accessed in parallel. Therefore,

the primitive CAM blocks also cannot be accessed in

parallel. So, some control logic should be implemented in

order to sequentially access the different CAM primitives

located in the same physical RAM.

Figure 4. Basic RAM block broken into 4 virtual RAM blocks

Fig. 5 shows how four CAM primitives of 8 bit data

width and 32 addresses can be implemented in a 1024 ×

32 physical RAM block.

Figure 5. Four CAM primitives implemented using one physical RAM block

As described earlier, a number of these CAM

primitives can be combined to implement a CAM with

the required data width.

It should be noted that higher number of clock cycles

would be consumed because of the fact that multiple

CAM blocks would be accessed sequentially. The latency

of the CAM would increase proportionately to the

number of CAM sub-blocks implemented in a single

RAM block.

It is clear from the above analysis that decrease in

data-width of the primitive CAM block would result in a

reduction of total size of memory required and an

increase in latency of the CAM. So, by varying the size of

the data-width of the CAM primitive, user can have a

trade-off between the resource usage and the latency,

International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016

©2016 Int. J. Electron. Electr. Eng. 461

Page 4: Efficient Content Addressable Memory Design Using RAM · Efficient Content Addressable Memory Design Using RAM . Charith P. Wedage . Millennium Information Technologies, Colombo,

allowing the user to have a CAM which is optimized to

the intended use.

IV. RESULTS

Table III shows how the memory requirement (bits) for

CAMs for different data widths differs when the data

width of the CAM sub-block is varied. The given data is

to implement CAMs with 32 addresses.

Total data

width of CAM (bits)

Data width of

CAM sub-block = 10

Data width of

CAM sub-block = 8

Data width of

CAM sub-block = 4

8 32768 8192 1024

16 65536 16384 2048

20 65536 24576 2560

25 98304 32768 3584

32 131072 32768 4096

40 131072 40960 10240

64 229376 65536 16384

It should be noted that data width of CAM described in

[6] is 10 bits. So by comparing column 1 with the data in

column 2 and 3, it is evident that large saving in memory

can be achieved using the method described in this

research compared to the method described in [6].

18K block RAMs are the smallest RAM blocks used in

Xilinx Virtex [8] FPGA series. Table IV shows the

consumption of these blocks for CAM implementation

using CAM sub-blocks of various data widths.

TABLE IV. CONSUMPTION OF 18K MEMORY BLOCKS FOR CAMS

IMPLEMENTED USING CAM SUB-BLOCKS OF DIFFERENT DATA WIDTHS

Total data

width of CAM (bits)

Data width of

CAM sub-block = 10

Data width of

CAM sub-block = 8

Data width of

CAM sub-block = 4

10 2 1 1

16 4 1 1

20 4 2 1

25 6 2 1

32 8 2 1

40 10 3 1

64 14 4 1

Table V shows the latencies of CAM implementations

using CAM sub-blocks of various data widths. It is

assumed 18K RAM blocks are used here.

TABLE V. CLOCK CYCLE LATENCY OF CAMS IMPLEMENTED USING

CAM SUB-BLOCKS OF DIFFERENT DATA WIDTHS

Total data width of

CAM (bits)

Data width of CAM sub-

block = 10

Data width of CAM sub-

block = 8

Data width of CAM sub-

block = 4

10 1 4 8

16 1 4 8

20 1 4 8

25 1 4 8

32 1 4 8

40 1 4 8

64 1 4 8

V. CONCLUSION

In this research, it was found that by reducing the data-

width of the primitive CAM sub-block, the memory

requirement of the CAM can be reduced. A methodology

to implement CAM sub-blocks with reduced data-widths

on fixed sized RAM blocks was introduced in this

research. Using the given methodology, the user of the

CAM can have a trade-off between the memory usage

and the latency of the CAM. This will allow the user to

optimize the resource usage of the CAM for the intended

purpose. In future investigations have to be done on

optimizing the CAMs with large address widths. Also,

investigations have to be done on using quad-port RAMs

to optimize the memory requirement of CAMs.

REFERENCES

[1] N. Correa, A. Garcia, M. C. Duarte, and F. Gonzalez, “An ASIC

CAM design for associative set processors,” in Proc. 4th Annual IEEE International ASIC Conference and Exhibit, Rochester, NY,

1991. [2] D. H. Le, K. Inoue, and C. K. Pham, “Design a fast CAM-based

information detection system on FPGA and 0.18µm ASIC

technology,” in Proc. IEEE International Conference of Electron Devices and Solid-State Circuits, Hong Kong, 2013, pp. 1-2.

[3] Y. C. Shin, R. Sridhar, V. Demjanenko, P. W. Palumbo, and S. N. Srihari, “A special-purpose content addressable memory chip for

real-time image processing,” IEEE Journal of Solid-State Circuits,

vol. 27, no. 5, pp. 737-744, 1992. [4] K. McLaughlin, N. O'Connor, and S. Sezer, “Exploring CAM

design for network processing using FPGA technology,” in Proc. International Conference on Internet and Web Applications and

Services/Advanced International Conference on

Telecommunications, 2006, p. 84. [5] M. M. Soni, and P. K. Dakhole, “FPGA implementation of content

addressable memory based information detection system,” in Proc. International Conference on Communications and Signal

Processing, Melmaruvathur, 2014, pp. 930-933.

[6] K. Locke. (2011). XAPP1151 - Parameterizable content-addressable memory. [Online]. Available:

http://www.xilinx.com/support/documentation/application_notes/xapp1151_Param_CAM.pdf

[7] Altera Corporation, San Jose, CA. (July 2011). Advanced

Synthesis Cookbook. [Online]. Available: https://www.altera.com/content/dam/altera-

www/global/en_US/pdfs/literature/manual/stx_cookbook.pdf [8] Xilinx Inc. (2014). Virtex6 FPGA memory resources user guide.

Available:

http://www.xilinx.com/support/documentation/user_guides/ug363.

pdf

Charith P. Wedage was born in Colombo,

Sri Lanka on September 28th, 1988. Mr. Wedage received Honours degree in Bachelor

in Science of Engineering degree in the field of electronics and telecommunication from

University of Moratuwa, Sri Lanka in 2013.

Since 2013, he has worked as Electronics Engineer at MillenniumIT which is a part of

London Stock Exchange Group. At MillenniumIT, he has done research in

developing FPGA based accelerated systems for trading systems. His

research interests include reconfigurable computing, computer architecture, and digital signal processing.

Mr. Wedage is a member of Institute of Engineers Sri Lanka.

International Journal of Electronics and Electrical Engineering Vol. 4, No. 5, October 2016

©2016 Int. J. Electron. Electr. Eng. 462

TABLE III. MEMORY REQUIREMENT (BITS) OF CAMS IMPLEMENTED

USING CAM SUB-BLOCKS OF DIFFERENT DATA WIDTHS