An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing

TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1

An Efficient and Scalable SemiconductorArchitecture for Parallel Automata ProcessingPaul Dlugosch, Dave Brown, Paul Glendenning, Michael Leventhal, Harold Noyes Member, IEEE

Abstract—We present the design and development of the automata processor, a massively parallel non-von Neumannsemiconductor architecture that is purpose-built for automata processing. This architecture can directly implement non-deterministic finite automata in hardware and can be used to implement complex regular expressions, as well as other typesof automata which cannot be expressed as regular expressions. We demonstrate that this architecture exceeds the capabilitiesof high-performance FPGA-based implementations of regular expression processors. We report on the development of anXML-based language for describing automata for easy compilation targeted to the hardware. The automata processor can beeffectively utilized in a diverse array of applications driven by pattern matching, such as cyber security and computational biology.

Index Terms—automata, parallel architectures, high performance computing, hardware, accelerator architectures, reconfig-urable architectures

�

1 INTRODUCTION

EFFICIENT implementation of automata-based pro-cessing has been studied for several decades and

has been applied to diverse fields such as network se-curity, computational biology, image processing, andtext search [1]. Research has been primarily focusedon the use of traditional CPU architectures as the en-gine for automata processing [2]. In the last few years,this work has been expanded to include multi-coreCPUs, network processors, and GPUs (for example,[3], [4], [5], [6]).

Software-based automata usually employs a DFA-based approach over NFA, e.g., see Liu et al. [7],trading speed for relatively plentiful memory, thoughgood results using simulated NFA have been reportedin some situations [8]. Various techniques to improvethe performance of automata processing on von Neu-mann architectures have been explored. Refinementof DFA [9] or reduction of acuity in pattern matchingcapability by restricting constructs such as boundedquantifications, Kleene stars and logical operations[10] may reduce the problem of state explosion. Mod-ified NFA, which reduce the problem of backtracking,and hybrid finite automata, have also been proposed( [6], [9], [11], [12], [13]). In von Neumann software-based automata processing systems, DFA and NFAcan be thought of as extremities of a spectrum of spaceand time complexity tradeoffs, as illustrated in Table1 [14].

Direct implementation of automata in hardware has

• The authors are with Micron Technology, DRAM Solutions Group,Architecture Development Group, 8000 S. Federal Way, Boise, ID.E-mail: {pddlugosch, dbrown, pglendenning, mleventhal,hnoyes}@micron.com

TABLE 1Storage Complexity and per Symbol Processing

Complexity of an n-State NFA and Equivalent DFAProcessing Complexity Storage Cost

NFA O(n2) O(n)DFA O(1) O(Σn)

the potential to be more efficient than software execut-ing on a von Neumann architecture. Hardware-basedautomata can effect simultaneous, parallel explorationof all possible valid paths in an NFA, thereby achiev-ing the processing complexity of a DFA without beingsubject to DFA state explosion. FPGAs offer, to someextent, the requisite parallelism and a number of rela-tively recent efforts have explored this direction ( [15],[16], [17], [18], [19]). However, there remain significantlimitations in FPGA-based implementations.

We have created an architecture purpose-built fordirect implementation of NFA which achieves sig-nificantly improved processing efficiency, capacity,expressiveness and computational power. We alsoexpect dramatic improvements in cost and powerconsumption compared to other approaches from thesilicon implementation, in fabrication at the time ofwriting.

We present the theoretical model of the architecturein relation to the theory of bit parallelism in Section 2and its semiconductor implementation in Section 3. InSection 4, we evaluate the architecture’s ability to im-plement complex regular expressions and other typesof automata in a time-and space-efficient manner. Anoverview of related work is presented in Section 5. Weconclude in Section 6 with a discussion on future workon the architecture. Additional detail on the semicon-ductor implementation, a more extensive comparisonto related work, and videos of example automata

Digital Object Indentifier 10.1109/TPDS.2014.8 1045-9219/14/$31.00 © 2014 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


presented in the paper is included in supplementarymaterial available at http://ieeexplore.ieee.org.

2 AUTOMATA PROCESSOR EXECUTIONMODEL

We make use of the formal definition of non-deterministic finite automata, extending its executionmodel to account for the unique properties of theautomata processor. An NFA is described by the 5-tuple 〈Q,Σ, δ, q0, F 〉 where Q is the set of automatonstates, Σ is the alphabet, δ is the transition function,q0 ∈ Q is the start state, and F ⊆ Q is the set of finalstates. The transition function δ(q, α) defines the setof states that can be reached from state q when thesymbol α is received at the automaton input.

An automaton can be modeled as a labeled graph,where the vertices are labeled with the states and theedges are labeled with the symbols of the transitionfunction.

The transition function can be extended to multiplestates in any class C ⊆ Q, by the following definition:

δ(C,α) =⋃

q∈C

δ(q, α)

The closure of the class C is the set of states that arereached from any member of C, on any input symbol.The closure is defined as:

δ(C) =⋃

α∈Σ

⋃

q∈C

δ(q, α)

A homogeneous automaton is a restriction on theautomaton definition above, such that all transitionsentering a state must occur on the same input sym-bol(s), i.e. for any two symbols (α, β) ∈ Σ and any twostates (p, q) ∈ Q, then δ(p, α)∩δ(q, β) = δ(q, α)∩δ(p, β).This definition has its roots in Gluskov’s position au-tomata [20]. The position automaton obtained from ann− length regular expression (excluding operators) isan (n+1) state homogeneous automaton. Each uniquesymbol and position pair (or character class), withinthe regular expression, maps to a homogeneous statein Q.

We now introduce the concept of symbol accep-tance. A homogeneous automaton state q is said toaccept the symbol α, if α labels a transition enteringthe state, or is a subset of any transition label enteringthe state. Formally, we define the symbol acceptancefunction as symbols(α) = ∪q∈Qδ(q, α) , i.e. state qaccepts symbol α iff q ∈ symbols(α). We can nowdefine the transition function as:

δ(C,α) = δ(C) ∩ symbols(α)

A sequence of alphabet symbols, on the input of thehomogeneous automaton, will cause the automaton totransition through a path of states in Q. The automa-ton runtime state C ⊆ Q is defined as the states that

will potentially transition on the next input symbol.T = C ∩ symbols(α) are the states that accept thenext input symbol. The automaton is said to matchthe sequence, presented to its input, when T ∩F �= ∅.The automaton execution model for an input string Sis shown below.

1: C = δ(q0)2: if q0 ∈ F then3: match the empty string4: end if5: for each input character α in S do6: T = C ∩ symbols(α)7: if T ∩ F �= ∅ then8: we have a match9: end if

10: if T is empty then11: stop processing S12: end if13: C = δ(T )14: end for

The runtime complexity of the execution modeldepends on how efficiently we can execute lines 6-13.A DFA is a special case where |T | ≤ 1 and |δ(T )| ≤ |Σ|,however the space cost |Q| can be exponential, in theworst case.

We now describe the theoretical foundation of theautomata processor in relation to bit parallelism. Bit-parallelism and its application to string matchingwas first introduced by Richard L. Baeza-Yates [21].The general concept is to encode states as an m-bit word. Typically m is chosen to be a multiple ofthe machine word size of the CPU executing the bit-parallel algorithm. For this reason bit-parallelism isusually reserved for small m-values. The bit parallelhomogeneous automaton is defined by the 5-tuple⟨2Q,Σ,Δ, 2I , 2F

⟩. 2Q is an m-bit word, where each

bit position represents a state in Q = {0..m − 1}.2Q = |q∈Q 2q and the | operation, over all Q, isequivalent to a bitwise OR. 2I = |q∈δ(0) 2

q is an m-bit word, where each bit position represents the setof initial states reachable from the start state q0 = 0.2F = |q∈F 2q is an m-bit word, where each bit positionrepresents a final state. Δ is the transition functionsuch that Δ(2q, α) is an m-bit word, where each bitrepresents the state bits reachable from bit position 2q

on symbol α.The bit parallel closure on a state q is defined as:

Δ(2q) = |α∈Σ

Δ(2q, α)

For small m, the transition function for a bit paral-lel homogeneous automata can be implemented effi-ciently in O(1) time by the bitwise AND (designatedas operator &) of two lookups.

Δ(2q, α) ≡ f o l l o w [q] & symbo l s [α]where f o l l o w [q]=Δ(2q) and symbo l s [α]=|q∈Q Δ(2q, α)



One benefit of the lookup method above is that itinherently supports character classes. The bit-parallelexecution model for an input string S is shown below.

1: 2C = 2I

2: if 2F&0x1 �= 0 then3: match the empty string4: end if5: for each input character α in S do6: 2T = 2C&symbols [α]7: if 2T&2F �= 0 then8: we have a match9: end if

10: Set 2C = 0, ∀q ∈ T, 2C = 2C | follow [q]11: if 2C = 0 then12: stop processing S13: end if14: end for

Line 2 tests if the start state 20 is a member of thefinal states 2F . We include this to provide a directcomparison with the traditional NFA however we donot do this in hardware since it has no practical value.Lines 6-9 can be executed in (m/w) time, where w isthe machine word size implementing the executionmodel above. Navarro and Raffinot [22] describe amethod, using k tables, to implement line 6 in (mk/w)time. The size of k, in Navarro’s algorithm, dependson the space complexity of the equivalent DFA, sinceeach table must store O(2m+1/k) entries, in the worstcase. We have developed, to our knowledge, the firstpractical method implementing the bit-parallel exe-cution model described above for large m=48k. Thepractical method used to achieve this is described inthe next section on the Architectural Design.

3 ARCHITECTURAL DESIGN

3.1 A memory-derived architecture

The automata processor is based on an adaptationof memory array architecture, exploiting the inherentbit-parallelism of traditional SDRAM. ConventionalSDRAM, organized into a two-dimensional array ofrows and columns, accesses a memory cell for anyread or write operation using both a row addressand a column address. The ”row address” for theautomata processor is the input symbol. The 8-bitinput symbol is decoded (8-to-256 decode) and thenprovided to the memory array. In place of memory’scolumn address and decode operation the automataprocessor invokes automata operations through itsrouting matrix structure. The memory array portionof the architecture is illustrated in Figure 1.

The architecture provides the ability to program in-dependent automata into a single silicon device. Eachautomaton and all automata routing matrix pathsrun in parallel, operating on the same input sym-bol simultaneously. Memory arrays are distributed

Automata Routing Matrix Structure

Mb(28-3)

Mb(2)

Mb(1)

Mb(0)

Mb(28-2)

Mb(28-1)

STE(

0)O

utpu

t

STE

(0)

Inpu

ts

State Transition Clock(common) Logic

Row Enable (0)(common)



Row Enable (28-3)(common)



8

STE

(1)

Mem

ory

Col

umn

Bits

(1 X

256

)Lo

gic

STE

(2)

Mem

ory

Col

umn

Bits

(1 X

256

)Lo

gic

STE

(3)

Mem

ory

Col

umn

Bits

(1 X

256

)Lo

gic

STE

(4)

Mem

ory

Col

umn

Bits

(1 X

256

)Lo

gic

STE

(y-4

)M

emor

y C

olum

n B

its (1

X 2

56)

Logi

c

STE

(y-3

)M

emor

y C

olum

n B

its (1

X 2

56)

Logi

c

STE

(y-2

)M

emor

y C

olum

n B

its (1

X 2

56)

Logi

c

STE

(y-1

)M

emor

y C

olum

n B

its (1

X 2

56)

Logi

c

STE

(y)

Mem

ory

Col

umn

Bits

(1 X

256

)Lo

gic

8

16

Fig. 1. Memory Array

throughout the silicon, providing O(1) lookup for am=48K bit memory word. This first implementation,derived from Microns DDR3 SDRAM memory arraytechnology, has an 8-bit DDR3 bus interface. It iscapable of processing 8-bit input symbols at 1 Gbpsper chip.

3.2 The routing matrix

The routing matrix controls the distribution of sig-nals to and from the automata elements, as pro-grammed by the application. The routing matrix isa complex lattice of programmable switches, buffers,routing lines, and cross-point connections. While inan ideal theoretical model of automata every elementcan potentially be connected to every other element,the actual physical implementation in silicon imposesrouting capacity limits related to tradeoffs in clockrate, propagation delay, die size, and power consump-tion. From our initial design, deeply informed by ourexperience in memory architectures, we progressivelyrefined our model of connectivity until we were sat-isfied that the target automata could be compiled tothe fabric and that our performance objectives wouldbe met.

The routing matrix is a hierarchical structure ofgroups of elements, with switches controlling theinterconnection between the levels of the hierarchy.This is illustrated in Figure 2.

A number of the various elements are groupedtogether into rows, rows are grouped into blocks, andblocks are laid out in a grid of block rows and blockcolumns. The routing matrix provides the intercon-nections at the various levels of this hierarchy: withinrows, within blocks, and within the grid of blocks.A summary of these routing matrix signals and theirrespective functions is given in Table 2.

The transitions between the different signal groupsare controlled through the programming of differentswitch buffers. These switch buffers are bi-directional,



Block Switch

Col

umn

Sw

itch

Column Signals

Col

umn

Sw

itch

Inter-ColumnSignals

Block Switch

Row Signals

Row

Sw

itch

Ele

men

t

Ele

men

t

Row Signals

Row

Sw

itch

Ele

men

t

Ele

men

t

Blo

ck S

igna

ls

Inter-ColumnSignals

Col

umn

Sw

itch

Column Signals

Col

umn

Sw

itch

Inter-ColumnSignals

Block Switch

Inter-ColumnSignals

Block Switch

Fig. 2. Routing MatrixTABLE 2

Routing Matrix SignalsSignal Name Functionrow Connections to all elements within a

given row.block Connections to rows within a given

block.column Interconnections of blocks in one col-

umn, in North and South directionsinter-column Interconnection of block columns, in

East and West directions

tri-state-able, multiplexing buffers. A given signal canbe selectively connected to several different inputs ofadjacent levels of the hierarchy.

The maximum number of transitions from a state atmaximum clock frequency is 2304, corresponding to256 states in each of 9 blocks arranged in an 8 pointcompass. A larger fan-out is achievable at a slowerclock rate, albeit at the cost of reduced performance.While any state can have the maximum number oftransitions, only 24 states total out of the 256 in a blockcan have that many transitions. The potential fan-outfrom any state is otherwise 16. Additional informationregarding fan-out, fan-in, and other routing complexi-ties may be found in the paper’s supplementary mate-rial available at http://ieeexplore.ieee.org. The imple-mented selective and programmable connectivity hasbeen sized and load-tested to have sufficient connec-tivity for representative classes of target automata andin practice has proven reasonably robust. The designdoes not preclude, of course, routing congestion issuesor simply inability to route, especially with highlyconnected graph-type automata.

The routing matrix controls signal routing whilethe elements implement the various logical functionsneeded to compute the automata. Programming the

routing matrix and the elements actually implementsthe desired automata.

3.3 The automata processor elementsThe various functional components, called elements,are woven into the hierarchy of the routing matrix.

All elements share four features:1) Each receives sixteen inputs from the routing

matrix2) Each performs a function within an automaton3) Each is programmable4) Each originates (drives) one signal back into the

routing matrixThe number of sixteen inputs arises as a conse-

quence of the design of the routing matrix, describedabove, including consideration of physical feasibilityand experimentation with target automata. The layoutof automata may be modified if the number of inputsexceed 16 during compilation. We transfer the in-degree congestion to out-degree by state splitting.Any state can be split into two where the inputs arepartitioned between the split pair.

The state transition element is at the heart of thedesign and the element with the highest populationdensity, having a one-to-one correspondence to thestate bits of a bit-parallel automata model described inthe previous section. Counters and boolean elementsare used with state transition elements to increasethe space efficiency of automata implementations, asa design convenience, and to extend computationalcapabilities beyond NFA.

3.3.1 The state transition elementThe state transition element consists of the currentstate memory and the next state decoder. In terms ofclassic hardware state machine design it can be associ-ated with the next state transition function. State tran-sition elements are implemented as a memory arraywith control and computational logic. Each column ofthe memory array includes logic that contains a singlestate bit, enable inputs, and an output decoder/driver.The output is determined by the logical AND of thestate bit and the output of the associated column ofmemory. Each state bit is either set (1) or reset (0),depending on whether that state transition element isin an active or inactive state. A conceptual model isshown in Figure 3.

The height of this memory column is determinedby the number of bits (n) in the input symbol. Forexample, for a byte-wide (8-bit) input symbol, eachstate transition element must have a column of 28 =256 bits of memory.

State bits are pre-loaded prior to processing anyinput symbols. This makes it possible for the initialstate of every state transition element to be eitheractive or inactive. Any state transition element andany number of state transition elements may therefore



Mb(2n-3)

Mb(2)

Mb(1)

Mb(0)

Mb(2n-2)

Mb(2n-1)

[n-to

-2n

“Row

Add

ress

” Dec

oder

]

InputSymbol

n

RE(2n-1)

RE(2n-2)

RE(2n-3)

RE(2)

RE(1)

RE(0)

StateClock

STEOutput

StateBit

QDSTE EnableInputs 16

Fig. 3. State Transition Element Memory Array

be a start state. This capability allows independentautomata to each have their own start state and alsopermits individual automata to have multiple startstates. This allows additional flexibility in the designcompared to conventional automata which are limitedto a single start state. For example, εNFAs can beimplemented directly with ease using the automataprocessor. The εNFA on the left side in Figure 4 withstart state s − i (indicated by the triangle) and anepsilon transition from s − i to s − j can be realizeddirectly in the automata processor by also making s−ja start-enabled state transition element as shown onthe right side of Figure 4.

Fig. 4. Epsilon NFA

The 256 bits of symbol recognition memory can bethought of as a single column of 256 rows of memory.The input symbol is analogous to a memory row ad-dress, which is decoded (an 8-to-256 decode), to selectone of the 256 memory cells of the symbol recogni-tion memory. If the selected bit was programmed torecognize the input symbol, the symbol recognitionmemory outputs a 1. If it is not programmed to

recognize the input symbol, the symbol recognitionmemory outputs a 0. An important consequence ofthis design is that it allows any subset of the allpossible 8-bit symbols (2256 combinations) to be pro-grammed to match. This provides the ability to handlefull character classes in every state transition element.

A simple example of a state transition elementrecognizing a character class is shown in Figure 5.The character class in the example is any value whichis not an ascii upper or lower case letter or a numericcharacter. The state transition element on the left isstart-enabled, receiving every input byte, and reportson a match (the former indicated by the ∞ symbol inthe upper left and the latter by the R in the lowerleft). This single state transition element could bea complete machine for reporting when unexpectedvalues are found in an input stream. The encodingof the character class is shown on the right side ofFigure 5. Each of the 256 bits which is not 0-9a-zA-Zis set. Each input symbol is decoded from 8 to 256bits and compared against all set character class bitsto determine if there is a match.

R

0 1111111111111111111111111111111111111111111111110000000000111111 63 64 0000000000000000000000000011111100000000000000000000000000111111 127 128 1111111111111111111111111111111111111111111111111111111111111111 191 192 1111111111111111111111111111111111111111111111111111111111111111 255

Fig. 5. Character Class

3.3.2 The counter elementThe counter element is a 12-bit binary counter. Everytime one of the count enable signals are assertedthe counter counts by one. When the counter countsequals the target value an output event is triggered.

The counter has also implemented several featuresthat provide greater flexibility in designing variousautomata. These features are:

1) the ability to cascade up to four counters toachieve up to a 48-bit counter

2) output and reload mode control (pulse, contin-uous, or pulse-and-reload)

3) an over-riding asynchronous reset function4) the ability to choose different row signal inputs

for both the count and reset functionsCounters can be an efficient way to count sub-

expressions, such as those that occur routinely inapplications using regular expressions with quantifi-cations. Figure 6 shows the use of two counters to im-plement a range quantification. The automaton imple-ments the regular expression /#[0− 9] {500, 999}#/.The state transition element on the left receives everyinput symbol (indicated by the∞ symbol in the upperleft). If it recognizes a # symbol recognition of digitsis begun by activation of the state transition elementto its left. This self-looping state transition elementremains asserted as long as digits continue to be rec-ognized. This element also activates the two countersas well as the state transition element below it. Theupper of the two counters counts the minimum range



value of 500 and the lower counts the maximum rangevalue of 999. If a non-digit is seen in the input stream,the lower state transition element will reset both coun-ters (indicated by the connection to the R terminalon the right side of the counters) and recognition ofthe current input sequence will terminate. Each time adigit is received, the count advances in both counters(indicated by the connection to the C terminal on theupper side of the counters). When the upper countercounts 500, it will continously activate (indicated bythe clock edge symbol on the right side of the counter)the following state transition element. This elementwill report recognition of the sequence (indicated bythe R in the lower left corner) if a terminating # isreceived. The lower counter enforces the maximumrange value by resetting the upper counter once themaximum range value is exceeded, causing activationof the final state transition element to cease and withit the ability for the element to report on receiving theterminating #.

C

R

C

R

R

Fig. 6. Counter Element ExampleCounters are also a type of restricted memory de-

vice, enabling creation of non-deterministic counterautomata. Counter automata can implement propersubsets of pushdown automata and, therefore, somecontext-free languages [23] and have been theoret-ically demonstrated to even be capable of imple-menting Turing machines [24]. We have been ableto construct many practical automata using countersas a scratchpad within NFA. A simple example isthe use of the counter to prune paths and preventlooping through cycles as shown in Figure 7. The threeconnected state transition elements in the exampleform a cycle. While the example is designed to be self-contained the principle can be extended to any graphwith multiple cycles. The left-most state transition el-ement is the start of the cycle, receiving the first inputsymbol (indicated by the 1 symbol in the upper left).The counter prevents the cycle from being traversedmore than once. It is configured as a one-shot, that is,it is set to count to 1. The first time it counts it activatesthe next state transition element, and, without a reset,stays dormant (this is the behavior of pulse mode, thesetting indicated by the clock edge symbol on rightside of the counter). Subsequent activations will notresult in an output activation to the following statetransition element. This behavior allows the cycle tobe traversed only once.

3.3.3 The boolean elementThe boolean element is a function-programmablecombinatorial element with an optional synchronized

1C

R

R

Fig. 7. Pruning with Counters Example

enable. The boolean element can be programmed toperform the following logical functions: OR, AND,NAND, NOR, sum of products, and products of sums.Booleans have no state (and do not use memory),unlike the state transition and counter elements.

Boolean elements are routinely used to simplifydesigns where a combination of the results of sub-expressions is required.

The synchronized enable is triggered by a specialsignal propagated simultaneously to all boolean el-ements. This signal is controlled by instructions tothe automata processor and can occur at any desiredposition(s) in the symbol input data set, includingat the end of the symbol set. When the synchro-nized enable is used, booleans introduce positionaldependence to the operation of the automata, sincethe boolean only computes its combinatorial functionwhen the special signal is asserted. One use of thisfeature is to implement a construct commonly usedin regular expressions – evaluation only on end-of-data (right-anchored expressions). More general usagewith automata includes gating data into chunks sizedfor specific automata and reduction or aggregation ofresults. This capability of the boolean element intro-duces a feature beyond the formal definition of NFA,adding a dynamic aspect to automata processing.

An example of the boolean element combining sub-expressions with use of the synchronized enable isillustrated in Figure 8. The automata reports, at asser-tion of the synchronized enable, if the input streamup to that point contained both a lower case asciiletter and an ascii digit. The state transition elementson the left each receive all input symbols (indicatedby the ∞ symbol in the upper left corner), the topone checking for the letter, the bottom for the digit.Each state transition element is latched (indicated bythe clock edge symbol at the bottom of the statetransition element symbol) meaning that if it matchesit will continue to assert until reset. The two statetransition elements are combined in an AND, effectiveon assertion of the synchronization enable (indicatedby the E symbol in the upper right corner). If theAND generates a high value it will report that at leastone letter and one digit were seen in the input stream(indicated by the R in the lower left corner).



andR

E

Fig. 8. Boolean Element Example

3.4 ReconfigurabilityThe automata processor is a programmable and re-configurable device. The element arrays are fixed butthe operations of individual elements and the connec-tions between them are programmable. The automataprocessor is also partially dynamically reconfigurableas the operation of elements can be reprogrammedduring runtime. The connections between elementsare optimized for resource utilization and requireplacement and routing analysis, done in a more time-consuming compilation phase. However, once place-and-route has been done for the automata, that struc-ture may be incrementally loaded dynamically along-side existing automata to a chip with unused capacity.

3.5 Intra-rank bus: scaling performance and ca-pacityThe automata processor architecture includes an intra-rank bus. It enables symbols to be distributed to aconnected set (rank) of automata processor chips, al-lowing expansion of both the capacity and the perfor-mance of automata processor systems. The intra-rankbus allows a range of configurations. For example, in a64-bit system that has eight automata processor chipsin a rank, the intra-rank bus allows configurationswith up to eight times the automata capacity or eighttimes the processing throughput of a single chip.

4 EVALUATION

At the time of this writing, a chip design for thearchitecture has been completed in DRAM processtechnology and is currently in fabrication. An SDKhas been developed which supports configuration ofthe chip with automata designs and runtime controlof sets of automata processors. Evaluation of thearchitecture is limited to the results from compilationof automata and simulation.

The automata processor may be configured withautomata using either a list of regular expressions inPCRE syntax or a direct description of the automata inan XML-based high-level language that we have cre-ated called the Automata Network Markup Language(ANML). Evaluation of the architecture is discussedfor each type of input, PCRE and ANML.

4.1 Configuration by PCREThe automata processor has been designed to havea high degree of compatibility with PCRE. Features

of PCRE which exceed the computational power ofregular languages such as lookahead and lookbehindassertions and backreferences cannot be implementedusing NFA alone (see [14]). While counters raise thecomputational power of the automata processor be-yond pure NFA, PCRE expressions using the afore-mentioned constructs must still be postprocessed insoftware. The software imposes a small number ofrestrictions on these constructs in order to ensurethat effective use is made of the hardware. Thisdesign allows the automata processor to representPCREs of any complexity as compact and efficientautomata. Figure 9 provides an example of a complexrule, taken from Snort [25], and shows how this isconverted to an automaton that runs directly on theautomata processor. This particular rule is designedto capture a buffer overflow attack on an Apacheweb server. A videos showing operation of the au-tomaton in a visual simulation tool has been includedin the paper’s supplementary material available athttp://ieeexplore.ieee.org.

4

5

36

7

8

3

2

1

C

R

R

C

R

C

R

1PICS-version\s+(\d{5,8}|\d(\x2e\d){10,})\s*\x29\s+

1 or more “whitespace” characters

Literal Text

Any digit

Open Group

Quantification(5 to 8 inclusive)

Alternation

0 or more

Close Group

ASCII hex code

Un-bounded Quantification

2

4

5

3

6 7 8

Fig. 9. Snort PCRE Rule as AutomataWe have run compilation and simulation tests from

[26] in which Becchi and Yu assembled large and com-plex rule sets designed to test the capabilities of widerange of regular expression engines. The tests wereextracted from Snort and modified to increase thenumber of patterns including character classes and toincrease the percentage of patterns using unboundedrepetitions of wildcards. We report the number ofregular expressions in the dataset, the number of NFAstates needed to implement the datasets evaluated byBecchi and Yu [26], the number of state transition



elements used after configuration of the chip by ourSDK’s compiler and the percentage of chip capacitythat that represents (Table 3).

TABLE 3PCRE Compilation Results

Ruleset Description Num.of

Regex

NFAStates

STEsUsed

%chipused

Backdoor real 226 4.3k 4.5k 13Spyware real 462 7.7k 7.8k 23EM only exact

matchpatterns

1k 28.7k 29.7k 78

Range.5 50% ofpatterns havechar-classes

1k 28.5k 29.3k 78

Range1 100% ofpatterns havechar-classes

1k 29.6k 30.4k 80

Dotstar.0.5 5% of patternshave *

1k 29.1k 30.0k 77

Dotstar.1 10% * 1k 29.2k 30.0k 77Dotstar.2 20% * 1k 28.7k 29.6k 77Dotstar.3 30% * 1k unavail 29.3k 76

The results show that the usage of state transitionelements corresponds nearly 1-to-1 with the numberof NFA states and that resource utilization does notgrow with expression complexity. All rule sets will fitin a single automata processor chip and will computeresults at exactly 1 Gbps per chip. A rank of 8 chipsconfigured as 8 groups would run at 8 Gbps andfurther scaling can be obtained by multiplying ranks.

4.2 Configuration by ANML

Our architecture, when configured through ANML,provides, as far as we know, the first hardware im-plementation to allow direct programming of au-tomata structures. While an extended discussion ofANML applications is beyond the scope of thispaper, we provide here two abbreviated exampleswhich show how ANML can be used for prob-lems that cannot be readily formulated using regularexpressions. Videos showing operation of both au-tomata in a visual simulation tool has been includedin the paper’s supplementary material available athttp://ieeexplore.ieee.org.

Our first abbreviated example reports when a se-quence contains one or more a symbols, followed byone or more b symbols, followed by one or morec symbols, and the total number of symbols in thesequence is equal to 17. A regex solution requires thatevery possible combination of symbols a, b and c beenumerated. This problem can be solved directly withthe following simple ANML machine (fig. 10).

The starting element (with the all-input ∞ symbol)will begin counting once an a is received. The upperthree state transition elements recognize one or moreof symbol a followed by one or more of symbol bfollowed by one more of symbol c and each time an a,b or c is recognized the count advances. If somethingother than an a, b, or c is received the state transition

C

R

andR

Fig. 10. ANML Pattern Match Example

elements implementing the negated character classes[∧ab], [∧bc] and [∧c] will reset the counter, restartingthe sequence. When 16 of a, b or c have been countedthe state transition element connected to the counteroutput is activated. If the seventeenth symbol is a cboth the lower and upper state transitions elementswill input into the AND element causing the ANDto report that the input sequence conforms to thepattern.

The second example illustrates how graph datastructures can be implemented directly in ANML us-ing the parallel evaluation and activation of automataprocessor elements to naturally perform a breadth-first descent of the tree. The graphic representationof a small ANML tree is shown in Figure 11, whereeach square represents a tree node consisting of sev-eral state transition elements which have the task ofreporting when a 4-byte search key matches the 4-bytevalue stored in the node. The search key is broadcast ntimes, where n is the depth of the tree. Each broadcasttakes 4 symbol cycles. In the worst case, all nodesmatching the search key will be identified in 4nsymbol cycles.

Node Onext2

Onext1I prev1

I prev2

Node Onext2

Onext1I prev1

I prev2Node Onext2

Onext1I prev1

I prev2

Node Onext2

Onext1I prev1

I prev2

Node Onext2

Onext1I prev1

I prev2

LeafA3

I prev1

Head Onext2

Onext1

LeafF8

I prev1

LeafAA

I prev1

LeafB3

I prev1

Leaf0A

I prev1

LeafC7

I prev1

Leaf12

I prev1

Leaf72

I prev1

Fig. 11. ANML Graph ExampleThe contents of an internal tree node are shown

in Figure 12. The small pentagonal shapes labeled Iat the top and O at the bottom are macro ports, i.e.,connection points into and out of the macro struc-ture used to encapsulate a node. The state transition



elements on the right side form a chain which willreport in the last element if the search key matchesthe node’s value. If one or more byte values do notmatch, activation will follow the chain in the middleor the chain on the left and the right-most chain willnot report.

Iprev1

R

Iprev2

O next2O next1

Fig. 12. ANML Graph Internal Node

5 RELATED WORK

The first effort to create a direct hardware implemen-tation of automata (NFA) goes back to Floyd andUllman in 1982 [27], and has been an active areaof research since then. All recent work that we areaware of has involved the use of regular expressionsas the means for expressing automata and imple-mentation as NFA in high-capacity FPGAs. Becchi[19] implemented a multi-stride (simultaneous inputbytes) NFA compiled from Snort [25] pattern setson Xilinx Virtex-4 and Virtex-5 FPGAs, obtaining be-tween 2 and 7 Gbps of throughput. She estimated thatabout 1000 complex regular expressions implementedas multi-stride NFA could be deployed on a Virtex-5XC5VLX50 device. Nakahara et al. [18] implemented amodular NFA (MNFA) on a Virtex-6, reporting systemthroughput of 3.2 Gbps on a compiled subset of Snortregular expressions. Yang and Prasanna [15] reportedobtaining up to 11 Gbps on a Virtex-5 using variousregex sets taken from Snort, implementing a compiledRE-NFA structure with multi-stride capability andenhanced character class processing. They reportedbeing unable to fit the entire regex portion of Snort,consisting of 2630 expressions, into a single FPGAand broke their test runs into six rule sets. Kaneta etal. [16] developed a dynamically reconfigurable NFAon a Virtex-5 running regexes in different complexityclasses at 2.9 and 1.6 Gbps for simple regexes and0.5 Gbps for complex regexes. The major potentialadvantage of the design is that universal control logicis compiled into the FPGA and specific regexes arerun by modifying memory, eliminating the cost ofregex-specific compilation. Kaneta’s design, however,is restricted in the complexity of regexes that canbe implemented with this method. Wang et al. [17]

created a counter-based NFA design that solves com-plex processing of character classes and developed amethodology for recognizing when overlapping char-acter classes are inherently collision free. The design iscapable of updating regexes through memory writes.A maximum system throughput of 2.57 Gbps wasreported, but this was dependent on the complexity ofthe regex and the capacity of the FPGA only allowedfor testing of partial rule sets derived from Snort.

A detailed comparison of the automata proces-sor to the related works cited above is presentedin the paper’s supplementary material available athttp://ieeexplore.ieee.org. The primary relevance ofthis prior work is that it has established NFA as themost efficient automata implementation for hardware.Our device is the only direct semiconductor archi-tecture purpose-built for automata processing, as faras we know, and targets a different type of domainthan high-capacity FPGAs relative to cost, footprint,power consumption, and system integration. Much ofthe existing research has focused on adaptation ofthe building blocks of FPGAs, primarily LUTs andblock memory to regular expression-specific automataprocessing. Our architecture, designed to provide na-tive support for PCRE, has addressed the challengesdescribed in the prior work (for example, in characterclass handling as extensively treated in [17]). It is not,however, regular expression specific. The architectureallows for the concise implementation of NFA diffi-cult or impossible to formulate with regular expres-sions and also allows creation of pushdown automataand dynamic control of automata at runtime. Partialdynamic modification of automata and incrementaladdition of automata at runtime are also supported.Other major differentiators with prior work includethe ability to interrupt and resume input streams witha mechanism for saving and restoring machine state,and the ability to distribute input data within a rankof automata processors to increase system scale.

6 CONCLUSION

The automata processor is, to our knowledge, thefirst semiconductor architecture of its kind, a non-von Neumann architecture designed for efficient andscalable automata processing. We have shown thatthe chip matches or exceeds the performance ofhigh-capacity FPGAs programmed with comparablefunctionality while providing greater logic capacityand offering superior capabilities in incremental anddynamic updates, the ability to interrupt and re-sume processing input streams, performance scalingby grouping processors into ranks, and expressivenessand computational power beyond that of regular ex-pressions. Routine silicon process shrinks, by movingto a new process node, brings significant increasesin capacity and higher throughput, even without im-provements in the architecture. We expect, however,



in addition to a process shrink, many improvementsto the architecture which will improve speed, routingcapacity, output capacity, and increases in computa-tional power from the automata elements. We believethat our architecture also implies a new parallel pro-gramming paradigm which we have begun to seeevolving from ANML and tools created for automataprocessor development.

REFERENCES

[1] A. V. Aho and M. J. Corasick, “Efficient string matching: an aidto bibliographic search,” Communications of the ACM, vol. 18,no. 6, pp. 333–340, 1975.

[2] M. Becchi, C. Wiseman, and P. Crowley, “Evaluating regularexpression matching engines on network and general purposeprocessors,” in Proceedings of the 5th ACM/IEEE Symposiumon Architectures for Networking and Communications Systems.ACM, 2009, pp. 30–39.

[3] M. Gongora-Blandon and M. Vargas-Lombardo, “State of theart for string analysis and pattern search using cpu and gpubased programming.” J. Information Security, vol. 3, no. 4, pp.314–318, 2012.

[4] Y. Zu, M. Yang, Z. Xu, L. Wang, X. Tian, K. Peng, and Q. Dong,“Gpu-based nfa implementation for memory efficient highspeed regular expression matching,” ACM SIGPLAN Notices,vol. 47, no. 8, pp. 129–140, 2012.

[5] G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “Paralleliza-tion and characterization of pattern matching using gpus,”in Workload Characterization (IISWC), 2011 IEEE InternationalSymposium on. IEEE, 2011, pp. 216–225.

[6] N. Cascarano, P. Rolando, F. Risso, and R. Sisto, “infant:Nfa pattern matching on gpgpu devices,” ACM SIGCOMMComputer Communication Review, vol. 40, no. 5, pp. 20–26, 2010.

[7] S. Pu, C.-C. Tan, and J.-C. Liu, “Sa2px: A tool to translatespamassassin regular expression rules to posix,” in proceedingsof Sixth Conferences on Email and Anti-Spam, 2009.

[8] R. Baeza-Yates and G. Navarro, “Faster approximate stringmatching,” Algorithmica, vol. 23, no. 2, pp. 127–158, 1999.

[9] Y.-H. Yang and V. K. Prasanna, “Space-time tradeoff in regularexpression matching with semi-deterministic finite automata,”in INFOCOM, 2011 Proceedings IEEE. IEEE, 2011, pp. 1853–1861.

[10] M. Becchi, M. Franklin, and P. Crowley, “A workload forevaluating deep packet inspection architectures,” in WorkloadCharacterization, 2008. IISWC 2008. IEEE International Sympo-sium on. IEEE, 2008, pp. 79–89.

[11] M. Becchi and P. Crowley, “A hybrid finite automaton forpractical deep packet inspection,” in Proceedings of the 2007ACM CoNEXT conference, ser. CoNEXT ’07. New York,NY, USA: ACM, 2007, pp. 1:1–1:12. [Online]. Available:http://doi.acm.org/10.1145/1364654.1364656

[12] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner,“Algorithms to accelerate multiple regular expressions match-ing for deep packet inspection,” in ACM SIGCOMM ComputerCommunication Review, vol. 36, no. 4. ACM, 2006, pp. 339–350.

[13] M. Becchi and P. Crowley, “An improved algorithm to ac-celerate regular expression evaluation,” in Proceedings of the3rd ACM/IEEE Symposium on Architecture for networking andcommunications systems. ACM, 2007, pp. 145–154.

[14] J. E. Hopcroft, R. Motwani, and J. D. Ullman, “Introduction toautomata theory, languages, and computation. 2nd edition,”2001.

[15] Y.-H. E. Yang and V. K. Prasanna, “High-performance andcompact architecture for regular expression matching onfpga,” IEEE Trans. Computers, vol. 61, no. 7, pp. 1013–1025,2012.

[16] Y. Kaneta, S. Yoshizawa, S.-i. Minato, and H. Arimura, “High-speed string and regular expression matching on fpga.”

[17] H. Wang, S. Pu, G. Knezek, and J. Liu, “Min-max: A counter-based algorithm for regular expression matching,” 2013.

[18] H. Nakahara, T. Sasao, and M. Matsuura, “A regular expres-sion matching circuit based on a modular non-deterministicfinite automaton with multi-character transition,” in The 16thWorkshop on Synthesis and System Integration of Mixed Informa-tion Technologies, Ballroom, 2010, pp. 359–364.

[19] M. Becchi, “Data structures, algorithms and architectures forefficient regular expression evaluation,” Ph.D. dissertation,Washington University, 2009. Department of Computer Sci-ence and Engineering., 2009.

[20] V. M. Glushkov, “The abstract theory of automata,” RussianMathematical Surveys, vol. 16, no. 5, pp. 1–53, 1961.

[21] R. A. Baeza-Yates, “Efficient text searching,” Ph.D. disserta-tion, University of Waterloo, 1989.

[22] G. Navarro and M. Raffinot, “New techniques for regularexpression searching,” Algorithmica, vol. 41, no. 2, pp. 89–116,2005.

[23] P. C. Fischer, “Turing machines with restricted memory ac-cess,” Information and Control, vol. 9, no. 4, pp. 364–379, 1966.

[24] M. L. Minsky, “Recursive unsolvability of post’s problemof”tag” and other topics in theory of turing machines,” TheAnnals of Mathematics, vol. 74, no. 3, pp. 437–455, 1961.

[25] “Snort,” http://www.snort.org.[26] X. Yu and M. Becchi, “Gpu acceleration of regular expression

matching for large datasets: exploring the implementationspace,” in Proceedings of the ACM International Conference onComputing Frontiers. ACM, 2013, p. 18.

[27] R. W. Floyd and J. D. Ullman, “The compilation of regular ex-pressions into integrated circuits,” Journal of the ACM (JACM),vol. 29, no. 3, pp. 603–622, 1982.

Paul Dlugosch received a B.S. degree inelectrical and electronics engineering fromNorth Dakota State University, Fargo, ND in1989. He currently works at Micron Tech-nology as Director of Automata ProcessorTechnology Development. He has held pre-vious positions in ASIC development andproduct engineering at Rosemount Industrialand Hewlett Packard. He has been awardedseveral patents in the field of system andsemiconductor architectures. His current in-

terests include automata-based computing applications, hybrid logicmemory architectures and cognitive processing techniques usingartificial neural networks and systolic array based architectures.

Dave Brown received BS and MS degrees inelectrical engineering from Clarkson Univer-sity in 1988 and 1990, respectively. He cur-rently works at Micron Technology in Allen,TX, as the lead chip designer for the au-tomata processor. He has been lead de-signer on many DRAM devices including:SDRAM, mobile DDR, RDRAM and RL-DRAM.

Paul Glendenning received a B.E. degreein engineering from James Cook University(Australia) in 1986. He joined Micron Tech-nology in 2011 and is currently working in theDRAM Solutions Group as the Software De-velopment Manager and Chief Scientist, Au-tomata Processor Technology Development.He is responsible for developing the softwaretool chain to support the automata processorand for modeling next generation automataprocessor architectures.



Michael Leventhal received his BS-EECSdegree from U.C. Berkeley in 1994. As amember of the Automata Processor Tech-nology Development team at Micron he hasbeen responsible for the creation of theANML language used to program the au-tomata processor, developer’s tools and in-vestigation of automata processor applica-tions. Prior to joining Micron he worked in thearea of FPGA-based accelerators and wasdeeply involved in the creation of XML and

the development of many internet-based software applications builtaround the use of XML. His current interests include automata theoryand automata-based computing applications and bioinformatics.

Harold Noyes received his BS degree inelectrical engineering from Utah State Uni-versity in 1981. He is the Senior Architect(Hardware), Automata Processor TechnologyDevelopment, at Micron Technology. Priorto joining Micron, he worked in various re-search and development roles for Hewlett-Packard Company and contract engineeringfor Freescale Semiconductor. His experienceencompasses engineering and project man-agement, including printed circuit assembly

design, electromagnetic and regulatory compliance engineering,modem design, full-custom silicon SRAM design, ASIC design, andtechnical writing, for portable computers and LaserJet printers. He isa member of the IEEE and Tau Beta Pi.


Documents

An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing