COGNISERVE:HETEROGENEOUS SERVER A LARGE …zfang/CogniServe_IEEEmicro11.pdf · mechanisms such as system calls to isolate the user space processes from directly control- ... nition

..........................................................................................................................................................................................................................

COGNISERVE: HETEROGENEOUS SERVERARCHITECTURE FOR LARGE-SCALE

RECOGNITION..........................................................................................................................................................................................................................

AS SMART MOBILE DEVICES BECOME PERVASIVE, VENDORS ARE OFFERING RICH FEATURES

SUPPORTED BY CLOUD-BASED SERVERS TO ENHANCE THE USER EXPERIENCE. SUCH

SERVERS IMPLEMENT LARGE-SCALE COMPUTING ENVIRONMENTS, WHERE TARGET DATA

IS COMPARED TO A MASSIVE PRELOADED DATABASE. COGNISERVE IS A HIGHLY EFFICIENT

RECOGNITION SERVER FOR LARGE-SCALE RECOGNITION THAT EMPLOYS A HETEROGENEOUS

ARCHITECTURE TO PROVIDE LOW-POWER, HIGH-THROUGHPUT CORES, ALONG WITH

APPLICATION-SPECIFIC ACCELERATORS.

......Over the past decade, Websearch has become an important large-scaleworkload for cloud servers, making it neces-sary for architects to rethink server architec-tures from power, cost, and performanceperspectives.1 In the next decade, we expectrecognition (of images, speech, gestures,and text) to engender the new wave of fea-tures supported on smart mobile devices.For example, mobile augmented reality(MAR) is an upcoming application that letsusers point their handheld cameras to an ob-ject and, with the help of a cloud server, rec-ognize that object and overlay relevantmetadata on it.2 Similarly, along with thewidespread use of handheld devices, therehas been a resurgence of interest in enablingusers to initiate searches and dictation byspeaking as opposed to typing. This alsorequires the device to work with the cloudserver to perform speech recognition effi-ciently and in real time. Typical operation

at the server involves extracting the most rel-evant information from target data (image,voice, and text) and mathematically compar-ing it to a very large database of similar infor-mation precomputed from possible matches.The mathematical operations involved canscale to very large databases, depending onthe range of inputs supported and the accu-racy required. However, providing such rec-ognition services at a large scale on cloudservers requires understanding these work-loads’ characteristics and designing a serverarchitecture that is highly efficient in termsof power, throughput, and response time.

In this article, we describe the key charac-teristics of two open-source recognitionworkloads: image recognition based onSpeeded-Up Robust Features (SURF)3 andOpenCV, and speech recognition basedon Sphinx (http://sourceforge.net/projects/cmusphinx/files). On the basis of these char-acteristics, we explore a recognition server

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 20

Ravi Iyer

Sadagopan Srinivasan

Omesh Tickoo

Zhen Fang

Ramesh Illikkal

Steven Zhang

Vineet Chadha

Paul M. Stillwell Jr.

Intel Labs

Seung Eun Lee

Seoul National University

of Science and

Technology

..............................................................

20 Published by the IEEE Computer Society 0272-1732/11/$26.00 �c 2011 IEEE

design (CogniServe) that employs a heteroge-neous architecture with the following keyfeatures:

� several small cores for low power andhigh throughput,

� application-specific recognition acceler-ators to improve response latency andenergy efficiency, and

� architectural support for general-purpose programming and efficientcommunication between cores andaccelerators.

Our small cores are based on Intel’s Atomcore, which is already a very low-power core.To demonstrate acceleration of some of thekey hot spots in image and speech recogni-tion, we designed and implemented threehardware accelerators: Gaussian mixturemodel (GMM), match, and interest pointdetection (IPD).

From a programming-model and com-munications perspective, accelerators havetraditionally been treated as devices in theplatform. The programming model is typi-cally based on a clear demarcation ofgeneral-purpose core functionality versusspecial-purpose device functionality. General-purpose user applications have relied onmechanisms such as system calls to isolatethe user space processes from directly control-ling the hardware. This model is well-suitedfor I/O devices and coarse-grained acceleratorswith execution times far higher than theinterface overheads. However, for efficient off-loading of fine-grained accelerators in hetero-geneous recognition servers, a more efficient,intuitive programming interface is neededfor accelerators.

Our proposed CogniServe programmingmodel allows for offloading functions onaccelerators without incurring significantoverhead in system calls and data copies.The architectural support we provide letscores and accelerators operate in the samevirtual-memory domain. We also enable adirect user interface to accelerators via newinstructions. By introducing this architec-tural support, we can provide a general-purpose programming model for accelera-tors. Moreover, reducing overhead allowsoffloading at a far lower granularity than

has been possible thus far. In this article,we restrict our description of the program-ming model to such fine-grained accelerators.However, it’s possible to use this same archi-tecture with different accelerators for differ-ent target segments in large-scale computing.

Recognition workloads and characteristicsAs users interact more with devices

(smartphones, smart displays or TVs, andso on), it’s possible to significantly enhancetheir personalized experience by enablingmore natural modes of input and outputthan with a traditional keyboard andmouse, and by enabling new usage modelsto provide information customized to theuser context. Recognition applications areemerging in both of these areas:

� Various applications recognize speechand gestures as a new, more naturalmode of input and output. For exam-ple, some applications use speech recog-nition to make voice calls, dictate notes,or transcribe data. Others use gesturerecognition to replace remote controlsfor smart TVs. Still others are used toenable interactive gaming.

� Some applications recognize objectsand text in images and video to enrichcontext and experience for users by pro-viding additional information—for ex-ample, mobile augmented reality.

In this article, we use two of these emerg-ing recognition applications: object recogni-tion in images and speech recognition. Wepresent execution time and hot-spot charac-teristics for these workloads, which high-lights the value of our recognition serverarchitecture.

Image recognition characteristicsObject recognition in images can be

used in various applications. Here, wefocus on MAR—a usage model that’s rap-idly emerging on smartphones and othermobile devices.2 Figure 1a shows the basicflow of MAR image recognition, whichbegins with a query image that the usertakes with the camera. The intent is tocompare this query image against a set ofpreexisting images in a database for a

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 21

....................................................................

MAY/JUNE 2011 21

potential match. This comparison involvesthree major steps:

1. Interest-point detection. Identify interestpoints in the query image to uniquelycharacterize the object.

2. Descriptor generation. Create descriptorvectors for these identified interestpoints.

3. Descriptor matching. Compare descrip-tor vectors of the query image againstdescriptor vectors of stored images in adatabase (narrowed down by the usercontext—that is, GPS location orother information).

Researchers have proposed several algo-rithms to detect interest points and generatedescriptors. The most popular are variants ofthe Scale-Invariant Feature Transform(SIFT)4 and SURF3 algorithms. We chosethe SURF algorithm for our MAR applica-tion because it’s faster and sufficiently accu-rate for the usage model of interest.Researchers have also used SURF successfullyin mobile phones for MAR.2

Figure 2a shows SURF’s execution timewhen running on two types of platforms: aserver platform with traditional large-coremicroprocessors (Intel’s Xeon processor run-ning at 3 GHz) and a netbook platform withsmall-core microprocessors (Intel’s Atomcore running at 1.6 GHz). Because westarted with open-source code, we alsoperformed significant software optimizations(multithreading, vectorization, and so forth),which Figure 2a labels as ‘‘opt.’’ As Figure 2ashows, our software optimizations havealready improved image recognition perfor-mance by as much as 2�. In addition, theperformance difference between large andsmall cores was roughly 4�. Despite thisdifference, we explore recognition serversbuilt from small cores, for the followingreasons:

� A significant part of this differencecomes from the 2� frequency differ-ence between the cores when measured.

� Multiple small cores can fit withina large core’s footprint and powerenvelope.

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 22

(a)

Interest pointdetection

Descriptorgeneration

Match againstdatabase files

Picturereceived fromphone via 3G

SURF

Matchand

content

Data from database

Each circle isa detected IP

describedwith a numeric

vector.

White linesdenote pairsof matching

points.

(c)

Acousticfront end

FFT,DCT

ADC

LM and AMsearches

GMMscoring

Speechrecognition

Imagerecognition

Smart mobiledevices

Recognitionservers

(b)

Figure 1. Large-scale recognition for image and speech: image recognition flow detailing various stages, from image

capture to database matching (a); recognition architecture employing mobile clients and recognition servers (b); and speech

recognition flow detailing various stages for real-time speech identification (c). Thus, we extend the basic recognition

architecture to image and speech recognition. (ADC: analog-to-digital converter; AM: acoustic model; DCT: discrete cosine

transform; FFT: fast Fourier transform; GMM: Gaussian mixture model; LM: language model; SURF: Speeded-Up Robust

Features.)

....................................................................

22 IEEE MICRO

...............................................................................................................................................................................................

SYSTEMS FOR VERY LARGE-SCALE COMPUTING

� These algorithms’ server power con-sumption and application executiontime must be reduced significantlythrough hardware offloading anyway.

Figure 2b shows the key primitives andhot spots in SURF-based image recognitionwhen running on the Intel Atom core. De-scriptor matching takes the most amount oftime, followed by IPD (for 50 imagematches). The execution time breakdownratio depends on the number of images tomatch against. (Extensive workload charac-teristics and optimizations of image recogni-tion are available elsewhere.5)

Speech recognition characteristicsAutomatic speech recognition is particu-

larly attractive as an input method on hand-held platforms because of their small formfactors. The rapid adoption of mobile de-vices and the need for enriching contexthas now made it even more important toenable natural language navigation, accu-rate transcriptions, and so on. In fact,cloud service providers have already begunenabling a rich user experience based onvoice, including voice-based search andother features.

In our work, we target the more difficultspeech recognition applications that focus on

large vocabulary, continuous speech recogni-tion (LVCSR). We employ the most widelyresearched LVCSR software: Carnegie MellonUniversity’s Sphinx 3.0 (http://sourceforge.net/projects/cmusphinx/files). Figure 1c illus-trates the basic flow of Sphinx 3, whichuses hidden Markov models (HMMs) as sta-tistical models to represent subphonemicspeech sounds. There are basically threesteps in HMM-based speech recognition sys-tems, including Sphinx 3: the acoustic frontend, GMM scoring, and acoustic-modeland language-model searches.

Figure 2c shows the execution time ofspeech recognition on large cores (Intel’sXeon processor) and small cores (Intel’sAtom core) when running on speech cap-tured from the Wall Street Journal andHUB-4 broadcast news. Again, we showthe impact of software optimizations thatwe’ve already performed on the open-sourcecode to speed it up on Intel microprocessors(including multithreading, vectorization, andsignificant code and algorithmic optimiza-tions). We achieved a speedup of 1.2�when applying these optimizations. We alsofound that the Atom core was about 5� to6� slower than the Xeon core when process-ing speech using the selected open-source en-gine. As with image recognition, however, westill argue for small cores to be supported in

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 23

0500

1,0001,5002,0002,5003,000

Atom(base)(a)

Atom(opt)

Xeon(base)

Xeon(opt)

Exe

cutio

n tim

e (m

s)

Descriptor matchingDescriptor generationIP detection

0

(c)

40

80

120

160

Atom(base)

Atom(opt)

Xeon(base)

Xeon(opt)

Spe

akin

g tim

e (%

)

OthersSearchGMM scoring

(b)

43%

5%

52%58%

7%

35%

20 image matches 50 image matches

IP detectionDescriptor generationDescriptor matching

(d)

62%

36%

2%

65K-word vocabulary

82%

2%16%

20K-word vocabulary

SearchOthers GMM scoring

Figure 2. Recognition hot-spot sensitivity: image recognition hot-spot sensitivity on Atom- and Xeon-based platforms

(a), image recognition hot-spot sensitivity with database size for database scaling (b), speech recognition hot-spot

sensitivity on Atom- and Xeon-based platforms (c), and speech recognition hot-spot sensitivity with vocabulary size

for database scaling (d). (opt: optimized.)

....................................................................

MAY/JUNE 2011 23

recognition servers, and for the same reasonsIn addition, much of this difference comesfrom large caches (8 Mbytes on the Xeoncore versus 512 Kbytes on the Atom).

Figure 2d shows hot-spot profiling ofSphinx 3 on an Atom processor. As the fig-ure shows, GMM scoring takes the mostamount of time (62 to 82 percent of process-ing) followed by HMM (which takes 16 to36 percent of the total execution time).These were collected with Wall Street Journalmaterial with no background noise. Thus,even without noise, the Sphinx 3 decodingtime on the Atom is 1.25� longer than inreal time.

In real-world usage cases, environmentalnoise is unavoidable. Compared with cleanaudio, decoding time for noisy audio can eas-ily increase by 50 to 300 percent. This showsthat hardware acceleration is necessary, notonly for power consumption reasons, butalso for real-time decoding.

Recognition server architectureFigure 3 shows the progression of the rec-

ognition server architecture design space thatwe explored when developing CogniServe.To provide a low-cost, energy-efficient architec-ture, we developed a recognition server withsmall cores (like the Intel Atom). Figure 3ashows this basic server architecture, consistingof multiple small cores connected via an

on-die interconnect to an integrated memorycontroller for attaching it to DRAM. Variouscommercial and research efforts are alreadyattempting to provide large-scale serversbased on Atom cores.6-8 Our basic designrepresents a simple homogeneous chip multi-processor (CMP) architecture, which is ageneral-purpose CMP from the perspectiveof programmability ease.

To further accelerate the recognition exe-cution time, we explored the potential of inte-grating application-specific accelerators forthe key recognition primitives. Figure 3bshows this architecture at a high level, inwhich we replaced some of the small coreswith a set of hardware recognition accelera-tors (RA1, RA2). We included a GMM ac-celerator for speech recognition because itconsumed more than 60 percent of the exe-cution time, a match accelerator for imagerecognition because it consumed more than50 percent of the execution time, and anIPD accelerator because it consumed morethat 40 percent of the execution time. Allthree accelerators are key primitives in suchalgorithms, and therefore are reusable acrossa range of recognition workloads.

Another important criterion that we usedto choose these accelerators is that they pri-marily involved integer or fixed-point com-putations. The resulting architecture whenintegrating accelerators and small cores on

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 24

Interconnect

Memory

SmallSmall Small Small

Homogeneous

(a) (b) (c)

ISA and common memory

SmallSmall RA

1

RA

2

General purpose Device specific

Heterogeneous

Separate memory; ISA vs. device

Interconnect

Memory

General purpose

SmallSmall RA

1

RA

2

General purpose

Heterogeneous

ISA and common virtual memory

Virtual memory interconnect

Memory

Figure 3. Recognition server: architecture and acceleration on small-core servers (a), driver-based accelerator integration

(b), and instruction set architecture (ISA)-based accelerator integration (c). We extended the generic programming model

from homogenous architectures to heterogeneous CPU and other device platforms. (RA: recognition accelerator.)

....................................................................

24 IEEE MICRO

...............................................................................................................................................................................................


the die is essentially heterogeneous. This ar-chitecture provides significant performanceimprovement and energy efficiency, whichis inherent in hardware accelerators asopposed to general-purpose cores. Tradi-tional models essentially treat acceleratorslike any other I/O device and let developersemploy accelerators based on device driverabstraction, as Figure 3b shows.

Unlike the small cores, which employgeneral-purpose instructions and virtual mem-ory, accelerators are programmed using firm-ware and device driver interfaces, and they’retypically programmed using physical addressesin memory. This makes it difficult to programand also makes the architecture inefficient be-cause it requires control transfer between theuser space and the kernel space before access-ing the accelerator. Moreover, it requirescopying data back and forth between thecore and the accelerator for processing.

As Figure 3c shows, to address the lack ofgeneral-purpose programming and ineffi-ciency in traditional (device-driver-based)models of heterogeneous architectures, wedesigned architectural support in the hetero-geneous recognition server, enabling directuser access via an instruction set architecture(ISA) from the core to the accelerator (thuseliminating user-to-kernel transition over-head). We also designed common memorymanagement units (MMUs), which let theaccelerator share the virtual-memory spacewith the core so that data movement over-heads can be eliminated as well.

Recognition accelerators in CogniServeTo improve energy efficiency and execu-

tion time for a class of special-purpose appli-cations, architects have been exploring theintegration of domain-specific hardwareaccelerators. The most popular hardware ac-celerator over the past decade has beengraphics. However, over the years, research-ers have developed application-specificaccelerators for video processing, image pro-cessing, security and cryptography, and net-working. We’ve designed and implementedthe following accelerators for speech andimage recognition:

� GMM accelerator. This speech recogni-tion accelerator enables GMM scoring

for each of the feature vectors of anaudio frame.

� Match accelerator. This image recogni-tion accelerator computes distance cal-culations for matching descriptorsfrom a query image to descriptorsfrom a set of database images.

� IPD accelerator. This accelerator identi-fies the interest points in an input image.

Here, we describe the design of theseaccelerators in more detail.

GMM acceleratorFigure 4a shows the GMM accelerator ar-

chitecture, which includes weighted-squaresum calculation of the distances betweenaudio feature vectors and the mean valuesin the Gaussian table, followed by scoregeneration. The GMM accelerator has afive-stage pipeline, processing 39 differentcomputations concurrently. The arithmeticlogic units (ALUs) are 32-bit and 64-bit add-ers, 32-bit � 16-bit ! 48-bit and 32-bit �32-bit! 64-bit multipliers, and two shifters,all in fixed-point format. The control unitconsists of a set of state machines that gener-ate addresses for both fetches from the staticRAM (SRAM), and direct memory access(DMA) transfers that copy data fromDRAM to SRAM. The DMAs ensure thatthe audio feature vectors and Gaussian tablevalues are available to the ALUs for thenext set of operations.

The Gaussian table for the more than8,000 senones accounts for most of thememory traffic. (A senone is a set of proba-bility density functions implemented as aGaussian mixture.) We adopt the audio-frame-grouping technique to save mainmemory bandwidth consumption. Wegroup four 10-ms audio frames for batchprocessing, which cuts memory traffic by4� at the cost of 40 ms of initial delay.

Match acceleratorThe match accelerator for image recogni-

tion computes the Euclidean distance betweenevery pair of descriptor vectors (one interestpoint from the query image, and one fromthe database image) and computes the num-ber of matches between two images. Each de-scriptor has 64 dimensions, represented by 64

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 25

....................................................................

MAY/JUNE 2011 25

8-bit values (see Figure 4b). The match accel-erator employs SRAM as a staging buffer inthe address space, without which all subse-quent descriptor values would need to be

read from DRAM one at a time. Our designuses an SRAM that supports a 64 � 64 de-scriptor computation (requiring only 8Kbytes). We rearranged the loop in the

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 26

...

38

i = 0

x×

X 0

mM 0 –

×vV 0

x×

X 1

mM 1 –

×vV 1+

x×

X 2

mM 2 –

×vV 2

x×

X 3

mM 3 –

×vV 3+

+

x ×X 38

mM 38 –

×vV 38+

+

+

+

+ –

LogAdd

>>14 +

LRD W

Gaussian score

Score

X SRAM(1 ×156 bytes)

Control

AGU

System bus

(a)

(b) (c)

Bus LRD SRAM(16 × 8 bytes)

W SRAM(16 × 4 bytes)

M SRAM(16 × 156 bytes)

V SRAM(16 × 78 bytes)

1st 2nd 3rd 4th

sum = Σ (xi – mi)2 × vi

5thPipelinestages

q×+

d

+

+

q

d

d

d

d

q

q

q

×+

×+

×+

×+

+

+

+

+

+

+

+

+

+...

...

...

Sum

Query 0

DB 0

Query 1

DB 1

Query 2

DB 2

Query 3

DB 3

Query 63

DB 63

<

<

Min

Min2

Dis

tanc

eca

lcul

atio

n

Load EU Sum1 Sum2 Min

1 2 3

63

i = 0sum = Σ (qi – di)

2 System bus interface

Integral image computationblock

Integralimagebuffer

Controlstate

machine

Controlinformation

memory

Computation pipeline(shifting, multiplication,

accumulation)

Accumulation buffer

0

Points for block filter DyyPoints for block filter DxxPoints for block filter DxyHessian result point

x

2nd

Figure 4. Accelerator designs for recognition hot spots: GMM block for speech recognition (a), mobile augmented reality

(MAR) match implementation (b), MAR interest point detection (IPD) accelerator (c). (AGU: address generation unit;

DB: database; EU: Euclidian distance square unit; LRD: logarithmic reciprocal determinant; M, X, V, W: input and database

vectors; SRAM: static RAM.)

....................................................................

26 IEEE MICRO

...............................................................................................................................................................................................


brute-force algorithm for 64 descriptor vectorsby using a loop-blocking technique. Thematch accelerator data path requires comput-ing the Euclidean distance square for everypair of descriptor vectors.

The match accelerator has a five-stagepipeline, computing 64 descriptor vectorsin 8-bit unsigned integer format concur-rently. The load stage loads 64 descriptorvectors from the internal SRAM into theregisters. The execution stage calculates theEuclidean distance square between descrip-tors by computing the difference between apair of descriptor vector elements and multi-plying it by itself to compute the square. TheSum1 and Sum2 stages accumulate 64 Eucli-dean distance squares. The accumulatedvalue is checked against the minimum andsecond minimum value in the Min stage,and the results are stored to the internalSRAM for the match operation. The final re-sult, the number of matches between twoimages, is calculated by accessing the Minand Min2 memory obtained from the Eucli-dean distance square unit. When the mini-mum value is less than half the secondminimum, the number of matches increasesby 1 for the pair of descriptors.

IPD acceleratorIPD is the first phase in image recogni-

tion and comprises steps such as integralimage computation and Hessian matrixcomputation. We designed the IPD accelera-tor to efficiently implement these steps andprovide significant execution time reductionat ultralow power. Figure 4c shows theblock filter points for Dxx , Dyy , and Dxy ,which are needed to compute one Hessianmatrix result point when an ‘‘octave’’ equals1 and an ‘‘octave layer’’ equals 1. The IPDaccelerator implements the following steps:

1. The accelerator reads a vector of an in-tegral image from the integral image

buffer and sends it to the computationpipeline block.

2. The accelerator reads an entry of thecontrol information table from thecontrol information memory, whichincludes one associated entry for eachblock filter point.

3. With the vector from step 1 and thecontrol information from step 2, thecomputation pipeline block performsshifting, coefficient multiplication, andaccumulation. The accumulation buf-fers store the accumulated results foreach block filter.

4. Steps 1 through 3 are repeated until allblock filter points are processed. Then,the final Hessian matrix results are cal-culated and written into either an on-chip buffer or system memory for thenext processing step.

5. Steps 1 through 4 are repeated for theentire image.

Accelerator design evaluationTo evaluate the design of the three

accelerators, we conducted extensiveregister-transfer level (RTL) simulations andintegrated each accelerator into a field-programmable gate array (FPGA) platform.We functionally verified that the acceleratorsoperate correctly and produce the correctresults. We also synthesized the designs ona 45-nm process and gathered area, fre-quency, and power statistics.

As Table 1 shows, the accelerators operateat frequencies ranging from 400 MHz tonearly 600 MHz and achieve speedups rang-ing from 3� to 21� for the kernel. Significantsavings in power consumption accompanythese speedups. The accelerator logic consumesonly about 25 mW to 100 mW in power.

Figures 5a and 5b show the performancegains of image and speech recognition forvarious hardware configurations and differentprocessor frequency scales and comparison

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 27

Table 1. Accelerator performance and speedup from the software implementation.

Accelerator Frequency (MHz) Logic area (mm2) Power (mW) Speedup

GMM 542 0.63 98.0 6�Match 568 0.08 23.5 21�IPD 400 0.36 39.1 13�

....................................................................

MAY/JUNE 2011 27

databases. The Atom-based accelerator-integrated platform is about 2� faster thanan Intel Xeon processor for 20 images, andgoes up to about 5� for 50 images. This ena-bles the accelerator-integrated platform toprocess more queries with lower powerconsumption. Although our accelerator-integrated platform is about 2� or 3� slowerthan the Intel Xeon for speech recognition,our platform delivers one-third the perfor-mance at an order-of-magnitude lowerpower consumption.

CogniServe accelerator programmingTraditional interfacing for accelerators has

been through device drivers, but this model

poses two challenges: performance loss dueto overhead latencies, and lack of a standar-dized interface for software programming.Extracting the maximum performance andenabling a standardized programming inter-face requires a new approach to accessingaccelerators and communicating with them.

Conventional approaches for acceleratorexposure are based on device driver abstrac-tion. Figure 6a shows the overheads withthe device driver interfacing model.

� System call overheads range from 100to 1,000 cycles for each call into thedriver (because they require a transitionfrom the user to the kernel).

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 28

0

20

40

60

80

100

120

140

160

Sp

eaki

ng ti

me

(%)

20K vocabulary65K vocabulary

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

Atom(SW)(a) (b)

Xeon(SW)

Atom(W/A)

Atom(SW)

Xeon(SW)

Atom(W/A)

Atom(SW)

Atom(SW)

Xeon(SW)

Xeon(SW)

Atom(W/A)

Atom(W/A)

Exe

cutio

n tim

e (m

s)

20 images

50 images

Figure 5. Performance scaling with frequency and database size: performance improvement of image recognition (a) and of

speech recognition (b). (SW: software; W/A: with acceleration.)

yKernelbuffer

Userbuffer

Application

Device driver

Accelerator

System call

I/O access

User

Kernel

Software

Hardware

Copy from userCopy to user

DMA read/write(physical)

DMA read/write(physical)

yUser

buffer Application

Accelerator

User

Kernel

Software

Hardware AMMU

AISA

DMA read/write(virtual)(a) (b)

Figure 6. CogniServe accelerator programming and communication: device driver interface (a) and virtual memory-based

acceleration (b). (AISA: accelerator ISA; AMMU: accelerator memory management unit; DMA: direct memory access.)

The proposed programming model achieves significant overhead reduction by eliminating data copy, buffer allocation,

and system-call cycles.

....................................................................

28 IEEE MICRO

...............................................................................................................................................................................................


� Data copy overheads are incurred be-cause the device driver must copy thedata from the user to the kernel buffer.The copy overhead is around 2,100cycles for a 4-Kbyte data transfer. An-other approach is to pin the applicationpage and get the physical address usingkernel services. Although this approachreduces copy overhead, it adds the costof address translation.

We address these overheads by intro-ducing the architectural support shown inFigure 6b, which has the performance ben-efits listed in Table 2.

AISA extensionsWe introduced new accelerator ISA

(AISA) extensions in the host core to reduceunnecessary device driver overheads wheninterfacing accelerators to the servers. Theseextensions also provide a standard interfacefor programs to access the accelerator hard-ware from the user space. This lets programsaccess the accelerator similar to accessing aprocessor functional unit or coprocessor.We define two new ISA extensions to createsplit-instruction semantics (which avoidsblocking the processor during the acceleratortransaction).

� PUTTXN. This instruction provides aprocess with an atomic method tosend data to an accelerator, alongwith the context information(returning a unique transaction IDthat can be used to query the statusasynchronously).

� GETTXN. This instruction provides aprocess with a method for querying

the hardware for a given transaction’scompletion status.

We also define an accelerator ID instruc-tion (AID) to enumerate and configure allthe available accelerators in the processor.

AMMU translation servicesWe designed a simplified memory model

to avoid data-copying requirements and en-able the accelerators to operate in thevirtual-memory domain. The key compo-nent of this new model is the acceleratormemory management unit (AMMU). TheAMMU is integrated on the interconnectand offers address translation services to theaccelerators so that they can execute in thevirtual-memory domain. The AMMU alsolets programs access the accelerator functionsdirectly from the user space and communi-cate using virtual-memory addresses. Whenan accelerator tries to access the applicationmemory with a virtual address, the AMMUintercepts the request and automaticallytranslates the virtual address into the corre-sponding physical address. If the translationisn’t present in the operating-system pagetables, the AMMU initiates a page faultthrough an interrupt. The operating systemthen services the page fault before theAMMU can again attempt the translation.

Table 2 shows the overheads associatedwith offloading the computationally inten-sive portions of prominent recognition suitesusing a traditional device driver approach onthe one hand, and using the proposedprogramming-model-based approach withthe AISA and AMMU on the other hand.Table 2 also shows the absolute overhead(in CPU cycles) for the IPD, match, and

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 29

Table 2. Performance benefits of the proposed accelerator-programming model.

Classical driver model (overhead cycles) Proposed model

Accelerator Data copy

Buffer

allocation

System calls

and register

setup

Total

overhead

Total over-

head (cycles)

Overhead

reduction (%)

MAR IPD 1,110,657 616,355 726 1,727,738 550 99.97

MAR match 33,081 31,757 726 65,565 550 99.16

Sphinx GMM, N ¼ 8 3,735 6,378 726 10,840 550 94.93

Sphinx GMM, N ¼ 1 3,200 6,096 726 10,022 550 94.51

....................................................................

MAY/JUNE 2011 29

GMM accelerators. As is evident from thistable, the proposed interfacing model consid-erably decreases the total data offload cycleswhen compared with the classical drivermodel. The reason for this performance ben-efit is that the overheads for data copyingand kernel buffer allocation are avoidedwhen the offload occurs directly from theuser space. The only remaining interfacingoverheads are those for invoking the offloadfrom software and those due to the hardwaredelays in queuing up the job for the accelera-tor. Both of these overheads are insignificant(about 550 cycles) and independent of thenature of the offload as opposed to the drivermodel.

Other overheads associated with virtual-address-based execution on the acceleratorsare translation look-aside buffer (TLB)misses and page fault handling. TheAMMU keeps a cache of virtual-to-physicaladdress translations in its TLB, which ismanaged like the processor TLB. Althoughthe analysis of performance data is beyondthe scope of this article, our preliminarywork in this area suggests that the overheadsto service a TLB miss aren’t very large whencompared to the benefits gained by reducingthe software offload path. This is especiallytrue in light of the spatial locality of recogni-tion workloads, which can potentially makeintelligent TLB prefetching effective. Increas-ing the interconnect speeds also reducesthe TLB miss service latencies. Page faultservicing should add minimal cost to theperformance for recognition accelerator off-loading, thanks to its use of source and des-tination buffers allocated by the offloadingprogram.

The accelerators use source buffers to fetchthe input data provided by the offloadedprogram for computation. The nature of rec-ognition workloads dictates that the compar-ison operations are performed between atarget and a source database. Both of thesedatabase buffers are prefilled by the softwarebefore the control is handed to the hardwareaccelerators through offloading. The tempo-ral locality makes it improbable for theAMMU to get a page fault when accessingthe source buffer.

The accelerators use destination buffers tocommunicate the results of recognition

offloading to software. Some applications al-locate a destination buffer on the basis of de-mand. This implies that, although thepointer to the buffers is valid, the memorydoesn’t get allocated until the acceleratoraccesses it to store the result. This can leadto a page fault. However, the number ofpage faults in such cases is limited (usuallyjust one) because most of the recognitionaccelerators communicate a ‘‘match or nomatch’’ result along with an optional indexinto the database. This result can easily fitinto the smallest page allocated by the oper-ating system. The examples discussed in thisarticle, MAR and speech recognition, fallinto this category.

A s this article shows, the CogniServerecognition server architecture is well-

suited for recognition applications and willfoster further technology development inthis area. With recognition-based applica-tions on smart devices constantly emerging,we will continue studying additional algo-rithms, and further refine the CogniServerecognition server architecture with theright set of accelerators and associatedarchitectural support. M I CR O

....................................................................References

1. L.A. Barroso, J. Dean, and U. Holzle, ‘‘Web

Search for a Planet: The Google Cluster Ar-

chitecture,’’ IEEE Micro, vol. 23, no. 2,

2003, pp. 22-28.

2. G. Takacs et al., ‘‘Outdoors Augmented Re-

ality on Mobile Phone Using Loxel-Based Vi-

sual Feature Organization,’’ Proc. ACM 1st

Int’l Conf. Multimedia Information Retrieval

(MIR 08), ACM Press, 2008, pp. 427-434.

3. H. Bay et al., ‘‘Speeded-Up Robust Fea-

tures (SURF),’’ J. Computer Vision and

Image Understanding, vol. 110, no. 3,

2008, pp. 346-359.

4. D.G. Lowe, ‘‘Distinctive Image Features from

Scale-Invariant Keypoints,’’ Int’l J. Computer

Vision, vol. 60, no. 2, 2004, pp. 91-110.

5. S. Srinivasan et al., ‘‘Performance Charac-

terization and Acceleration of Optical Char-

acter Recognition on Handheld Platforms,’’

Proc. IEEE Int’l Symp. Workload Character-

ization (IISWC 10), IEEE Press, 2010,

doi:10.1109/IISWC.2010.5648852.

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 30

....................................................................

30 IEEE MICRO

...............................................................................................................................................................................................


6. D.G. Andersen et al., ‘‘FAWN: A Fast Array

of Wimpy Nodes,’’ Proc. ACM SIGOPS

22nd Symp. Operating Systems Principles

(SOSP 09), ACM Press, 2009, pp. 1-14.

7. V. Reddi et al., ‘‘Web Search Using Mobile

Cores: Quantifying and Mitigating the

Price of Efficiency,’’ Proc. 37th Ann. Int’l

Symp. Computer Architecture (ISCA 10),

ACM Press, 2010, pp. 314-325.

8. SeaMicro, http://www.seamicro.com.

Ravi Iyer is a senior principal engineer anddirector of the SoC Platform ArchitectureResearch Group at Intel Labs. His researchfocuses on future system-on-chip (SoC) andchip multiprocessor (CMP) architectures,especially small cores, accelerators, cache andmemory hierarchies, fabrics, quality of service,emerging applications, and performance eva-luation. Iyer has a PhD in computer sciencefrom Texas A&M University.

Sadagopan Srinivasan is an engineer atIntel Labs, where he works on architectureand applications for mobile and heterogeneoussystems. His research interests include proces-sor architecture, memory subsystems, andperformance modeling and analysis. Srinivasanhas a PhD in computer engineering from theUniversity of Maryland, College Park.

Omesh Tickoo is a research scientist at IntelLabs. His research interests include computersystem architecture, multimedia, and net-working. Tickoo has a PhD in electrical andcomputer systems engineering from Rensse-laer Polytechnic Institute.

Zhen Fang is an engineer at Intel Labs,where he works on architecture and applica-tions for mobile and embedded systems. Hisresearch interests include processor architec-ture, memory subsystems, parallel proces-sing, and performance modeling andanalysis. Fang has a PhD in computerscience from the University of Utah.

Ramesh Illikkal is a principal engineer inthe Integrated Platform Research Lab at IntelLabs. His research interests include SoCs,

CMPs, server architectures, virtualization,and memory hierarchies. Illikkal has anMTech in electronics from Cochin Univer-sity of Science and Technology in India.

Steven Zhang is a senior engineer in theIntegrated Platform Research Lab at IntelLabs. His research interests include SoCarchitecture, reconfigurable computing, andquick prototyping. Zhang has a PhD incommunication engineering from BeijingUniversity of Posts and Telecommunications.

Vineet Chadha is a research scientist at IntelLabs. His research interests include computerarchitecture, virtualization, and distributedcomputing. Chadha has a PhD in computerand information science and engineeringfrom the University of Florida. He’s amember of the IEEE Computer Societyand the ACM.

Paul M. Stillwell Jr. is a senior engineer inthe Integrated Platform Research Lab at IntelLabs. His research interests include SoCs anddeeply embedded architectures. Stillwell hasa BS in electrical engineering from NorthCarolina State University.

Seung Eun Lee is an assistant professor inthe Department of Electronic and Informa-tion Engineering at Seoul National Univer-sity of Science and Technology. His researchinterests include computer architecture,multiprocessor SoCs, networks on chips(NoCs), and very large-scale integration(VLSI) design. Lee has a PhD in electricaland computer engineering from the Uni-versity of California, Irvine. He’s a memberof IEEE.

Direct questions and comments aboutthis article to Ravi Iyer, JF2-58, 2111 NE25th Ave., Hillsboro, OR 97124; [email protected].

[3B2-9] mmi2011030020.3d 23/5/011 15:44 Page 31

....................................................................

MAY/JUNE 2011 31

Documents

COGNISERVE:HETEROGENEOUS SERVER A LARGE …zfang/CogniServe_IEEEmicro11.pdf · mechanisms such as system calls to isolate the user space processes from directly control- ... nition