Demonstration and Architectural Analysis of Complementary Metal-Oxide Semiconductor /Multiple-Quantum-Well Smart-Pixel Array Cellular Logic Processors for Single-Instruction Multiple-Data

Demonstration and architectural analysis ofcomplementary metal-oxide semiconductorymultiple-quantum-well smart-pixel array cellularlogic processors for single-instructionmultiple-data parallel-pipeline processing

Jen-Ming Wu, Charles B. Kuznia, Bogdan Hoanca, Chih-Hao Chen, andAlexander A. Sawchuk

We present an optoelectronic-VLSI system that integrates complementary metal-oxide semiconductorymultiple-quantum-well smart pixels for high-throughput computation and signal processing. The sys-tem uses 5 3 10 cellular smart-pixel arrays with intrachip electrical mesh interconnections and interchipoptical point-to-point interconnections. Each smart pixel is a fine grain microprocessor that executesbinary image algebra instructions. There is one dual-rail optical modulator output and one dual-railoptical detector input in each pixel. These optical input–output arrays provide chip-to-chip opticalinterconnects. Cascading these smart-pixel array chips permits direct transfer of two-dimensional dataor images in parallel. We present laboratory demonstrations of the system for digital image edgedetection and digital video motion estimation. We also analyze the performance of the system comparedwith that of conventional single-instruction–multiple-data processors. © 1999 Optical Society of America

OCIS codes: 200.2610, 200.4650, 200.4690, 100.2000.

mcro

1. Introduction

As the digital age evolves, various media such asimages, videos, audio, and data are digitized for stor-age, processing, and transmission. The digitized in-formation creates the need for processing hugeamount of data in real time. User applications aremoving to sophisticated features such as multimedia,video conferencing, three-dimensional ~3D! graphicsrendering, and high-resolution images, resulting insystems that require high data bandwidths.1 Theseapplications require the system to transfer rapidly alarge amount of data to perform signal processing athigh speed. Significant advancements in comple-

When this research was performed, the authors were with theSignal and Image Processing Institute, Department of ElectricalEngineering, University of Southern California, Los Angeles, Cal-ifornia 90089-2564. J.-M. Wu is now with Sun Microsystems,Inc., Palo Alto, California 94303. A. A. Sawchuk’s e-mail addressis [email protected].

Received 1 April 1998; revised manuscript received 21 Septem-ber 1998.

0003-6935y99y112270-12$15.00y0© 1999 Optical Society of America

2270 APPLIED OPTICS y Vol. 38, No. 11 y 10 April 1999

mentary metal-oxide semiconductor ~CMOS! technol-ogy have made extremely fast microprocessorspossible. By the year 2001 the integration density ofCMOS logic is expected to be more than 40 3 106

transistors per chip, and the projected frequency isexpected to be 1.4 GHz.2 Performance limits inmany systems today are due not to processor clockspeed but rather to inputyoutput ~IyO! bottlenecksand system architectures. The signal IyO’s or inter-connections exist between processors and input de-vices, between processors for multiprocessor systems,and between processors and storage devices.

With recent progress in smart-pixel technologiesand development in bump-bonding techniques, it hasbecome possible to attach large numbers of opticalIyO devices to foundry-grade CMOS VLSI’s.3 Withthis method small multiple-quantum-well ~MQW!

odulators and detectors are attached to CMOShips by flip-chip bonding with subsequent substrateemoval. This technology has made possible manyptical IyO’s normal to the surface of a VLSI chip.4,5

It thus creates large two-dimensional ~2D! informa-tion transfer capabilities between VLSI chips and

m

aafS~PsecrdaacWrsRcttptVtocat~peaSa

cpaopbfipoirbiv

Sttscn

ap

faimdua

potentially alleviates the integrated circuit IyO com-unication bottleneck.In this paper we present an n-stage smart-pixel

rray cellular logic ~nSPARCL! processor system thatttempts to overcome the IyO bottleneck by usingree-space digital optical interconnects. ThePARCL chip is a single-instruction–multiple-data

SIMD! processor element ~PE! array in which allE’s are identical and execute the same instructionet on multiple data elements in lock step. Theyfficiently execute so-called data-level parallel appli-ations, which are programs in which the same algo-ithm or instruction sequence is applied to a largeata set. Matrix-vector multiplication as well as im-ge convolution and filtering operations are some ex-mples of data parallel operations. In the SPARCLhip each PE is implemented with one smart pixel.e designed this optoelectronic chip and had it fab-

icated through the CO-OP program at George Ma-on University sponsored by the Defense Advancedesearch Projects Agency.6 The 0.8-mm CMOS cir-

uitry was fabricated by a Hewlett-Packard processhrough the Metal-Oxide Semiconductor Implemen-ation Service ~MOSIS!. Then AlGaAsyGaAs MQW–i–n structures were bonded onto it by Bell Labora-oriesyLucent Technologies with its optoelectronicLSI process.6 The 1.95 mm 3 1.95 mm area con-

ains 200 MQW diodes that can operate as eitherptical detectors or modulators. The SPARCL chipontains a 5 3 10 array of smart pixels, each withrea of 125 mm 3 250 mm. Each smart pixel con-ains 182 transistors to execute binary image algebraBIA! operations with a 3-bit local memory. Eachixel also detects or transmits one optical data bit onach clock cycle. The smart-pixel array operates asmesh-connected SIMD processor. Operation of thePARCL chip was simulated at more than 100 MHznd has been tested at 90 MHz.We constructed a demonstration system that inter-

onnects as many as three SPARCL chips in a 2Dipeline processing array. Data flow unidirection-lly through the SPARCL pipeline on a 5 3 10 arrayf digital optical free-space channels. The system isackaged on a 100 3 140 ~25.4 cm 3 35.56 cm! slottedase plate that houses polarization-sensitive and dif-ractive optical components. A host computer sendsnstructions to the SPARCL chips to perform datarocessing routines. We have successfully verifiedperation of the SPARCL prototype system. Specif-cally, we tested several image and data processingoutines, such as parallel numerical processing ~10-it-wide addition, subtraction, and multiplication!,mage edge detection, noise filtering, and digitalideo motion estimation.We describe the nSPARCL system architecture in

ection 2 and the SPARCL chip architecture in Sec-ion 3. In Section 4 we present the experimentalest results of the SPARCL system and demonstrateome applications of the system for digital image pro-essing and digital video motion estimation. Fi-ally, in Section 5 we analyze the system architecture

nd show the advantage of the SPARCL system com-ared with conventional SIMD systems.

2. System Architecture

The system integrates several SPARCL chips in par-allel pipelines, using free-space digital optics technol-ogy, as shown in Fig. 1. Each chip has a 5 3 10array of PE’s that are electrically mesh connected.The processing elements are optically interconnectedpoint to point between the chip planes, thus creatinga 3D massively parallel processing system for dataprocessing and communication. The prototype sys-tem uses a host computer as a controller to sendinstructions as well as data blocks to SPARCL chips.The input datum, e.g., an image, is usually muchlarger than the 5 3 10 SPARCL array size and there-ore is partitioned into 5 3 10 blocks. These blocksre pipelined into the system from the electrical datanput pads of the first stage chip. Each chip in this

ultistage system can be programmed to carry out aifferent set of instructions. A processing routinesually contains a sequence of instructions written asBIA sequence.7 The host computer analyzes the

instructions and shares the computation load amongthe SPARCL stages to optimize the computation ef-ficiency. The processed blocks leave the systemfrom the last stage of the SPARCL system. The hostcomputer then collects the processed blocks and as-sembles the result. Thus the SPARCL system loads,processes, and unloads the data blocks in a pipelinedfashion.

In general, free-space digital optics technologiesoffer a promising solution for IyO bottlenecks inSIMD systems. Figure 2 shows a comparison of theconventional SIMD architecture and two types ofSPARCL system, a one-dimensional ~1D! parallel

Fig. 1. SPARCL system prototype.

10 April 1999 y Vol. 38, No. 11 y APPLIED OPTICS 2271

nv

slttt3mvdtaetsTfu

e~tsoiMCtMftas

m

2

data access nSPARCL and a 2D parallel data accessnSPARCL, where the prefix n represents the numberof cascaded SPARCL stages. The 1D nSPARCL sys-tem reads and writes data with the same bus band-width as a conventional SIMD machine. The 2DnSPARCL system permits reading from input devicesand writing to output devices optically in 2D paralleland hence with much larger IyO bandwidth. Allthree systems assume the same total number of pro-cessing elements. In later sections of this paper weanalyze the system performance and make compari-sons among these three systems.

3. Binary Image Algebra and SPARCL ChipArchitecture

The SPARCL chip is designed to execute binary im-age algebra. Each SPARCL pixel is a 1-bit processorfor binary image processing. In this section webriefly describe the BIA and show how to implementthe BIA into the chip architecture.

A. Binary Image Algebra

BIA, derived from mathematical morphology, is asystematic mathematical tool for general morpholog-ical image processing and data manipulation.7,8 Itdefines three fundamental operations:

• Complement, X:

X 5 $~x, y!u~x, y! [ W, ~x, y! [y X% 5 W 2 X; (1)

• Union, ø:

X ø Y 5 $~x, y!u~x, y! [ X or ~x, y! [ Y%, (2)

• Dilation, Q:

X % R 5 $~p, q!uRp,q ù X Þ A%, (3)

in which X and Y denote the raw data sets, W de-otes the universal set in which all pixels have thealue 1, and Rp,q denotes the translation or structur-

ing element R such that its origin is located at ~p, q!.

Fig. 2. Architectural comparison of ~a! a conventional SIMD ma-chine, ~b! a 1D nSPARCL with electrically loaded input and output,and ~c! a 2D nSPARCL with optically loaded input and output.All systems are assumed have the same number of total processingelements.


It has been proved that any binary morphologicalimage processing routine can be decomposed intothese three fundamental BIA operations.7 By thecombination and repetition of these three operations,any arithmetic or symbolic functions of binary dataarray can be synthesized. For more informationabout BIA, please refer to Refs. 7 and 8.

B. SPARCL Chip Architecture

To implement the electronics of the cellular imageprocessor, a VLSI architecture has been developedthat maps the three fundamental BIA operations intoeach smart pixel. Figure 3 is a block diagram of theBIA smart pixel. Each pixel contains a 3-bit localmemory ~M1–M3!, a union section, and a dilationection. At the input port, a multiplexer ~MUX! se-ects the input either from the optical receiver or fromhe electrical feedback, permitting recursive opera-ions. The input data bit is then routed into one ofhe three available local memories under control of a-bit memory-select command. Each memory isade of a flip-flop register that outputs both the

alue of the data and the complement value of theata. A 6-bit union command chooses outputs fromhe memory modules and performs a union operationmong selected values. The result of the union op-ration is sent to the dilation section and then dis-ributed to north, west, south, and east neighbormart pixels as a local interconnection for dilation.he dilation section takes a reference image pattern

rom the control unit and performs dilation with val-es from the local neighbor pixels.Optical signals transmitted from or received by

ach pixel are encoded as two separate channelsdual-rail encoding!. The power ratio of the two spa-ial channels determines the 0 and 1 logic levels. Achematic of a single smart pixel with one dual-railptical receiver and one dual-rail optical transmitters shown in Fig. 4. The receiver contains GaAs

QW self-electro-optical effect device detectors and aMOS transimpedance receiver. Similarly, the

ransmitter contains a CMOS modulator driver andQW modulators. The GaAs and CMOS chips are

abricated separately and flip-chip bonded, and thenhe MQW GaAs substrate is removed. The receivernd modulator driver circuitry are standard cells de-igned by Bell LabsyLucent Technologies.6The silicon chip is fabricated by 0.8-mm HP-

Fig. 3. Block diagram of a BIA smart-pixel design: Q and Q, aemory output and its complement, respectively.

pul

fG~foCMcsm

pmtctrt

ype C

CMOS26G technology at the MOSIS foundry. Fig-ure 5 shows the physical layout of the SPARCL chipdesign. Each chip contains a total of 12,863 transis-tors. The chip contains an array of 5 3 10 pixelswithin a 1.95 mm 3 1.95 mm die. Each SPARCL

ixel is a relatively simple processing element thatses only 182 transistors, and it can implement a

arge number of operations. This chip is adequate

Fig. 4. Optical IyO of a smart pixel.

Fig. 5. Physical layout of the 5 3 10 SPARCL chip.

Table 1. Protot

Parameter

ApplicationAlgorithmArchitectureCMOS processGaAs process ~optical IyO devices!Flip-chip BondingTotal number of transistorsDie sizeArray sizeNumber of padsThroughput rateNumber of optical IyO’sOperation wavelength

or prototyping and testing purposes. A companionaAs chip containing a 10 3 20 array of MQW diodes

which operate as either detectors or modulators! wasabricated at the Bell LaboratoriesyLucent Technol-gies MQW foundry and flip-chip bonded to theMOS chip. The operating wavelength of theQW’s is 850 nm. Table 1 summarizes the specifi-

ations of the prototype SPARCL chip. Withmaller CMOS feature sizes and larger chips, manyore pixels per array can be fabricated.

4. Experimental Results and Demonstration

We constructed a testbed for optoelectronic testingand demonstration purposes. The system is pack-aged upon a 100 3 140 slotted base plate housing

olarization-sensitive and diffractive optical ele-ents. The demonstrator that we designed is able

o house three SPARCL chips and, at present, twohips were built. A host computer sends instruc-ions to the SPARCL chip to perform data processingoutines. A 4-kByte first-in–first-out buffer is usedo interface a slower ~100-kbyteys! data acquisition

board on the host computer to the SPARCL chip’sinput data rate of 20 Mbytesys. Therefore, with fiveparallel electrical data input pads and a 5 3 10 arrayof optical IyO’s, each SPARCL chip achieved a 100-Mbyteys electrical IyO data rate or a 1-Gbyteys opti-cal IyO data rate.

The MQW modulator contrast ratio had a mea-sured average of 1.96 and 2.13 for logic levels 0 and 1,respectively. The optical switching power is theminimum difference in optical power between thedual-rail detectors that switches the logic states.The optical switching power of the detector MQW is;1.5 mW per diode at 20 MHz. The chip consumedapproximately 400 mW of static power dissipation at5-V operation voltage because of the transimpedancereceivers.5 Dynamic power dissipation was mea-sured at ;100 mW at 20-MHz operation. The totalchip power dissipation was measured to be ;500mW. References 9 and 10 contain many additionaldetails about the chip design and its optoelectroniccharacterization.

SPARCL is a programmable cellular logic proces-

hip Parameters

Description

gital cellular logic parallel processingAtrachip mesh connected SIMD processor array8-mm HP CMOS26G process through MOSIS foundryll LabsyLucent Technologies MQW foundryll LabsyLucent Technologies Service,86395 mm 3 1.95 mm3 10 cells

Mbytesys per cell 3 50 cells 5 1 Gbyteysdual-rail inputs, 50 dual-rail outputs

0 nm

DiBIIn0.BeBe121.540205085


eimt

t

b

F

2

sor for general morphological operations. It has awide range of applications, including

• Mathematical morphological processing: ba-sic operations ~e.g., dilation, erosion, closing, opening,thinning, skeleton!, image feature extraction ~e.g.,dge detection, shape, size and location verification!,mage enhancement ~e.g., salt and pepper noise re-

oval!, parallel pattern recognition ~e.g., hit–missransform, template matching!,

• Parallel numerical computation ~addition, sub-raction, multiplication!,

• Combinatorial logic functions, and• Serial-to-parallel or parallel-to-serial data for-

mat conversion and buffering.

Classic linear operators have been powerful in var-ious numerical analysis and signal processing applica-tions. However, when they are applied to imageanalysis they do not directly address the fundamentalissues of how to quantify image shape or geometricalstructures. In contrast, mathematical morphology,which is a set-theoretical methodology for image anal-ysis, can rigorously quantify many aspects of the geo-metrical structure in a way that agrees with humanintuition and perception. Morphological image anal-ysis is done by operating on images with some struc-turing elements.11 Different structural informationis extracted by interaction with selected structuringelements and different combinations of operators.Here we demonstrate two examples of image analysisthat use SPARCL instructions for image edge detec-tion and digital video motion estimation.

A. Image Edge Detection

Figure 6 shows an example of the application of theSPARCL system for image edge detection. Al-though the current version of SPARCL chip utilizesbinary image algebra, it is also possible to process agray-level image as a set of binary images by use oftop-surface and umbra encoding.10 Here we simplyinarize the gray image by 64 3 64 block quantiza-

tion at the mean value of the block. The host com-puter then partitions the binarized 256 3 256 imageinto 5 3 10 blocks and pipelines these blocks into the


SPARCL system. Let X be the image input toSPARCL chip and R be the reference image, which is

F0 1 01 1 10 1 0

Gin the example. Then the resultant edge detectedimage is

Z 5 X ø X % R, (4)

where X represents the compliment of X, ø repre-sents the union operation, and Q represents the di-lation operation. The edge detection routine takesonly three clock cycles for operation. The same rou-tine repeats for every block in the pipeline.

B. Digital Video Motion Estimation

The transmission of digital video sequences containshighly redundant information, and there is consider-able correlation between adjacent frames. Most ofthe change from frame to frame occurs to the movingobjects in the picture, and most of the backgroundinformation remains unchanged or little changed.Instead of transmitting the whole current frame andwasting precious channel bandwidth, the MPEG en-coding algorithm ~shown in block diagram form in

ig. 7! transmits the difference between the framesalong with the motion vectors by the following proce-dure: The frames are partitioned into small blocks,and a search is made in the previous frame for theblock that best matches a block in the current frame.When the best-matched block is found, the index off-set is coded as a motion vector. Collecting these

Fig. 7. SPARCL for motion estimation. ~a! Encoding and trans-mission, ~b! receiving and frame recovery. At transmitter site,two consecutive video frames ~at the left! and the difference imageand motion vectors ~at the right! are computed by the SPARCLsystem. Player number 18 moves upward in the image frames,leaving his imprint in the difference image. The current framecan be recovered easily at the receiver site.

Fig. 6. Demonstration of the SPARCL system for image edgedetection.

tTsfatdbee2

m

ms

ebdottmwissemacsuls

a~measd

fi

Ns

best-matched blocks, we obtain a motion-compensatedimage frame. Subtracting the motion-compensatedimage frame from the current frame, we can obtain thedifference image. The resultant difference image iscompressed with a JPEG encoder and transmittedalong with the motion vectors.12 At the receiver site,he recovery of the current frame is straightforward.he motion-compensated image frame is recon-tructed from the motion vectors and the previousrame. The current frame is then recovered from theddition of the motion-compensated image frame andhe JPEG decoded difference image. In the overalligital video transmission system, searching for theest matched block is highly computation intensive,specially when real-time operation is desired. Forxample, for the MPEG system with a frame size of88 3 322 and a 16 3 16 block to run a full search in

a 48 3 48 neighborhood area in 30 framesys requiresore than 3 3 109 operations per second.A parallel-pipeline smart-pixel array system such

as SPARCL offers a method for running digital videomotion estimation efficiently. Figure 7 shows a sys-tem level simulation in which the SPARCL chip per-forms the digital video motion estimation functions.A necessary step in motion estimation is computationof the difference between two video frames and imageblock matching. First the current frame is parti-tioned into data blocks of 5 3 10 pixels that match thearray size of the SPARCL chip. The SPARCL sys-tem searches the neighborhood area of 15 3 30 pixelsin the previous frame to find a block that is bestmatched to the data block in the current frame. Toperform this search we load the current frame intothe SPARCL chip in the second stage of the SPARCLsystem. Then we scroll the search area datathrough the first chip, one column at a time. Forevery new column the search data are updated in thefirst chip and transmitted optically to the second chipin 2D parallel. The second chip receives the searchdata optically and compares them with the data blockthat already resides in its memory. The second chipthen performs the difference operation

D 5 B0 ø B1 (5)

to match the data block and the search data, whereB0 is the search block in the previous frame, B1 is thedata block in current frame, and D is the differencebetween B0 and B1. The search block that is leastdifferent from the data block is chosen as the best-matched block. This block is used as an estimate ofcurrent block, and the index offset is coded as a mo-tion vector. The system also subtracts the best-matched block from the current block and obtains thedifference block. By collecting these differenceblocks and motion offsets we can then create, encode,and transmit the difference image and motion vectorfor digital video applications.

5. Architecture Analysis and Performance Scaling

A. SIMD IyO Problem

SIMD systems contain two types of IyO traffic be-tween the PE’s and external devices: instructionsand data elements. The system delivers identicalcopies of the instruction to every PE, and each PEexercises the same instruction at the same time.There are several methods for delivering the instruc-tions ~e.g., sequential loading and broadcast!. The

ost efficient method is simply to broadcast the in-truction to every PE simultaneously.On the other hand, the data elements delivered to

very PE are different. Thus we cannot simplyroadcast the data to every PE as we do in instructionelivery. Because of the limited number of IyO padsn a chip, the system has to load the data block fromhe border of the chip, and the data elements flowhrough the PE array interconnection network ~e.g.,esh! step by step until the data block is registeredith the PE array. Moreover, the size of the data set

s usually much larger than that of the instructionet, e.g., in image processing. The same instructionet applies to a large number of data blocks repeat-dly. Here, the data element’s IyO becomes theost critical bottleneck for the SIMD system. Also,

fter processing, the system has to unload the pro-essed data elements from the PE array by the slowtep-by-step method again. Therefore there aresually separate IyO channels for loading and un-

oading data elements. However, the IyO bottlenecktill exists.To perform an image or data processing routine oncomputing system requires three distinct steps:

1! loading the data from the input device ~such asemory or digital camera! to the processor~s!, ~2!

xecuting the instructions for the application routine,nd ~3! unloading the data from the processor~s! andtoring them to an output device ~memory or displayevice!.12 To evaluate the performance of the sys-

tem we define the processing speed as the number ofdata elements or pixels that are processed over thetotal processing time, described by

Spr 5N2

Tload 1 Texe 1 Tunload, (6)

where Tload, Texe, and Tunload are the times requiredfor loading, executing, and unloading the N 3 N data

eld or image.13 The value of Texe is the number ofinstructions required by an SIMD algorithm multi-plied by the SIMD clock period. Here we assumethat each instruction requires one clock cycle.

A SIMD machine with P 3 Q PE’s processes an N 3image by sequentially processing image blocks of

ize P 3 Q, where N is usually much larger than Pand Q. The computing system addresses the exter-nal device to load input image blocks through a 1Dparallel bus. In all current SIMD architectures thetime required for loading and unloading each of theP 3 Q blocks depends on this IyO bus’s bandwidthand can easily dominate the total processing time.


i

sct

c

c

tcfnst

2

This IyO bottleneck occurs for two reasons: The firsts that data enter the P 3 Q processing array through

one of its borders on a 1D column parallel data bus.If the data bus is P bits wide, Q clock cycles arerequired for loading or unloading the processor array.The processing speed in such a system grows with thebus’s width rather than with the number of process-ing elements. The second reason is that when theSIMD array is fully loaded and operating on a datablock, the data IyO lines are idle; this results in anunderutilized data bus.

Faced with this IyO bottleneck problem, architec-ture designers have developed a prefetchtechnique.14–16 The elapsed time associated withloading data from memory is called memory latency.The system deals with the memory latency by addingan extra register within each PE and adding on-chipcircuitry that performs data IyO and registration inthe background. These registers are interconnectedwith a network similar to the PE array, e.g., mesh,and match the size and location of the PE array.Instead of loading data through the PE array, thesystem prefetches data elements through the registerarray, while the PE’s are dedicated for instructionexecution. The technique hides the memory latencythrough data caching. The designers optimize theSIMD chip by balancing the amount of VLSI realestate used for PE circuitry versus registers andbackground IyO circuitry to maximize the processingpeed.13 The trend is revealed when we considerurrent devices for SIMD image processing, such ashe video signal processor,17 the integrated memory

array processor,18 and the GLiTCH.19 These systemsuse PE array sizes of only 16 3 16 or fewer per chip onhips of size greater than 1 cm2. Because the PE’s are

simple 1-bit processors, the processing array itself usesonly a small portion of the chip area. The majority ofthe chip area is used for memory and data IyO.

The two architectures based on an nSPARCL pro-essor system,20,21 1D nSPARCL and 2D nSPARCL,

that we are examining in this paper are shown inFigs. 2~b! and 2~c! and compared with a conventionalSIMD machine. With the smart-pixel optical detec-tors and transmitters, a SPARCL chip can opticallytransfer its entire data block to another SPARCL chipin a single clock cycle. The 1D nSPARCL uses thisfeature in the system shown in Fig. 2~b!, which hasthe first SPARCL stage as a dedicated input device,n 2 2 intermediate SPARCL’s for processing, and thelast SPARCL stage as an output device. The 2DnSPARCL assumes that the IyO devices are imple-mented with smart-pixel technology that permits 2Dparallel optical IyO. The IyO devices can be a pho-tonically accessed page-oriented memory,22,23 a videocamera, a display device, or a network connection.In this case, data enter the SPARCL chip in a 2Dparallel format and the IyO bottleneck is eliminated.When a SIMD system has no IyO bottleneck, theprocessing speed scales linearly with the number ofprocessing elements.

The SPARCL chip itself is a SIMD system. Cas-cading these SPARCL chips to a multistage


nSPARCL system makes a multiple-instructionmultiple-data stream system in the sense that differ-ent SPARCL stages can execute different instructionsets simultaneously. By scheduling the instructionphases among the nSPARCL stages we can improvehe efficiency of the system. Figure 8 shows a timinghart of data block processing for a SIMD system, aour-stage 1D nSPARCL system, and a four-stage 2DSPARCL system. All these systems contain theame total number of PE’s. The SIMD system con-ains 8 3 8 PE’s on a single chip. Both SPARCL

systems have four stages of SPARCL chips that have4 3 4 PE’s. In this example, the conventional SIMDsystem takes 16 clock cycles to load one data blockand executes the processing in 8 clock cycles. Itloads a data block and executes the processing sepa-rately. Also, we assume that the system has sepa-rate buses for data load and unload so that dataunloading and loading occur simultaneously in apipeline manner.

The 1D nSPARCL system partitions the data intosmaller blocks, four times smaller than the conven-tional SIMD block in this example. The systemloads the block into the first stage in only four clockcycles because the block size is smaller. After thedata block is loaded in the first SPARCL stage, ittakes another clock cycle to transfer the block fromchip 1 to chip 2. The system also shares the execu-tion commands evenly between chip 2 and chip 3, soeach chip takes four cycles for execution. The pro-cessed block is then transferred to the last chip forunloading, and the unloading again takes four clockcycles. The system overlaps the loading time, theexecution time, and the unloading time betweenblocks. For example, when the fourth chip is load-ing data block 1, chip 3 is executing commands onblock 2, chip 2 is executing commands on block 3, andchip 1 is busy loading block 4 from the input device.The resultant total processing time is reduced be-cause of the pipeline processing.

The 2D nSPARCL loads data blocks in 2D parallelin a single clock cycle. For the four stage 2DnSPARCL, for example, the system shares the eightexecution commands evenly over the four stages.Each stage uses two clock cycles to finish the opera-tion. Again these operations are done in pipeline

Fig. 8. Timing chart of data block processing for a SIMD system,a four-stage 1D nSPARCL, and a four stage 2D nSPARCL.

dt

Twaifici2rt

ntn

slns5

aI

api

2HnstItwPo

fashion. While chip 4 is executing commands onblock 1, chip 3 is executing block 2, chip 2 is executingblock 3, and chip 1 is executing block 4. The systemuses all the PE resources for operation and thereforethe fewest clock cycles of the three systems.

B. Performance Comparison of SIMD and 1D nSPARCLSystems

We compare the performance of nSPARCL and con-ventional SIMD systems given that both have thesame number of total processors in the system.Here we compare and discuss the 1D nSPARCL andconventional SIMD architectures in terms of process-ing time, scalability, bus utilization, flexible multiplespeeds, and unbalanced bandwidth applications.Because loading and unloading occur simulta-neously, we can hide the unloading latency and setTunload to 0 for our simulations.

1. Comparison of Processing TimesAssume that each of the n chips in the 1D nSPARCLsystem has an array size of p 3 q. The equivalentSIMD system has a total size of npq ~we assume thatthis is equivalent in size to the P 3 Q blocks discussedabove!. Both SIMD and 1D nSPARCL systems havethe same bus bandwidth of p bits per second. Thetotal processing time needed for the SIMD system is

TSIMD 5 S N2

npqD~nq 1 Texe!, (7)

where ~N2ynpq! represents the number of blocks tobe processed and ~nq 1 Texe! represents the loadingtime and the execution time required for each block.Note that the SIMD requires separate time slots forloading and execution.

For the 1D nSPARCL system we normally use thefirst and last chips as input and output devices fordata IyO. The intermediate n 2 2 chips execute the

ata processing instructions. The total processingime needed for the 1D nSPARCL system is then

T12D nSPARCL 5 5N2

pq~q 1 1!

Texe

n 2 2# q,

N2

pq STexe 1 2q 1 nn D Texe

n 2 2. q

.

(8)

here are two cases of 1D nSPARCL system operationhen tasks with different lengths of instructions arepplied. For the first case, or Texey~n 2 2! # q, thentermediate SPARCL’s finish processing before therst chip finishes loading data. Thus the total pro-essing time is dominated by the loading time and isndependent of Texe. For the other case, or Texey~n 2! . q, all n stages are used to run the processingoutine after the first stage is finished loading. Thushe workloads are spread properly over n stages.

To compare the performance between 1DSPARCL and conventional SIMD systems we inves-igate the ratio of total processing time for 1DSPARCL systems to that of the equivalent SIMD

ystems for the same data IyO bandwidth at differentengths of instruction sets, as shown in Fig. 9. 1DSPARCL’s with stage numbers n 5 3, 4, 9, 25 areimulated. For the simulation, the image size is12 3 512 pixels and each SPARCL system is a 5 3

10 processing array ~p 5 5, q 5 10!. When Texe issmall and Texe ,, q~n 2 2! the loading time domi-nates the total processing time, and both systemsperform roughly the same. In fact, when the in-struction set is very small ~Texe , number ofnSPARCL stages!, the extra steps of optically trans-ferring data between SPARCL stages reduces theSPARCL system efficiency. As Texe increases to bepproximately equal to loading time, the SIMD datayO bus must stop frequently when the array is

loaded with the data block and is busy executinginstructions. On the other hand, the nSPARCL sys-tem moves loaded data blocks down the multistagesystem pipeline for processing instead of halting dataIyO. The nSPARCL performance is optimized whenthe distributed execution time equals the systemdepth, because the loading time and the executiontime are balanced. For Texe . q~n 2 2! the executiontime plays an increasingly more critical role in thesystem performance than do the data. When Texebecomes much greater than q~n 2 2!, both systemsre dominated by the time required for executingrocessing instructions, and the loading and unload-ng times become insignificant.

. Scalability of the 1D nSPARCL Systemere we compare the processing time for 1DSPARCL systems with 3, 4, 8, and 25 stages, ashown in Fig. 9. The optimum ratio of processingime decreases as the number of stages increases.n general, the performance of the 1D nSPARCL sys-em scales up as the size of the system increases,here the system size is defined as the number ofE’s of the system. As the problem size ~5numberf instructions per SIMD algorithm! increases, we

can improve the system performance by increasingthe size of the nSPARCL tailored to the size of the

Fig. 9. Comparison of ratio of total processing times of n 5 3, 4,9, 25 nSPARCL’s and equivalent SIMD systems plotted againstthe time needed for processing operations, Texe.


am

na

vptiuspnS

panb

n

2

problem. However, a larger nSPARCL system is notalways better than a smaller nSPARCL system forny size of problem. For a fixed-size problem theatched size of the 1D nSPARCL system would op-

timize the efficiency. From the example shown inFig. 9, for problems that need fewer than 10 instruc-tion cycles, n 5 3 nSPARCL is better than any n $ 4

SPARCL. For problems that need more than 10nd fewer than 20 instruction cycles, n 5 4 nSPARCL

is better than n 5 3 and n $ 5 nSPARCL systems.In summary, for a problem that needs an instructionset of Texe cycles, the best number of 1D nSPARCLstages nopt that optimizes the efficiency is

nopt 5 Texe

Tload1 2 , (9)

where ● represents the next-larger integer.On the other hand, in a parallel-processing system

with multiple users it is also desirable for users to beable to share the processors.15 Because of the single-instruction nature of SIMD systems, complicatedmechanisms are needed to handle the scheduling. Incontrast, the multistage 1D nSPARCL is indeed amultiple-instruction multiple-data system in that dif-ferent SPARCL stages are able to execute differentsets of instructions. It is easy to partition the multi-stage 1D nSPARCL system into two or more sub-systems in terms of SPARCL stages. Eachsubsystem is an independent SIMD system, runningapplication programs from different end users. Withpredictions of problem size and instruction length, wecan also assign the optimum number of SPARCLstages to a subsystem dynamically according to Eq. ~8!and optimize the processing efficiency individually.

3. Comparison of Bus UtilizationWe can also approach the comparison of 1DnSPARCL and conventional SIMD systems from busutilization of the two systems for different cases ofTexe and Tload. The bus utilization is defined as theolume of data flowing through the bus interface ofrocessor array and external devices over a period ofime. Because of equilibrium, the utilization of thenput bus and the output bus should be equal. Bustilization represents the data throughput rate of theystem and is therefore a good measure of the systemerformance. The bus utilization ratio of 1DSPARCL is compared in Fig. 10 with that of theIMD system at different relative values of Texey

Tload. The nSPARCL has the greatest advantageover the SIMD architecture when Texe ' Tload. Thisillustrates the ability of the 1D nSPARCL system toutilize the bus bandwidth better by moving loadedblocks to open SPARCL processor arrays in the pipe-line. It also shows the scalability advantage ofnSPARCL system over its equivalent SIMD machine.Given a task with certain length of instructions, wecan scale up the stages of the SPARCL system prop-erly such that Texe ' q~n 2 2! and the data through-put rate is optimized.


4. Hybrid Speeds with the 1D nSPARCL SystemAt the system level, a multistage nSPARCL also of-fers opportunities for high-speed data IyO. In aVLSI chip, electrical signals enter and leave throughelectrical IyO pads at the side of the chip. In prac-tice, because of the off-chip parasitics from the pack-age and the printed circuit board, the off-chip clocksuffers from limited signal bandwidth. To overcomethis problem it is common practice is to have VLSIchips designed with slow ~tens of megahertz! off-chipclocks synchronized to the high-speed on-chip clockswith on-chip phase lock loop circuitry. However, forthe SIMD array IyO bottleneck that we have dis-cussed, doing this helps only to shorten the executiontime on-chip but not the loading–unloading time.The fundamental problem of data element traffic stillexists. Although some special VLSI componentsfabricated in GaAs can have higher-speed IyO, thedesign of such VLSI’s may be more difficult and lessdense than that of CMOS. On the other hand,SPARCL with optical interconnects offers opportuni-ties at the system level to avoid these problems. It isobviously possible for the multistage nSPARCL sys-tem to have multiple off-chip speeds. In the systemwe can dedicate high-speed chips ~e.g., GaAs! for thefirst and the last stages of the system for data IyO.The loaded data elements are then transferred to thefollowing stages optically down the pipe for process-ing at a high-speed on-chip clock.

5. 1D nSPARCL for Bandwidth UnbalancedApplicationsThe 1D nSPARCL has a basic internal chip-to-chipbandwidth of O~N2! and an external IyO bandwidthof O~N!. Because of this bandwidth mismatch, ap-

lying the system to general-purpose problems takescertain amount of effort. However, the system fitsicely the problems that require only modest externalandwidths @O~N!# and internal bandwidths of

O~N2!. For example, matrix-vector multiplication ofan on-chip N 3 N matrix and an off-chip N-elementvector is an application that requires only O~N! band-width externally and O~N2! bandwidth internally.

Fig. 10. Comparison of bus utilization of n 5 3, 4, 9, 25SPARCL’s and equivalent SIMD systems.

2

S

aasTnvs

To do this multiplication we have the matrix residingin the second chip and load the N-element vector fromthe first chip in column parallel. Every data ele-ment of the vectors is then broadcast to a row of thematrix in 1-to-N fanout. Another special examplemeeting these conditions is video motion estimationdescribed above. In the search for the best-matchedblock, the desired data block resides on the secondchip and the search area scrolls over the first chip incolumn parallel. Every time that one column of thesearch area is loaded to the first chip, every column inthe chip shifts laterally one column to the side andcreates a new array of O~N2! internally for the match-ing operation. There are other systems ~e.g., neuralnetworks! that have this type of unbalancedexternal–internal traffic and are suited to the 1DnSPARCL architecture.

C. Performance Comparison of SIMD, 1D nSPARCL, andD n-SPARCL Systems

Integrated with input and output devices that sup-port 2D parallel IyO’s, the nSPARCL system can loadand unload an entire p 3 q data block in a single clockcycle. The same technology used to create SPARCLcan be used to make dense memory chips, data buff-ers, video relay systems, and network interfacedevices.22–24 For this system the total processingtime becomes

T22D nSPARCL 5 SN2

pqDS1 1Texe

n D . (10)

For each block the loading time is always a constantof 1 because it requires only one single clock cycle forloading and unloading, and the execution time isTexeyn because the execution instructions are sharedevenly over n stages. Figure 11 compares the pro-cessing speed Spr of 1D nSPARCL, 2D nSPARCL, and

IMD systems when a 256 3 256 image is processedover various numbers of PE’s up to 256 ~516 3 16

Fig. 11. Comparison of processing speed ~in terms of pixelsyclockcycle! with the number of processing elements for SIMD, 1DnSPARCL, and 2D nSPARCL systems. The 2D nSPARCL elim-inates the data IyO bottleneck by performing 2D parallel data IyOwith input and output devices.

rray!. In this simulation the same number of PE’sre used for all three systems, and they exercise theame task with an instruction length of 20 clock cycles.he four-stage case is assumed for both 1D and 2DSPARCL’s. In the simulation result, both the con-entional SIMD and the 1D nSPARCL processingpeeds Spr tend to saturate as the number of PE’s

increases. This is so because the loading time domi-nates the system as the array size grows too large. Incontrast, the processing speed of a 2D nSPARCL in-creases linearly with the number of PE’s because theIyO bottleneck is eliminated and all the processors arededicated to performing the application routine.

A commonly cited advantage of SIMD systems istheir scaling properties. The larger the SIMD array,the more data elements can be processed simulta-neously. Decreasing VLSI feature sizes allows forhigher-density PE implementation and thus for largerprocessing arrays per chip. Ideally the processingspeed per chip, defined as the number of data elementsprocessed divided by the processing time, increaseslinearly with the processing array size. However, be-cause the processing time includes the time requiredfor loading the data into the processing array, process-ing the data, and then unloading the data, the process-ing speed is also sensitive to the data IyO bandwidth ofthe chip. The fundamental problem of data IyO inconventional SIMD systems is the 2D nature of theprocessing array and the 1D nature of the data IyOports of electronic buses. Ideally the computationbandwidth increases proportionally to the processorarray size. However, 2D data fields enter the process-ing array in a row-parallel format along the edge of thearray and flow into the array on the mesh network.As a result, as the PE array size grows in O~N2!, theIyO bandwidth grows only in O~N!. This causes anIyO bottleneck as the PE array size grows. Conse-quently, it greatly reduces the overall system through-put and limits the SIMD system array size. The 1DnSPARCL deals with the problem by hiding the mem-ory latency by prefetching. However, this helps onlywhen the lengths of loading–unloading cycles and theexecution cycles are comparable. As the PE arraysize grows, the length of the loading–unloading cyclebecomes much larger than that of the execution cycle.The data IyO overwhelms the system, and the memorylatency dominates the system performance as well.This occurs because of the fundamental limits of lim-ited IyO bandwidth. On the other hand, the 2DnSPARCL IyO bandwidth grows in O~N2!, well scal-able with the size of the PE array.

So far we have compared the systems under theassumption of the same number of PE’s. On theother hand, considering the fact that the yield of aVLSI chip decreases as the die size increases, it wouldbe difficult to build a large SIMD chip. As theSPARCL system decomposes a large SIMD array intoa multiple SIMD stages, it presents an opportunityfor building a multiprocessor system with a largenumber of PE’s distributed over several stages.


ance receiver–transmitter circuit,” IEEE Photon. Technol.

2

6. Conclusion

We have described an optoelectronic VLSI architec-ture for a SIMD computing system, the SPARCL.The device uses novel hybrid CMOS–MQW smart-pixel technology. We constructed an experimentalsystem for testing the devices as well as for demon-strating the system. This prototype system utilizesBIA for general-purpose morphological image pro-cessing. We have demonstrated applications of thesystem to image edge detection and estimation ofdigital video motion. We compared the performanceof the conventional SIMD machine and 1D and 2DnSPARCL systems under the assumption that thetotal number of PE’s in the systems was the same.The results illustrate that, given the same task, thenSPARCL system outperforms the SIMD system interms of processing time, bus utilization, and process-ing speed. The nSPARCL system also has manysystem aspect advantages in scalability to optimizethe computation efficiency, flexibility in hybridspeeds and multiple-instruction systems, and utilityfor construction of large-number PE systems. Theoptoelectronic VLSI technology has the potential toimprove the performance of multiprocessor comput-ing systems significantly. However, major effortsare still needed for the integration of efficient, reli-able, and cost-effective systems.

The authors thank Lily Cheng for help with chipdesign; Allan G. Weber for help with circuit boardpackaging; and Matt Derstine and Sue Wakelin ofOptical Networks, Inc., for help with optomechanicalpackaging and tunable external-cavity laser diodesources. This study was supported by the Joint Ser-vices Electronics Program through the U.S. Air ForceOffice of Scientific Research under contract F49620-97-10238; by the National Center for Integrated Pho-tonic Technology program funded by the DefenseAdvanced Research Projects Agency under contractMDA972-94-1-0001; and by the Integrated MediaSystems Center, a National Science Foundation En-gineering Research Center, with additional supportfrom the Annenberg Center for Communication of theUniversity of Southern California and the CaliforniaTrade and Commerce Agency.

References1. A. A. Sawchuk, “Smart pixel devices and free-space digital

optics applications,” in Proceedings of 1995 IEEEyLEOS An-nual Meeting ~Institute of Electrical and Electronics Engi-neers, Piscataway, N.J., 1995!, pp. 268–269.

2. Semiconductor Industry Association, The National TechnologyRoadmap for Semiconductors ~Sematech, Inc., San Jose, Calif.,1997!.

3. K. W. Goossen, J. A. Walker, L. A. D’Asaro, S. P. Hui, B. Tseng,R. Leibenguth, D. Kossives, D. D. Bacon, D. Dahringer,L. M. F. Chirovsky, A. L. Lentine, and D. A. B. Miller, “GaAsMQW modulators integrated with silicon CMOS,” IEEE Pho-ton. Technol. Lett. 7, 360–362 ~1995!.

4. A. V. Krishnamoorthy, A. L. Lentine, K. W. Goossen, J. A.Walker, T. K. Woodward, J. E. Ford, G. F. Aplin, L. A. D’Asaro,S. P. Hui, and B. Tseng, “3-D integration of MQW modulatorsover active sub-micron CMOS circuits: 375Mbys transimped-


Lett. 7, 1288–1290 ~1995!.5. T. K. Woodward, A. V. Krishnamoorthy, A. L. Lentine, K. W.

Goossen, J. A. Walker, J. E. Cunningham, W. Y. Jan, L. A.D’Asaro, and L. M. F. Chirovsky, “1-Gbys two-beam transim-pedance smart pixel optical receivers made from hybrid GaAsMQW modulators bonded to 0.8 mm silicon CMOS,” IEEEPhoton. Technol. Lett. 8, 422–424 ~1996!.

6. A. V. Krishnamoorthy and K. W. Goossen, “Progress inoptoelectronic-VLSI smart pixel technology based on GaAsyAlGaAs MQW modulators,” Int. J. Optoelectron. 11, 181–198~1997!.

7. K.-S. Huang, B. K. Jenkins, and A. A. Sawchuk, “Image alge-bra representation of parallel optical binary arithmetic,” Appl.Opt. 28, 1263–1278 ~1989!.

8. K.-S. Huang, A. A. Sawchuk, B. K. Jenkins, P. Chavel, J.-M.Wang, A. G. Weber, C.-H. Wang, and I. Glaser, “Digital opticalcellular image processor ~DOCIP!: experimental implemen-tation,” Appl. Opt. 32, 166–173 ~1993!.

9. C. B. Kuznia, J.-M. Wu, C.-H. Chen, A. A. Sawchuk, andL. Cheng, “Hybrid CMOSySEED smart pixel array for 2Dparallel pipeline operations,” in Digest IEEEyLEOS 1996Summer Topical Meetings: Smart Pixels ~Institute of Electri-cal and Electronics Engineers, Piscataway, N.J., 1996!, pp.80–81.

10. C. B. Kuznia, J.-M. Wu, C.-H. Chen, B. Hoanca, L. Cheng, A. G.Weber, and A. A. Sawchuk, “Two-dimensional parallel pipelineprocessing with smart pixel array cellular logic ~SPARCL! pro-cessors: system implementation,” submitted to J. LightwaveTechnol.

11. P. Maragos and R. Shafer, “Morphological systems for multi-dimensional signal processing,” Proc. IEEE 78, 690–709~1990!.

12. D. Le Gall, “MPEG: a video compression standard for multi-media applications,” Commun. ACM 34, 46–58 ~1991!.

13. A. Broggi and F. Gregoretti, “Performance evaluation and op-timization in low-cost cellular SIMD systems,” Microprocess.Microprogramm. 41, 659–678 ~1996!.

14. K. Hwang, Advanced Computer Architecture: Parallelism,Scalability, Programmability ~McGraw-Hill, New York, 1994!.

15. J. M. del Rosario and A. K. Choudhary, “High-performance IyOfor massively parallel computers—problems and prospects,”IEEE Comput. 27~3!, 59–68 ~1994!.

16. J. D. Allen and D. E. Schimmel, “Issues in the design of highperformance SIMD architectures,” IEEE Trans. Parallel Distr.Syst. 7, 818–829 ~1996!.

17. J. Goodenough, R. J. Meacham, J. D. Morris, N. L. Seed, andP. A. Ivey, “A single chip video signal processing architecturefor image processing, coding, and computer vision,” IEEETrans. Circ. Syst. Video Technol. 5, 436–445 ~1995!.

18. S. Okazaki, Y. Fujita, and N. Yamashita, “A compact real-timevision system using integrated memory array processor archi-tecture,” IEEE Trans. Circ. Syst. Video Technol. 5, 446–452~1995!.

19. H. D. Santos, J. C. Ramalho, J. M. Fernandes, and A. J.Proenca, “A heterogeneous computer vision architecture: im-plementation issues,” Comput. Syst. Eng. 6, 401–408 ~1995!.

20. J.-M. Wu, C. B. Kuznia, B. Hoanca, C.-H. Chen, L. Cheng, A. G.Weber, and A. A. Sawchuk, “Smart pixel array cellular logic~SPARCL! processor for eliminating SIMD IyO bottlenecks:system demonstration and performance scaling,” in Optics inComputing, Vol. 8 of 1997 OSA Technical Digest Series ~Op-tical Society of America, Washington, D.C., 1997!, pp. 152–154.

21. J.-M. Wu, C. B. Kuznia, B. Hoanca, C.-H. Chen, and A. A.Sawchuk, “Integration of CMOSyMQW smart pixel array cel-lular logic ~SPARCL! processors for SIMD parallel pipelineprocessing,” presented at the 1997 North American Chinese

Photonics Technology Conference, Los Angeles, Calif., 17–19

2

2

high-speed optical read and write,” in Spatial Light Modula-
October 1997.
2. A. A. Sawchuk, “Optoelectronic memory applications forVCSEL-based smart pixels,” in Proceedings, IEEE Lasers andElectro-Optics Society 1997 Annual Meeting ~Institute of Elec-trical and Electronics Engineers, Piscataway, N.J., 1997!, pp.149–150.

3. A. V. Krishnamoorthy, R. G. Rozier, J. E. Ford, and F. E.Kiamilev, “Demonstration of a CMOS static RAM chip with

tors, G. Burdge and S. Esener, eds., Vol. 14 of OSA Trends inOptics and Photonics Series ~Optical Society of America,Washington, D.C., 1997!, pp. 23–26.

24. F. E. Kiamilev and R. G. Rozier, “Design of optoelectronic-VLSI ICs for optically accessed SRAMs,” in Spatial Light Mod-ulators, G. Burdge and S. Esener, eds., Vol. 14 of OSA Trendsin Optics and Photonics Series ~Optical Society of America,Washington, D.C., 1997!, pp. 11–13.


Documents

Demonstration and Architectural Analysis of Complementary Metal-Oxide Semiconductor /Multiple-Quantum-Well Smart-Pixel Array Cellular Logic Processors for Single-Instruction Multiple-Data