12
PM 4 -A reconfigurable multiprocessor system for pattern recognition and image processing* by FAYE A. BRIGGS, KING-SUN FD, KAI HWANG and JANAK H. PATEL Purdue University West Lafayette, Indiana MOTIVATION AND OBJECTIVES Pictorial information is often described by digitized arrays, syntactic (and semantic) strings and high-dimensional trees or graphs. The analysis and extraction of meaningful infor- mation for pictorial patterns by digital computers is called pictorial pattern analysis. Pattern analysis tasks require a wide variety of processing techniques and mathematical tools. In most machine intelligence systems, large computers are employed to process pictorial information. Because most image processing tasks require only repetitive Boolean op- erations or simple arithmetic operations defined over ex- tremely large arrays of picture elements (pixels),l the use of large computers with rigidly structured sequential or parallel processors may result in intolerable waste of resources. 2 For example, the array-structured ILLIAC IV3 and STARAN4 are efficient for processing fixed-length vectors, but are very inefficient for mixed scalar and vector operations, due to the fact that multiple instruction streams do not exist si- multaneously in these supercomputers. In the application domain, explosive amounts of pictorial information need to be processed. For example, a single frame of LANDSAT imagery contains 30 million bytes of information and it takes 13 such images to cover the state of Alabama. What is even more demanding is that an entirely new set of imageries is produced for the entire earth surface every nine days. The conventional parallel computers are ,not tailored for such large-scale image processing. A com- puter system is demanded to maximize the utilization of parallelism embedded in repetitive image operations. A sim- ple example may help to quantitatively justify such a need. Suppose that we are interested in performing the texture analysis of an image with size 500x500 pixels. A lOx 10 pixel window size is selected. Assume that, on the average, ten assembly instructions are required to perform one tex- ture analysis (neighborhood) operation. It then requires 5oox500x lOx lOx 10=2.5 X 10 8 instructions to analyze the whole image. For a computer system with one MIPS, it will take 2.5 x 10 8 /10 6 =250 sec. =4.17 min. to perform each tex- ture analysis operation on the whole image. An increase of * This work is partially supported by the National Science Foundation Grant ENG 78-16970. 255 machine speed to 100 MIPS will reduce the time required to perform one texture analysis operation to 2.5 seconds. Sim- ilar examples can be easily found in cluster analysis and statistical classification of high-dimensional pattern recog- nition problems. To meet the needs of the 80s or beyond, a versatile com- puter system must be able to execute more than 100 MIPS with a memory bandwidth of 256 megabytes or greater. With the rapidly growing IC technology, it is now possible to consider the use of a large number of microprocessors as the processing elements of a computer system for pattern recognition. This system will derive its high performance by the multiplicity of processing elements and the high level of concurrency of processing. In this paper, we report a powerful computer system that is currently under development at the Advanced Automation Research Laboratory (AARL) of Purdue University. The system consists of hundreds ofLS! bit-slice microprocessors with a large number of shared memory modules and flexible interconnection networks for efficient image processing and pattern recognition applications. The system is designed to be able to reconfigure its resources under system control to assume four different operation modes-SIMD, MIMD,2 multiple SIMD and distributed mixed modes. Fast interac- tive 110 and large image data base are incorporated into the system. Cost effective system architecture and wide range of applications are the main development concerns. An overview of various existing special computer archi- tectures for pattern information processing can be found in Fu. 5 The system presented here offers a relatively new ar- chitectural configuration with high application flexibility and high system throughput at only a modeiate system cost. THE PM4 SYSTEM ARCHITECTURE The architecture of the Purdue Multi-mode Multimicro- processor (PM4) system grew from a consideration of exist- ing system organizations like the c'mmp" ILLIAC IV and Cm*.7 We wanted a system that would reconfigure itself to execute MIMD or SIMD processes. In fact, the flexibility was extended so that the system can be partitioned into groups of processors which may be assigned to different From the collection of the Computer History Museum (www.computerhistory.org)

PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

  • Upload
    votu

  • View
    230

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

PM4-A reconfigurable multiprocessor system for pattern recognition and image processing*

by FAYE A. BRIGGS, KING-SUN FD, KAI HWANG and JANAK H. PATEL Purdue University West Lafayette, Indiana

MOTIVATION AND OBJECTIVES

Pictorial information is often described by digitized arrays, syntactic (and semantic) strings and high-dimensional trees or graphs. The analysis and extraction of meaningful infor­mation for pictorial patterns by digital computers is called pictorial pattern analysis. Pattern analysis tasks require a wide variety of processing techniques and mathematical tools. In most machine intelligence systems, large computers are employed to process pictorial information. Because most image processing tasks require only repetitive Boolean op­erations or simple arithmetic operations defined over ex­tremely large arrays of picture elements (pixels),l the use of large computers with rigidly structured sequential or parallel processors may result in intolerable waste of resources. 2 For example, the array-structured ILLIAC IV3 and STARAN4 are efficient for processing fixed-length vectors, but are very inefficient for mixed scalar and vector operations, due to the fact that multiple instruction streams do not exist si­multaneously in these supercomputers.

In the application domain, explosive amounts of pictorial information need to be processed. For example, a single frame of LANDSAT imagery contains 30 million bytes of information and it takes 13 such images to cover the state of Alabama. What is even more demanding is that an entirely new set of imageries is produced for the entire earth surface every nine days. The conventional parallel computers are ,not tailored for such large-scale image processing. A com­puter system is demanded to maximize the utilization of parallelism embedded in repetitive image operations. A sim­ple example may help to quantitatively justify such a need. Suppose that we are interested in performing the texture analysis of an image with size 500x500 pixels. A lOx 10 pixel window size is selected. Assume that, on the average, ten assembly instructions are required to perform one tex­ture analysis (neighborhood) operation. It then requires 5oox500x lOx lOx 10=2.5 X 108 instructions to analyze the whole image. For a computer system with one MIPS, it will take 2.5 x 108/106=250 sec. =4.17 min. to perform each tex­ture analysis operation on the whole image. An increase of

* This work is partially supported by the National Science Foundation Grant ENG 78-16970.

255

machine speed to 100 MIPS will reduce the time required to perform one texture analysis operation to 2.5 seconds. Sim­ilar examples can be easily found in cluster analysis and statistical classification of high-dimensional pattern recog­nition problems.

To meet the needs of the 80s or beyond, a versatile com­puter system must be able to execute more than 100 MIPS with a memory bandwidth of 256 megabytes or greater. With the rapidly growing IC technology, it is now possible to consider the use of a large number of microprocessors as the processing elements of a computer system for pattern recognition. This system will derive its high performance by the multiplicity of processing elements and the high level of concurrency of processing.

In this paper, we report a powerful computer system that is currently under development at the Advanced Automation Research Laboratory (AARL) of Purdue University. The system consists of hundreds ofLS! bit-slice microprocessors with a large number of shared memory modules and flexible interconnection networks for efficient image processing and pattern recognition applications. The system is designed to be able to reconfigure its resources under system control to assume four different operation modes-SIMD, MIMD,2 multiple SIMD and distributed mixed modes. Fast interac­tive 110 and large image data base are incorporated into the system. Cost effective system architecture and wide range of applications are the main development concerns.

An overview of various existing special computer archi­tectures for pattern information processing can be found in Fu. 5 The system presented here offers a relatively new ar­chitectural configuration with high application flexibility and high system throughput at only a modeiate system cost.

THE PM4 SYSTEM ARCHITECTURE

The architecture of the Purdue Multi-mode Multimicro­processor (PM4) system grew from a consideration of exist­ing system organizations like the c'mmp" ILLIAC IV and Cm*.7 We wanted a system that would reconfigure itself to execute MIMD or SIMD processes. In fact, the flexibility was extended so that the system can be partitioned into groups of processors which may be assigned to different

From the collection of the Computer History Museum (www.computerhistory.org)

Page 2: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

256 National Computer Conference, 1979

SIMD processes. Hence, mUltiple SIMD (MSIMD) and MIMD processes can be in execution concurrently. More­over, since we wanted dynamic system reconfiguration of resources and high level of concurrency of processes, each processor would be designed to handle such system require­ments. The reconfiguration is mostly software controlled. This architecture differs from the restructurable computer system proposed in Reference 8 or the partitionable multi­processor system discussed in Reference 9 in many aspects, as will be seen shortly. The architecture of the PM4 was configured by considering some of the major problems in­volving multiprocessor systems as discussed in Reference 10.

The basic components of the PM4 consist of N identical Processor-Memory Units (PM U), K identical Vector Control Units (VCU), a three-level hierarchical memory connected by a set of interconnection networks and memory manage­ment units. Figure 1 shows a block diagram of the PM4. We

MP: VCU: PMU:

I PCN:

Monitor Processor Vector Control Unit

Interprocessor Communication Network Mp·

will give a brief description of each of the individual com­ponents and their interrelationship in the system.

The Vector Control Units (VCU) are used in the SIMD mode of operation. Each of these units has a microproces­sor, a local memory (LM) which is managed by its own Local Memory Management Unit (LMMU) as shown in Figure 2a. The vector control instructions and program of an SIMD process are loaded into the VCU local memory prior to execution. When the SIMD process is ready-to-run, the VCU broadcasts instruction to all of the Processor­Memory Units (PMU) that are assigned to the SIMD proc­ess. The VCU may also send permutation function com­mands to the Interprocessor Communication Network (IPCN) to permute the data in a group of PMUs. Further­more, the VCU has the ability to mask or disable PM Us so that only the active or unmasked PMU s execute the broad­casted instructions.

Each of the PMUs consists of three functional units,

SMMU:

Processor Memory Unit ~ Shared Memory Management Unit M BUS

LRB: SM:

SMMU

Line Request Buffer _____________________________ -,

Shared Memory Module~~~

• • •

VC BUS

• • • MM BUS

PROCESSOR-MEMORY INTERCONNECTION NETWORK (PMIN)

• • •

• • •

IPCN BUS

PMU N-l • •

IPCN

--------~

MO,M; SHARED MEMORY SML-I,M-1 ~~------~----~------------------~--------------~~~----~- ______ J

FILE MEMORY CONTROL UNIT (FMCU)

• • •

000 o FILE MEMORY

Figure I-The PM4 architecture.

From the collection of the Computer History Museum (www.computerhistory.org)

Page 3: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

veu ,----I I I I I I I I I I

P

- ---, I PCN BUS

L _______ --.J Figure 2a-Details of Vector Control Unit (VCU).

namely, a microprocessor (P), a local memory (LM) and a local memory management unit (LMMU) as shown in Figure 2b. The LMMU in the PMU is similar to that in the VCU. Each local memory, which acts as a cache memory to its associated processor, is interleaved to allow it to meet the speed requirements of the high speed LMMU.

Both VCU and PMU operate under a virtual memory system and hence have hardware facilities to map virtual addresses to physical addresses. We have decided to imple­ment each processor so that it can directly address all of shared memory.

The LMMU of the PMU is also used to load and unload the local memory of a PMU. Furthermore, it can also act as a channel to transfer a block of shared memory to any VCU memory that is associated with that PMU. Each LMMU in a PMU or VCU handles the page replacement policy for its local memory. In both cases, the transfer may be initiated by a command from the processor of the PMU to its LMMU. A mUltiplexor has been conveniently located between each PMU and the Vector Control busses to switch the signal path from any veu to either the processor (P) or the LMMU of the PMU. Hence, the program for an SIMD process may be transferred from shared memory to the VCU's local memory through the LMMU of an assigned PMU. More­over, during the execution of the SIMD process, the multi­plexor can route the broadcasted instruction from a VCU to the PMU.

The Interprocessor Communications Network (lPCN), is used to implement permutation functions needed during ex­ecution of SIMD processes. This network permits concur­rent permutation of data from multiple SIMD processes which are assigned to distinct subsets of PMU s to be per­formed simultaneously. The IPCN is controlled by the

PM 4_A Reconfigurable MUltiprocessor System 257

PMU ,----VC BUS

- ----, I I I I I

P 1----+----. I PC N I

MH BUS

L

M

M

PHIN

L-- __________ _

I I I t I I I I I

_.-1 Figure 2b---Details of Processor Memory Unit (PMU).

VCU s over a time-shared bus and contains its own internal conflict resolution logic.

The Shared Memory Management Unit (SMMU) is con­nected to each LMMU of a PCU via the Memory Manage­ment (MM) Bus. The SMMU acts to control the use of the shared memory by communicating with each LMMU or the File Management Control Unit (FMCU) and effecting ap­propriate page replacement policy in the shared memory.

The Processor Memory Interconnection Network (PMIN) is used to transfer information between the shared memory and the LMMUs. Transfers are made in a burst-mode on this network. Hence, once a path through the network has been established, it is held (for the most part) until the transaction is completed. Further discussion on the PMIN is given in a later section.

The file memory control unit (FMCU) controls the trans­fer of information between the shared memory and the file memory. We defer the discussion of the shared memory to a later section.

For performance measurement purposes, we have incor­porated a monitor processor (MP) to monitor the activities of the various modules of the system as shown in Figure 1. The information collected will be used to determine the operating characteristics of the system.

We have incorporated fault-tolerance capabilities into the architecture of the PM4 by modularizing the system struc­ture. This, for example, may permit a PMU or VCU to be logically isolated from the rest of the system once a fault is

From the collection of the Computer History Museum (www.computerhistory.org)

Page 4: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

258 National Computer Conference, 1979

detected in the unit. The logical isolation of a unit will permit its diagnosis to be carried out without appreciably affecting system peIformance.

CHARACTERISTICS OF THE MICROPROCESSORS

The processors used in the multiprocessor system must have certain desirable characteristics in order to handle the multi-modal requirements of the PM4 system architecture. We have investigated the characteristics of existing LSI microprocessors such as the LSI II, Intel 8086, Z8000 and Motorola 68000 only to find that they do not meet either our operating system requirements or they fall a bit short on our speed requirements. We will discuss some of the processor characteristics we find desirable in our system configuration. We feel that such processor requirements can be attained by using bit-slice microprocessors.

For swift reconfiguration of system resources, a processor must be capable of holding more than one active process state. Hence, the VCU and PMU are multiprogrammed. The degree of multiprogramming in the PMU is tentatively cho­sen to be four. Hence, the processor of a PMU should consist of four register arrays. Each register array may be used to hold the state of an active process. For example, one array may be assigned to the kernel of an operating system while the other three are assigned to user MIMD and SIMD processes. A register array which is designated for an SIMD process when the PMU is allocated to a VCU will retain the state of the SIMD process until the PMU is deallocated. The coexistence of SIMD and MIMD processes in a PMU will permit the switching of a current SIMD process to an active MIMD process, and vice versa, effi­ciently. Further studies will be needed to determine the degree of multiprogramming in the VCU which will be re­quired to maintain a high level of concurrency efficiently.

Each processor of a VCU or PMU has a Status Output Register (SOR), the contents of which can be read by any other processor. Traps and interrupts are also needed to handle fault and communication problems. For example, a page fault from either VCU or PMU should cause a page fault trap which will abort the execution of the current in­struction. The instruction may be re-executed when the page fault condition is resolved. In the case when the trap con­dition occurs in a PMU which is assigned to a VCU for an SIMD process, the VCU is signalled to suspend its instruc­tion broadcasting. Furthermore, all PMUs in that group will be interrupted to suspend the current SIMD process and switch to a ready-to-run MIMD process until the SIMD process is awakened. The context switching of processes can be performed, in this case, simply by modifying a Cur­rent Process Pointer (CPP) register in the PMU to point to a ready-to-run MIMD process whose process state is resi­dent in a register set of the PMU.

Note that when an SIMD process is suspended due to, say, a page fault, the PMU group is still allocated to the suspended SIMD process. When the VCU is ready to re­sume its suspended process. it may do this hy hroadcasting an instruction to the allocated PMU group. If the PMU is

within an instruction cycle of a current MIMD process when the vector instruction is broadcasted, it sets an internal VIP (Vector Instruction Present) flip-flop. Hence the broadcast­ing of an instruction may be asynchronous with respect to a group of PMUs. The VIP flipflop is checked at the end of the instruction cycle if the PMU is allocated to an SIMD process. If the VIP is set, the CPP register may be modified to point to the state vector of the resumed SIMD process which is resident in the processor of the PMU s. Further, it puts the processor in the instruction fetch state in order to receive the broadcasted instruction.

When a processor of a PMU is executing broad casted instructions in the SIMD mode, an instruction completion signal is sent to the associated VCU at the end of each instruction cycle. This will permit the VCU to broadcast the next instruction to the PMUs. In general, instruction fetch and execution may be overlapped in both the VCU and PMU by prefetching the instructions in both SIMD and MIMD modes. We are currently investigating some other characteristics of the processors.

MEMORY HIERARCHY

The memory hierarchy consists of three levels of memory. The highest level of memory are the local memories in the Vector Control Unit (VCU) and the Processor-Memory Unit (PMU). The next level is the shared memory that is shared by all the processors in the system. The lowe~t level is the file memory which is essential for the data base. Generally, the higher the level of the memory, the faster is its speed, the higher is its cost per byte and the smaller is its capacity. Transfer of information between adjacent levels of memory in the hierarchy is entirely controlled by activities in the first level. The first level in this case consists of the set of VCUs and the set of PMUs. However, this does not imply that the memory access times for the local memory in the VCU is identical to that in the PMU.

One of the advantages of the hierarchical memory orga­nization is that the working set 11 of a process accumulates rapidly in the fastest level. Hence, accesses to memory words in a process are completed at nearly the speed of the local memories, but the total cost of the storage system approaches that of the lowest level. Another advantage is that the mechanism which effects the transfer of pages be­tween adjacent levels of memory can be readily implemented with very little intervention by the operating system.

The local memory of a PMU or VCU acts as a "cache" memory to its local processor. The wide usage of cache memories has shown that a process' memory references tend to cluster in a small portion of its· address space in a space-time window. 12•13 Each local memory is logically par­titioned into several pages which is large enough to hold the working sets of the several active processes. The local mem­ory also has the appropriate hardware to keep track of the usage of each page. This information may be collected by the Local Memory Management Unit (LMMU) and used to implement the page replacement policy-

The shared memory uses a buffered version of the L-M

From the collection of the Computer History Museum (www.computerhistory.org)

Page 5: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

PM4_A Reconfigurable Multiprocessor System 259

memory organization studied by Briggs and Davidson. 14 This 000 memory consists of I memory lines and m memory modules on each line. A line refers to a bus within the shared mem- 00 I ory. Each memory module has an address and data latch so that the address cycle (hold-time), a, is much shorter than the memory cycle time, c. Hence, if a = 1 and c =4, four memory requests can be in different stages of the service a I 0 concurrently on the same line, thereby increasing the mem-ory bandwidth without increasing the cost of the PMIN. all

Multiple-access conflicts which occur when simultaneous (parallel) requests reference the same line are resolved in the PMIN as discussed in the next section.

In our studies, we assume that the number of lines, I, is equal to the number of parallel PMUs, N, so that the cost of the processor memory interconnection network (PMIN) will be kept to a minimum with respect to l.

Each line has a Line Request Buffer (LRB) which buffers and resolves the conflicts of the memory requests made to modules on the same line. The LRB may subsequently issue the request to a referenced memory module which is idle, and initiate a return path through the PMIN for the refer­enced data. A memory request for a page can be made to the LRB which will eventually generate the series of se­quential accesses required to retrieve or store the requested block. Furthermore, the LRB may be interrupted by the File Memory Control Unit (FMCU) when it performs DMA (direct memory access) transfers between shared memory and file memory.

The performance of the L-M organization was discussed in Reference 15 for the nonbuffered requests with random address references. The analysis of the buffered requests is currently being investigated, but intuitively the implemen­tation of block transfers on such an organization would result in a higher memory band with than discussed in Ref­erence 15.

The file memory is a very large data base and backup for the programs and data in the system.

INTERCONNECTION NETWORKS

Several communication paths exist between different components of the PM4 system. A glance at the block dia­gram of Figure 1 shows the explicit connections between vector control units (VCUs) and the processors. Other prin­cipal connections shown simply as black boxes are inter­processor communication network (lPCN), processor-mem­ory interconnection network (PMIN), and the implied con­nection in the file-memory control unit (FMCU). Of these three networks the connection in the FMCU is the simplest and the least demanding of all. For these reasons, a single high-speed time-shared -bus is chosen as the communication path between the shared memory and the file memory due to the slow transfer rates of the file memory. The other two networks, namely, PMIN and IPCN, are quite complex and if their design is not properly chosen these networks can either become the bottleneck of the system or they can become the most expensive parts of the whole system.

For processor-memory interconnection (PMIN) we have

lOa

101

110

111

Figure 3-An 8x8 delta network.

investigated the delta networks. 16 These networks are easy to design and control. The networks use 2 x 2 crossbars as the basic building blocks. The logic for arbitration between conflicting requests is distributed throughout the network. A connection between a processor, LMMU, and a shared­memory module is established at the request of the proces­sor which sends the address of the requested module on the control lines. This address acts as the pathfinder through the network and the path is established locally at each 2x2 crossbar module. Each module requires a single bit from the address to establish a path, thus the control is completely distributed. An example of an 8 x8 delta network is shown in Figure 3. The 2 x 2 modules used are sketched in Figure 4. The complexity of a N xN delta network grows as Nlog 2N

request ro RO request

dest i nat ion dO A acknowledge a acknowledge

CONTROL

r l d l Rl a l X X AI

If x=O then ~onnect iO to 10 and I, to II.

If x'i"l then connect iO to II and i, to 10 .

<: :> iO

>liJ INFO

~: < ;> II

Figure 4--Details of 2x2 modules for delta networks.

From the collection of the Computer History Museum (www.computerhistory.org)

Page 6: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

260 National Computer Conference, 1979

as opposed to N2 for a full crossbar. However, the band­width of the delta network, assuming completely random requests, is not substantially less than that of crossbars. For example, for a delta network of size 256x256, the expected bandwidth is 77 requests per memory cycle; for a full cross­bar of the same size the bandwidth is 162; however, the crossbar costs about 20 times as much as delta. Once a path between a processor and a memory module is established, the words can be transferred at a continuous rate without any conflict. Thus, every subsequent word transfer does not suffer the initial delay to establish a path, and, therefore, the effectiveness of a delta network is higher in the block transfer mode than in single word transfer mode. We will study the use of delta networks to do block transfers be­tween the processor memory and the shared memory, where a block may consist of 64, 128,256 or 512 bytes.

The interprocessor communication network (lPCN) is still under investigation. When it is finally designed, it will have the following characteristics. _

The network will be of low cost, recirculating type. The design will be such that the permutations most frequently used in an SIMD environment can be generated in a single pass through the network. Other less frequently used per­mutations may require several passes. Furthermore, the net­work will be partitionable in fixed-size blocks so that several small size independent SIMD operations may be executed in parallel. For example, a network of size 64x64 can be partitioned into four networks of size 16x 16, and each of these networks will be partitionable into two networks of size 8 x8. To reduce the cost and delay, arbitrary partitions will not be implemented. Again for cost reasons, the network will not necessarily be of the same dimension as that of PM4 system. For example, if we build the system with 256 pro­cessors, we may have an IPCN of size 64x64. The d~sign and cost performance trade-ofTs of the IPCN will be reported at a later date.

PARALLEL PROGRAMMING LANGUAGES

In order to have an operational PM4 system, a mUltiple mode operating system must be developed for the multipro­cessor system. The resident UNIX system in PDP 11145 is currently under extensive revision at Purdue to handle the following four operation modes of PM 4. Special high-level programming languages for parallel processing need to be developed, most probably by extension of the C-program­ming language or the concurrent PASCAL17 or APL. The parallel programming language will provide the user the power to exploit the full capacity of the system. This will include vector operations (SIMD mode), and two or more distinct vector operations in parallel (Multiple SIMD mode). MIMD mode will permit concurrent execution of several scalar processes and distributed mixed mode allows part of the PM 4 system to operate as an SIMD computer and part as an MIMD computer. The user will not be burdened with the layout of the vectors in the memory, or the allocation, deallocation and synchronization of processors.

The four fundamental operation modes of PM4 and the

corresponding user programming requirements are briefly described as follows:

1. SIMD Mode-Vector instructions with Single Instruc­tion Stream and Multiple Data Streams (SIMD) must be explicitly declared by user's programs. The compiler is responsible for the layout of vectors and the VCU is responsible for broadcasting the instructions. The VCU executes control or non-vectored instructions without passing them to the PMUs. In this mode, each SIMD vector statement is executed in parallel, but subsequent vector statements are executed sequen­tially. In other words, no multiple SIMD statements can be simultaneously executed as demonstrated. Example 1: Consider the use of a 32-processor

PM4 system for SIMD operations. Begin Integer Vector A, B, C [0:31]; Real Vector X, Y, Z [0:31]; Integer I, J;

A~bB+C;

X~Y+(Z/2);

End

2. Multiple SIMD Mode-In this mode, mUltiple number of SIMD operations are executed in parallel. With a 64-processor system, typical Multiple SIMD instruc­tions may assume the following form. In this example, A, B, and C can be considered to be arrays of 128 vectors each, where a vector has 16 elements. Similarly X, Y and Z are arrays of vectors with 32 elements in each vector. The notation X[I,*] signifies a vector: X [1,0], X[I, 1], ... , X[I,31]. Example 2:

Begin Integer Vector A, B, C [0:127,0:15]: Real Vector X, Y, Z [0:127,0:31];

Parbegin For 1=0 until 127 do A[I,* ]~B[I,*]+ C[/,*], For J=O until 127 do X[J,*]~ Y[J,*]+ Z[J,*],

Parend End

The two vector processes between the parbe gin (parallel begin) and parend (parallel end) may be executed simulta­neously by two VCUs in this mode.

3. MIMD Mode-Multiple Instruction streams and Mul­tiple Data streams (MIMD) operations are the most generalized parallel programs. Each individual instruc­tion stream must have a sequence of scalar operations. These parallel processes may be interdependent. Sys­tem deadlock would be a major problem to be solved for MIMD operations. Vector instructions may not ap-

From the collection of the Computer History Museum (www.computerhistory.org)

Page 7: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

pear in strict MIMD mode, but may appear in the mixed mode to be described in 4. Example 3:

Parbegin Subprocess 1, Subprocess 2,

Subprocess n, Parend

4. Distributive Mixed Mode-In this mode, SIMD vector instructions and parallel MIMD processes are simul­taneously executed as declared by. the following state­ments. Example 4:

Parbegin

~:~:i.} SIMD mode

Subprocess 1, } : MIMD mode

Subprocess n, Parend

The above operation modes are only the fundamental ones to be implemented. There are many combinations of the above modes. Only after we implement the basic modes can we challenge the implementation of more sophisticated op­eration modes to upgrade the system throughput and en­hance its flexibility.

Special system control instructions must be developed to make the above operations possible. Listed below are sev­eral typical system command instructions that may be im­plemented in the system.

1. INITIALIZE-Set the program counters of allocated processors to specific values.

2. SYNCHRONIZE-Put the allocated PMUs in the WAIT or FETCH state.

3. Vector issue, mask, routing, etc. 4. Memory management, interrupts and I/O commands,

etc.

For special parallel-processing applications, such as mul­tiple-frame image processing or pattern classification, spe­cial programming or query languages must be developed to handle the very large scale data bases. There always exists a trade-off between the complexity of user programming language and the operating system capabilities.

OPERATING SYSTEM REQUIREMENTS

The operating strategy for the PM4 system has to be de­cided from the following choices:

1. Multiprogramming versus uniprogramming on vector control and processor-memory units.

2. Distributed versus dedicated processor operating sys­tem.

PM4-A Reconfigurable Multiprocessor System 261

Based on the architectural features of PM\ a Distributed Multiprogramming Operating System (DMOS) is under de­velopment for the PM4 machine. The DMOS system will be developed based on operating system design trade-offs of existing MIMD machines such as the C.mmp Hydra,18 the CM* operating system. 19 We will also consider the incor­poration of some of the operating system aspects of multiple SIMD machines proposed by Nutt for the MAP system20

and the DIMO for the DAMP system discussed in Hwang and Ni. 21

The DMOS is to handle all the four operation modes described in the preceding section with emphasis on multiple SIMD and MIMD modes. We have considered the case in which the operating system is distributed over all PMUs. Each PMU contains a local kernel operating system which resides partly in its local memory and partly in shared mem­ory. This kernel will be used for scheduling of MIMD and SIMD processes as well as supervising the execution of MIMD and component vector processes.

In the DMOS system, the processes are scheduled to processors on. the basis of processor availability and its workload. In addition, the availability of other system re­sources demanded by the processes will influence the proc­ess scheduling algorithm. However, we have investigated a vector or SIMD process scheduling procedure that is flexi­ble. In this case a PMU, which is a temporary master sched­uler, schedules the vector process to an available veu. In the following illustration we assume that the veus are un­iprogrammed systems and the PMUs are preemptible. Hence a vector process always has a higher priority over a user's MIMD process in being assigned to a PMU.

Let us assume that a vector process which was allocated to a (VCUi , [lj,k) pair has just been completed. [lj,k is a set of 2k consecutively numbered PMUs starting at PMUp ,

where p= j.2k, for j=O,I,2, ... and k is a nonnegative in­teger. For example, [l4,3={PMU32 , PMU33 , ... , PMU40}. Furthermore, assume that there exists a PMUm E [lj,k which is a master for the [lj,k group of PMUs. The completion of the old vector process causes PM U m to search the vector process queues (VPQis) (which may reside in SMMU) for a schedulable process. A vector process of vector size, V, is schedulable if it is independent or if all the dependencies for that process have been satisfied. The schedulable proc­ess is scheduled by the PMUm to VCUi if there exists a set, [10.,(3 of PMUs which is not currently allocated to an SIMD process and consists of 2(3 PM Us such that f3= r log 2 vl . If the vector process is scheduled to VCUi ; PMUm requests allocation of processors in [10.,(3 for the new SIMD process by transmitting the request signal to the processor via the MM bus as shown in Figure 2. Of course, such a scheduling process would be implemented as a critical section. 22

The PMUm also gives the processor in [lo.,fJ the necessary information to load-in the mUltiple data streams required for the SIMD process into their respective local memories. While these PMU s are being loaded with their data, the PMUm may also initiate the loading of the program segment of the SIMD process through its LMMU to the VCUj's local memory. After the loading of VCU j is complete, the PMUm

From the collection of the Computer History Museum (www.computerhistory.org)

Page 8: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

262 National Computer Conference, 1979

A v

Vector Process Arrival

Vector Process Queue_~

VPQn

A SPQ

s .111 II III Scalar Process" ... - -Arrival

MIMD Process Queue

• • •

Figure 5-Queuing model for performance evaluation of scheduling disciplines.

may assign any PMUn in YQ ,{3 as the new master of YQ ,{3'

Henceforth, the current master, PMUn , will coordinate and monitor the activities of VCUi and handle page fault traps from VCUi• Notice that the loading of multiple data streams in YQ ,{3 may be overlapped with MIMD process execution in these processors. Now the new pair (VCUi, YQ ,(3) is allocated to execute the new SIMD process.

In case a vector process is not scheduled because of the unavailability of an appropriate sized Y group of PMUs, the master of the next set of PMUs released may check the vector process queue for the schedulability of a vector proc­ess. However, if the VP queues are empty, any PMU not currently assigned to an SIMD process may periodically check the VP queues for a process. Alternately, if an empty VP queue becomes non-empty it may signal PMUs not as­signed to an SIMD process for service.

Figure 5 shows a typical queuing model which may be used to study the performance of various scheduling disci­plines for SIMD and MIMD processes in the PM4. In this diagram, VPQi is a queue which buffers vector processes that require a set of 2{3f PMUs for their execution. SPQ is a queue which buffers MIMD processes. Each process may have tags that will indicate the dependency of the process on another.

Control and scalar instructions in an SIMD process are executed directly by the VCU with no need to broadcast them to the PMU. However, some information may be broadcasted to the PMU during such executions to inform the PMlT of current ~ctivitie" in the vC'n. During the exe­cution of a sequence of vector instructions, the veu may

issue a MASKing instruction to select the necessary subset of PMUs among the PMU group allocated to the VCU. Only the masked (enabled) PMUs will execute the broadcasted instructions while the remaining allocated PMUs can con­tinue executing their resident MIMD processes. A DEAL­LOCATE or RELEASE instruction is needed to release part or all of the allocated PMUs when the SIMD process or subprocess is completed.

Figure 6 illustrates a simplified state transition diagram for each PMU assuming the VCU is operated in uniprogram­ming mode. States cf>, K, V and correspond to the PMU being idle, executing a kernel process, vector process or an

State: Description

cf>: idle

K: kernel

V: SIHD

S: HIHD

ECPk

Figure 6-State transitions in P\1U for SIMD and MIMD modes.

From the collection of the Computer History Museum (www.computerhistory.org)

Page 9: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

MIMD or Scalar process respectively. The events which trigger the state transitions are listed below.

E «px-Arrival of a new process EK<f>-Departure of a last process in PMU EKy--Vector process initiation on allocated PMU Eyx-Trap condition in SIMD process EKs-----MIMD process scheduled Esx-Trap condition in MIMD process Eys-----Suspended SIMD process causes process switch to

MIMD Esy--Process switch to ready-to-run SIMD process Ess-----Process switch from one MIMD 'process to another

It is expected that the utilization factors of the PMU s will

PM 4_A Reconfigurable Multiprocessor System 263

be high if we assume that deadlock problems are eliminated in this multi-mode multiprogrammed operating system.

APPLICATION AREAS AND COMPUTATIONAL TASKS

The PM4 system was designed for the following applica­tions. Both statistical and syntactic methods 23.24 are to be used in image enhancement, feature extraction, picture seg­mentation, pattern recognition and scene analysis.

1. Industrial automation (automatic assembly and inspec­tion)

2. Medical diagnosis of X-ray pictures and cytology anal­ysis

TABLE I.-Computational Tasks Required for Various Application Problems in Intelligent Systems

Image Processing & Pattern Clas- OIl/) I/)

sification e OJ l)

I 'r- 01 'r-

Problems 'r- ,.... L. IU I 01 .J::. I/) - ..... :I ..... 01 e Q

e e I/) IU VI 01 u e 'r- It) 0 e ..... 0 It) e u e OJ·r- .J::. L.

'r- 0 e 'r- e - O·r- 01 VI '+- It) L. VI U l!) OJ ..... 'r- OJ OJ ..... OJ 0 U or- .... e OJ O.J L. L. .....

Typical 01 It) OJ ..... OJ E mit) L. 'r- ..... I/) 'r- :I o It) It) L. It) L. 01 It) 01'11 It) ..... :I ..... e It)'r- L. IT 01 OJ uo.. ~ OJ

Computa- E ..... It) L. It) u OJ 01 E e ..., u L. u ..... OJ''''' e OJ ..... >-4 VI E 0 E e Ole >-4 OJ It) It) OJ 'r- It) ..... e 'r- L. L. 01 .J::. :I

tional Tasks 'r- >-4 ..... >-41l) Il) 'r- E OJ L. ..... '+-"" VIr; VII-- o e Q a. 01 I/) .c: E'O en u. ..... ..... VI ::J u L. L. 'r- Il) E OJ OJ e >-4 0 OJ X ~; ...... - OJ IU L. L. L. a

0:: 0:: UJ U V'l UJ U I- a.. a UJ l!) u

Solving Linear System X X X X X of EQuat ions

Solving Noniinear X X System of Equations

Matrix Multipl ication X X X X X and Inversion X

Polynomial X X X X X Evaluation

Nonlinear X X Mapping

Interpolation X X Methods

Iterat i ve Relaxa-X X tion Methods X

Transformation X X X X g, Convolution Methods X

Linear and Non-X X linear Programming

Searching & X X X X X X X Sorting

count ing, Backtrack-X X X X ing & Estimating X

Graph X X )( X X Operat ions

From the collection of the Computer History Museum (www.computerhistory.org)

Page 10: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

264 National Computer Conference, 1979

3. Remote sensing of LAN DSA T pictures 4. Automated cartography and stereo compilation 5. Target identification and change detection 6. Computer vision and three-dimensional s'cene analysis 7. Recognition of human faces, fingerprints and hand­

written characters 8. Speech recognition and understanding 9. Pollution control, archaeology and socio-economics

Typical computational tasks associated with above appli­cation problems are summarized in Table I. Both numerical and combinatorial (syntactic) algorithms need to be effi­ciently implemented. An illustrative example is given here to show the advantages of using parallel processing in syn­tactic pattern recognition and image analysis.

It has been demonstrated that tree languages are efficient in describing and analyzing two-dimensional pictorial pat­terns. 25 Application examples include classification of bub­ble chamber events,26 fingerprint identification,27 texture analysis 28 and recognition of objects of LANDSAT im­ages. 29,30 In order to effectively analyze noisy and distorted images, the use of error-correcting tree automata has been suggested. 30,31 The price to pay for the error-correcting ca­pability in image analysis is the increase of computation time. However, with a parallel processing computer system, such an increase of computation time can be easily reduced. An SIMD parallel parsing algorithm for tree languages has been recently proposed. 32 Computer simulations based on the analysis of five tree languages L(G1) , L(G22), L(G34), L(G3R), L(~R) (one for highway recognition is a LANDSAT image and four for texture analysis, with a window size of 9x9 pixels) have produced the interesting results shown in Figure 7. Furthermore, since all the windows in an image are indepen~ent in this problem, they can certainly be proc-

8

,

o

e

o .. e 12 16. iio, 0; Fr OCf'ssOr" S

Figure 7-Speed-up for a 9x9 window.

2'fO o

80

o

The parameters indicate the number of processors per w'i nd6w.

.-100.

No.

I 200

2

of Pr oc foSS"r" S

I 300

Figure 8--Speed-up for 100 windows.

3

..

I 'too

essed in parallel to improve the processing speed. With 100 window patterns processed in parallel, the speed-up of high­way recognition (L(G1» result is shown in Figure 8. It is noticed from Figures 7 and 8 that either the parallel parsing of tree languages or the parallel processing of image win­dows would result in significant speed-up of computation time in image analysis and recognition.

ACKNOWLEDGMENT

We would like to extend our appreciation to Richard W. Bishop for his direct contributions to the PM4 architecture.

REFERENCES

I. Fu, K. S., and R. Rosenfeld, "Pattern Recognition and Image Pro­cessing," IEEE Trans. Compute., Vol. C-25, 1976, pp. 1336-1346.

2. Stone, H. S. (ed.), Introduction to Computer Architecture, Science Re­search Assn., Inc., Chicago, Illinois, 1975.

3. Barnes, G., et aI, "The ILLIAC IV Computer," IEEE Trans. Comput., Vol. C-17, August 1968, pp. 746-757.

4. Thurber, K. J., Large Scale Computer Architecture, Hayden Book Co., 1976.

5. Fu, K. S., "Special Computer Architectures for Pattern Recognition and Image Processing," Proc. 1978 National Computer Conference, pp. 1003-1013.

6. Wulf, W. A., and C. G. Bell, "C.mmp-A multi-mini-processor," AFlPS Conf. Proc., Vol. 41, Pt. II, FJCC 1972, pp. 765-777.

7. Swan, R. J., S. H. Fuller, and D. P. Siewiovek, "Cm*-A modular, mUlti-microprocessor," Proc. 1977 National Camp. Conf., pp. 637-644.

8. Reddi, S. S., and E. A. Feustel, "A Restructurable Computer System," IEEE Trans. Compllt., Vol. C-27, January 1978, pp. 1-20.

9. Bogdanowicz, J. F., "Preliminary Design of a Partitionable Multi-micro­programmable Microprocessor System for Image Processing," Tech. Re­port, EE 771l. Schoo! cf Electrical Engineering, Purdue eni\'crsit~,

Nov. 1977.

From the collection of the Computer History Museum (www.computerhistory.org)

Page 11: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

10. Fuller, S. H., et ai, "Multi-microprocessors: An Overview and Working Examples," Proc. IEEE, Vol. 66, No.2, Feb. 1978, pp. 216-228.

11. Denning, P. J., "Virtual Memory," Comput. Surveys, Vol. 2, No.3, September 1970, pp. 153-189.

12. Meade, R. M., "On Memory System Design," Fall Joint Comput. Conf .. 1970, pp. 33-43.

13. Strecker, W. D., "Cache Memories for PDP-II Family Computers," 3rd Annual Symp. Comput. Arch., January 1976, pp. 155-158.

14. Briggs, F. A., and E. S. Davidson, "Organization of Semiconductor Memories for Parallel-Pipelined Processors," IEEE Trans. Comput., VoL C-26, No.2, pp. 162-169, February 1977.

15. Briggs, F. A., "Performance of Memory Configurations for Parallel-Pipe­lined Computers," 5th Annual Symp. Compiii. Arch., April 1978, pp. 202-209.

16. Patel, J. H., "Design and Analysis of Processor-Memory Interconnec­tions for Multiprocessors," Tech. Report EE 78-40, School of Elec. Engr., Purdue University, October 1978.

17. Brinch Hansen, P., "The Programming Language Concurrent Pascal," IEEE Trans. Software Eng., Vol. S-I, No.2, pp. 199-207, June 1975.

18. Wulf, W., et al, "HYDRA: The Kernel of a Multiprocessor Operating System," CACM. VoL 17, No.6, pp. 337-345, June 1974.

19. Jones, A. K., et ai, "Software Management of Cm*-A Distributed Multiprocessor," National Comput. COI{(., 1977, pp. 657-663.

20. Nutt, G. J., "A Parallel Processor Operating System Comparison," IEEE Trans. on Software Eng., Vol. SE-3, pp. 467-475, November 1977.

21. Hwang, K., and L. M. Ni, "Modeling and Analysis of MUltiple SIMDI SISD Computer Systems," Proc. of the Third International Computer Symposium, Taipei, Taiwan, China, December 1978.

PM4-A Reconfigurable Multiprocessor System 265

22. Habermann, A. N., Introduction to Operating System Design. Science Research Assn., Inc., Chicago, Illinois, 1976.

23. Fu, K. S., Sequential Methods in Pattern Recognition. Academic Press, 1968.

24. Fu, K. S., Syntactic Methods in Pattern Recognition. Academic Press, 1974.

25. Fu, K. S., "Tree Languages and Syntactic Pattern Recognition," Pattern Recognition and Artificial Intelligence. C. H. Chen (ed.), Academic Press, 1977.

26. Fu, K. S., and B. K. Bhargava, "Tree Systems for Syntactic Pattern Recognition," IEEE Trans. on Computers, Vol. C-22, December 1973.

27. Moayer, B., K. S. Fu, "A Tree System Approach for Fingerprint Pattern Recognition:' IEEE Trans. on Computers, Vo!' C-25, March 1976.

28. Lu, S. Y., and K. S. Fu, "A Syntactic Approach to Texture Analysis," Computer Graphics and Image Processing. Vol. 7, June 1978.

29. Li, R. Y., and K. S. Fu, "Tree System Approach for LANDSAT Data Interpretation," Proc. Symposium on Machine Processing of Remotely Sensed Data, 1976, West Lafayette, Indiana.

30. Lu, S. Y. and K. S. Fu, "Structure-Preserved Error-Correcting Tree Automata for Syntactic Pattern Recognition," Proc. 1976 IEEE Confer­ence on Decision and Control. Clearwater Beach, Florida.

31. Lu, S. Y., and K. S. Fu, "Error-Correcting Tree Automata for Syntactic Pattern Recognition," IEEE Trans. on Computers. Vol. C-27, November 1978.

32. Chang, N. S.; and K. S. Fu, "Parallel Parsing of Tree Languages for Syntactic Pattern Recognition," Proc. 1978 IEEE Computer Society Con­ference on Pattern Recognition and Image Processing. May 31-June 2, Chicago, Illinois.

From the collection of the Computer History Museum (www.computerhistory.org)

Page 12: PM -A reconfigurable multiprocessor system for pattern ... · PM4-A reconfigurable multiprocessor system for pattern recognition and image ... inefficient for mixed scalar and vector

From the collection of the Computer History Museum (www.computerhistory.org)