Types of DSP Architectures

Types of ArchitecturesTypes of Architectures

VON-Neumann Architecture.

Harvard Architecture.

Super Harvard Architecture

VLIW Architecture.

The VON-Neumann ArchitectureThe VON-Neumann Architecture In 1946 , Developed by John Von Neumann. Development took place at University of Pennsylvania’s

Moore school of Electrical Engineering. This is the most widely used today, and is implemented by

the majority of Microprocessors on the market. For this architecture, all the elements of the computer are

interconnected by a single system of 3 busses: Data , Address & Control Bus.

The Von Neumann ArchitectureThe Von Neumann Architecture

Model for designing and building computers, based on the following three characteristics:

1) The computer consists of four main sub-systems: Memory ALU (Arithmetic/Logic Unit) Control Unit Input/output System (I/O)

2) Program is stored in memory during execution.

3) Program instructions are executed sequentially.

The Von Neumann ArchitectureThe Von Neumann Architecture

Memory

Processor (CPU)

Input-OutputControl Unit

ALUStore data and programStore data and program

Execute programExecute program

Do arithmetic/logic operationsDo arithmetic/logic operationsrequested by programrequested by program

Communicate withCommunicate with"outside world", "outside world", e.g. e.g. • ScreenScreen• KeyboardKeyboard• Storage devices Storage devices • ......

Bus

VON-Neumann ArchitectureVON-Neumann ArchitectureThe Data Bus : Transports data between the CPU and its peripherals. It is bi-directional.The CPU can read or write data in the peripherals.

The Address Bus:The CPU uses the address bus to indicate which peripherals it wants to access, and within each peripheral which specific register.The address bus is unidirectional.The CPU always writes the address, which is read by the peripherals.

Control Bus:This bus carries signals that are used to manage and Synchronize the exchanges between the CPU and its peripherals. Special lines of the control bus, such as interrupt or wait-state signals carry information from the peripherals to the CPU.

Memory SubsystemMemory Subsystem Memory, also called RAM (Random Access

Memory) Consists of many memory cells (storage units) of a fixed

size. Each cell has an address associated with it: 0, 1, … All accesses to memory are to a specified address.

A cell is the minimum unit of access (fetch/store a complete cell).

The time it takes to fetch/store a cell is the same for all cells.

When the computer is running, both Program Data (variables)

are stored in the memory.are stored in the memory.

Memory Size / SpeedMemory Size / Speed

Typical memory in a personal computer (PC):Typical memory in a personal computer (PC): 64MB - 256MB64MB - 256MB

Memory sizes:Memory sizes: Kilobyte Kilobyte (KB)(KB) = 2= 210 10 = = 1,024 bytes ~ 1 thousand 1,024 bytes ~ 1 thousand Megabyte (MB)Megabyte (MB) = 2= 22020 = = 1,048,576 bytes ~ 1 million 1,048,576 bytes ~ 1 million GigabyteGigabyte (GB)(GB) = 2= 23030 = = 1,073,741,824 bytes ~ 1 billion 1,073,741,824 bytes ~ 1 billion

Memory Access Time (read from/ write to memory)Memory Access Time (read from/ write to memory) 50-75 nanoseconds (1 nsec. = 0.000000001 sec.)50-75 nanoseconds (1 nsec. = 0.000000001 sec.)

RAM isRAM is volatile (can only store when power is on)volatile (can only store when power is on) relatively expensive.relatively expensive.

Operations on Memory Operations on Memory Fetch (address):

Fetch a copy of the content of memory cell with the specified address.

Non-destructive, copies value in memory cell.

Store (address, value): Store the specified value into the memory cell specified by

address. Destructive, overwrites the previous value of the memory

cell.

The memory system is interfaced via: Memory Address Register (MAR) Memory Data Register (MDR) Fetch/Store signal

Structure of the Memory SubsystemStructure of the Memory Subsystem Fetch(address)

Load address into MAR. Decode the address in MAR. Copy the content of memory

cell with specified address into MDR.

Store(address, value) Load the address into MAR. Load the value into MDR. Decode the address in MAR Copy the content of MDR into

memory cell with the specified address.

MAR MDR

---

Memorydecodercircuit

Fetch/Storecontroller

F/S

Input/output SubsystemInput/output Subsystem Handles devices that allow the computer system to:

Communicate and interact with the outside world Screen, keyboard, printer, ...

Store information (mass-storage) Hard-drives, floppies, CD, tapes, …

Mass-Storage Device Access Methods: Direct Access Storage Devices (DASDs)

Hard-drives, floppy-disks, CD-ROMs, ...

Sequential Access Storage Devices (SASDs) Tapes (for example, used as backup devices)

The ALU SubsystemThe ALU Subsystem The ALU (Arithmetic/Logic Unit) performs

mathematical operations (+, -, x, /, …) logic operations (=, <, >, and, or, not, ...)

In today's computers integrated into the CPU Consists of:

Circuits to do the arithmetic/logic operations. Registers (fast storage units) to store intermediate

computational results. Bus that connects the two.

Structure of the ALUStructure of the ALU Registers:

Very fast local memory cells, that store operands of operations and intermediate results.

CCR (condition code register), a special purpose register that stores the result of <, = , > operations.

ALU circuitry: Contains an array of circuits to

do mathematical/logic operations.

Bus: Data path interconnecting the registers to the ALU circuitry.

ALU circuitry

GT EQ LT

R0

R1

R2

Rn

The Control UnitThe Control Unit Program is stored in memory

as machine language instructions, in binary. The task of the control unit is to execute programs

by repeatedly: Fetch from memory the next instruction to be

executed. Decode it, that is, determine what is to be done. Execute it by issuing the appropriate signals to the

ALU, memory, and I/O subsystems. Continues until the HALT instruction.

Machine Language InstructionsMachine Language Instructions

A machine language instruction consists of: Operation code, telling which operation to perform Address field(s), telling the memory addresses of the

values on which the operation works. Example: ADD X, Y (Add content of memory locations X

and Y, and store back in memory location Y).

Assume: opcode for ADD is 9, and addresses X=99, Y=100

00001001 0000000001100011 0000000001100100

Opcode (8 bits)Opcode (8 bits) Address 1 (16 bits)Address 1 (16 bits) Address 2 (16 bits)Address 2 (16 bits)

Instruction Set DesignInstruction Set Design

Two different approaches:

Reduced Instruction Set Computers (RISC) Instruction set as small and simple as possible. Minimizes amount of circuitry --> faster computers

Complex Instruction Set Computers (CISC) More instructions, many very complex Each instruction can do more work, but require more

circuitry.

Structure of the Control UnitStructure of the Control Unit PC (Program Counter):PC (Program Counter):

stores the address of next instruction to fetchstores the address of next instruction to fetch IR (Instruction Register):IR (Instruction Register):

stores the instruction fetched from memorystores the instruction fetched from memory Instruction Decoder:Instruction Decoder:

Decodes instruction and activates necessary circuitryDecodes instruction and activates necessary circuitry

Instruction Decoder

IR

+1

PC

von Neumannvon Neumann

ArchitectureArchitecture

von Neumannvon Neumann

ArchitectureArchitecture

How does this all work together?How does this all work together?

Program Execution:

PC is set to the address where the first program instruction is stored in memory.

Repeat until HALT instruction or fatal error Fetch instruction

Decode instruction

Execute instruction

End of loop

Program Execution (cont.)Program Execution (cont.)

Fetch phase PC --> MAR (put address in PC into MAR) Fetch signal (signal memory to fetch value into MDR) MDR --> IR (move value to Instruction Register) PC + 1 --> PC (Increase address in program counter)

Decode Phase IR -> Instruction decoder (decode instruction in IR) Instruction decoder will then generate the signals to

activate the circuitry to carry out the instruction.

Program Execution (cont.)Program Execution (cont.)

Execute Phase Differs from one instruction to the next.

Example: LOAD X (load value in addr. X into register)

IR_address -> MAR Fetch signal MDR --> R

ADD X left as an exercise

Instruction Set for Our Von Neumann MachineInstruction Set for Our Von Neumann Machine

OpcodeOpcode OperationOperation MeaningMeaning

0000 LOAD X CON(X) --> R0001 STORE X R --> CON(X)0010 CLEAR X 0 --> CON(X)0011 ADD X R + CON(X) --> R0100 INCREMENT X CON(X) + 1 --> CON(X)0101 SUBTRACT X R - CON(X) --> R0101 DECREMENT X CON(X) - 1 --> CON(X)

0111COMPARE X If CON(X) > R then GT = 1 else 0

If CON(X) = R then EQ = 1 else 0

If CON(X) < R then LT = 1 else 01000 JUMP X Get next instruction from memory location X1001 JUMPGT X Get next instruction from memory loc. X if GT=1... JUMPxx X xx = LT / EQ / NEQ1101 IN X Input an integer value and store in X1110 OUT X Output, in decimal notation, content of mem. loc. X1111 HALT Stop program execution

HARVARD ARCHITECTUREHARVARD ARCHITECTURE


A DSP Chip is a microprocessor A DSP Chip is a microprocessor specially designed for DSP specially designed for DSP applicationsapplications

Harvard architecture allows Harvard architecture allows multiple memory reads multiple memory reads

Architecture optimized to provide Architecture optimized to provide rapid processing of discrete time rapid processing of discrete time signals, e.g. Multiply and signals, e.g. Multiply and Accumulate (MAC) in one cycleAccumulate (MAC) in one cycle

Program Memory Data Bus

ALU

Accumulator

Program Memory

Data Memory

Multiplexer Multiplexer

Data Memory Data Bus


The Harvard architecture – as the name implies – was developed at Harvard University.

By contrast to the Von Neumann architecture, it uses two separate bus systems to transport the instruction codes and the data being processed by the CPU.

The Program Bus System: Is used exclusively to transport instruction codes from the program memory to the CPU during the fetch cycle.

The Data Bus System: Is used exclusively to transport data from/to the CPU, to/from the memory and peripherals.

PROS AND CONS OF EACH ARCHITECTUREPROS AND CONS OF EACH ARCHITECTURE HA is capable of simultaneously reading an instruction HA is capable of simultaneously reading an instruction

code, and reading or writing a memory or peripheral as code, and reading or writing a memory or peripheral as part of the execution of the previous instruction. part of the execution of the previous instruction.

It has a speed advantage over the Von Neumann It has a speed advantage over the Von Neumann architecture.architecture.

The HA is also safer, since it is not possible for the CPU The HA is also safer, since it is not possible for the CPU to mistakenly write codes into the program memory and to mistakenly write codes into the program memory and therefore corrupt the code while it is executing.therefore corrupt the code while it is executing.

HA is HA is less flexibleless flexible. It needs two independent memory . It needs two independent memory banks (one for program and another one for data). These banks (one for program and another one for data). These two resources are not interchangeable.two resources are not interchangeable.

For an Embedded system that always runs the same For an Embedded system that always runs the same application, Harvard architecture is most suitable.(memory application, Harvard architecture is most suitable.(memory needs can be easily anticipated)needs can be easily anticipated)

PROS AND CONS OF EACH ARCHITECTUREPROS AND CONS OF EACH ARCHITECTURE

A computer system however A computer system however may run a wide variety of may run a wide variety of applicationsapplications, requiring large data and program memories, requiring large data and program memories

The Von Neumann architecture is better suited for this The Von Neumann architecture is better suited for this type situation, because program and data memories are type situation, because program and data memories are interchangeable, and it will lead to a better usage of the interchangeable, and it will lead to a better usage of the memory resources.memory resources.

The SHARCThe SHARC

Super Harvard Architecture ComputerSuper Harvard Architecture Computer

The SHARCThe SHARC

Developed by Analog Devices. Optimized for demanding DSP and imaging

applications. 32 Bit floating point, with 40 bit extended floating

point capabilities. Large on-chip memory. Ideal for scalable multi-processing applications ..

Super Harvard ArchitectureSuper Harvard Architecture

Many processor employ Harvard Architecture by having two separate memories or caches integrated into the processor chip.

The SHARC is unique in that it’s internal memory is capable of holding a large program as well a large amount of data. This is what makes it SUPER!!!.

SHARC DSPSHARC DSP

The SHARC incorporates features aimed at optimizing such loops.

High-Speed Floating Point Capability. Extended Floating Point.

These features are DSP specific. Meaning, when applied to a non-DSP application

performance may not be as optimal.

Floating Point and Extended Floating PointFloating Point and Extended Floating Point

The SHARC supports Floating, Extended-Floating and Non-Floating point operations.

No additional clock cycles for floating point computations.

Data automatically truncated and zero padded when moved between 32-bit memory and internal registers.

Excellent signal to noise ratio.

SHARC’s Internal MemorySHARC’s Internal Memory

Makes SHARC unique. Size

Allows many complex functions to be preformed on-chip. Eliminating the need to move data between internal and external memory.

Memory size is significantly larger then most other high speed computational devices.

Dual-block, Dual-port Optimizes the Harvard Architecture by allowing the

fetch of instructions while performing data memory accesses.

Multiply and Accumulate Instructions on the Multiply and Accumulate Instructions on the SHARCSHARC

Like most DSPs the SHARC is able to compute a product and add the product to a running total in a single clock cycle.

The SHARC’s super instruction is that it can multiply and accumulate while adding, subtracting, or averaging data in two other registers.

These instructions give the SHARC its 120 megaflop rating.

DAGs on the SHARCDAGs on the SHARC

Data Address Generators are integer computation units that manage the indexing of registers.

Allows the SHARC to to fetch a value and update the index value.

If the updated value exceeds a limit, the DAB adjusts the index so that it wraps.

This occurs in the same clock cycle as the read or write.

DAG CapabilitiesDAG Capabilities

Circular Buffering Rather then actually moving data in and out of a

vector, circular buffers are used. Updating the index modulo, the oldest entry can be

conveniently replaced by the newest entry.

Bit Reverse Addressing The bit pattern of a vector index is reversed. Done automatically by the SHARC. Required for Fast Fourier Transform (FFT), which is

often critical to DSP applications.

SHARC DSPSHARC DSP

What Makes the SHARC unique? It also has some features not related directly related to

optimizing numeric computations. Pipelining Handling Branches.

Why has this not emerged sooner? Technology has only recently become available to

make it economical to integrate general single computing devices.

SHARC’s PipelineSHARC’s Pipeline

3 stages1 Instruction Fetch2 Decode3 Execution

Takes three clock cycles for an instruction to propagate through the pipeline.

The processor execution speed is one instruction per clock cycle even though each instruction requires three clock cycles.

Multi-processingMulti-processing

SHARC is uniquely equipped for multi-processing. Links to ports are very powerful multi-processing

capabilities. Two main program models depending on the

application. Adapts well to different multi-processing

architectures.

Multi-processing Multi-processing SHARC LinksSHARC Links

SHARC has 6 link ports that can transport data at rates up to 40Mbytes/sec.

Links designed for point-to-point connections.

Data can be transmitted in either direction but not both simultaneously.

Multi-processing Program Model MIMDMulti-processing Program Model MIMD

Multiple instruction, multiple data.

Good for applications that require multiple instruction threads to execute concurrently.

Processors operate individually. Each processor executes different code.

Typically used for image reconstruction and multi-channel DSP.

Multi-processing Program Model Multi-processing Program Model SIMDSIMD

Single instruction, multiple data. Works best when all processors execute identical

instruction sequences. Do not require overhead for inter-processor

synchronization. Typically used for synthetic aperture radar (SAR)

and automatic target recognition.

( SAR is a form of SAR is a form of radar which is used to which is used to create images of an object, such as a landscape of an object, such as a landscape)

http://en.wikipedia.org/wiki/Radar

http://en.wikipedia.org/wiki/Radar_imaging




Summary of what makes the SHARC Summary of what makes the SHARC SuperSuper

It performs excellently for DSP applications.

Employs a Harvard Architecture with very large on chip memory.

Respectable Megaflop rating.

It’s multiprocessing capabilities.

VLIW(Very Large Instruction Width) Architecture

VLIW architectures execute multiple instructions/cycle VLIW architectures execute multiple instructions/cycle and use simple, regular instruction sets and use simple, regular instruction sets More parallelism, higher performance.More parallelism, higher performance. Better compiler target.Better compiler target.

Multiple independent instructions per cycle, packed into Multiple independent instructions per cycle, packed into single large "instruction word" or "packet“.single large "instruction word" or "packet“.

Large, uniform register sets.Large, uniform register sets. Wide program and data buses.Wide program and data buses.

Invented by Josh Fisher in his research group at Yale University in the early 1980s.

http://en.wikipedia.org/wiki/Josh_Fisher

http://en.wikipedia.org/wiki/Yale_University

Very Long Instruction Word (VLIW)

• A technique for instruction-level parallelism by executing instructions without dependencies (known at compile-time) in parallel

• Example of a single VLIW instruction:

F=a+b; c=e/g; d=x&y; w=z*h;

VLIW instruction F=a+b c=e/g d=x&y w=z*h

PU

PU

PU

PU

a

b

F

c

d

w

e

g

x

y

z

h

VLIW – Simplified Architecture ExampleVLIW – Simplified Architecture Example

Program Memory

256 bits consisting of 8 instructionsEach instruction is 32 bits

Execution Units

Execution Units

Execution Units

Execution Units

Execution Units

Execution Units

Execution Units

Execution Units

Each unit executing one instruction

Documents

Types of DSP Architectures