Upload
pareshsawant27
View
30
Download
2
Tags:
Embed Size (px)
DESCRIPTION
DSP Processor Architecture and Features
Citation preview
Types of ArchitecturesTypes of Architectures
VON-Neumann Architecture.
Harvard Architecture.
Super Harvard Architecture
VLIW Architecture.
The VON-Neumann ArchitectureThe VON-Neumann Architecture In 1946 , Developed by John Von Neumann. Development took place at University of Pennsylvania’s
Moore school of Electrical Engineering. This is the most widely used today, and is implemented by
the majority of Microprocessors on the market. For this architecture, all the elements of the computer are
interconnected by a single system of 3 busses: Data , Address & Control Bus.
The Von Neumann ArchitectureThe Von Neumann Architecture
Model for designing and building computers, based on the following three characteristics:
1) The computer consists of four main sub-systems: Memory ALU (Arithmetic/Logic Unit) Control Unit Input/output System (I/O)
2) Program is stored in memory during execution.
3) Program instructions are executed sequentially.
The Von Neumann ArchitectureThe Von Neumann Architecture
Memory
Processor (CPU)
Input-OutputControl Unit
ALUStore data and programStore data and program
Execute programExecute program
Do arithmetic/logic operationsDo arithmetic/logic operationsrequested by programrequested by program
Communicate withCommunicate with"outside world", "outside world", e.g. e.g. • ScreenScreen• KeyboardKeyboard• Storage devices Storage devices • ......
Bus
VON-Neumann ArchitectureVON-Neumann ArchitectureThe Data Bus : Transports data between the CPU and its peripherals. It is bi-directional.The CPU can read or write data in the peripherals.
The Address Bus:The CPU uses the address bus to indicate which peripherals it wants to access, and within each peripheral which specific register.The address bus is unidirectional.The CPU always writes the address, which is read by the peripherals.
Control Bus:This bus carries signals that are used to manage and Synchronize the exchanges between the CPU and its peripherals. Special lines of the control bus, such as interrupt or wait-state signals carry information from the peripherals to the CPU.
Memory SubsystemMemory Subsystem Memory, also called RAM (Random Access
Memory) Consists of many memory cells (storage units) of a fixed
size. Each cell has an address associated with it: 0, 1, … All accesses to memory are to a specified address.
A cell is the minimum unit of access (fetch/store a complete cell).
The time it takes to fetch/store a cell is the same for all cells.
When the computer is running, both Program Data (variables)
are stored in the memory.are stored in the memory.
Memory Size / SpeedMemory Size / Speed
Typical memory in a personal computer (PC):Typical memory in a personal computer (PC): 64MB - 256MB64MB - 256MB
Memory sizes:Memory sizes: Kilobyte Kilobyte (KB)(KB) = 2= 210 10 = = 1,024 bytes ~ 1 thousand 1,024 bytes ~ 1 thousand Megabyte (MB)Megabyte (MB) = 2= 22020 = = 1,048,576 bytes ~ 1 million 1,048,576 bytes ~ 1 million GigabyteGigabyte (GB)(GB) = 2= 23030 = = 1,073,741,824 bytes ~ 1 billion 1,073,741,824 bytes ~ 1 billion
Memory Access Time (read from/ write to memory)Memory Access Time (read from/ write to memory) 50-75 nanoseconds (1 nsec. = 0.000000001 sec.)50-75 nanoseconds (1 nsec. = 0.000000001 sec.)
RAM isRAM is volatile (can only store when power is on)volatile (can only store when power is on) relatively expensive.relatively expensive.
Operations on Memory Operations on Memory Fetch (address):
Fetch a copy of the content of memory cell with the specified address.
Non-destructive, copies value in memory cell.
Store (address, value): Store the specified value into the memory cell specified by
address. Destructive, overwrites the previous value of the memory
cell.
The memory system is interfaced via: Memory Address Register (MAR) Memory Data Register (MDR) Fetch/Store signal
Structure of the Memory SubsystemStructure of the Memory Subsystem Fetch(address)
Load address into MAR. Decode the address in MAR. Copy the content of memory
cell with specified address into MDR.
Store(address, value) Load the address into MAR. Load the value into MDR. Decode the address in MAR Copy the content of MDR into
memory cell with the specified address.
MAR MDR
---
Memorydecodercircuit
Fetch/Storecontroller
F/S
Input/output SubsystemInput/output Subsystem Handles devices that allow the computer system to:
Communicate and interact with the outside world Screen, keyboard, printer, ...
Store information (mass-storage) Hard-drives, floppies, CD, tapes, …
Mass-Storage Device Access Methods: Direct Access Storage Devices (DASDs)
Hard-drives, floppy-disks, CD-ROMs, ...
Sequential Access Storage Devices (SASDs) Tapes (for example, used as backup devices)
The ALU SubsystemThe ALU Subsystem The ALU (Arithmetic/Logic Unit) performs
mathematical operations (+, -, x, /, …) logic operations (=, <, >, and, or, not, ...)
In today's computers integrated into the CPU Consists of:
Circuits to do the arithmetic/logic operations. Registers (fast storage units) to store intermediate
computational results. Bus that connects the two.
Structure of the ALUStructure of the ALU Registers:
Very fast local memory cells, that store operands of operations and intermediate results.
CCR (condition code register), a special purpose register that stores the result of <, = , > operations.
ALU circuitry: Contains an array of circuits to
do mathematical/logic operations.
Bus: Data path interconnecting the registers to the ALU circuitry.
ALU circuitry
GT EQ LT
R0
R1
R2
Rn
The Control UnitThe Control Unit Program is stored in memory
as machine language instructions, in binary. The task of the control unit is to execute programs
by repeatedly: Fetch from memory the next instruction to be
executed. Decode it, that is, determine what is to be done. Execute it by issuing the appropriate signals to the
ALU, memory, and I/O subsystems. Continues until the HALT instruction.
Machine Language InstructionsMachine Language Instructions
A machine language instruction consists of: Operation code, telling which operation to perform Address field(s), telling the memory addresses of the
values on which the operation works. Example: ADD X, Y (Add content of memory locations X
and Y, and store back in memory location Y).
Assume: opcode for ADD is 9, and addresses X=99, Y=100
00001001 0000000001100011 0000000001100100
Opcode (8 bits)Opcode (8 bits) Address 1 (16 bits)Address 1 (16 bits) Address 2 (16 bits)Address 2 (16 bits)
Instruction Set DesignInstruction Set Design
Two different approaches:
Reduced Instruction Set Computers (RISC) Instruction set as small and simple as possible. Minimizes amount of circuitry --> faster computers
Complex Instruction Set Computers (CISC) More instructions, many very complex Each instruction can do more work, but require more
circuitry.
Structure of the Control UnitStructure of the Control Unit PC (Program Counter):PC (Program Counter):
stores the address of next instruction to fetchstores the address of next instruction to fetch IR (Instruction Register):IR (Instruction Register):
stores the instruction fetched from memorystores the instruction fetched from memory Instruction Decoder:Instruction Decoder:
Decodes instruction and activates necessary circuitryDecodes instruction and activates necessary circuitry
Instruction Decoder
IR
+1
PC
von Neumannvon Neumann
ArchitectureArchitecture
von Neumannvon Neumann
ArchitectureArchitecture
How does this all work together?How does this all work together?
Program Execution:
PC is set to the address where the first program instruction is stored in memory.
Repeat until HALT instruction or fatal error Fetch instruction
Decode instruction
Execute instruction
End of loop
Program Execution (cont.)Program Execution (cont.)
Fetch phase PC --> MAR (put address in PC into MAR) Fetch signal (signal memory to fetch value into MDR) MDR --> IR (move value to Instruction Register) PC + 1 --> PC (Increase address in program counter)
Decode Phase IR -> Instruction decoder (decode instruction in IR) Instruction decoder will then generate the signals to
activate the circuitry to carry out the instruction.
Program Execution (cont.)Program Execution (cont.)
Execute Phase Differs from one instruction to the next.
Example: LOAD X (load value in addr. X into register)
IR_address -> MAR Fetch signal MDR --> R
ADD X left as an exercise
Instruction Set for Our Von Neumann MachineInstruction Set for Our Von Neumann Machine
OpcodeOpcode OperationOperation MeaningMeaning
0000 LOAD X CON(X) --> R0001 STORE X R --> CON(X)0010 CLEAR X 0 --> CON(X)0011 ADD X R + CON(X) --> R0100 INCREMENT X CON(X) + 1 --> CON(X)0101 SUBTRACT X R - CON(X) --> R0101 DECREMENT X CON(X) - 1 --> CON(X)
0111COMPARE X If CON(X) > R then GT = 1 else 0
If CON(X) = R then EQ = 1 else 0
If CON(X) < R then LT = 1 else 01000 JUMP X Get next instruction from memory location X1001 JUMPGT X Get next instruction from memory loc. X if GT=1... JUMPxx X xx = LT / EQ / NEQ1101 IN X Input an integer value and store in X1110 OUT X Output, in decimal notation, content of mem. loc. X1111 HALT Stop program execution
HARVARD ARCHITECTUREHARVARD ARCHITECTURE
HARVARD ARCHITECTUREHARVARD ARCHITECTURE
A DSP Chip is a microprocessor A DSP Chip is a microprocessor specially designed for DSP specially designed for DSP applicationsapplications
Harvard architecture allows Harvard architecture allows multiple memory reads multiple memory reads
Architecture optimized to provide Architecture optimized to provide rapid processing of discrete time rapid processing of discrete time signals, e.g. Multiply and signals, e.g. Multiply and Accumulate (MAC) in one cycleAccumulate (MAC) in one cycle
Program Memory Data Bus
ALU
Accumulator
Program Memory
Data Memory
Multiplexer Multiplexer
Data Memory Data Bus
HARVARD ARCHITECTUREHARVARD ARCHITECTURE
The Harvard architecture – as the name implies – was developed at Harvard University.
By contrast to the Von Neumann architecture, it uses two separate bus systems to transport the instruction codes and the data being processed by the CPU.
The Program Bus System: Is used exclusively to transport instruction codes from the program memory to the CPU during the fetch cycle.
The Data Bus System: Is used exclusively to transport data from/to the CPU, to/from the memory and peripherals.
PROS AND CONS OF EACH ARCHITECTUREPROS AND CONS OF EACH ARCHITECTURE HA is capable of simultaneously reading an instruction HA is capable of simultaneously reading an instruction
code, and reading or writing a memory or peripheral as code, and reading or writing a memory or peripheral as part of the execution of the previous instruction. part of the execution of the previous instruction.
It has a speed advantage over the Von Neumann It has a speed advantage over the Von Neumann architecture.architecture.
The HA is also safer, since it is not possible for the CPU The HA is also safer, since it is not possible for the CPU to mistakenly write codes into the program memory and to mistakenly write codes into the program memory and therefore corrupt the code while it is executing.therefore corrupt the code while it is executing.
HA is HA is less flexibleless flexible. It needs two independent memory . It needs two independent memory banks (one for program and another one for data). These banks (one for program and another one for data). These two resources are not interchangeable.two resources are not interchangeable.
For an Embedded system that always runs the same For an Embedded system that always runs the same application, Harvard architecture is most suitable.(memory application, Harvard architecture is most suitable.(memory needs can be easily anticipated)needs can be easily anticipated)
PROS AND CONS OF EACH ARCHITECTUREPROS AND CONS OF EACH ARCHITECTURE
A computer system however A computer system however may run a wide variety of may run a wide variety of applicationsapplications, requiring large data and program memories, requiring large data and program memories
The Von Neumann architecture is better suited for this The Von Neumann architecture is better suited for this type situation, because program and data memories are type situation, because program and data memories are interchangeable, and it will lead to a better usage of the interchangeable, and it will lead to a better usage of the memory resources.memory resources.
The SHARCThe SHARC
Super Harvard Architecture ComputerSuper Harvard Architecture Computer
The SHARCThe SHARC
Developed by Analog Devices. Optimized for demanding DSP and imaging
applications. 32 Bit floating point, with 40 bit extended floating
point capabilities. Large on-chip memory. Ideal for scalable multi-processing applications ..
Super Harvard ArchitectureSuper Harvard Architecture
Many processor employ Harvard Architecture by having two separate memories or caches integrated into the processor chip.
The SHARC is unique in that it’s internal memory is capable of holding a large program as well a large amount of data. This is what makes it SUPER!!!.
SHARC DSPSHARC DSP
The SHARC incorporates features aimed at optimizing such loops.
High-Speed Floating Point Capability. Extended Floating Point.
These features are DSP specific. Meaning, when applied to a non-DSP application
performance may not be as optimal.
Floating Point and Extended Floating PointFloating Point and Extended Floating Point
The SHARC supports Floating, Extended-Floating and Non-Floating point operations.
No additional clock cycles for floating point computations.
Data automatically truncated and zero padded when moved between 32-bit memory and internal registers.
Excellent signal to noise ratio.
SHARC’s Internal MemorySHARC’s Internal Memory
Makes SHARC unique. Size
Allows many complex functions to be preformed on-chip. Eliminating the need to move data between internal and external memory.
Memory size is significantly larger then most other high speed computational devices.
Dual-block, Dual-port Optimizes the Harvard Architecture by allowing the
fetch of instructions while performing data memory accesses.
Multiply and Accumulate Instructions on the Multiply and Accumulate Instructions on the SHARCSHARC
Like most DSPs the SHARC is able to compute a product and add the product to a running total in a single clock cycle.
The SHARC’s super instruction is that it can multiply and accumulate while adding, subtracting, or averaging data in two other registers.
These instructions give the SHARC its 120 megaflop rating.
DAGs on the SHARCDAGs on the SHARC
Data Address Generators are integer computation units that manage the indexing of registers.
Allows the SHARC to to fetch a value and update the index value.
If the updated value exceeds a limit, the DAB adjusts the index so that it wraps.
This occurs in the same clock cycle as the read or write.
DAG CapabilitiesDAG Capabilities
Circular Buffering Rather then actually moving data in and out of a
vector, circular buffers are used. Updating the index modulo, the oldest entry can be
conveniently replaced by the newest entry.
Bit Reverse Addressing The bit pattern of a vector index is reversed. Done automatically by the SHARC. Required for Fast Fourier Transform (FFT), which is
often critical to DSP applications.
SHARC DSPSHARC DSP
What Makes the SHARC unique? It also has some features not related directly related to
optimizing numeric computations. Pipelining Handling Branches.
Why has this not emerged sooner? Technology has only recently become available to
make it economical to integrate general single computing devices.
SHARC’s PipelineSHARC’s Pipeline
3 stages1 Instruction Fetch2 Decode3 Execution
Takes three clock cycles for an instruction to propagate through the pipeline.
The processor execution speed is one instruction per clock cycle even though each instruction requires three clock cycles.
Multi-processingMulti-processing
SHARC is uniquely equipped for multi-processing. Links to ports are very powerful multi-processing
capabilities. Two main program models depending on the
application. Adapts well to different multi-processing
architectures.
Multi-processing Multi-processing SHARC LinksSHARC Links
SHARC has 6 link ports that can transport data at rates up to 40Mbytes/sec.
Links designed for point-to-point connections.
Data can be transmitted in either direction but not both simultaneously.
Multi-processing Program Model MIMDMulti-processing Program Model MIMD
Multiple instruction, multiple data.
Good for applications that require multiple instruction threads to execute concurrently.
Processors operate individually. Each processor executes different code.
Typically used for image reconstruction and multi-channel DSP.
Multi-processing Program Model Multi-processing Program Model SIMDSIMD
Single instruction, multiple data. Works best when all processors execute identical
instruction sequences. Do not require overhead for inter-processor
synchronization. Typically used for synthetic aperture radar (SAR)
and automatic target recognition.
( SAR is a form of SAR is a form of radar which is used to which is used to create images of an object, such as a landscape of an object, such as a landscape)
Summary of what makes the SHARC Summary of what makes the SHARC SuperSuper
It performs excellently for DSP applications.
Employs a Harvard Architecture with very large on chip memory.
Respectable Megaflop rating.
It’s multiprocessing capabilities.
VLIW(Very Large Instruction Width) Architecture
VLIW architectures execute multiple instructions/cycle VLIW architectures execute multiple instructions/cycle and use simple, regular instruction sets and use simple, regular instruction sets More parallelism, higher performance.More parallelism, higher performance. Better compiler target.Better compiler target.
Multiple independent instructions per cycle, packed into Multiple independent instructions per cycle, packed into single large "instruction word" or "packet“.single large "instruction word" or "packet“.
Large, uniform register sets.Large, uniform register sets. Wide program and data buses.Wide program and data buses.
Invented by Josh Fisher in his research group at Yale University in the early 1980s.
Very Long Instruction Word (VLIW)
• A technique for instruction-level parallelism by executing instructions without dependencies (known at compile-time) in parallel
• Example of a single VLIW instruction:
F=a+b; c=e/g; d=x&y; w=z*h;
VLIW instruction F=a+b c=e/g d=x&y w=z*h
PU
PU
PU
PU
a
b
F
c
d
w
e
g
x
y
z
h
VLIW – Simplified Architecture ExampleVLIW – Simplified Architecture Example
Program Memory
256 bits consisting of 8 instructionsEach instruction is 32 bits
Execution Units
Execution Units
Execution Units
Execution Units
Execution Units
Execution Units
Execution Units
Execution Units
Each unit executing one instruction