Upload
dinhnga
View
222
Download
2
Embed Size (px)
Citation preview
Company
LOGO
Arab Academy for Science, Technology & Maritime
Transport
Advanced Computer
Architectures
CC721
Magdy Saeb
Ad. Comp Arch. CC721
• Deeper understanding of;
• Computer Architecture concepts
• design trade-offs for cost/performance
• Advanced Architectures
• trends for the future
• Why?
• match/choose hardware and software to solve a
problem
• design better software (for many programmers)
• design better hardware (for a chosen few)
Ad. Comp Arch.CC721
Course planning
8 Lectures
2 Written exams, project, report and
presentation
Course Outline
Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel
architectures and parallel processing. The main aim is to develop the students‟ research skills and knowledge in the state-of-the-art architectures. This topic is strongly related to areas like: computer graphics acceleration, cryptography, coding, hardware design, etc
Department Home page: www.aast-compeng.info
( Here you find many course handouts, VHDL lectures, solution of homework problems, and sample exams)
Topics: 1. Course Overview, Computational Models,
2. ILP-Processors, Instruction Set, Area, Cost,
3. Pipelined Processors,
4. VLIW Processors,
5. Superscalar Processors,
6 . Code Scheduling for ILP-Processors,
7 . Branch Processing,
8 . SIMD Architectures,
9 . MIMD Architectures, Memory Systems,
10 . Dataflow Architecture
11 . Processor-in-Memory Architecture
Texts & Grading
Text:
Advanced Computer Architectures: A Design Space Approach, Addison-Wesley, 1998.
References:
J.L. Hennessy, D. A. Patterson, Computer Architecture, 3rd Edition, Morgan Kauffman, 2003.
Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, 1993.
Grading:
Homework 10%
Project 20%
Midterm1 30%
Final 40%
Lecturer: Magdy Saeb, Ph.D.
.
Ad. Comp Arch.CC721
Main text:
Advanced Computer Architectures: A Design Space Approach
Sima, Fountain, Kacsuk
Supplementary text:
Computer Architecture: A Quantitative Approach Hennesey, Pattersson
Ad. Comp Arch.CC721
See
http://www.aast-compeng.info
for information on:
• News
• Lectures
• Sample Exams
• Lab Status
• Grading
Company
LOGO
Arab Academy for Science, Technology & Maritime
Transport
Computational Models
Advanced Computer Architectures:
Part I
• Computational Models
• The Concept of Computer Architecture
• Introduction To Parallel Processing
Sima, et al. introduce a design space approach
to Computer Architecture (design aspects
are broken down to atoms or tiny pieces).
Computational Models
Computational
Model
Language
Class
Architectural
Class
Turing 0-language:
Assembly
Von Neumann Imperative:
C-language
Von Neumann
Data Flow Single Assignment Data Flow
Applicative Functional:
LISP
Reduction
Predicate Logic-
based
Logic Programming:
Prolog
N/A
Object-oriented Object-oriented:
C++
Object-oriented
Part I, Computational Models
• Turing, Typ0 language
• Infinite memory, not feasible
• Von Neumann, Imperative (C)
• Traditional architecture, Control/Memory
• Finite State Machine (FSM)
• Multiple Assignment gives side effects
• Sequential in nature
• Control statements
Part I, Computational Models
• Dataflow, Single Assignment language • Dataflow machines
• Applicative, Functional (Haskell/ML) • Reduction machines
• Object Based, Object Oriented (C++) • Object oriented computers (similar to Von Neumann
however depend on message passing)
• Predicate Logic Based, Logic Based (Prolog) • Has Not been realized
Part I, Computational Models
Computational models can be “emulated” on
Von Neumann machines.
Hard to beat on cost/performance!
Part I, Sima, The Concept of Computer
Architecture
• Abstract architecture
• Deals with functional specification
• For example: programmers model/instruction set
• Concrete architecture
• Deals with aspects of the implementation
• For example: logic design as block diagram of
functional units
• DS Trees to define design space
• con:consist of
• pex:can be exclusively performed by
• per:can be performed by
• example=con(pex(A,B),per(C,con(D,E))
A B C
D E
Part I, The Concept of Computer
Architecture
• Process/Process trees/Threads • Process Control Block (PCB)
• Resource mapping per process
• Threads/light weight processes, inherits/shares resources
• Concurrent/Parallel execution • Concurrent (time sliced)
Multi threaded architectures
• Parallel (multiple CPUs) Parallel architectures, multi processors, multi computers
(clusters)
Part I, Introduction to Parallel
Processing
Part I, Introduction to Parallel
Processing
• Types of Parallelism • Available, inherent in problem
• Utilized by architecture implementation
• Functional, from problem solution (usually irregular) ILP, multi-threading, MIMD
• Data, from computations (regular, like vectors…) SIMD
SISD SIMD
MISD MIMD
Flynn‟s Classification
Company
LOGO
Arab Academy for Science, Technology & Maritime
Transport
Introduction to Instruction-Level Parallelism
(ILP)
Introduction to Instruction-Level Parallelism (ILP)
Traditional
Von Neumann
Processors
(sequential issue,
sequential execution)
Scalar ILP
Processors
(sequential issue,
parallel execution)
SuperScalar ILP
Processors
(parallel issue,
parallel execution)
Parallelism of Instruction Execution
Parallelism of Instruction Issue
typical
implementation
Non-pipelined
Processors
Processors with
multiple non-
pipelined EUs and
pipelined
processors
VLSI and
superscalar
processors with
multiple pipelined
EUs.
Some Definitions
• A pipelined processor :
• Has instruction level parallelism by having
one instruction in each stage of the pipeline
• An execution unit (EU) is a block that
performs some function which helps
complete an instruction :
• Integer ALU, Floating Point Unit (FPU),
Branch Unit (BU), Load Store Unit are
examples of execution units.
Methods of achieving parallelism
There are two major methods of achieving
parallelism:
• Pipelining
• Replication
More Definitions
• A superscalar processor :
• Issues multiple instructions per clock cycle from a
sequential stream
• Dynamic scheduling of execution units (scheduling
done in hardware)
• An Very Long Instruction Word (VLIW)
processor :
• Issues one very „wide‟ instruction per clock cycle; this
instruction contains multiple operations
• Static scheduling of execution units (done by
compiler).
Pipelined vs. VLIW/Superscalar
Pipelined operation
EU1 EU2 EU3
Pipelined Processors
Parallel operation
EU1 EU2 EU3
VLIW and Superscalar
processors
Execution units in VLIW and
Superscalar processors can be
pipelined!
Typical Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem Wr Load
• Ifetch: Instruction Fetch
• Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file
Execution Units can be pipelined
(PowerPC 601 Example)
Branch Instructions
Fetch Issue
Decode
Execute
Predict
Integer Instructions
Fetch Issue
Decode Execute Write-
back
Load/Store Instructions
Fetch Issue
Decode Addr Gen Cache Write-
back
Execution Units can be pipelined
(PowerPC 601 Example) (cont.)
FP Instructions
Fetch Issue Decode Execute 1 Writeback Execute 2
Data Dependencies
• Data Dependencies present problems for
instruction level parallelism
• Types of data dependencies:
• Straight line code
• Read After Write (RAW)
• Write After Read (WAR)
• Write After Write (WAW)
• Loops
• Recurrence or inter-iteration dependencies
Straight Line Dependencies
Read After Write (RAW)
i1: load r1, a;
i2: add r2, r1, r1;
Assume a pipeline of Fetch/Decode/Execute/Mem/Writeback
When add is in the DECODE stage (which fetches r1), the load is
in the EXECUTE stage and the true value of r1 has not been
fetched yet! (r1 is fetched in the Mem stage)
Solve this by either stalling the ‘add’ until the value of r1 is ready,
or by forwarding the value of r1 from the Mem stage to the
Execute stage.
Straight Line Dependencies
(cont)
Write after Read (WAR)
i1: mul r1, r2, r3; r1 <= r2 * r3
i2: add r2, r4 , r5; r2 <= r4 + r5
If instruction i2 (add) is executed before instruction i1 (mul) for
some reason, then i1 (mul) could read the wrong value for r2.
One reason for delaying i1 would be a stall for the ‘r3’ value being
produced by a previous instruction. Instruction i2 could proceed
because it has all its operands, thus causing the WAR hazard.
Use register renaming to eliminate WAR dependency. Replace r2
with some other register that has not been used yet.
Straight Line Dependencies
(cont.)
Write after Write (WAW)
i1: mul r1, r2, r3; r1 <= r2 * r3
i2: add r1, r4 , r5; r2 <= r4 + r5
If instruction i1 (mul) finishes AFTER instruction i2 (add), then
register r1 would get the wrong value. Instruction i1 could finish
after instruction i2 if separate execution units were used for
instructions i1 and i2.
One way to solve this hazard is to simply let instruction i1 proceed
normally, but disable its write stage.
Loop Dependencies
Recurrences:
do I = 2, n
X(I) = A * X(I-1) + B;
enddo
One way to parallelize this loop would be to ‘unroll’ this loop
(create (N-2) copies of the loop). However, a dependency exists
between the current X value and the previous loop value, so loop
unrolling will not give us anymore parallelism.
This type of data dependency cannot be solved at the
implementation level, but must be addressed at the compiler level.
Control Dependencies
• Control Dependencies (i.e. branches) are a
major obstacle to instruction level parallelism
• In a pipelined machine, normally have branch
condition computation done as EARLY as possible in
the pipeline in order to lessen the impact of incorrect
branch prediction (taken or not taken)
• Conditional branch instructions are 20% for
general purpose code, 5-10% for scientific code.
Branch Strategies
• Static
• Always predict taken or not-taken
• Dynamic
• Keep a history of code execution and modify
predictions based on execution history
• Multi-way
• Execute both branch paths and kill incorrect
path as soon as branch condition is resolved.
Control Dependency Graph
i0
i1
i2
i3
i4
i5 i6
i8
i7
i0: r1 = op1;
i1: r2 = op2;
i2: r3 = op3;
i3: if (r2 > r1) {
i4: if (r3 > r1) {
i5: r4 = r3;
i6: else r4 = r1 }
i7: } else r4 = r2;
i8: r5 = r4 * r4;
Resource Dependencies
• A resource dependency is when an instruction
requires a hardware resource being used by a
previously issued instruction (also known as
structural hazard)
• Execution Units, Busses (e.g, external address/data
bus)
• A resource dependency can only be solved by
resource duplication
• The Harvard architecture has separate address/data
busses for instructions and data
Instruction Scheduling
• Instruction Scheduling is the assignment of
instructions to hardware resources.
• Hardware resources are busses, registers, and
execution units
• Static scheduling is done by compiler or by
human
• Hardware assumes that ALL hazards have been
eliminated.
• Lessens the amount of control logic needed which
hopefully speeds up maximum clock speed
Instruction Scheduling (cont).
• Dynamic Scheduling is implemented in
hardware inside of processor.
• All instruction streams are „legal‟
• Control logic and hardware resources needed
for dynamic scheduling can be significant.
• If trying to execute legacy code streams,
then dynamic scheduling may be the only
option.
Company
LOGO
Arab Academy for Science, Technology & Maritime
Transport
Pipelined Processors
Definitions
• FX pipeline - Fixed point pipeline (integer
pipeline)
• FP pipeline - Floating Point pipeline
• Cycle time - length of clock period for pipeline,
determined by slowest stage.
• Latency used in referenced to RAW hazards -
the amount of time that a result of a particular
instruction takes to become available in the
pipeline for a subsequent dependent instruction
(measured in multiples of clock cycles)
RAW Dependency, Latencies
• Define-use Latency is the time delay after
decoding and issue of an instruction until the
result becomes available for a subsequent RAW
dependent instruction.
add r1, r2,r3
add r5, r1, r6 define-use dependency
Usually one cycle for simple instructions.
• Define-use Delay of an instruction is the time a
subsequent RAW-dependent instruction has to
be stalled in the pipeline. It is one less cycle
than the define-use latency.
RAW Dependency, Latencies
(cont)
• If define-use latency = 1, then define-use delay
is 0 and the pipeline is not stalled.
• This is the case for most simple instructions in the FX
pipeline
• Non-pipelined FP operations can have define-use
latencies from a few cycles to a 10‟s of cycles.
• Load-use dependency, Load-use latency, load-
use delay refer to load instructions
load r1, 4(r2)
add r3, r1, r2
Definitions are the same as define-use
dependency, latency, and delay.
More Definitions
Repetition Rate R (throughput) - shortest
possible time interval between subsequent
independent instructions in the pipeline
Performance Potential of a Pipeline - the number of independent
instructions which can be executed in a unit interval of time:
P = 1 / (R * tc )
R: repetition rate in clock cycles
tc : cycle time of the pipeline
Table 5.1 from Text
(latency/repetition rate)
Processor CycleTime Prec Fadd FMult FDiv Fsqrt
a21064 7/5/2 s 6/1 6/1 34 -
p 6/1 6/1 63 -
Pentium 6/5/3.3 s 3/1 3/1 39 70
d 3/1 3/1 30 70
Pentium Pro 6.7/5/3.3 s 3/1 5/2 18 29
d 3/1 5/2
HP PA 8000 5.6 s 3/1 3/1 17 17
d 3/1 3/1 31 31
SuperSparc 20/17 s 1/1 3/1 6/4 8/6
d 9/7 12/10
How many stages?
• The more stages, the less combinational logic within a
stage, the higher the possible clock frequency
• More stages can complicate control. Dec Alpha has 7 stages for
FX instructions, and these instructions have a define-use delay
of one cycle for even basic FX instructions
• Becomes difficult to divide up logic evenly between stages
• Clock skew between stages becomes more difficult
• Diminishing returns as stages become large
• Superpipelining is a term used for processors that use a
high number of stages.
Dedicated Pipelines versus
Multifunctional Pipelines
• Trend in current high performance CPUs
is to used different logical AND physical
pipelines for different instruction classes
• FX pipeline (integer)
• FP pipeline (floating point)
• L/S pipeline (Load/Store)
• B pipeline (Branch)
• Allows more concurrency, more
optimization
• Silicon area more plentiful
Sequential Consistency
• With multiple pipelines, how do we maintain
sequential consistency when instructions are
finishing at different times?
• With just two pipelines (FX and FP), we can lengthen
the shorter pipeline with statically or dynamically.
Dynamic lengthening would be used only when
hazards are detected.
• We can force the pipelines to write to a special unit
called a Renaming Buffer or Reordering Buffer. It is
the job of this unit to maintain sequential consistency.
Will look at this in detail in Chapter 7 (superscalar).
RISC versus CISC pipelines
• Pipelines for CISC are required to handle complex
memory to register addressing
• mov r4, (r3, r2)4 EA is r3 + r2 + 4
• Will have an extra stage for Effective address calculation (see
Figures 5.40, 5.41, 5.43)
• Some CISC pipelines avoid a load-use delay penalty (Fig 5.54,
5.56)
• RISC pipelines have a load-use penalty of at least one
• Determining load-use penalties when multiple pipelines
are in action are instruction sequence dependent (ie., 1,
2, more than 2 cycles)
Some other important Figures in
Chapter 5
• Figure 5.26 (illustrates use of both clock phases
for performing pipeline tasks)
• Figure 5.31, Figure 5.32 (Pentium Pipeline,
shows difference between logical and physical
pipelines)
• Figure 5.33, Figure 5.34 (PowerPC 604 - first
look at a modern superscalar processor)
Company
LOGO
Arab Academy for Science, Technology & Maritime
Transport
CC721
Computing Systems
Part 3: VLIW Architecture
Basic Working
Principles of VLIW • Aim at speeding up computation by exploiting
instruction-level parallelism.
• Same hardware core as superscalar processors, having multiple execution units (EUs) working in parallel.
• An instruction is consisted of multiple operations; typical word length from 52 bits to 1 Kbits.
• All operations in an instruction are executed in a lock-step mode.
• One or multiple register files for FX and FP data.
• Rely on compiler to find parallelism and schedule dependency free program code.
Basic VLIW Approach
Register File Structure for
VLIW
What is the challenge to register file in VLIW? R/W ports
Differences Between VLIW & Superscalar
Architecture (I)
• Instruction formulation:
• Superscalar:
• Receive conventional instructions conceived for seq.
processors.
• VLIW:
• Receive (very) long instruction words, each comprising a
field (or opcode) for each execution unit.
• Instruction word length depends (a) number of execution
units, and (b) code length to control each unit (such as
opcode length, register names, …).
• Typical word length is 256 – 1024 bits, much longer than
conventional machine word length.
Differences Between VLIW & Superscalar Architecture (II)
• Instruction scheduling:
• Superscalar:
• Performed dynamically at run-time by the hardware.
• Data dependency is checked and resolved in hardware.
• Need a look-ahead hardware window for instruction fetch.
Differences Between VLIW & Superscalar Architecture (III)
• Instruction scheduling (cont‟d):
• VLIW:
• Static scheduling done at compile-time by
the compiler.
• Advantages: • Reduce hardware complexity.
• Tasks such as decoding, data dependency
detection, instruction issue, …, etc. becoming
simple.
• Potentially higher clock rate.
• Higher degree of parallelism with global program
information.
Differences Between VLIW & Superscalar Architecture (IV)
Instruction scheduling (cont‟d):
VLIW:
Disadvantages
Higher complexity of the compiler.
Compiler optimization needs to consider
technology dependent parameters such as
latencies and load-use time of cache.
(Question: What happens to the software if the
hardware is updated?)
Non-deterministic problem of cache misses,
resulting in worst case assumption for code
scheduling.
In case of un-filled opcodes in a (V)LIW, memory
space and instruction bandwidth are wasted.
Differences Between VLIW & Superscalar Architecture (V)
Development history of Proposed/Commercial
VLIWs
Case Study of VLIW: Trace 200 Family (I)
Case Study of VLIW: Trace 200 Family (II)
Only two branches might be used in Trace 7/2000
• It is found that code in VLIW is expanded roughly by a factor of three.
• For “long” VLIW, more opcode fields will be emptied. This will result in wasting bandwidth and storage space.
Can you propose a solution for it?
Code Expansion in VLIW