61
Company LOGO Arab Academy for Science, Technology & Maritime Transport Advanced Computer Architectures CC721 Magdy Saeb

Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

  • Upload
    dinhnga

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Company

LOGO

Arab Academy for Science, Technology & Maritime

Transport

Advanced Computer

Architectures

CC721

Magdy Saeb

Page 2: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Ad. Comp Arch. CC721

• Deeper understanding of;

• Computer Architecture concepts

• design trade-offs for cost/performance

• Advanced Architectures

• trends for the future

• Why?

• match/choose hardware and software to solve a

problem

• design better software (for many programmers)

• design better hardware (for a chosen few)

Page 3: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Ad. Comp Arch.CC721

Course planning

8 Lectures

2 Written exams, project, report and

presentation

Page 4: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Course Outline

Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

architectures and parallel processing. The main aim is to develop the students‟ research skills and knowledge in the state-of-the-art architectures. This topic is strongly related to areas like: computer graphics acceleration, cryptography, coding, hardware design, etc

Department Home page: www.aast-compeng.info

( Here you find many course handouts, VHDL lectures, solution of homework problems, and sample exams)

Topics: 1. Course Overview, Computational Models,

2. ILP-Processors, Instruction Set, Area, Cost,

3. Pipelined Processors,

4. VLIW Processors,

5. Superscalar Processors,

6 . Code Scheduling for ILP-Processors,

7 . Branch Processing,

8 . SIMD Architectures,

9 . MIMD Architectures, Memory Systems,

10 . Dataflow Architecture

11 . Processor-in-Memory Architecture

Page 5: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Texts & Grading

Text:

Advanced Computer Architectures: A Design Space Approach, Addison-Wesley, 1998.

References:

J.L. Hennessy, D. A. Patterson, Computer Architecture, 3rd Edition, Morgan Kauffman, 2003.

Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, 1993.

Grading:

Homework 10%

Project 20%

Midterm1 30%

Final 40%

Lecturer: Magdy Saeb, Ph.D.

.

Page 6: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Ad. Comp Arch.CC721

Main text:

Advanced Computer Architectures: A Design Space Approach

Sima, Fountain, Kacsuk

Supplementary text:

Computer Architecture: A Quantitative Approach Hennesey, Pattersson

Page 7: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Ad. Comp Arch.CC721

See

http://www.aast-compeng.info

for information on:

• News

• Lectures

• Sample Exams

• Lab Status

• Grading

Page 8: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Company

LOGO

Arab Academy for Science, Technology & Maritime

Transport

Computational Models

Page 9: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Advanced Computer Architectures:

Part I

• Computational Models

• The Concept of Computer Architecture

• Introduction To Parallel Processing

Sima, et al. introduce a design space approach

to Computer Architecture (design aspects

are broken down to atoms or tiny pieces).

Page 10: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Computational Models

Computational

Model

Language

Class

Architectural

Class

Turing 0-language:

Assembly

Von Neumann Imperative:

C-language

Von Neumann

Data Flow Single Assignment Data Flow

Applicative Functional:

LISP

Reduction

Predicate Logic-

based

Logic Programming:

Prolog

N/A

Object-oriented Object-oriented:

C++

Object-oriented

Page 11: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Part I, Computational Models

• Turing, Typ0 language

• Infinite memory, not feasible

• Von Neumann, Imperative (C)

• Traditional architecture, Control/Memory

• Finite State Machine (FSM)

• Multiple Assignment gives side effects

• Sequential in nature

• Control statements

Page 12: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Part I, Computational Models

• Dataflow, Single Assignment language • Dataflow machines

• Applicative, Functional (Haskell/ML) • Reduction machines

• Object Based, Object Oriented (C++) • Object oriented computers (similar to Von Neumann

however depend on message passing)

• Predicate Logic Based, Logic Based (Prolog) • Has Not been realized

Page 13: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Part I, Computational Models

Computational models can be “emulated” on

Von Neumann machines.

Hard to beat on cost/performance!

Page 14: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Part I, Sima, The Concept of Computer

Architecture

• Abstract architecture

• Deals with functional specification

• For example: programmers model/instruction set

• Concrete architecture

• Deals with aspects of the implementation

• For example: logic design as block diagram of

functional units

Page 15: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

• DS Trees to define design space

• con:consist of

• pex:can be exclusively performed by

• per:can be performed by

• example=con(pex(A,B),per(C,con(D,E))

A B C

D E

Part I, The Concept of Computer

Architecture

Page 16: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

• Process/Process trees/Threads • Process Control Block (PCB)

• Resource mapping per process

• Threads/light weight processes, inherits/shares resources

• Concurrent/Parallel execution • Concurrent (time sliced)

Multi threaded architectures

• Parallel (multiple CPUs) Parallel architectures, multi processors, multi computers

(clusters)

Part I, Introduction to Parallel

Processing

Page 17: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Part I, Introduction to Parallel

Processing

• Types of Parallelism • Available, inherent in problem

• Utilized by architecture implementation

• Functional, from problem solution (usually irregular) ILP, multi-threading, MIMD

• Data, from computations (regular, like vectors…) SIMD

SISD SIMD

MISD MIMD

Flynn‟s Classification

Page 18: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Company

LOGO

Arab Academy for Science, Technology & Maritime

Transport

Introduction to Instruction-Level Parallelism

(ILP)

Page 19: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Introduction to Instruction-Level Parallelism (ILP)

Traditional

Von Neumann

Processors

(sequential issue,

sequential execution)

Scalar ILP

Processors

(sequential issue,

parallel execution)

SuperScalar ILP

Processors

(parallel issue,

parallel execution)

Parallelism of Instruction Execution

Parallelism of Instruction Issue

typical

implementation

Non-pipelined

Processors

Processors with

multiple non-

pipelined EUs and

pipelined

processors

VLSI and

superscalar

processors with

multiple pipelined

EUs.

Page 20: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Some Definitions

• A pipelined processor :

• Has instruction level parallelism by having

one instruction in each stage of the pipeline

• An execution unit (EU) is a block that

performs some function which helps

complete an instruction :

• Integer ALU, Floating Point Unit (FPU),

Branch Unit (BU), Load Store Unit are

examples of execution units.

Page 21: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Methods of achieving parallelism

There are two major methods of achieving

parallelism:

• Pipelining

• Replication

Page 22: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

More Definitions

• A superscalar processor :

• Issues multiple instructions per clock cycle from a

sequential stream

• Dynamic scheduling of execution units (scheduling

done in hardware)

• An Very Long Instruction Word (VLIW)

processor :

• Issues one very „wide‟ instruction per clock cycle; this

instruction contains multiple operations

• Static scheduling of execution units (done by

compiler).

Page 23: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Pipelined vs. VLIW/Superscalar

Pipelined operation

EU1 EU2 EU3

Pipelined Processors

Parallel operation

EU1 EU2 EU3

VLIW and Superscalar

processors

Execution units in VLIW and

Superscalar processors can be

pipelined!

Page 24: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Typical Pipeline

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem Wr Load

• Ifetch: Instruction Fetch

• Fetch the instruction from the Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode

• Exec: Calculate the memory address

• Mem: Read the data from the Data Memory

• Wr: Write the data back to the register file

Page 25: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Execution Units can be pipelined

(PowerPC 601 Example)

Branch Instructions

Fetch Issue

Decode

Execute

Predict

Integer Instructions

Fetch Issue

Decode Execute Write-

back

Load/Store Instructions

Fetch Issue

Decode Addr Gen Cache Write-

back

Page 26: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Execution Units can be pipelined

(PowerPC 601 Example) (cont.)

FP Instructions

Fetch Issue Decode Execute 1 Writeback Execute 2

Page 27: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Data Dependencies

• Data Dependencies present problems for

instruction level parallelism

• Types of data dependencies:

• Straight line code

• Read After Write (RAW)

• Write After Read (WAR)

• Write After Write (WAW)

• Loops

• Recurrence or inter-iteration dependencies

Page 28: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Straight Line Dependencies

Read After Write (RAW)

i1: load r1, a;

i2: add r2, r1, r1;

Assume a pipeline of Fetch/Decode/Execute/Mem/Writeback

When add is in the DECODE stage (which fetches r1), the load is

in the EXECUTE stage and the true value of r1 has not been

fetched yet! (r1 is fetched in the Mem stage)

Solve this by either stalling the ‘add’ until the value of r1 is ready,

or by forwarding the value of r1 from the Mem stage to the

Execute stage.

Page 29: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Straight Line Dependencies

(cont)

Write after Read (WAR)

i1: mul r1, r2, r3; r1 <= r2 * r3

i2: add r2, r4 , r5; r2 <= r4 + r5

If instruction i2 (add) is executed before instruction i1 (mul) for

some reason, then i1 (mul) could read the wrong value for r2.

One reason for delaying i1 would be a stall for the ‘r3’ value being

produced by a previous instruction. Instruction i2 could proceed

because it has all its operands, thus causing the WAR hazard.

Use register renaming to eliminate WAR dependency. Replace r2

with some other register that has not been used yet.

Page 30: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Straight Line Dependencies

(cont.)

Write after Write (WAW)

i1: mul r1, r2, r3; r1 <= r2 * r3

i2: add r1, r4 , r5; r2 <= r4 + r5

If instruction i1 (mul) finishes AFTER instruction i2 (add), then

register r1 would get the wrong value. Instruction i1 could finish

after instruction i2 if separate execution units were used for

instructions i1 and i2.

One way to solve this hazard is to simply let instruction i1 proceed

normally, but disable its write stage.

Page 31: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Loop Dependencies

Recurrences:

do I = 2, n

X(I) = A * X(I-1) + B;

enddo

One way to parallelize this loop would be to ‘unroll’ this loop

(create (N-2) copies of the loop). However, a dependency exists

between the current X value and the previous loop value, so loop

unrolling will not give us anymore parallelism.

This type of data dependency cannot be solved at the

implementation level, but must be addressed at the compiler level.

Page 32: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Control Dependencies

• Control Dependencies (i.e. branches) are a

major obstacle to instruction level parallelism

• In a pipelined machine, normally have branch

condition computation done as EARLY as possible in

the pipeline in order to lessen the impact of incorrect

branch prediction (taken or not taken)

• Conditional branch instructions are 20% for

general purpose code, 5-10% for scientific code.

Page 33: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Branch Strategies

• Static

• Always predict taken or not-taken

• Dynamic

• Keep a history of code execution and modify

predictions based on execution history

• Multi-way

• Execute both branch paths and kill incorrect

path as soon as branch condition is resolved.

Page 34: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Control Dependency Graph

i0

i1

i2

i3

i4

i5 i6

i8

i7

i0: r1 = op1;

i1: r2 = op2;

i2: r3 = op3;

i3: if (r2 > r1) {

i4: if (r3 > r1) {

i5: r4 = r3;

i6: else r4 = r1 }

i7: } else r4 = r2;

i8: r5 = r4 * r4;

Page 35: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Resource Dependencies

• A resource dependency is when an instruction

requires a hardware resource being used by a

previously issued instruction (also known as

structural hazard)

• Execution Units, Busses (e.g, external address/data

bus)

• A resource dependency can only be solved by

resource duplication

• The Harvard architecture has separate address/data

busses for instructions and data

Page 36: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Instruction Scheduling

• Instruction Scheduling is the assignment of

instructions to hardware resources.

• Hardware resources are busses, registers, and

execution units

• Static scheduling is done by compiler or by

human

• Hardware assumes that ALL hazards have been

eliminated.

• Lessens the amount of control logic needed which

hopefully speeds up maximum clock speed

Page 37: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Instruction Scheduling (cont).

• Dynamic Scheduling is implemented in

hardware inside of processor.

• All instruction streams are „legal‟

• Control logic and hardware resources needed

for dynamic scheduling can be significant.

• If trying to execute legacy code streams,

then dynamic scheduling may be the only

option.

Page 38: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Company

LOGO

Arab Academy for Science, Technology & Maritime

Transport

Pipelined Processors

Page 39: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Definitions

• FX pipeline - Fixed point pipeline (integer

pipeline)

• FP pipeline - Floating Point pipeline

• Cycle time - length of clock period for pipeline,

determined by slowest stage.

• Latency used in referenced to RAW hazards -

the amount of time that a result of a particular

instruction takes to become available in the

pipeline for a subsequent dependent instruction

(measured in multiples of clock cycles)

Page 40: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

RAW Dependency, Latencies

• Define-use Latency is the time delay after

decoding and issue of an instruction until the

result becomes available for a subsequent RAW

dependent instruction.

add r1, r2,r3

add r5, r1, r6 define-use dependency

Usually one cycle for simple instructions.

• Define-use Delay of an instruction is the time a

subsequent RAW-dependent instruction has to

be stalled in the pipeline. It is one less cycle

than the define-use latency.

Page 41: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

RAW Dependency, Latencies

(cont)

• If define-use latency = 1, then define-use delay

is 0 and the pipeline is not stalled.

• This is the case for most simple instructions in the FX

pipeline

• Non-pipelined FP operations can have define-use

latencies from a few cycles to a 10‟s of cycles.

• Load-use dependency, Load-use latency, load-

use delay refer to load instructions

load r1, 4(r2)

add r3, r1, r2

Definitions are the same as define-use

dependency, latency, and delay.

Page 42: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

More Definitions

Repetition Rate R (throughput) - shortest

possible time interval between subsequent

independent instructions in the pipeline

Performance Potential of a Pipeline - the number of independent

instructions which can be executed in a unit interval of time:

P = 1 / (R * tc )

R: repetition rate in clock cycles

tc : cycle time of the pipeline

Page 43: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Table 5.1 from Text

(latency/repetition rate)

Processor CycleTime Prec Fadd FMult FDiv Fsqrt

a21064 7/5/2 s 6/1 6/1 34 -

p 6/1 6/1 63 -

Pentium 6/5/3.3 s 3/1 3/1 39 70

d 3/1 3/1 30 70

Pentium Pro 6.7/5/3.3 s 3/1 5/2 18 29

d 3/1 5/2

HP PA 8000 5.6 s 3/1 3/1 17 17

d 3/1 3/1 31 31

SuperSparc 20/17 s 1/1 3/1 6/4 8/6

d 9/7 12/10

Page 44: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

How many stages?

• The more stages, the less combinational logic within a

stage, the higher the possible clock frequency

• More stages can complicate control. Dec Alpha has 7 stages for

FX instructions, and these instructions have a define-use delay

of one cycle for even basic FX instructions

• Becomes difficult to divide up logic evenly between stages

• Clock skew between stages becomes more difficult

• Diminishing returns as stages become large

• Superpipelining is a term used for processors that use a

high number of stages.

Page 45: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Dedicated Pipelines versus

Multifunctional Pipelines

• Trend in current high performance CPUs

is to used different logical AND physical

pipelines for different instruction classes

• FX pipeline (integer)

• FP pipeline (floating point)

• L/S pipeline (Load/Store)

• B pipeline (Branch)

• Allows more concurrency, more

optimization

• Silicon area more plentiful

Page 46: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Sequential Consistency

• With multiple pipelines, how do we maintain

sequential consistency when instructions are

finishing at different times?

• With just two pipelines (FX and FP), we can lengthen

the shorter pipeline with statically or dynamically.

Dynamic lengthening would be used only when

hazards are detected.

• We can force the pipelines to write to a special unit

called a Renaming Buffer or Reordering Buffer. It is

the job of this unit to maintain sequential consistency.

Will look at this in detail in Chapter 7 (superscalar).

Page 47: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

RISC versus CISC pipelines

• Pipelines for CISC are required to handle complex

memory to register addressing

• mov r4, (r3, r2)4 EA is r3 + r2 + 4

• Will have an extra stage for Effective address calculation (see

Figures 5.40, 5.41, 5.43)

• Some CISC pipelines avoid a load-use delay penalty (Fig 5.54,

5.56)

• RISC pipelines have a load-use penalty of at least one

• Determining load-use penalties when multiple pipelines

are in action are instruction sequence dependent (ie., 1,

2, more than 2 cycles)

Page 48: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Some other important Figures in

Chapter 5

• Figure 5.26 (illustrates use of both clock phases

for performing pipeline tasks)

• Figure 5.31, Figure 5.32 (Pentium Pipeline,

shows difference between logical and physical

pipelines)

• Figure 5.33, Figure 5.34 (PowerPC 604 - first

look at a modern superscalar processor)

Page 49: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Company

LOGO

Arab Academy for Science, Technology & Maritime

Transport

CC721

Computing Systems

Part 3: VLIW Architecture

Page 50: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Basic Working

Principles of VLIW • Aim at speeding up computation by exploiting

instruction-level parallelism.

• Same hardware core as superscalar processors, having multiple execution units (EUs) working in parallel.

• An instruction is consisted of multiple operations; typical word length from 52 bits to 1 Kbits.

• All operations in an instruction are executed in a lock-step mode.

• One or multiple register files for FX and FP data.

• Rely on compiler to find parallelism and schedule dependency free program code.

Page 51: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Basic VLIW Approach

Page 52: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Register File Structure for

VLIW

What is the challenge to register file in VLIW? R/W ports

Page 53: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Differences Between VLIW & Superscalar

Architecture (I)

Page 54: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

• Instruction formulation:

• Superscalar:

• Receive conventional instructions conceived for seq.

processors.

• VLIW:

• Receive (very) long instruction words, each comprising a

field (or opcode) for each execution unit.

• Instruction word length depends (a) number of execution

units, and (b) code length to control each unit (such as

opcode length, register names, …).

• Typical word length is 256 – 1024 bits, much longer than

conventional machine word length.

Differences Between VLIW & Superscalar Architecture (II)

Page 55: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

• Instruction scheduling:

• Superscalar:

• Performed dynamically at run-time by the hardware.

• Data dependency is checked and resolved in hardware.

• Need a look-ahead hardware window for instruction fetch.

Differences Between VLIW & Superscalar Architecture (III)

Page 56: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

• Instruction scheduling (cont‟d):

• VLIW:

• Static scheduling done at compile-time by

the compiler.

• Advantages: • Reduce hardware complexity.

• Tasks such as decoding, data dependency

detection, instruction issue, …, etc. becoming

simple.

• Potentially higher clock rate.

• Higher degree of parallelism with global program

information.

Differences Between VLIW & Superscalar Architecture (IV)

Page 57: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Instruction scheduling (cont‟d):

VLIW:

Disadvantages

Higher complexity of the compiler.

Compiler optimization needs to consider

technology dependent parameters such as

latencies and load-use time of cache.

(Question: What happens to the software if the

hardware is updated?)

Non-deterministic problem of cache misses,

resulting in worst case assumption for code

scheduling.

In case of un-filled opcodes in a (V)LIW, memory

space and instruction bandwidth are wasted.

Differences Between VLIW & Superscalar Architecture (V)

Page 58: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Development history of Proposed/Commercial

VLIWs

Page 59: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Case Study of VLIW: Trace 200 Family (I)

Page 60: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

Case Study of VLIW: Trace 200 Family (II)

Only two branches might be used in Trace 7/2000

Page 61: Advanced Computer Architectures CC721 - aast · PDF fileCourse Outline Course objectives: This course gives a thorough knowledge in advanced computer architecture concepts, parallel

• It is found that code in VLIW is expanded roughly by a factor of three.

• For “long” VLIW, more opcode fields will be emptied. This will result in wasting bandwidth and storage space.

Can you propose a solution for it?

Code Expansion in VLIW