Computer Organization and Architecture (3 Credits/SKS) Prof. Dr. Bagio Budiardjo Semester Genap 2010/2011

Computer Organization and Architecture

(3 Credits/SKS)

Prof. Dr. Bagio Budiardjo

Semester Genap 2010/2011

About the Course :

Course Objectives: After completing this course the students are expected to understand and to be able to analyze the computer architecture, in particular the instruction-set design (e.g. addressing modes), and its influence to performance. The students are also expected to understand the meaning of computer organization, that is, the interconnections of computer sub-systems : CPU, memory, bus and I/O from a computing system. The student is expected to understand the more advanced technique in processor design : pipelining.

Key words : architecture, instruction-set design, computer organization, performance, processor design and, pipelining techniques

About the grading scheme :

• This part is actually not too rigid but it will appear as the combination of : homework, quiz, exercise, mid-test and final-test; whenever possible.

• One scheme possible is :Homework : 15% (4)Mid test : 40 %Final Test : 45 %

• Grading the homework : Maximum point , 5 point each. Three levels of grading :Good(5), OK(3), and Bad(2).

The books and supporting materials :

• Williams Stalling’s book titled Computer Organization and Architecture, Seventh Edition, Prentice Hall 2006; will be used as the main reference for this lecture. There is a new edition of this book, issued in 2010 but up till now is still unavailable in Jakarta.

• The classic book is good (Logic and Computer Design Fundamentals) , by Morris M Manno and Charles Kilme - Pearson Asia – 2004), but too many stresses on digital logics. We use materials from this book to explain the hardware design of computer components, whenever possible

• Chapters covered will be : Chapters: 1, 2, 3, 4, 5, 10 and 11 and 13 (Stalling’s). Additional materials about pipelining are taken from another book.

Books and supporting materials - continued

• There will be no handouts (unless it is very important).

• Lecture notes are given through memory stick/CD, SAP could be downloaded from SIAK-NG

• Students are encouraged to read books/papers in this field of study.

Schedule of class :• At scheduled time and place (K-102) for about 120

minutes

• Lecture will be given mainly using LCD projector

About the “course direction”

Why do we study Computer Architecture ?History :

Course under this name has been taught in many universities long before the microprocessors exist. Years ago, people studied mainframe architectures : IBM S/370, CDC Cyber, CRAY, Amdahl, etc.Since the microprocessors emerge, this course is changed slightly to cope with more advanced topics: Computer design and performance issues

About the “course direction”

Computer Organization & Architecture

MicroprocessorsApplication of µproc

Embedded Systemsembedding µproc based intelligence to new system/device

Processors Architecture & DesignAnalyzing & Implementing

Computer Systems to achievebest processing speed – Cost effectiveness

Processors Architecture & DesignAnalyzing processor design emphasizing on how to obtain better processing speed

(Cost effectiveness)

Parallel & Distributed Computing Systems

Organizing Processors/Computing systems to obtain better speed up with

different processing paradigm

OAKMicro & Embedded

About the “course direction” - continued

This course is aimed at : 1. Explaining the phenomena of computer architecture and computer design

Knowing the basic instruction cycle and its implication to processing speed

2. Studying the “key” problems : a. CPU memory bottleneck b. CPU I/O devices problems3. Studying how the “performance” could be improved example : CPU-memory : cache memory4. How could we improve execution speed with other techniques ? Example : pipelining

Reasons for studying Computer Architecture(Stalling’s arguments)

• Able to select “proper” computer systems for a particular environment (cost and effectiveness)

• Able to analyzed a processor “embedded” to an environment. Able to analyzed the use of processor in automobile, able to use proper tools to analyzed

• Able to choose proper software for a particular computer system

View of a Computer System

– Processor Organization : Another view

ALU1 ALU2

ALU3

ADDER

BUS

R1

R2

R3

Issues :Clock speed,Gating signal

ControlUnit

PC

MBR

MARTo/from memory

IR

FPU : Floating Point Unit

MMU : Mem Mng. Unit

CPU : Central Processing Unit

Cachememory

Implementation in CHIP

Frequently Asked Question

What is the role of CPU clock ?

What is the difference between P IV/2.4 G & P IV/3.0 G ? (CPU - clock speed 2.4 and 3.0 Ghz)

Consider an instruction of a CPU :

AR R1, R2 (add register, content of R1 and content of register R2, place result in R1)

– Execution steps of AR R1,R2

The “possible” micro-execution steps are :

a. ALU1 [R1] {content of R1 is moved to ALU1}

b. ALU2 [R2] {content of R2 is moved to ALU2}

c. ADD {content of ALU1 + ALU2 = ALU3}

d. R1 [ALU3] {Result of addition is moved to R1}

If, each micro-step is executed in “one” clock-cycle,

then this AR instruction needs 4 clock-cycles.

For the time being, we ignore the fetch cycle

– Processor Organization – continued.1

ALU1 ALU2

ALU3

ADDER

BUS

R1

R2

R3

ControlUnit

PC

MBR

MAR

To/from memory

IR

ALU1 [R1]

: jalur/unit tidak aktif

a. ALU1 [R1]b. ALU2 [R2]c. ADDd. R1 [ALU3]

ADD R1, R2


ALU1 ALU2

ALU3

ADDER

BUS

R1

R2

R3

ALU2 [R2]

ControlUnit

PC

MBR

MAR

To/from memory

IR

: jalur/komponen tdk aktif


ADD R1, R2


ALU1 ALU2

ALU3

ADDER

BUS

R1

R2

R3

ADD

ControlUnit

PC

MBR

MAR

To/from memory

IR



ADD R1, R2


ALU1 ALU2

ALU3

ADDER

BUS

R1

R2

R3

R1 [ALU3]

ControlUnit

PC

MBR

MAR

To/from memory

IR



ADD R1, R2

– Processor Organization – Microprogram

ALU1 ALU2

ALU3

ADDER

BUS

R1

R2

R3

ControlUnit

PC

MBR

MAR

To/from memory

IR

A1

A2

A3

B2B1

B3

ADD

1

A1

0 0 1 0 0 0

A2 A3 B1 B2 B3 ADD

0

0

1

1 0 0 1 0 0

0 0 0 0 0 1

a

b

c

d 0 0 0 0 1 0

Microprogram

1 = open; 0 = closed


ADD R1, R2

Analysis of Instruction Cycle

• With single bus, it is slow, since in each “clock” only one transfer could be executed

• Is there any other way to “improve” the speed?• Dual bus processor may be faster• Additional processor cost

Dual processor-bus : A way to improve speed

ALU1 ALU2

ALU3

ADDER

DUAL BUS

R1

R2

R3

Only 3 clocks cycles needed,25% faster

1 2

1. ALU1 [R1] (bus1) ALU2 [R2] (bus2)

2. ADD

3. R1 [ALU3] (bus1)

Other components(Control Unit,IR,PC,

MAR,MBR)

1. ALU1 [R1] (bus1) ALU2 [R2] (bus2) ADD

2. R1 [ALU3] (bus1)

How about this :

Only 2 clocks cycles needed,50% faster

Dual processor-bus : Microprogram level representation

ALU1 ALU2

ALU3

ADDER

DUAL BUS

R1

R2

R3

1 2 Other components

(Control Unit,IR,PC,MAR,MBR)

A1

A2

A3

A4

A5

A6

B1 B2 B3 B4

B5

B6

How do we create the Microprogramfor instruction

SUB R3, R2 ?SUB

Microprogram for SUB R3, R2 on dual bus Processor

Step A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 SUB

a 0 0 1 0 1 0 1 0 1 0 0 0 0

b 0 0 0 0 0 0 0 0 0 0 0 0 1

c 0 0 0 0 0 1 0 0 0 0 0 1 0

1. Assume that Subtraction and transfer back theresult of SUB operation are done in separate clock

2. Assume that Subtraction and transfer back theresult of SUB operation are done in the same clock

Step A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 SUB

a 0 0 1 0 1 0 1 0 1 0 0 0 0

b 0 0 0 0 0 1 0 0 0 0 0 1 1

Triple processor-bus : Can the processing speed imrpoved?

ALU1 ALU2

ALU3

ADDER

Triple Bus

R1

R2

R3

1 2 3

Please notice the direction of arrows

Other components(Control Unit,IR,PC,

MAR,MBR)

If all the CPU components (registers, ALUs and adder)could work in a one third (1/3) clock cycle (transfer of bits, adding numbers), how many clock (s)needed to complete an addition operation (ADD R1,R2) ?Write down the “register transfer”and the microprogram for your register transfer language

Program Execution

• A scientific program using assembly language is run on a microprocessor with 1 Ghz clock. To complete the program , it needs to execute :

a. 150.000 arithmetic instructions (e.g ADD R1,R2; MUL R1,R3; etc)

b. 250.000 register transfer instructions (e.g MOV R1,R2; etc)

c. 100.000 memory access instructions (e.g LOAD R1,X; STORE R2,Y; etc).

If, average arithmetic instructions need 2 clocks (to complete), average register transfer instructions need 1 clock and average memory access instructions need 10 clocks; calculate the average CPI (clock per instruction) of the above mentioned program.

How many times it needs to complete the program (in seconds)?

Can it be “one clock?” – Yes it can !Views of Other Books on “Micro Operations”

• The Bus is called “data path”• It is not only consist of bus (a bunch of wires), but

other digital devices• Enable signals is forced to fasten execution• Additional (processor) cost

• Four parallel-loadregisters

• Two mux-based register selectors

• Register destination decoder

• Mux B for external constant input

• Buses A and B with externaladdress and data outputs

• ALU and Shifter withMux F for output select

• Mux D for external data input

• Logic for generating status bitsV, C, N, Z

MD select 0 1MUX D

V

C

NZ

n

n

n

n

n

n

n

nn n

n

2 2

n

n

A data B data

Register file

1 0

MUX B AddressOutDataOut

Bus ABus B

nn

Function unit

A B nG select

4

Zero Detect

MF select

nn

nF

MUX F

H select2

n

A BS2:0 || Cin

Arithmetic/logicunit (ALU)

G

BS

Shifter

H

MUX

01

23

MUX

0123

0 1 2 3

Decoder

Load

Load

Load

Load

Load enable

WriteD data

D address2

Destination select

Constant in

MB select

A select

A address

B select

B address

R3

R2

R1

R0

Bus Dn

Data In

ILIR0 0

0 1

Datapath Example : Taken from Morris Manno’s book

Microoperation: R0 ← R1 + R2

MD select 0 1MUX D

V

C

NZ

n

n

n

n

n

n

n

nn n

n

2 2

n

n

A data B data

Register file

1 0

MUX B AddressOutDataOut

Bus ABus B

nn

Function unit

A B nG select

4

Zero Detect

MF select

nn

nF

MUX F

H select2

n

A BS2:0 || Cin

Arithmetic/logicunit (ALU)

G

BS

Shifter

H

MUX

01

23

MUX

0123

0 1 2 3

Decoder

Load

Load

Load

Load

Load enable

WriteD data

D address2

Destination select

Constant in

MB select

A select

A address

B select

B address

R3

R2

R1

R0

Bus Dn

Data In

ILIR0 0

0 1

Datapath Example: Performing a Microoperation

Apply 01 to A select to place contents of R1 onto Bus A

Apply 10 to B select to place contents of R2 onto B data and apply 0 to MB select to place B data on Bus B

Apply 0010 to G select to perform addition G = Bus A + Bus B

Apply 0 to MF select and 0 to MDselect to place the value of G onto BUS D

Apply 00 to Destination select to enable the Load input to R0

Apply 1 to Load Enable to force the Load input to R0 to 1 so that R0 is loaded on the clock pulse (not shown)

The overall microoperation requires1 clock cycle (!)

Lesson Learned

• We could improve the instruction execution speed by increasing processor clock speed (can we?)

• We could improve the instruction execution speed by implementing dual bus (can we?)

• We can overcome (partly) the CPU-Memory bottleneck by inserting cache memory between CPU and Main Memory

(can we?)

• Is there any other way to improve instruction execution speed (increasing performance)? - pipelining

• Are these improvements need extra cost? (cost vs performance issue)

What do we get after studying Computer Architecture ?

• It is always a complicated problem to answer.• Basically we learn about the processor design

issues, namely hardware of a computer but it was taught through “software” logics.

• At least we know about basic building blocks of a computer

• We know the design development trends

Question : How do we fetch the instruction? (from memory)

• There is a procedure to bring an instruction from memory to CPU (IR), is called the instruction fetch

• PC always hold the address of (next) instruction in memory

• PC tranfer the address to MAR, and READ memory

• PC ususally is icremented by 1 (point to next instruction)

• Instruction is placed by memory in MBR

• Content of MBR is transferred to IR

(instruction is fetched, ready to be executed)

Question : How do we fetch the instruction? (from memory) - continued

• Or with register transfer language, we could express the fetch cycle as

1. MAR ← [PC]

2. READ (memory) and wait for completion

3. IR ← [MBR]

In terms of CPU clock, this steps may take up to 50 CPU clocks depending on the memory clock speed.

What is our topic ? Intruction Set Architecture(ISA)

ISA

Compiler OS

CPUDesign

CircuitDesign

ChipLayout

ApplicationProgram

Chapter 1 : Introduction

1. 1. Introduction : Organization & Architecture

• Organization and Architecture : two jargons that are often confusing

• Computer organization refers to the operational units and their interconnections that realize the architectural specifications (!)

• Computer Architecture refers to those attributes of a system visible to a programmer, or put another way, those attributes that have a direct impact on the logical execution of a program (!)

• The later definition (architecture) concerns more about the performance, compared to the first one (organization)

1. 1. Introduction - continued

• Architecture concerns more about the basic instruction design, that may lead to better performance of the system

• Organization, is the implementation of computer system, in terms of its interconnection of functional units : CPU, memory, bus and I/O devices.

• Example : IBM/S-370 family architecture. There are plenty of IBM products having the same architecture (S-370) but different organization, depending on its price/performance measures. Cost and performance differs the organizations

• So, organization of a computer is the implementation of its architecture, but tailored to fit the intended price and performance measures.

Chapter 2 :

Computer Evolution and Performance

ENIAC - background

• Electronic Numerical Integrator And Computer• Eckert and Mauchly• University of Pennsylvania• Trajectory tables for weapons • Started 1943• Finished 1946

– Too late for war effort• Used until 1955

ENIAC - details

• Decimal (not binary)• 20 accumulators of 10 digits• Programmed manually by switches• 18,000 vacuum tubes• 30 tons• 15,000 square feet• 140 kW power consumption• 5,000 additions per second

ENIAC

ENIAC

Another View of ENIAC

YOUR PICTURE GALLERY IS NOW LOADING...

Structure of von Neumann machine

IAS - details

• 1000 x 40 bit words– Binary number– 2 x 20 bit instructions

• Set of registers (storage in CPU)– Memory Buffer Register– Memory Address Register– Instruction Register– Instruction Buffer Register– Program Counter– Accumulator– Multiplier Quotient

2. 1.Evolution and Performance - history

• 1946 Von Neuman and his gang proposed IAS (Institute for Advanced Studies)

• The design included :

– main memory

– ALU

– Control Unit

– I/O

• First Stored Program, able to perform :

+, -, x, :

• The “father” of all modern computer/processor

Structure of IAS

IAS

2. 1. Evolution and Performance -history

IAS components are :• MBR (memory buffer register), MAR (memory address

register), IR (instruction register), IBR (instruction buffer register), PC (program counter), AC (accumulator and MQ (multiplier quotient), memory (1000 locations)

• 20 bit instruction : 8 bit opcode, 12 bit address (addressing one of 1000 memory locations - 0 to 999)

• 39 bit data (with sign bit - 1 bit)

• Operations : data transfer between registers and ALU, unconditional branch, conditional branch, arithmetic, address modify

2.1. Evolution - History of Commercial computers

• First Generation : 1950 Mauchly & Eckert developed UNIVAC I, used by Census Beureau

• Then appeared UNIVAC II, and later grew to UNIVAC 1100 series (1103, 1104,1105,1106,1108) - vacuum tubes and later transistor

• Second Generation : Transistors, IBM 7094 (although there are NCR, RCA and others tried to develop their versions - commercially not successful)

• Third Generation : Integrated Circuit (IC) - SSI. IBM S/360 was the successful example

• Later generations (possibly fourth and fifth) : LSI and VLSI technology

2.1. Evolution - history of commercial computers

Table 2.1

Approx Speed

Generation Time Technology (opr/sec)

--------------------------------------------------------------------------

1. 1946-57 Vacuum tube 40,000

2. 1958-64 Transistor 200,000

3. 1965-71 SSI & MSI 1,000,000

4. 1972-77 LSI 10,000,000

5. 1978- VLSI 100,000,000

--------------------------------------------------------------------------

Vaccum Tubes

Transistor

2.1. Evolution - System 360 Family

Model Model Model Model Model

Characteristic 30 40 50 65 75

------------------------------------------------------------------------------------------

Max memory size (Bytes) 64K 256K 256K 512K 512K

Memory data-rate(MB/s) 0.5 0.8 2.0 8.0 16.0

Processor cycle time (s) 1.0 0.625 0.5 0.25 0.2

Relative Speed 1 3.5 10 21 50

Max Number data channel 3 3 4 6 6

Max chan. data-rate(KB/s) 250 400 800 1250 1250

---------------------------------------------------------------------------------------• Family architecture menyebabkan adanya istilah : upward dan

downward compatible

Generations of Computer

• Vacuum tube - 1946-1957• Transistor - 1958-1964• Small scale integration - 1965 on

– Up to 100 devices on a chip• Medium scale integration - to 1971

– 100-3,000 devices on a chip• Large scale integration - 1971-1977

– 3,000 - 100,000 devices on a chip• Very large scale integration - 1978 to date

– 100,000 - 100,000,000 devices on a chip• Ultra large scale integration

– Over 100,000,000 devices on a chip

Moore’s Law• Increased density of components on chip• Gordon Moore - cofounder of Intel• Number of transistors on a chip will double every year• Since 1970’s development has slowed a little

– Number of transistors doubles every 18 months• Cost of a chip has remained almost unchanged• Higher packing density means shorter electrical paths,

giving higher performance• Smaller size gives increased flexibility• Reduced power and cooling requirements• Fewer interconnections increases reliability

Moore’s Law

Growth in CPU Transistor Count

Growth in CPU Transistor Count

IBM 360 series

• 1964

• Replaced (& not compatible with) 7000 series

• First planned “family” of computers

– Similar or identical instruction sets

– Similar or identical O/S

– Increasing speed

– Increasing number of I/O ports (i.e. more terminals)

– Increased memory size

– Increased cost

• Multiplexed switch structure

2.1. Evolution - Later generations

• Semiconductor memories : 1K,4K,16K,64K,256K,1M,4M,16 Mbits on a single chip

At present : 256 Mbit, 512 Mbit per chip

• Microprocessors appeared :

Intel 4004 (1971), Intel 8008 (72), Intel 8080 (8 bit-74), 8086 (16 bit-81), 80386 (32bit-85) onward.

• At almost the same time : Motorola, 6800 (8bit), 68000 (16bit), 68010 (16bit), 68020 (32bit), 68030/40 (32bit)

• Then Motorola’s product disappeared commercially

• Intel products dominated the market, since the appearance of IBM PC

2.1. Evolution of Microprocessors

Table 2.2------------------------------------------------------------------------------------------

Feature 8008 8080 8086 80386 80486

------------------------------------------------------------------------------------------

Year introduced 1972 1974 1978 1985 1989

# of instructions 66 111 133 154 235

Address bus width 8 16 20 32 32

Data bus width 8 8 16 32 32

# of registers 8 8 16 8 8

Memory addressability 16KB 64KB 1 MB 4 GB 4 GB

Bus Bandwidth (MB/s) - 0.75 5 32 32

Reg-Reg add time (s) - 1.3 0.3 0.125 0.06

------------------------------------------------------------------------------------------

2.2 Designing for Performance• Price of processor continue to drop every year

• $1000 for an advanced system is today’s price : in it you may find more than 100 million transistors !

• Even 100 millions pieces of toilet papers cost more !!

• Computing power is for free !!

• People solve problem that never been thought possible before : image processing, speech recognition, videoconferencing, multimedia authoring, etc.

• We need more and more computing power

• The organization and architecture of today’s processor remains the same (basically) as those of IAS !

• Algorithms to improve speed and efficiency differs !

2.2. Designing - processor speed

• Intel Pentium and PowerPC follows Moore’s Law :

By shrinking size of lines in IC chips by 10%, industry may get new IC with 4 times transistor density every 3 years !

• The above law is true for DRAM (Dynamic Random Access Memory)

• If the capacity does increase, the speed doesn’t increase automatically

• More work in designing instructions needed

• Also, techniques for faster instruction execution must be developed : branch prediction, data flow analysis and speculative execution

Pentium Evolution (1)

• 8080– first general purpose microprocessor– 8 bit data path– Used in first personal computer – Altair

• 8086– much more powerful– 16 bit– instruction cache, prefetch few instructions– 8088 (8 bit external bus) used in first IBM PC

• 80286– 16 Mbyte memory addressable– up from 1Mb

• 80386– 32 bit– Support for multitasking

Pentium Evolution (2)• 80486

– sophisticated powerful cache and instruction pipelining– built in maths co-processor

• Pentium– Superscalar– Multiple instructions executed in parallel

• Pentium Pro– Increased superscalar organization– Aggressive register renaming– branch prediction– data flow analysis– speculative execution

Pentium Evolution (3)

• Pentium II– MMX technology– graphics, video & audio processing

• Pentium III– Additional floating point instructions for 3D graphics

• Pentium 4– Note Arabic rather than Roman numerals– Further floating point and multimedia enhancements

• Itanium– 64 bit– see chapter 15

• See Intel web pages for detailed information on processors

Intel Microprocessor Performance

Summary: Important Points

• Organization and Architecture

• Family Architectures

• Function of a Computer (Data Processing, Control, Data movement)

• Born of Computers (Eniac-decimal, IAS-digital) Mauckly-Eckert

• Microprocessors(I-4004,8008,8080,8086/16,80386/32)

• IAS Instructions

• Von Neuman bottleneck

• Increasing clock speed, make bus wider, cache memory

• Loosers : e.g. Motorola Micro Processor, Radio Shack,

• More dense transistor in a single chip (4 times every 3 years, by shrinking lines by 10%)

Documents

Computer Organization and Architecture (3 Credits/SKS) Prof. Dr. Bagio Budiardjo Semester Genap 2010/2011