Upload
rafe-cameron
View
223
Download
2
Tags:
Embed Size (px)
Citation preview
Computer Organization and Architecture
(3 Credits/SKS)
Prof. Dr. Bagio Budiardjo
Semester Genap 2010/2011
About the Course :
Course Objectives: After completing this course the students are expected to understand and to be able to analyze the computer architecture, in particular the instruction-set design (e.g. addressing modes), and its influence to performance. The students are also expected to understand the meaning of computer organization, that is, the interconnections of computer sub-systems : CPU, memory, bus and I/O from a computing system. The student is expected to understand the more advanced technique in processor design : pipelining.
Key words : architecture, instruction-set design, computer organization, performance, processor design and, pipelining techniques
About the grading scheme :
• This part is actually not too rigid but it will appear as the combination of : homework, quiz, exercise, mid-test and final-test; whenever possible.
• One scheme possible is :Homework : 15% (4)Mid test : 40 %Final Test : 45 %
• Grading the homework : Maximum point , 5 point each. Three levels of grading :Good(5), OK(3), and Bad(2).
The books and supporting materials :
• Williams Stalling’s book titled Computer Organization and Architecture, Seventh Edition, Prentice Hall 2006; will be used as the main reference for this lecture. There is a new edition of this book, issued in 2010 but up till now is still unavailable in Jakarta.
• The classic book is good (Logic and Computer Design Fundamentals) , by Morris M Manno and Charles Kilme - Pearson Asia – 2004), but too many stresses on digital logics. We use materials from this book to explain the hardware design of computer components, whenever possible
• Chapters covered will be : Chapters: 1, 2, 3, 4, 5, 10 and 11 and 13 (Stalling’s). Additional materials about pipelining are taken from another book.
Books and supporting materials - continued
• There will be no handouts (unless it is very important).
• Lecture notes are given through memory stick/CD, SAP could be downloaded from SIAK-NG
• Students are encouraged to read books/papers in this field of study.
Schedule of class :• At scheduled time and place (K-102) for about 120
minutes
• Lecture will be given mainly using LCD projector
About the “course direction”
Why do we study Computer Architecture ?History :
Course under this name has been taught in many universities long before the microprocessors exist. Years ago, people studied mainframe architectures : IBM S/370, CDC Cyber, CRAY, Amdahl, etc.Since the microprocessors emerge, this course is changed slightly to cope with more advanced topics: Computer design and performance issues
About the “course direction”
Computer Organization & Architecture
MicroprocessorsApplication of µproc
Embedded Systemsembedding µproc based intelligence to new system/device
Processors Architecture & DesignAnalyzing & Implementing
Computer Systems to achievebest processing speed – Cost effectiveness
Processors Architecture & DesignAnalyzing processor design emphasizing on how to obtain better processing speed
(Cost effectiveness)
Parallel & Distributed Computing Systems
Organizing Processors/Computing systems to obtain better speed up with
different processing paradigm
OAKMicro & Embedded
About the “course direction” - continued
This course is aimed at : 1. Explaining the phenomena of computer architecture and computer design
Knowing the basic instruction cycle and its implication to processing speed
2. Studying the “key” problems : a. CPU memory bottleneck b. CPU I/O devices problems3. Studying how the “performance” could be improved example : CPU-memory : cache memory4. How could we improve execution speed with other techniques ? Example : pipelining
Reasons for studying Computer Architecture(Stalling’s arguments)
• Able to select “proper” computer systems for a particular environment (cost and effectiveness)
• Able to analyzed a processor “embedded” to an environment. Able to analyzed the use of processor in automobile, able to use proper tools to analyzed
• Able to choose proper software for a particular computer system
View of a Computer System
– Processor Organization : Another view
ALU1 ALU2
ALU3
ADDER
BUS
R1
R2
R3
Issues :Clock speed,Gating signal
ControlUnit
PC
MBR
MARTo/from memory
IR
FPU : Floating Point Unit
MMU : Mem Mng. Unit
CPU : Central Processing Unit
Cachememory
Implementation in CHIP
Frequently Asked Question
What is the role of CPU clock ?
What is the difference between P IV/2.4 G & P IV/3.0 G ? (CPU - clock speed 2.4 and 3.0 Ghz)
Consider an instruction of a CPU :
AR R1, R2 (add register, content of R1 and content of register R2, place result in R1)
– Execution steps of AR R1,R2
The “possible” micro-execution steps are :
a. ALU1 [R1] {content of R1 is moved to ALU1}
b. ALU2 [R2] {content of R2 is moved to ALU2}
c. ADD {content of ALU1 + ALU2 = ALU3}
d. R1 [ALU3] {Result of addition is moved to R1}
If, each micro-step is executed in “one” clock-cycle,
then this AR instruction needs 4 clock-cycles.
For the time being, we ignore the fetch cycle
– Processor Organization – continued.1
ALU1 ALU2
ALU3
ADDER
BUS
R1
R2
R3
ControlUnit
PC
MBR
MAR
To/from memory
IR
ALU1 [R1]
: jalur/unit tidak aktif
a. ALU1 [R1]b. ALU2 [R2]c. ADDd. R1 [ALU3]
ADD R1, R2
– Processor Organization – continued.2
ALU1 ALU2
ALU3
ADDER
BUS
R1
R2
R3
ALU2 [R2]
ControlUnit
PC
MBR
MAR
To/from memory
IR
: jalur/komponen tdk aktif
a. ALU1 [R1]b. ALU2 [R2]c. ADDd. R1 [ALU3]
ADD R1, R2
– Processor Organization – continued.3
ALU1 ALU2
ALU3
ADDER
BUS
R1
R2
R3
ADD
ControlUnit
PC
MBR
MAR
To/from memory
IR
: jalur/komponen tdk aktif
a. ALU1 [R1]b. ALU2 [R2]c. ADDd. R1 [ALU3]
ADD R1, R2
– Processor Organization – continued.4
ALU1 ALU2
ALU3
ADDER
BUS
R1
R2
R3
R1 [ALU3]
ControlUnit
PC
MBR
MAR
To/from memory
IR
: jalur/komponen tdk aktif
a. ALU1 [R1]b. ALU2 [R2]c. ADDd. R1 [ALU3]
ADD R1, R2
– Processor Organization – Microprogram
ALU1 ALU2
ALU3
ADDER
BUS
R1
R2
R3
ControlUnit
PC
MBR
MAR
To/from memory
IR
A1
A2
A3
B2B1
B3
ADD
1
A1
0 0 1 0 0 0
A2 A3 B1 B2 B3 ADD
0
0
1
1 0 0 1 0 0
0 0 0 0 0 1
a
b
c
d 0 0 0 0 1 0
Microprogram
1 = open; 0 = closed
a. ALU1 [R1]b. ALU2 [R2]c. ADDd. R1 [ALU3]
ADD R1, R2
Analysis of Instruction Cycle
• With single bus, it is slow, since in each “clock” only one transfer could be executed
• Is there any other way to “improve” the speed?• Dual bus processor may be faster• Additional processor cost
Dual processor-bus : A way to improve speed
ALU1 ALU2
ALU3
ADDER
DUAL BUS
R1
R2
R3
Only 3 clocks cycles needed,25% faster
1 2
1. ALU1 [R1] (bus1) ALU2 [R2] (bus2)
2. ADD
3. R1 [ALU3] (bus1)
Other components(Control Unit,IR,PC,
MAR,MBR)
1. ALU1 [R1] (bus1) ALU2 [R2] (bus2) ADD
2. R1 [ALU3] (bus1)
How about this :
Only 2 clocks cycles needed,50% faster
Dual processor-bus : Microprogram level representation
ALU1 ALU2
ALU3
ADDER
DUAL BUS
R1
R2
R3
1 2 Other components
(Control Unit,IR,PC,MAR,MBR)
A1
A2
A3
A4
A5
A6
B1 B2 B3 B4
B5
B6
How do we create the Microprogramfor instruction
SUB R3, R2 ?SUB
Microprogram for SUB R3, R2 on dual bus Processor
Step A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 SUB
a 0 0 1 0 1 0 1 0 1 0 0 0 0
b 0 0 0 0 0 0 0 0 0 0 0 0 1
c 0 0 0 0 0 1 0 0 0 0 0 1 0
1. Assume that Subtraction and transfer back theresult of SUB operation are done in separate clock
2. Assume that Subtraction and transfer back theresult of SUB operation are done in the same clock
Step A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 SUB
a 0 0 1 0 1 0 1 0 1 0 0 0 0
b 0 0 0 0 0 1 0 0 0 0 0 1 1
Triple processor-bus : Can the processing speed imrpoved?
ALU1 ALU2
ALU3
ADDER
Triple Bus
R1
R2
R3
1 2 3
Please notice the direction of arrows
Other components(Control Unit,IR,PC,
MAR,MBR)
If all the CPU components (registers, ALUs and adder)could work in a one third (1/3) clock cycle (transfer of bits, adding numbers), how many clock (s)needed to complete an addition operation (ADD R1,R2) ?Write down the “register transfer”and the microprogram for your register transfer language
Program Execution
• A scientific program using assembly language is run on a microprocessor with 1 Ghz clock. To complete the program , it needs to execute :
a. 150.000 arithmetic instructions (e.g ADD R1,R2; MUL R1,R3; etc)
b. 250.000 register transfer instructions (e.g MOV R1,R2; etc)
c. 100.000 memory access instructions (e.g LOAD R1,X; STORE R2,Y; etc).
If, average arithmetic instructions need 2 clocks (to complete), average register transfer instructions need 1 clock and average memory access instructions need 10 clocks; calculate the average CPI (clock per instruction) of the above mentioned program.
How many times it needs to complete the program (in seconds)?
Can it be “one clock?” – Yes it can !Views of Other Books on “Micro Operations”
• The Bus is called “data path”• It is not only consist of bus (a bunch of wires), but
other digital devices• Enable signals is forced to fasten execution• Additional (processor) cost
• Four parallel-loadregisters
• Two mux-based register selectors
• Register destination decoder
• Mux B for external constant input
• Buses A and B with externaladdress and data outputs
• ALU and Shifter withMux F for output select
• Mux D for external data input
• Logic for generating status bitsV, C, N, Z
MD select 0 1MUX D
V
C
NZ
n
n
n
n
n
n
n
nn n
n
2 2
n
n
A data B data
Register file
1 0
MUX B AddressOutDataOut
Bus ABus B
nn
Function unit
A B nG select
4
Zero Detect
MF select
nn
nF
MUX F
H select2
n
A BS2:0 || Cin
Arithmetic/logicunit (ALU)
G
BS
Shifter
H
MUX
01
23
MUX
0123
0 1 2 3
Decoder
Load
Load
Load
Load
Load enable
WriteD data
D address2
Destination select
Constant in
MB select
A select
A address
B select
B address
R3
R2
R1
R0
Bus Dn
Data In
ILIR0 0
0 1
Datapath Example : Taken from Morris Manno’s book
Microoperation: R0 ← R1 + R2
MD select 0 1MUX D
V
C
NZ
n
n
n
n
n
n
n
nn n
n
2 2
n
n
A data B data
Register file
1 0
MUX B AddressOutDataOut
Bus ABus B
nn
Function unit
A B nG select
4
Zero Detect
MF select
nn
nF
MUX F
H select2
n
A BS2:0 || Cin
Arithmetic/logicunit (ALU)
G
BS
Shifter
H
MUX
01
23
MUX
0123
0 1 2 3
Decoder
Load
Load
Load
Load
Load enable
WriteD data
D address2
Destination select
Constant in
MB select
A select
A address
B select
B address
R3
R2
R1
R0
Bus Dn
Data In
ILIR0 0
0 1
Datapath Example: Performing a Microoperation
Apply 01 to A select to place contents of R1 onto Bus A
Apply 10 to B select to place contents of R2 onto B data and apply 0 to MB select to place B data on Bus B
Apply 0010 to G select to perform addition G = Bus A + Bus B
Apply 0 to MF select and 0 to MDselect to place the value of G onto BUS D
Apply 00 to Destination select to enable the Load input to R0
Apply 1 to Load Enable to force the Load input to R0 to 1 so that R0 is loaded on the clock pulse (not shown)
The overall microoperation requires1 clock cycle (!)
Lesson Learned
• We could improve the instruction execution speed by increasing processor clock speed (can we?)
• We could improve the instruction execution speed by implementing dual bus (can we?)
• We can overcome (partly) the CPU-Memory bottleneck by inserting cache memory between CPU and Main Memory
(can we?)
• Is there any other way to improve instruction execution speed (increasing performance)? - pipelining
• Are these improvements need extra cost? (cost vs performance issue)
What do we get after studying Computer Architecture ?
• It is always a complicated problem to answer.• Basically we learn about the processor design
issues, namely hardware of a computer but it was taught through “software” logics.
• At least we know about basic building blocks of a computer
• We know the design development trends
Question : How do we fetch the instruction? (from memory)
• There is a procedure to bring an instruction from memory to CPU (IR), is called the instruction fetch
• PC always hold the address of (next) instruction in memory
• PC tranfer the address to MAR, and READ memory
• PC ususally is icremented by 1 (point to next instruction)
• Instruction is placed by memory in MBR
• Content of MBR is transferred to IR
(instruction is fetched, ready to be executed)
Question : How do we fetch the instruction? (from memory) - continued
• Or with register transfer language, we could express the fetch cycle as
1. MAR ← [PC]
2. READ (memory) and wait for completion
3. IR ← [MBR]
In terms of CPU clock, this steps may take up to 50 CPU clocks depending on the memory clock speed.
What is our topic ? Intruction Set Architecture(ISA)
ISA
Compiler OS
CPUDesign
CircuitDesign
ChipLayout
ApplicationProgram
Chapter 1 : Introduction
1. 1. Introduction : Organization & Architecture
• Organization and Architecture : two jargons that are often confusing
• Computer organization refers to the operational units and their interconnections that realize the architectural specifications (!)
• Computer Architecture refers to those attributes of a system visible to a programmer, or put another way, those attributes that have a direct impact on the logical execution of a program (!)
• The later definition (architecture) concerns more about the performance, compared to the first one (organization)
1. 1. Introduction - continued
• Architecture concerns more about the basic instruction design, that may lead to better performance of the system
• Organization, is the implementation of computer system, in terms of its interconnection of functional units : CPU, memory, bus and I/O devices.
• Example : IBM/S-370 family architecture. There are plenty of IBM products having the same architecture (S-370) but different organization, depending on its price/performance measures. Cost and performance differs the organizations
• So, organization of a computer is the implementation of its architecture, but tailored to fit the intended price and performance measures.
Chapter 2 :
Computer Evolution and Performance
ENIAC - background
• Electronic Numerical Integrator And Computer• Eckert and Mauchly• University of Pennsylvania• Trajectory tables for weapons • Started 1943• Finished 1946
– Too late for war effort• Used until 1955
ENIAC - details
• Decimal (not binary)• 20 accumulators of 10 digits• Programmed manually by switches• 18,000 vacuum tubes• 30 tons• 15,000 square feet• 140 kW power consumption• 5,000 additions per second
ENIAC
ENIAC
Another View of ENIAC
YOUR PICTURE GALLERY IS NOW LOADING...
Structure of von Neumann machine
IAS - details
• 1000 x 40 bit words– Binary number– 2 x 20 bit instructions
• Set of registers (storage in CPU)– Memory Buffer Register– Memory Address Register– Instruction Register– Instruction Buffer Register– Program Counter– Accumulator– Multiplier Quotient
2. 1.Evolution and Performance - history
• 1946 Von Neuman and his gang proposed IAS (Institute for Advanced Studies)
• The design included :
– main memory
– ALU
– Control Unit
– I/O
• First Stored Program, able to perform :
+, -, x, :
• The “father” of all modern computer/processor
Structure of IAS
IAS
2. 1. Evolution and Performance -history
IAS components are :• MBR (memory buffer register), MAR (memory address
register), IR (instruction register), IBR (instruction buffer register), PC (program counter), AC (accumulator and MQ (multiplier quotient), memory (1000 locations)
• 20 bit instruction : 8 bit opcode, 12 bit address (addressing one of 1000 memory locations - 0 to 999)
• 39 bit data (with sign bit - 1 bit)
• Operations : data transfer between registers and ALU, unconditional branch, conditional branch, arithmetic, address modify
2.1. Evolution - History of Commercial computers
• First Generation : 1950 Mauchly & Eckert developed UNIVAC I, used by Census Beureau
• Then appeared UNIVAC II, and later grew to UNIVAC 1100 series (1103, 1104,1105,1106,1108) - vacuum tubes and later transistor
• Second Generation : Transistors, IBM 7094 (although there are NCR, RCA and others tried to develop their versions - commercially not successful)
• Third Generation : Integrated Circuit (IC) - SSI. IBM S/360 was the successful example
• Later generations (possibly fourth and fifth) : LSI and VLSI technology
2.1. Evolution - history of commercial computers
Table 2.1
Approx Speed
Generation Time Technology (opr/sec)
--------------------------------------------------------------------------
1. 1946-57 Vacuum tube 40,000
2. 1958-64 Transistor 200,000
3. 1965-71 SSI & MSI 1,000,000
4. 1972-77 LSI 10,000,000
5. 1978- VLSI 100,000,000
--------------------------------------------------------------------------
Vaccum Tubes
Transistor
2.1. Evolution - System 360 Family
Model Model Model Model Model
Characteristic 30 40 50 65 75
------------------------------------------------------------------------------------------
Max memory size (Bytes) 64K 256K 256K 512K 512K
Memory data-rate(MB/s) 0.5 0.8 2.0 8.0 16.0
Processor cycle time (s) 1.0 0.625 0.5 0.25 0.2
Relative Speed 1 3.5 10 21 50
Max Number data channel 3 3 4 6 6
Max chan. data-rate(KB/s) 250 400 800 1250 1250
---------------------------------------------------------------------------------------• Family architecture menyebabkan adanya istilah : upward dan
downward compatible
Generations of Computer
• Vacuum tube - 1946-1957• Transistor - 1958-1964• Small scale integration - 1965 on
– Up to 100 devices on a chip• Medium scale integration - to 1971
– 100-3,000 devices on a chip• Large scale integration - 1971-1977
– 3,000 - 100,000 devices on a chip• Very large scale integration - 1978 to date
– 100,000 - 100,000,000 devices on a chip• Ultra large scale integration
– Over 100,000,000 devices on a chip
Moore’s Law• Increased density of components on chip• Gordon Moore - cofounder of Intel• Number of transistors on a chip will double every year• Since 1970’s development has slowed a little
– Number of transistors doubles every 18 months• Cost of a chip has remained almost unchanged• Higher packing density means shorter electrical paths,
giving higher performance• Smaller size gives increased flexibility• Reduced power and cooling requirements• Fewer interconnections increases reliability
Moore’s Law
Growth in CPU Transistor Count
Growth in CPU Transistor Count
IBM 360 series
• 1964
• Replaced (& not compatible with) 7000 series
• First planned “family” of computers
– Similar or identical instruction sets
– Similar or identical O/S
– Increasing speed
– Increasing number of I/O ports (i.e. more terminals)
– Increased memory size
– Increased cost
• Multiplexed switch structure
2.1. Evolution - Later generations
• Semiconductor memories : 1K,4K,16K,64K,256K,1M,4M,16 Mbits on a single chip
At present : 256 Mbit, 512 Mbit per chip
• Microprocessors appeared :
Intel 4004 (1971), Intel 8008 (72), Intel 8080 (8 bit-74), 8086 (16 bit-81), 80386 (32bit-85) onward.
• At almost the same time : Motorola, 6800 (8bit), 68000 (16bit), 68010 (16bit), 68020 (32bit), 68030/40 (32bit)
• Then Motorola’s product disappeared commercially
• Intel products dominated the market, since the appearance of IBM PC
2.1. Evolution of Microprocessors
Table 2.2------------------------------------------------------------------------------------------
Feature 8008 8080 8086 80386 80486
------------------------------------------------------------------------------------------
Year introduced 1972 1974 1978 1985 1989
# of instructions 66 111 133 154 235
Address bus width 8 16 20 32 32
Data bus width 8 8 16 32 32
# of registers 8 8 16 8 8
Memory addressability 16KB 64KB 1 MB 4 GB 4 GB
Bus Bandwidth (MB/s) - 0.75 5 32 32
Reg-Reg add time (s) - 1.3 0.3 0.125 0.06
------------------------------------------------------------------------------------------
2.2 Designing for Performance• Price of processor continue to drop every year
• $1000 for an advanced system is today’s price : in it you may find more than 100 million transistors !
• Even 100 millions pieces of toilet papers cost more !!
• Computing power is for free !!
• People solve problem that never been thought possible before : image processing, speech recognition, videoconferencing, multimedia authoring, etc.
• We need more and more computing power
• The organization and architecture of today’s processor remains the same (basically) as those of IAS !
• Algorithms to improve speed and efficiency differs !
2.2. Designing - processor speed
• Intel Pentium and PowerPC follows Moore’s Law :
By shrinking size of lines in IC chips by 10%, industry may get new IC with 4 times transistor density every 3 years !
• The above law is true for DRAM (Dynamic Random Access Memory)
• If the capacity does increase, the speed doesn’t increase automatically
• More work in designing instructions needed
• Also, techniques for faster instruction execution must be developed : branch prediction, data flow analysis and speculative execution
Pentium Evolution (1)
• 8080– first general purpose microprocessor– 8 bit data path– Used in first personal computer – Altair
• 8086– much more powerful– 16 bit– instruction cache, prefetch few instructions– 8088 (8 bit external bus) used in first IBM PC
• 80286– 16 Mbyte memory addressable– up from 1Mb
• 80386– 32 bit– Support for multitasking
Pentium Evolution (2)• 80486
– sophisticated powerful cache and instruction pipelining– built in maths co-processor
• Pentium– Superscalar– Multiple instructions executed in parallel
• Pentium Pro– Increased superscalar organization– Aggressive register renaming– branch prediction– data flow analysis– speculative execution
Pentium Evolution (3)
• Pentium II– MMX technology– graphics, video & audio processing
• Pentium III– Additional floating point instructions for 3D graphics
• Pentium 4– Note Arabic rather than Roman numerals– Further floating point and multimedia enhancements
• Itanium– 64 bit– see chapter 15
• See Intel web pages for detailed information on processors
Intel Microprocessor Performance
Summary: Important Points
• Organization and Architecture
• Family Architectures
• Function of a Computer (Data Processing, Control, Data movement)
• Born of Computers (Eniac-decimal, IAS-digital) Mauckly-Eckert
• Microprocessors(I-4004,8008,8080,8086/16,80386/32)
• IAS Instructions
• Von Neuman bottleneck
• Increasing clock speed, make bus wider, cache memory
• Loosers : e.g. Motorola Micro Processor, Radio Shack,
• More dense transistor in a single chip (4 times every 3 years, by shrinking lines by 10%)