Arch Book Solution Ch5 Sep

8/9/2019 Arch Book Solution Ch5 Sep

1/27


2/27

2 Chapter 5

51 Abus transactionis a sequence of actions to complete a well-defined activity. Some examples of

such activities are memory read, memory write, I/O read, burst read, and so on.


3/27


4/27

4 Chapter 5

53 Address bus width determines memory-addressing capacity of the system. Typically, 32-bit pro-

cessors such as the Pentium use 32-bit addresses, and 64-bit processors use 64-bit addresses.


5/27

Chapter 5 5

54 System performance improves with a wider data bus as we can move more bytes in parallel. Thus,

higher data bus width increases the data transfer rate. For example, the Pentium uses a 64-bit databus, whereas the Itanium uses a 128-bit data bus. Therefore, if all other parameters are the same,

we can double the bandwidth in the Itanium relative to the Pentium processor.


6/27

6 Chapter 5

55 As the name implies, dedicated buses have separate buses dedicated to carry data and address

information. For example, a 64-bit processor with 64 data and address lines requires 128 pinsjust for these two buses. If we want to move 128 bits of data like the Itanium, we need 192 pins!

The obvious advantage of these designs is the performance we can get out of them. To reduce

the cost of such systems we might use multiplexed bus designs in which buses are not dedicated

to a function. Instead, both data and address information is time multiplexed on a shared bus.

Multiplexed bus designs reduce the cost but they also reduce the system performance.


7/27

Chapter 5 7

56 In synchronous buses, a bus clock signal provides the timing information for all actions on the bus.

Change in other signals is relative to the falling or rising edge of the clock. In asynchronous buses,there is no clock signal. Instead, they use four-way handshaking to perform a bus transaction.

Asynchronous buses allow more flexibility in timing.

The main advantage of asynchronous buses is that they eliminate this dependence on the bus clock.

However, synchronous buses are easier to implement, as they do not use handshaking. Almost

all system buses are synchronous, partly for historical reasons. In the early days, the difference

between the speeds of various devices was not so great as it is now. Since synchronous buses are

simpler to implement, designers chose them.


8/27


9/27

Chapter 5 9

58 The READY signal is needed in the synchronous buses because the default timing allowed by

the processor is sometimes insufficient for a slow device to respond. The processor samples theREADY line when it expects data on the data bus. It reads the data on the data bus only if the

READY signal is active; otherwise, it waits. Thus, slower devices can use this line to indicate that

they need more time to respond to the request.

For example, in a memory read cycle, if we have a slow memory it may not be able to supply data

when the processor expects. In this case, the processor should not presume that whatever is present

on the data bus is the actual data supplied by memory. That is why the processor always reads the

value of the READY line to see if the memory has actually placed the data on the data bus. If this

line is inactive (indicating not ready status), the processor waits one more cycle and samples the

READY line again. The processor inserts wait states as long as the READY line is inactive. Once

this line is active, it reads the data and terminates the read cycle.

The asynchronous buses do not require the READY signal as they use handshaking to perform a

bus transaction (i.e., there is no default timing as in the synchronous buses).


10/27

10 Chapter 5

59 Processors provide block transfer operations that read or write several contiguous locations of a

memory block. Such block transfers are more efficient than transferring each individual word. Thecache line fill is an example that requires reading several bytes of contiguous memory locations.

Data movement between cache and main memory is in units of cache line size. If the cache line size

is 32 bytes, each cache line fill requires 32 bytes of data from memory. The Pentium uses 32-byte

cache lines. It provides a block transfer operation that transfers four 64-bit data from memory.

Thus, by using this block transfer, we can fill a 32-byte cache line. Without block transfer, we

need four memory read cycles to fill a cache line, which takes more time than the block transfer

operation.


11/27

Chapter 5 11

510 The main disadvantage of the static mechanism is that its bus allocation follows a predetermined

pattern rather than the actual need. In this mechanism, a master may be given the bus even if itdoes not need it. This kind of allocation leads to inefficient use of the bus. This inefficiency is

avoided by the dynamic bus arbitration, which uses a demand-driven allocation scheme.


12/27

12 Chapter 5

511 A fair allocation policy does not allow starvation. Fairness can be defined in several ways. For

example, fairness can be defined to handle bus requests within a priority class, or requests fromseveral priority classes. Some examples of fairness are: (i) all bus requests in a predefined window

must be satisfied before granting requests from the next window; (ii) a bus request should not be

pending for more than

milliseconds. For example, in the PCI bus, we can specify fairness by

indicating the maximum delay to grant a request.


13/27

Chapter 5 13

512 A potential disadvantage of the nonpreemptive policies is that a bus master may hold the bus for

a long time, depending on the transaction type. For example, long block transfers can hold thebus for extended periods of time. This may cause problems for some types of services where the

bus is needed immediately. Preemptive policies force the current master to release the bus without

completing its current bus transaction.


14/27

14 Chapter 5

513 A drawback of the transaction-based release policy is that if there is only one master that is re-

questing the bus most of the time, we have to unnecessarily incur arbitration overhead for eachbus transaction. This is typically the case in single-processor systems. In these systems, the CPU

uses the bus most of the time; DMA requests are relatively infrequent. In demand-based release,

the current master releases the bus only if there is a request from another bus master; otherwise, it

continues to use the bus. Typically, this check is done at the completion of each transaction. This

policy leads to more efficient use of the bus.


15/27

Chapter 5 15

514 Centralized implementations suffer from single-point failures due to the presence of the central

arbiter. This causes two main problems:

1. If the central arbiter fails, there will be no arbitration.

2. The central arbiter can become a bottleneck limiting the performance of the whole system.

The distributed implementation avoids these problems. However, the arbitration logic has to be

distributed among the masters. In contrast, in the centralized organization, the bus masters dont

have the arbitration logic.


16/27

16 Chapter 5

515 The daisy-chaining scheme has three potential problems:

It implements a fixed-priority policy, which can lead to starvation problems. From our dis-

cussion, it should be clear that the master closer to the arbiter (in the chain) has a higher

priority.

The bus arbitration time varies and is proportional to the number of masters. The reason is

that the grant signal has to propagate from master to master. If each master takes

time

units to propagate the bus grant signal from its input to output, a master that is in the

th

position in the chain would experience a delay of time units before receiving the

grant signal.

This scheme is not fault tolerant. If a master fails, it may fail to pass the bus grant signal to

the master down the chain.


17/27

Chapter 5 17

516 The hybrid scheme limits the following three potential problems associated with the daisy-chaining

scheme as explained below:

The hybrid scheme limits the fixed-priority policy of the daisy-chaining scheme to a class.

In the daisy-chaining scheme, the bus arbitration time varies and is proportional to the num-

ber of masters. The hybrid scheme limits this delay as the class size is small.

The daisy-chaining scheme is not fault tolerant. If a master fails, it may fail to pass the bus

grant signal to the master down the chain. However, in the hybrid scheme, this problem is

limited to the class in which the node failure occurs.

The independent request/grant lines scheme rectifies the problems associated with the daisy-chaining

scheme. However, it is expensive to implement. The hybrid scheme reduces the cost by applying

this scheme at the class level.


18/27

18 Chapter 5

517 The ISA bus was closely associated with the system bus used by the IBM PC. It operates at

8.33 MHz clock with a maximum bandwidth of about 8 MB/s. This bandwidth was sufficientat that time as memories were slower and there were no multimedia, windows GUI, and the like to

worry about. However, in current systems, the ISA bus can only support slow I/O devices. Even

this limited use of the ISA bus is disappearing due to the presence of the USB. Current systems use

the PCI bus, which is processor independent. It can provide a peak bandwidth of (up to) 528 MB/s.


19/27

Chapter 5 19

518 The main reason is to save the number of connector pins. Even with multiplexing, 32-bit PCI

uses 120-pin connectors while the 64-bit version needs an additional 64 pins. A drawback withmultiplexed address/data bus is that it needs additional time to turn around the bus.


20/27

20 Chapter 5

519 The four Byte Enable lines identify the byte of data that is to be transferred. Each BE# line

identifies one byte of the 32-bit data: BE0# identifies byte 0, BE1# identifies byte 1, and so on.Thus, we can specify any combination of the bytes to be transferred. Two extreme cases are:

C/BE# = 0000 indicates transfer of all four bytes, and C/BE# = 1111 indicates a null data phase

(no byte transfer). In a multiple data phase bus transaction, the byte enable value can be specified

for each data phase. Thus, we can transfer the bytes of interest in each data phase. The extreme

case of null data phase is useful, for example, if you want to skip one or more 32-bit values in the

middle of a burst data transfer. If null data transfer is not allowed, we have to terminate the current

bus transaction, request the bus again via the arbiter, and restart the transfer with a new address.


21/27

Chapter 5 21

520 The current bus master drives the FRAME# signal to indicate the start of a bus transaction (when

the FRAME# signal goes low). This signal is also used to indicate the length of the bus transactioncycle. This signal is held low until the final data phase of the bus transaction.


22/27

22 Chapter 5

521 PCI uses a centralized bus arbitration with independent grant and request lines. Each device has

separate grant (GNT#) and request (REQ#) lines connected to the central arbiter. The PCI spec-ification does not mandate a particular arbitration policy. However, it mandates that the policy

should be a fair one to avoid starvation.

A device that is not the current master can request the bus by asserting the REQ# line. The

arbitration takes place while the current bus master is using the bus. When the arbiter notifies a

master that it can use the bus for the next transaction, the master must wait until the current bus

master has released the bus (i.e., the bus is idle). The bus idle condition is indicated when both

FRAME# and IRDY# are high.

PCI uses hidden bus arbitration in the sense that the arbiter works while another bus master is

running its transaction on the PCI bus. This overlapped bus arbitration increases the PCI bus

utilization by not keeping the bus idle during arbitration.

PCI devices should request a bus for each transaction. However, a transaction may consist of anaddress phase and one or more data phases. For efficiency, data should be transferred in burst

mode. PCI specification has safeguards to avoid a single master from monopolizing the bus and to

force a master to release the bus.


23/27

Chapter 5 23

522 PCI allows hierarchical PCI bus systems, which are typically built using PCI-to-PCI bridges (e.g.,

using the Intel 21152 PCI-to-PCI bridge chip). This chip connects two independent PCI buses: aprimary and a secondary. The bridge improves performance due to the following reasons:

1. It allows concurrent operation of the two PCI buses. For example, a master and target on the

same PCI bus can communicate while the other PCI bus is busy.

2. The bridge also provides traffic filtering which minimizes the traffic crossing over to the other

side.

Obviously, this traffic separation along with concurrent operation improves overall system perfor-

mance for bandwidth-hungry applications such as multimedia.


24/27

24 Chapter 5

523 Most PCI buses tend to operate at 33 MHz clock speed due to serious challenges in implementing

the 66 MHz design. To understand this problem, look at the timing of the two buses. The 33 MHzbus cycle of 30 ns leaves about 7 ns of setup time for the target. When we double the clock

frequency, all values are cut in half. The reduction in setup time is important as we have only 3 ns

for the target to respond. As a result of this difficulty, most PCI buses tend to operate at 33 MHz

clock speed.

The PCI-X solves this problem by using a register-to-register protocol, as opposed to the immedi-

ate protocol implemented by PCI. In the PCI-X register-to-register protocol, the signal sent by the

master device is stored in a register until the next clock. Thus the receiver has one full clock cycle

to respond to the masters request. This makes it possible to increase the frequency to 133 MHz.

At this frequency, one clock period corresponds to about 7.5 ns, about the same period allowed for

the decode phase in the 33 MHz PCI implementation. We get this increase in frequency by adding

one additional cycle to each bus transaction. This increased overhead is more than compensated

for by the increase in the frequency.


25/27

Chapter 5 25

524 With the increasing demand for high-performance video due to applications such as 3D graphics

and full-motion video, the PCI bus is reaching its performance limit. In response to these demands,Intel introduced the AGP to exclusively support high-performance 3D graphics and full-motion

video applications. The AGP is not a bus in the sense that it does not connect multiple devices.

The AGP is a port that precisely connects only two devices: the CPU and video card.

To see the bandwidth demand of a full-motion video, let us look at a 640

480 resolution screen.

For true color, we need three bytes per pixel. Thus, each frame requires 640 * 480 * 3 = 920 KB.

Full-motion video should use a frame rate of 30 frames/second. Therefore, the required bandwidth

is 920 * 30/1000 = 27.6 MB/s. If we consider a higher resolution of 1024

768, it goes up

to 70.7 MB/s. We actually need twice this bandwidth when displaying video from hard disks or

DVDs. This is due to the fact that the data have to traverse the bus twice: once from the disk to the

system memory and again from the memory to the graphics adaptor. The 32-bit, 33 MHz PCI with

133 MB/s bandwidth can barely support this data transfer rate. The 64-bit PCI can comfortably

handle the full-motion video but the video data transfer uses half the bandwidth. Since the videounit is a specialized subsystem, there is no reason for it to be attached to a general-purpose bus like

the PCI. We can solve many of the bandwidth problems by designing a special interconnection to

supply the video data. By taking the video load off the PCI bus, we can also improve performance

of the overall system. Intel proposed the AGP precisely for these reasons.


26/27

26 Chapter 5

525 As we have seen in the text, AGP is targeted at 3D graphical display applications, which have

high memory bandwidth requirements. One of the performance enhancements AGP uses is thepipelining. AGP pipelined transmission can be interrupted by PCI transactions. This ability to

intervene in a pipelined AGP transfer allows the bus master to maintain high pipeline depth for

improved performance.


27/27

Chapter 5 27

526 The STSCHG# signal gives I/O status changeinformationfor multifunction PC cards. In a pure I/O

PC card, we do not normally require this signal. However, in multifunction PC cards containingmemory and I/O functions, this signal is needed to report the status-signals removed from the

memory interface (READY, WP, BVD1, and BVD2). A configuration register (called the pin

replacement register) in the attribute memory maintains the status of the signals removed from

the memory interface. For example, since BVD signals are removed from the memory interface,

this register keeps the BVD information to report the status of the battery. When a status change

occurs, this signal is asserted. The host can read the pin replacement register to get the status.

Documents

Arch Book Solution Ch5 Sep