Arch Book Solution Ch5 Sep

Embed Size (px)

Citation preview

  • 8/9/2019 Arch Book Solution Ch5 Sep

    1/27

  • 8/9/2019 Arch Book Solution Ch5 Sep

    2/27

    2 Chapter 5

    51 Abus transactionis a sequence of actions to complete a well-defined activity. Some examples of

    such activities are memory read, memory write, I/O read, burst read, and so on.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    3/27

  • 8/9/2019 Arch Book Solution Ch5 Sep

    4/27

    4 Chapter 5

    53 Address bus width determines memory-addressing capacity of the system. Typically, 32-bit pro-

    cessors such as the Pentium use 32-bit addresses, and 64-bit processors use 64-bit addresses.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    5/27

    Chapter 5 5

    54 System performance improves with a wider data bus as we can move more bytes in parallel. Thus,

    higher data bus width increases the data transfer rate. For example, the Pentium uses a 64-bit databus, whereas the Itanium uses a 128-bit data bus. Therefore, if all other parameters are the same,

    we can double the bandwidth in the Itanium relative to the Pentium processor.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    6/27

    6 Chapter 5

    55 As the name implies, dedicated buses have separate buses dedicated to carry data and address

    information. For example, a 64-bit processor with 64 data and address lines requires 128 pinsjust for these two buses. If we want to move 128 bits of data like the Itanium, we need 192 pins!

    The obvious advantage of these designs is the performance we can get out of them. To reduce

    the cost of such systems we might use multiplexed bus designs in which buses are not dedicated

    to a function. Instead, both data and address information is time multiplexed on a shared bus.

    Multiplexed bus designs reduce the cost but they also reduce the system performance.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    7/27

    Chapter 5 7

    56 In synchronous buses, a bus clock signal provides the timing information for all actions on the bus.

    Change in other signals is relative to the falling or rising edge of the clock. In asynchronous buses,there is no clock signal. Instead, they use four-way handshaking to perform a bus transaction.

    Asynchronous buses allow more flexibility in timing.

    The main advantage of asynchronous buses is that they eliminate this dependence on the bus clock.

    However, synchronous buses are easier to implement, as they do not use handshaking. Almost

    all system buses are synchronous, partly for historical reasons. In the early days, the difference

    between the speeds of various devices was not so great as it is now. Since synchronous buses are

    simpler to implement, designers chose them.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    8/27

  • 8/9/2019 Arch Book Solution Ch5 Sep

    9/27

    Chapter 5 9

    58 The READY signal is needed in the synchronous buses because the default timing allowed by

    the processor is sometimes insufficient for a slow device to respond. The processor samples theREADY line when it expects data on the data bus. It reads the data on the data bus only if the

    READY signal is active; otherwise, it waits. Thus, slower devices can use this line to indicate that

    they need more time to respond to the request.

    For example, in a memory read cycle, if we have a slow memory it may not be able to supply data

    when the processor expects. In this case, the processor should not presume that whatever is present

    on the data bus is the actual data supplied by memory. That is why the processor always reads the

    value of the READY line to see if the memory has actually placed the data on the data bus. If this

    line is inactive (indicating not ready status), the processor waits one more cycle and samples the

    READY line again. The processor inserts wait states as long as the READY line is inactive. Once

    this line is active, it reads the data and terminates the read cycle.

    The asynchronous buses do not require the READY signal as they use handshaking to perform a

    bus transaction (i.e., there is no default timing as in the synchronous buses).

  • 8/9/2019 Arch Book Solution Ch5 Sep

    10/27

    10 Chapter 5

    59 Processors provide block transfer operations that read or write several contiguous locations of a

    memory block. Such block transfers are more efficient than transferring each individual word. Thecache line fill is an example that requires reading several bytes of contiguous memory locations.

    Data movement between cache and main memory is in units of cache line size. If the cache line size

    is 32 bytes, each cache line fill requires 32 bytes of data from memory. The Pentium uses 32-byte

    cache lines. It provides a block transfer operation that transfers four 64-bit data from memory.

    Thus, by using this block transfer, we can fill a 32-byte cache line. Without block transfer, we

    need four memory read cycles to fill a cache line, which takes more time than the block transfer

    operation.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    11/27

    Chapter 5 11

    510 The main disadvantage of the static mechanism is that its bus allocation follows a predetermined

    pattern rather than the actual need. In this mechanism, a master may be given the bus even if itdoes not need it. This kind of allocation leads to inefficient use of the bus. This inefficiency is

    avoided by the dynamic bus arbitration, which uses a demand-driven allocation scheme.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    12/27

    12 Chapter 5

    511 A fair allocation policy does not allow starvation. Fairness can be defined in several ways. For

    example, fairness can be defined to handle bus requests within a priority class, or requests fromseveral priority classes. Some examples of fairness are: (i) all bus requests in a predefined window

    must be satisfied before granting requests from the next window; (ii) a bus request should not be

    pending for more than

    milliseconds. For example, in the PCI bus, we can specify fairness by

    indicating the maximum delay to grant a request.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    13/27

    Chapter 5 13

    512 A potential disadvantage of the nonpreemptive policies is that a bus master may hold the bus for

    a long time, depending on the transaction type. For example, long block transfers can hold thebus for extended periods of time. This may cause problems for some types of services where the

    bus is needed immediately. Preemptive policies force the current master to release the bus without

    completing its current bus transaction.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    14/27

    14 Chapter 5

    513 A drawback of the transaction-based release policy is that if there is only one master that is re-

    questing the bus most of the time, we have to unnecessarily incur arbitration overhead for eachbus transaction. This is typically the case in single-processor systems. In these systems, the CPU

    uses the bus most of the time; DMA requests are relatively infrequent. In demand-based release,

    the current master releases the bus only if there is a request from another bus master; otherwise, it

    continues to use the bus. Typically, this check is done at the completion of each transaction. This

    policy leads to more efficient use of the bus.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    15/27

    Chapter 5 15

    514 Centralized implementations suffer from single-point failures due to the presence of the central

    arbiter. This causes two main problems:

    1. If the central arbiter fails, there will be no arbitration.

    2. The central arbiter can become a bottleneck limiting the performance of the whole system.

    The distributed implementation avoids these problems. However, the arbitration logic has to be

    distributed among the masters. In contrast, in the centralized organization, the bus masters dont

    have the arbitration logic.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    16/27

    16 Chapter 5

    515 The daisy-chaining scheme has three potential problems:

    It implements a fixed-priority policy, which can lead to starvation problems. From our dis-

    cussion, it should be clear that the master closer to the arbiter (in the chain) has a higher

    priority.

    The bus arbitration time varies and is proportional to the number of masters. The reason is

    that the grant signal has to propagate from master to master. If each master takes

    time

    units to propagate the bus grant signal from its input to output, a master that is in the

    th

    position in the chain would experience a delay of time units before receiving the

    grant signal.

    This scheme is not fault tolerant. If a master fails, it may fail to pass the bus grant signal to

    the master down the chain.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    17/27

    Chapter 5 17

    516 The hybrid scheme limits the following three potential problems associated with the daisy-chaining

    scheme as explained below:

    The hybrid scheme limits the fixed-priority policy of the daisy-chaining scheme to a class.

    In the daisy-chaining scheme, the bus arbitration time varies and is proportional to the num-

    ber of masters. The hybrid scheme limits this delay as the class size is small.

    The daisy-chaining scheme is not fault tolerant. If a master fails, it may fail to pass the bus

    grant signal to the master down the chain. However, in the hybrid scheme, this problem is

    limited to the class in which the node failure occurs.

    The independent request/grant lines scheme rectifies the problems associated with the daisy-chaining

    scheme. However, it is expensive to implement. The hybrid scheme reduces the cost by applying

    this scheme at the class level.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    18/27

    18 Chapter 5

    517 The ISA bus was closely associated with the system bus used by the IBM PC. It operates at

    8.33 MHz clock with a maximum bandwidth of about 8 MB/s. This bandwidth was sufficientat that time as memories were slower and there were no multimedia, windows GUI, and the like to

    worry about. However, in current systems, the ISA bus can only support slow I/O devices. Even

    this limited use of the ISA bus is disappearing due to the presence of the USB. Current systems use

    the PCI bus, which is processor independent. It can provide a peak bandwidth of (up to) 528 MB/s.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    19/27

    Chapter 5 19

    518 The main reason is to save the number of connector pins. Even with multiplexing, 32-bit PCI

    uses 120-pin connectors while the 64-bit version needs an additional 64 pins. A drawback withmultiplexed address/data bus is that it needs additional time to turn around the bus.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    20/27

    20 Chapter 5

    519 The four Byte Enable lines identify the byte of data that is to be transferred. Each BE# line

    identifies one byte of the 32-bit data: BE0# identifies byte 0, BE1# identifies byte 1, and so on.Thus, we can specify any combination of the bytes to be transferred. Two extreme cases are:

    C/BE# = 0000 indicates transfer of all four bytes, and C/BE# = 1111 indicates a null data phase

    (no byte transfer). In a multiple data phase bus transaction, the byte enable value can be specified

    for each data phase. Thus, we can transfer the bytes of interest in each data phase. The extreme

    case of null data phase is useful, for example, if you want to skip one or more 32-bit values in the

    middle of a burst data transfer. If null data transfer is not allowed, we have to terminate the current

    bus transaction, request the bus again via the arbiter, and restart the transfer with a new address.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    21/27

    Chapter 5 21

    520 The current bus master drives the FRAME# signal to indicate the start of a bus transaction (when

    the FRAME# signal goes low). This signal is also used to indicate the length of the bus transactioncycle. This signal is held low until the final data phase of the bus transaction.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    22/27

    22 Chapter 5

    521 PCI uses a centralized bus arbitration with independent grant and request lines. Each device has

    separate grant (GNT#) and request (REQ#) lines connected to the central arbiter. The PCI spec-ification does not mandate a particular arbitration policy. However, it mandates that the policy

    should be a fair one to avoid starvation.

    A device that is not the current master can request the bus by asserting the REQ# line. The

    arbitration takes place while the current bus master is using the bus. When the arbiter notifies a

    master that it can use the bus for the next transaction, the master must wait until the current bus

    master has released the bus (i.e., the bus is idle). The bus idle condition is indicated when both

    FRAME# and IRDY# are high.

    PCI uses hidden bus arbitration in the sense that the arbiter works while another bus master is

    running its transaction on the PCI bus. This overlapped bus arbitration increases the PCI bus

    utilization by not keeping the bus idle during arbitration.

    PCI devices should request a bus for each transaction. However, a transaction may consist of anaddress phase and one or more data phases. For efficiency, data should be transferred in burst

    mode. PCI specification has safeguards to avoid a single master from monopolizing the bus and to

    force a master to release the bus.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    23/27

    Chapter 5 23

    522 PCI allows hierarchical PCI bus systems, which are typically built using PCI-to-PCI bridges (e.g.,

    using the Intel 21152 PCI-to-PCI bridge chip). This chip connects two independent PCI buses: aprimary and a secondary. The bridge improves performance due to the following reasons:

    1. It allows concurrent operation of the two PCI buses. For example, a master and target on the

    same PCI bus can communicate while the other PCI bus is busy.

    2. The bridge also provides traffic filtering which minimizes the traffic crossing over to the other

    side.

    Obviously, this traffic separation along with concurrent operation improves overall system perfor-

    mance for bandwidth-hungry applications such as multimedia.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    24/27

    24 Chapter 5

    523 Most PCI buses tend to operate at 33 MHz clock speed due to serious challenges in implementing

    the 66 MHz design. To understand this problem, look at the timing of the two buses. The 33 MHzbus cycle of 30 ns leaves about 7 ns of setup time for the target. When we double the clock

    frequency, all values are cut in half. The reduction in setup time is important as we have only 3 ns

    for the target to respond. As a result of this difficulty, most PCI buses tend to operate at 33 MHz

    clock speed.

    The PCI-X solves this problem by using a register-to-register protocol, as opposed to the immedi-

    ate protocol implemented by PCI. In the PCI-X register-to-register protocol, the signal sent by the

    master device is stored in a register until the next clock. Thus the receiver has one full clock cycle

    to respond to the masters request. This makes it possible to increase the frequency to 133 MHz.

    At this frequency, one clock period corresponds to about 7.5 ns, about the same period allowed for

    the decode phase in the 33 MHz PCI implementation. We get this increase in frequency by adding

    one additional cycle to each bus transaction. This increased overhead is more than compensated

    for by the increase in the frequency.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    25/27

    Chapter 5 25

    524 With the increasing demand for high-performance video due to applications such as 3D graphics

    and full-motion video, the PCI bus is reaching its performance limit. In response to these demands,Intel introduced the AGP to exclusively support high-performance 3D graphics and full-motion

    video applications. The AGP is not a bus in the sense that it does not connect multiple devices.

    The AGP is a port that precisely connects only two devices: the CPU and video card.

    To see the bandwidth demand of a full-motion video, let us look at a 640

    480 resolution screen.

    For true color, we need three bytes per pixel. Thus, each frame requires 640 * 480 * 3 = 920 KB.

    Full-motion video should use a frame rate of 30 frames/second. Therefore, the required bandwidth

    is 920 * 30/1000 = 27.6 MB/s. If we consider a higher resolution of 1024

    768, it goes up

    to 70.7 MB/s. We actually need twice this bandwidth when displaying video from hard disks or

    DVDs. This is due to the fact that the data have to traverse the bus twice: once from the disk to the

    system memory and again from the memory to the graphics adaptor. The 32-bit, 33 MHz PCI with

    133 MB/s bandwidth can barely support this data transfer rate. The 64-bit PCI can comfortably

    handle the full-motion video but the video data transfer uses half the bandwidth. Since the videounit is a specialized subsystem, there is no reason for it to be attached to a general-purpose bus like

    the PCI. We can solve many of the bandwidth problems by designing a special interconnection to

    supply the video data. By taking the video load off the PCI bus, we can also improve performance

    of the overall system. Intel proposed the AGP precisely for these reasons.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    26/27

    26 Chapter 5

    525 As we have seen in the text, AGP is targeted at 3D graphical display applications, which have

    high memory bandwidth requirements. One of the performance enhancements AGP uses is thepipelining. AGP pipelined transmission can be interrupted by PCI transactions. This ability to

    intervene in a pipelined AGP transfer allows the bus master to maintain high pipeline depth for

    improved performance.

  • 8/9/2019 Arch Book Solution Ch5 Sep

    27/27

    Chapter 5 27

    526 The STSCHG# signal gives I/O status changeinformationfor multifunction PC cards. In a pure I/O

    PC card, we do not normally require this signal. However, in multifunction PC cards containingmemory and I/O functions, this signal is needed to report the status-signals removed from the

    memory interface (READY, WP, BVD1, and BVD2). A configuration register (called the pin

    replacement register) in the attribute memory maintains the status of the signals removed from

    the memory interface. For example, since BVD signals are removed from the memory interface,

    this register keeps the BVD information to report the status of the battery. When a status change

    occurs, this signal is asserted. The host can read the pin replacement register to get the status.