16
CMP Design Choices To appear in the grading of CS838 Sam Koblenski Peter McClone Abstract In recent years, as feature sizes decrease, the possibility of placing multiple processors on a chip has become a more viable option. Obtaining simulation for this increasingly complex design space is extremely time consuming. We propose the use of two methods for narrowing the CMP design space, Plackett and Burman statistical analysis and analytical modeling. Our results show that CMPs complicate the relationship between processor cores and the memory subsystem and that the number of cores on chip is the most critical design parameter. SECTION I - INTRODUCTION As the performance limits of superscalar processors become more evident and relative chip area increases, the industry is looking into alternative ways to exploit parallelism and improve performance. One of the more intriguing possibilities that has recently become available is the Chip Multiprocessor (CMP). In recent years, as feature sizes decrease, the possibility of placing multiple processors on a chip has become a more viable option. CMPs can be quite an attractive option for a system designed for high throughput. However, the complexity of even a small multiprocessor design is aggravated significantly by placing a system on a single chip. The design of such systems has been proposed [4][1] using simulation data as a reference for design decisions. While accurate, obtaining simulation results from even the best multiprocessor simulators is time consuming and must be repeated for each design. This time penalty often limits the number of simulations that can be performed, which in turn can limit the design. In the midst of this obstacle, we propose the use of two methods for exploring the CMP design space, Plackett and Burman statistical analysis [6][5] and analytical modeling. Through limited simulation runs a Plackett and Burman (PB) Design can determine which design parameters are most important to performance, while analytical modeling provides the advantage of obtaining performance estimates for a wide range of systems by running simple, fast computations. Together, these methods can provide cross-validation by showing that the same parameters found to be most critical in a PB Design can quickly be shown to provide the greatest changes in performance in the analytical model. In addition, the PB Design can provide the parameters necessary to quickly develop an accurate analytical model. Since the PB Design uses extreme values of design parameters, it can also verify the

CMP Design Choices

  • Upload
    aure

  • View
    17

  • Download
    0

Embed Size (px)

DESCRIPTION

CMP Design Choices. To appear in the grading of CS838. Sam Koblenski Peter McClone. Abstract. - PowerPoint PPT Presentation

Citation preview

Page 1: CMP Design Choices

CMP Design ChoicesTo appear in the grading of CS838

Sam Koblenski Peter McClone

AbstractIn recent years, as feature sizes decrease, the possibility of placing multiple processors on a chip has become a more viable option. Obtaining simulation for this increasingly complex design space is extremely time consuming. We propose the use of two methods for narrowing the CMP design space, Plackett and Burman statistical analysis and analytical modeling. Our results show that CMPs complicate the relationship between processor cores and the memory subsystem and that the number of cores on chip is the most critical design parameter.

SECTION I - INTRODUCTION

As the performance limits of superscalar processors become more evident and relative chip area increases, the industry is looking into alternative ways to exploit parallelism and improve performance. One of the more intriguing possibilities that has recently become available is the Chip Multiprocessor (CMP). In recent years, as feature sizes decrease, the possibility of placing multiple processors on a chip has become a more viable option. CMPs can be quite an attractive option for a system designed for high throughput. However, the complexity of even a small multiprocessor design is aggravated significantly by placing a system on a single chip.

The design of such systems has been proposed [4][1] using simulation data as a reference for design decisions. While accurate, obtaining simulation results from even the best multiprocessor simulators is time consuming and must be repeated for each design. This time penalty often limits the number of simulations that can be performed, which in turn can limit the design.

In the midst of this obstacle, we propose the use of two methods for exploring the CMP design space, Plackett and Burman statistical analysis [6][5] and analytical modeling. Through limited simulation runs a Plackett and Burman (PB) Design can determine which design parameters are most important to performance, while analytical modeling provides the advantage of obtaining performance estimates for a wide range of systems by running simple, fast computations.

Together, these methods can provide cross-validation by showing that the same parameters found to be most critical in a PB Design can quickly be shown to provide the greatest changes in performance in the analytical model. In addition, the PB Design can provide the parameters necessary to quickly develop an accurate analytical model. Since the PB Design uses extreme values of design parameters, it can also verify the analytical model at design corners, where the model may be most likely to break down. Unfortunately, due to time constraints, none of these potential advantages could be exercised in this project.

AssumptionsThe time constraints of this project also forced several assumptions to be made. Due to the current inability of the Multifacet simulator to model an in-order core, an in-order core was approximated as an out-of-order core with a small issue window – two entries for a 1-way issue pipeline, and four entries for a 4-way issue pipeline. This approximation should be refined with an in-order addition to the simulator in future work. We assumed a die size of 300mm2 in a 65nm technology. According to [4], this die size can accommodate a 16MB L2 cache, and core sizes can be directly related to cache sizes using the idea of the Cache Byte Equivalent (CBE). A particular core organization will take some amount of die area that can be equated to the number of bytes of L2 cache that would occupy the same area. Estimates of core sizes from that work were used in approximating the core sizes, listed in Table 1, used in this work.

Table 1 includes the pipeline organization, pipeline width, Instruction Queue size, pipeline depth, and integer and floating point ALU counts. The Int/Float ALUs column specifies the number of each type of ALU. Notice the pipeline depths are the same for single issue in-order and out-of-order cores. Since the in-

Page 2: CMP Design Choices

order core is an approximation using an out-of-order core, the simulator still executes dispatch and issue stages, and they cannot be excluded. However, pipeline depth would not affect the simulations significantly with these numbers.

Pipeline Width IQ SizePipeline Depth

Int/Float ALUs

CBE

In- Order 1 2 8 1 50KB

In- Order 4 4 10 4 100KB

Out-of- Order 1 16 8 2 75KB

Out-of- Order 4 64 13 4 250KB

Table 1: Cache Byte Equivalents of studied pipeline organizations

The 4-way out-of-order core size, not including the L1 caches, was used unmodified. The 1-way out-of-order core size was approximated by reducing the number of functional units and Instruction Window size to be appropriate for a single issue core. Area approximations from [3][2] were used to subtract the size of the removed functional units and branch predictor from the core size, and to find the size of the Instruction Window and reduce it to further adjust the core size. The sizes of the register files, branch predictor, and functional units were removed from the core size, and the remaining area was reduced by considering that the Instruction Window size increases quadratically with the number of entries, and it was reduced by a factor of 4. Adding the other component sizes back into the core size gives our result.

For the in-order core sizes, the 2-way core size estimate was used and the window size and integer ALUs were reduced while the branch predictor was increased to equal that of the out-of-order predictor. We did not want to vary the predictor in this study, and even though a 16 kB predictor may seem unreasonable in an in-order pipeline, it was used so as to not taint the results. This modification resulted in the same CBE estimate for the single issue core. The instruction window and functional units of the 4-way core were increased to give its CBE estimate.

These core sizes were used to find appropriate L2 Cache sizes for each simulation configuration. The aggregate core size was calculated and subtracted from the total die size, and the remaining die area was used for L2 cache. Since an arbitrary cache size could not be used, the discrete sizes of 4, 8, and 12 MB were used. To reach 12 MB caches, a non-power-of-two associativity was necessary so the associativity varies slightly between the 12 MB cache configurations and the others, but should not change the results significantly.

Several assumptions were also made to simplify the analytical model. Distribution of time between memory requests of a non-stalled processor was assumed exponential. Processor cores were assumed homogeneous. Each core in the model exhibits the same average behavior with respect to their service times and miss rates. The change in miss rate due to size is a factor of 1/√2 for each doubling of the cache size.

The rest of this paper is divided as follows. Section II discusses the Placket and Burman method. Section III discusses an analytical model and two solutions. Section IV presents results and Section V concludes.

SECTION II - METHOD 1: PLACKET AND BURMAN DESIGN

Background and MotivationExploring a design space consisting of numerous parameters, each with a range of legitimate values, is inherently difficult and time intensive. Given ‘N’ parameters and ‘L’ values for each parameter, a full exploration of the design space would require LN simulations. Simulation time quickly becomes excessive,

Page 3: CMP Design Choices

and the complete exploration impractical. In 1956 Plackett and Burman [6] published a method of gathering useful information from these multifactorial designs without resorting to an exponentially large number of experiments. Rather, the PB Design method requires N+1 simulations for N parameters to find the parameters that impact performance the most. The designer can then focus simulation time on these parameters since maximizing the performance of the system with respect to these parameters will have the greatest effect. Almost 50 years later, the ARCTiC group at the University of Minnesota [7] was the first to apply this statistical technique to architecture design using the SimpleScalar Toolset parameters.

The PB Design uses extreme values of parameters in the simulation runs. Extreme values are values at or just outside the normal range of parameter values used in realistic designs. The design must also have a measure of performance, such as execution time or throughput in the case of this study. Finally, the number of parameters must be a multiple of four minus one. Thus, seven would be a valid number of parameters since eight is a multiple of four. If the design does not meet this criterion, extra dummy parameters can be added to meet the restriction; they will just not be varied in the simulations. However, the full number of simulations for the real and dummy parameters must still be run.

To generate the simulation configurations, Plackett and Burman provide the configuration for the first simulation for designs ranging from seven parameters up to 99 parameters [6]. For example, the configuration for a seven parameter design is + + + – + – – where ‘+’ denotes the high value and ‘–’ denotes the low extreme value for each parameter. Each of the next six configurations is a right rotation of the previous configuration, and the last simulation consists of all ‘+’s. Later, another researcher found that inverting all of these configurations and running 2(N+1) simulations improved the accuracy of the method [5].

Using the performance metric from the results of each simulation, the designer can assemble an importance ranking of the parameters. For each parameter, the performance result is added to a running total if the parameter was its high value and subtracted from the total if it was its low value. If the performance metric is to be maximized, the final ranking is in descending order of magnitudes of the running totals with the value largest in magnitude being the parameter most critical to performance. If the performance metric is to be minimized, the ranking is reversed.

The Plackett and Burman Design method is intuitively satisfying since no configuration is repeated, and the variation of the parameters at their extremes will naturally show how large of an effect varying that parameter has on performance. In order to maximize the effectiveness of this method, Plackett and Burman describe several constraints that are placed upon the initial configuration to make sure that the simulations are representative of the corners of the design space. Assuring that each configuration is suitably distinct further strengthens confidence that these simulations are uncovering the quantitative effect of each parameter on the performance of the system. The use of these constraints results in the above restriction in the number of parameters plus a unique solution to the configuration for each parameter number N, of which Plackett and Burman enumerate.

The Design

One disadvantage of the PB Design is that the extreme values of the parameters must be chosen carefully. If the range of any parameter is much larger than its typical design range, the importance of that parameter will be artificially enhanced in the results due to its increased affect on system performance. The judicious choice of high and low parameter values is an intricate part of a properly constructed PB Design.

Table 2 lists the CMP parameters explored in this study. Some of these parameter choices deserve a short explanation. The pipeline organization has a seemingly restricted range since it can only assume two values, but the variation in the Instruction Queue was given a large range to compensate for this constraint as shown in Table 1. The large cache associativities for the extreme high values may have been overkill, but since increasing the associativity above 8-way does not change cache performance dramatically, this choice will not affect the results significantly. Additionally, the high and low associativities each have two values due to the 12 MB L2 cache size in some configurations. Varying the low value for associativity will have a greater effect on the results than varying the high value, but this configuration artifact could not be

Page 4: CMP Design Choices

avoided. Also, the L2 cache size depends on the core configuration so it is not an independent parameter and does not have a reference label associated with it.

Reference Parameter Low Value (-) High Value (+)

A Number of Cores 2 16

B Pipeline Organization In-Order Out-of-Order

C Pipeline Width 1 4

D L1 Cache Size 16 kB 128 kB

E L1 Associativity 1-way 32-Way

L2 Cache Size Die Area – Core Area

F L2 Associativity 1-way / 3-way 24-way / 32-Way

G L2 Banks 2 32

H L2 Latency 50 Cycles 12 Cycles

I L2 Directory Latency 25 Cycles 6 Cycles

J Link Bandwidth 25.6 B/Cycle 640 B/Cycle

K Memory Latency 300 Cycles 100 Cycles

Table 2: Low and High Values of Parameters

Run A B C D E F G H I J K L2 $01 + + - + + + - - - + - 8 MB02 - + + - + + + - - - + 12 MB03 + - + + - + + + - - - 8 MB04 - + - + + - + + + - - 12 MB05 - - + - + + - + + + - 12 MB06 - - - + - + + - + + + 12 MB07 + - - - + - + + - + + 12 MB08 + + - - - + - + + - + 12 MB09 + + + - - - + - + + - 8 MB10 - + + + - - - + - + + 12 MB11 + - + + + - - - + - + 8 MB12 + + + + + + + + + + + 4 MB13 - - + - - - + + + - + 12 MB14 + - - + - - - + + + - 8 MB15 - + - - + - - - + + + 12 MB16 + - + - - + - - - + + 12 MB17 + + - + - - + - - - + 8 MB18 + + + - + - - + - - - 8 MB19 - + + + - + - - + - - 12 MB20 - - + + + - + - - + - 12 MB21 - - - + + + - + - - + 12 MB22 + - - - + + + - + - - 12 MB23 - + - - - + + + - + - 12 MB24 - - - - - - - - - - - 12 MB

Table 3: Plackett and Burman Design Matrix showing simulation configurations for each of 24 simulation runs

Page 5: CMP Design Choices

The selection of values for the number of L2 banks came into question during the later stages of the project. It is clear now that the number of banks should have been partially dependent on the number of cores by making that number a multiple of the core number. Undoubtedly, implementing a design with 16 processors and only two L2 cache banks, as is suggested by some configurations, would be reckless. Lastly, the link bandwidth is the bandwidth supported by each connection in the memory hierarchy. In future work this link bandwidth could be broken up into its constituent parts: L1 cache bandwidth, L2 cache link bandwidth, directory bandwidth, and pin bandwidth. In addition, the effects of prefetching, TBE size, outstanding request capacity, L2 network structure, and L2 inclusion/non-inclusion on performance could be studied.

The reference labels in Table 2 correspond to the column labels in Table 3. Table 3 specifies the simulation configurations for the 11-parameter PB Design used in this study. Only one core configuration was so large that no more than 4 MB of L2 cache fit on the die. However, this anomaly did not have much affect on the results, discussed in Section IV. The three discrete L2 cache sizes further complicated matters by leaving some percentage of empty space on a 300mm2 die, which varied depending on the configuration. This problem could be alleviated in future work by refining the cache size for each configuration, accomplished by using a wider variety of associativities.

Simulation Methodology

After getting up to speed on the Multifacet simulator using Ruby and Opal, we selected the OLTP benchmark for generating the first set of results because OLTP does a good job of stressing the memory subsystem. In a CMP it is believed that the design of the memory subsystem is paramount to good performance with the off chip bandwidth to memory becoming a critical resource as the number of cores on the chip increases. Unfortunately pin bandwidth was not a parameter that could be easily changed without learning more about the simulator, so varying this parameter independently will have to be postponed for future work. Nonetheless, the link bandwidth parameter does encompass pin bandwidth and we expect the OLTP results would partially verify that intuitive conclusion.

Initially we found that the OLTP benchmark did not complete for all simulation runs. The total memory usage for the simulation would continuously increase until Condor killed the job at a memory usage of about 2.8 GB. All simulations showed this increase in memory, and about half could not complete. We then moved to SPECjbb to see if we could produce some results with a different benchmark. SPECjbb does not stress the memory system as much, but it does run quickly. Again, we could not get all simulations to complete. After reducing the number of transactions, though, we were able to get a complete set of throughput numbers. Trying to reduce the number of transactions for OLTP also worked, for the most part. A couple of simulations did not complete, and appeared to be experiencing another problem because the memory usage would freeze and the job would run for days with no apparent change. This difficulty will be discussed further in Section IV.

Benchmark Processors Transactions

OLTP 2 200

OLTP 16 100

JBB 2 20000

JBB 16 10000

Table 4: Number of Transactions run for combinations of benchmarks and number of processors

Table 4 shows the number of transactions run for each benchmark and number of processors. The runs with 16 processor cores used cache warmup files, and consequently needed to be run for less transactions due to the absence of warmup effects. The two processor simulations were run for as many transactions as could be completed for all configurations, or almost all in the case of OLTP. Multiple runs for each

Page 6: CMP Design Choices

configuration were not performed due to time constraints. It would be beneficial to run each simulation ten times to reach a reasonable number of transactions for each benchmark and to compute the variation in the simulation runs. Simulating other benchmarks that are available would allow for more analysis of parameter characteristics, and generating – or getting access to – cache warmup files for two processor configurations would permit stronger assertions to be made because the simulation runs would be uniform. These improvements would help us answer the question being asked more definitively, but for the purposes of this study, the current results will suffice.

SECTION III - METHOD 2: ANALYTICAL MODELING

The analytical model is composed of three types of service centers and n job classes, where n is the number of cores in the CMP. Customers within the model are memory requests. The first type of service center is the core/L1 center. Each of these centers has an exponentially distributed service time with mean µ. The second center is the L2. The service time at this center is deterministic (L2 latency) but has a variable queue length. The final center is the Memory. This service time is also deterministic, and it is perfectly pipelined such that there is no queuing. Each job class is identical in nature, except for which core/L1 service center the job class returns to. The separate job classes guarantee that there is no queuing at the core/L1 centers.

The interconnection of these centers into the model is shown in Figure 1. Customers spend exp(µ) time at their respective core/L1 centers and with a probability equal to the L1 hit rate, immediately return and begin a new service. With a probability equal to the L1 miss rate (1 - hit rate) a customer will continue to the shared L2 center. All customer classes (all customers from each L1/core) continue to the same L2 center. Here they experience an average wait time W in the queue and a deterministic service time (L2 latency). After this service time a customer will return to its core/L1 with probability equal to the L2 hit rate or continue to the Memory with probability equal to the L2 miss rate (1 – hit rate). Finally, at the Memory center it will experience a deterministic service time with no wait and return immediately to its respective core/L1. This is a simple closed network that can be solved using mean value analysis (MVA) equations.

Figure 1.

Exact MVA Model

To find the throughput of this model we must solve for the residence time of a customer at each center. We use as inputs to the model mean core/L1 service time µ, L1 miss rate M1, L2 latency, L2 miss rate M2 and average off chip memory latency. Using these inputs, solving the model for one customer is simple. Average throughput is simply 1 divided by the sum of the average residence times.

The solution for any number of customers can be extracted from a simple base case. The residence time at a center p for n customers (Rn,p) is it’s service time Dp plus the service time of all the customers that are

Page 7: CMP Design Choices

ahead of it in the queue when it arrives. Since this customer can never be ahead of itself in the queue, the average number it sees is the average queue length when there are n-1 customers in the model, Qn-1,p. Thus the solutions can be incrementally computed for additional core/L1s in the system.

Rn,p = Dp * (1 + Qn-1,p)

Since a exactly one request is generated from each core, the queue length at the processor is always 0. Also, since the memory is modeled as being perfectly pipelined, it’s queue length is also always 0.

Rn,core = Dcore * (1 + 0) = Dcore

Rn,M = DM * (1 + 0) = DM

Once the residence times are known, the average queue lengths (to be used in computing n+1 customers) can be solved.

Xn = n / (Z + Rn,core + Rn,L2 + Rn,M )Qn,core = Xn * Rn,core

Since the L2 and Memory service centers are only reached a fraction of the time, their service times will be reduced accordingly. This reduced visit count decreases each D by the respective miss rates.

DL2 = M1* (L2 latency)DM = M1*M2(n)*(Average off chip memory latency)

The L2 miss rate, as noted above and in section 2, is a function of the number of core/L1 centers. Since each additional center takes area away from the L2 area, the miss rate will increase with each core added. The base miss rate is measured from a base simulation and M2(n) is then calculated using the stated assumption that a doubling of the cache results in a miss rate of 1/√2, or:

Miss raten,L2 = M0miss

(√2)in

i = log2[1 - (n-8)*(core area)/(base L2 area)]

This calculation for i given the core and base L2 area gives incremental changes in miss rate for the assumption.

With this solvable set of equations established, throughput can be calculated and compared to simulation results to determine the validity of the model. However, due to the inaccuracy of these results (that will be discussed in the results sections) a second solution to the model became necessary.

Approximate MVA

The inability of the exact MVA equations to produce accurate results pointed towards a flaw in the model’s computation of each customer’s residence time at a core/L1 service center. Maintaining the assumption that this service time distribution is exponential, an iterative method can be used to converge on a service time that will match the total residence time in the model with that measured in simulation. Approximate mean value analysis (AMVA) suits such a method because it finds a solution for throughput and mean queue length using an iterative method.

The AMVA solution is found by initially guessing at the mean queue length for each center (in this case only the L2 center has a queue length greater than zero) and solving for mean residence time and throughput. This solution can then be used to calculate a better guess at mean queue length and the model can be solved again. These iterations continue until the mean queue length guesses converge and an approximate result is found. This can be simply extended to also match total mean residence time with an input parameter, measured total mean residence time from simulation. The AMVA algorithm is as follows:

Page 8: CMP Design Choices

while(!done){ while(cont){ cont = false totalR = 0 for(each service center){ oldQk = Qk

Ak = ((n-1)/n)* oldQk

Dk*(1+Ak) totalR += Rk }

X = n/totalR for(each service center){ Qk = X*Rk

if(|(Qk - oldQk)/Qk)| > .01) cont = true

}}

cont = true done = true if(|X – realR|/realR > .0001){ done = false D1 = D1 - D1*(X - realR)/realR}

SECTION IV - RESULTS

Plackett & Burman DesignThe results for SPECjbb are discussed first because the data is complete for this benchmark, and it is much easier to draw conclusions from complete data. The results gathered so far for OLTP will also be presented to discuss a few first impressions, but we will tread lightly with the analysis since the results from four runs are missing. Figure 3 shows the relative importance of the 11 parameters explored for SPECjbb.

0

2

4

6

8

10

12

14

16

18

20

Figure 3: The relative importance of the CMP parameters varied while simulating the SPECjbb benchmark

Clearly, the number of processor cores on the chip is a critical parameter. The extreme difference between this parameter and the rest brings a couple things into question. First, one of the issues with using the PB

Page 9: CMP Design Choices

Design has possibly come into play here. The number of cores is varied between 2 and 16, or by a factor of eight. Most other parameters do not vary by nearly the same amount so this result could mean that the large variation in the number of cores is accentuating the perceived importance of the parameter. However, if the cores were varied between 4 and 16 and this change cut the relative importance in half – it is doubtful that such a change could have a greater effect – it would still be the principal parameter by about two points. Second, the importance of the parameters does depend on the benchmark, and SPECjbb has relatively low memory system requirements compared to such benchmarks as OLTP. Maximizing the number of cores at the expense of other parameters seems to be the thing to do to maximize SPECjbb performance.

The number of L2 banks and the link bandwidth are also leading the other parameters in importance. The link bandwidth significance is intuitively satisfying while there may be alternative causes hidden behind the way the number of L2 banks was varied. Increasing the L2 bank count increases the aggregate bandwidth in the L2 cache network, and it also increases the maximum latency to service a request. Similar effects can be drawn from increasing the number of cores so that a CMP with 16 cores and 2 L2 banks will behave completely differently than one with 16 L2 banks, and saying that the change in the number of banks made the difference is not exactly correct. Making the high and low values for the number of banks dependent on the number of cores would partially alleviate this artifact. Carefully designing a custom L2 network for each configuration would help even more. In fact, making the type of L2 network a parameter and exploring its importance would be intriguing as well and is being considered as future work.

To wrap up this discussion, it is interesting to note that the L2 latency comes in last of all parameter in importance. This outcome is not likely due to an under-emphasis of the parameter since it was varied by almost a factor of five. Two actions working in tandem can account for this effect. One, enough parallelism exists to easily tolerate the L2 latency, and two, the L1 miss rate is relatively low for SPECjbb. The latter explanation is supported by the higher importance of the L1 cache size so that decreasing the miss rate of the L1 cache makes the L2 latency insignificant. Also noteworthy is the low importance of the pipeline organization and width. Most likely this area is better spent on extra L2 cache for those times when requests miss in the L1 since the bulk of the parallelism exists at a courser granularity than those techniques can utilize.

0

1

2

3

4

5

6

7

8

Figure 4: The relative importance of the CMP parameters varied while simulating the OLTP benchmarkFigure 4 shows the preliminary relative importance of parameters for the OLTP benchmark. It appears that once again the number of cores is among the critical parameters, although not to the same extent as for SPECjbb. This prediction is likely to be correct since the four simulations that did not complete are all 16

Page 10: CMP Design Choices

processor configurations so the importance of the number of cores will increase with each additional completed simulation. No other parameter will exhibit an increase for all four simulations. According to this graph, the number of L2 banks is also highly significant, yet this graph is a plot of the magnitudes of the calculated values and the L2 bank value is actually negative. Given that three of the four remaining simulations have high L2 bank counts, the relative importance will go towards zero. Without finished simulation runs, not much more can be said about the trends in these results with a good deal of certainty.

Analytical ModelThe exact MVA solution was able to compute throughput and residence time estimates. These estimates, however, were widely inaccurate. Computed with parameters matching several simulation runs, the model solution produced large errors. No discernible pattern could be found in this error. Initially it was hypothesized that this error was the result of an incorrect calculation of the mean service time for the core/L1 service center. This led to the development of the AMVA solution.

In the AMVA solution, an attempt was made to converge upon this mean service time to match simulation results. However, convergence did not occur, resulting in no solution, and no throughput. In execution of the modified AMVA algorithm given above, core/L1 residence time dropped to 0 without total residence time matching the measured value. This eliminated mean service time calculation from error, leaving two other possibilities as the error source, the input parameters to the model, or the structure of the model itself. Since the model solution can be computed in negligible time, a wide variety of parameters can be attempted to eliminate parameter choices as the cause of error. These attempts maintained nonconvergence. Thus it can be concluded that the structure of the model is the source of error.

SECTION V - CONCLUSION

In this study of the importance of CMP parameters, we have seen that the Plackett and Burman Design methodology puts more strain on the simulator than a typical study where realistic system configurations are simulated. The PB Design hits the corners of the design space, something the simulator may not have been optimized, or even tested thoroughly for. Moreover, making changes to a PB Design during its execution is extremely time-consuming, especially because changing parameter values or adding new ones entails throwing out data from any simulations that have already completed. Getting the design of the experiment right the first time or within as few iterations as possible is even more essential in this study than in the typical architecture study.

The large number of simulations further aggravates these problems. The benefits can far outweigh the costs, though. Once the statistical design is correct and the data has been collected, invaluable system trends can be discovered, allowing simulation time to be focused on the design parameters that will have the greatest impact on system performance. In the case of this study, the number of processor cores, the number of L2 banks, and the L2 cache network link bandwidth are among the most important parameters to optimize. Refining the PB Design in the numerous ways discussed in this paper, and exploring additional CMP design parameters will achieve significant progress towards a definitive answer to the question of which parameters matter in CMP design.

Similarly, analytical modeling can provide key insights to the trends in CMP performance. The two model solutions explored here offer potential in alleviating the cost of simulating a vast design space. Though the throughput for either the exact MVA solution or the AMVA solution did not match simulation results, some useful observations can be made from their results. The failure of the model demonstrates that there is a complex relationship between the memory system and the rate at which a core issues requests and that this relationship must be included in an analytical model to produce accurate results. It is clear that CMPs complicate this relationship between cores and the memory subsystem.

References

Page 11: CMP Design Choices

[1] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In The 27th Annual International Symposium on Computer Architecture, pages 282–293, June 2000.

[2] A. N. Eden and T. Mudge, The YAGS branch prediction scheme. In The 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 69-77, December 1998.

[3] S. Gupta, S. W. Keckler, and D. Burger. Technology independent area and delay estimates for microprocessor building blocks. Technical Report 2000-5, Department of Computer Sciences, University of Texas at Austin, April 2000.

[4] J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In The 10th International Conference on Parallel Architectures and Compilation Techniques, pages 199-210, September 2001.

[5] D. C. Montgomery, Design and Analysis of Experiments. Third Edition, Wiley 1991.

[6] R. Plackett and J. Burman, The design of optimum multifactorial experiments. Biometrika, Vol. 33, Issue 4, pages 305-325, June 1956.

[7] J. Yi, D. Lilja, and D. Hawkins, A statistically rigorous approach for improving simulation methodology. In The Ninth International Symposium on High-Performance Computer Architecture (HPCA’03), February 2003.