6
Evaluating Thread Level Parallelism Based on Optimum Cache Architecture Mehdi Alipour, Ali Khorramshahi.B, Fatemeh Karimi, Zhila Mirzaei, and Aryo Vaghari Allameh Rafiei Higher Education Institute of Qazvin, Iran [email protected] Abstract— By scaling down the feature size and emersion of multi-cores that are usually multi-thread processors, the performance requirements almost guaranteed. Despite the ubiquity of multi-cores, it is as important as ever to deliver high single-thread performance. Multithreaded processors, by simultaneously using both the thread-level parallelism and the instruction-level parallelism of applications, achieve larger instruction per cycle rate than single-thread processors. In the recent multi-core multi-thread systems, the performance and power consumption is severely related to the average memory access time and its power consumption. This makes the cache as a major and important part in designing multi-thread multi-core embedded processor architectures. In this paper we perform a comprehensive design space exploration to find cache sizes that create the best tradeoffs between performance, power, and area of the processor. Finally we run multiple threads on the proposed optimum architecture to find out the maximum thread level parallelism based on performance per power and area efficient uni-thread architecture. Keywords- Embedded processor; cache; multithread architecture; hyper threading; performance per power; MIPS I. INTRODUCTION While a general-purpose processor is designed to be flexible and can execute a wide range of applications, embedded processors are designed to perform dedicated functions often with real-time computing constraints. Embedded systems are used to control many devices in common use today [1]: more than 10 billion embedded processors have been sold in 2008 and more than 10.75 billion in 2009 [2]. In recent years embedded application and internet traffic have become heavier and more sophisticated, therefor, future embedded processors will be encountered with more computation-intensive embedded applications and designing high performance processors is inevitable. By scaling down the feature size and emersion of chip multiprocessors (CMP) that are usually hyper threaded processors, somehow the user’s performance requirements are guaranteed. Recently in numerous researches, hyper threading processors are used to design a fast processor especially in network and embedded domains [3-7, 17]. In [18] a Markov model based on fine-grain hardware multithreading is implemented. Analytical Markov model is faster than simulation and has negligible inaccuracy that means the level of inaccuracy is acceptable. In their method stalled threads are defined as states, and transitions indicate the cache contention between threads [18]. Cache is one of the most important parts in designing multithread CMPs because the performance of a processor is severely related to cache access. Cache memories are usually used to improve the performance and power consumption by bridging the gap between the speed and power consumption of the main memory and CPU [28]. Therefore, the system performance and power consumption is severely related to the average memory access time and power consumption which makes cache as a major part in designing embedded processor architectures. In [3, 4] cache misses are introduced as a factor for reducing memory level parallelism between threads. Thread criticality prediction has been used in [5-8]. In these methods for achieving better performance, resources are given to the so called most critical threads which have higher L2 cache misses. To improve packet-processing in network processors, [4, 8] have applied direct cache access (DCA) techniques. In [4-9] processor architectures are based on simultaneous multithreading (SMT) and cache miss rate is used to evaluate the performance improvement. To find out the effect of cache access delay on performance, a comparison between multi-core and multi-thread processors has been performed in [10]. Likewise, victim cache is an approach to improve the performance of a multi-thread processor [5]. Most of the recent researches rely on comparing the multithread results with single-core single-thread processors. In the other word, multi-thread processors are the heir of the single thread processors [6-7, 17]. Generally, one of the easiest ways to improve the performance of embedded and network processors is increasing the cache size [6, 7, 14-20], but this improvement, severely increase the occupied area and power consumption of the processor. So, it is necessary to find a cache size that creates the best tradeoff between performance, power, and area of the processor. From other point of view, according to the performance per area parameter, higher performance in a specified area budget is one of the most important needs of a high performance embedded processor. A negative point of the recent researches is that they don’t have any constraints on the cache size and the main similarity between these researches is that they have tolerated hardware cost created by hyper threading and they didn't make any constraint on area and power budget on other parts to create spaces for some hardware that is needed for running multiple threads! Because of the limited area budget in embedded processors, in this paper we will find the optimum size of L1 and L2 cache. Generally embedded processors are not implemented in multi-issue architectures and out of order (OOO) instruction execution that has renaming logic [20-22], [7], [11]. So we set the parameters in the way that this important point will be satisfied. 978-1-4673-3033-6/10/$26.00 ©2012 IEEE 2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, Kota Kinabalu Malaysia 48

[IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

  • Upload
    aryo

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

Evaluating Thread Level Parallelism Based on Optimum Cache Architecture

Mehdi Alipour, Ali Khorramshahi.B, Fatemeh Karimi, Zhila Mirzaei, and Aryo Vaghari Allameh Rafiei Higher Education Institute of Qazvin, Iran

[email protected]

Abstract— By scaling down the feature size and emersion of multi-cores that are usually multi-thread processors, the performance requirements almost guaranteed. Despite the ubiquity of multi-cores, it is as important as ever to deliver high single-thread performance. Multithreaded processors, by simultaneously using both the thread-level parallelism and the instruction-level parallelism of applications, achieve larger instruction per cycle rate than single-thread processors. In the recent multi-core multi-thread systems, the performance and power consumption is severely related to the average memory access time and its power consumption. This makes the cache as a major and important part in designing multi-thread multi-core embedded processor architectures. In this paper we perform a comprehensive design space exploration to find cache sizes that create the best tradeoffs between performance, power, and area of the processor. Finally we run multiple threads on the proposed optimum architecture to find out the maximum thread level parallelism based on performance per power and area efficient uni-thread architecture.

Keywords- Embedded processor; cache; multithread architecture; hyper threading; performance per power; MIPS

I. INTRODUCTION While a general-purpose processor is designed to be flexible

and can execute a wide range of applications, embedded processors are designed to perform dedicated functions often with real-time computing constraints. Embedded systems are used to control many devices in common use today [1]: more than 10 billion embedded processors have been sold in 2008 and more than 10.75 billion in 2009 [2]. In recent years embedded application and internet traffic have become heavier and more sophisticated, therefor, future embedded processors will be encountered with more computation-intensive embedded applications and designing high performance processors is inevitable. By scaling down the feature size and emersion of chip multiprocessors (CMP) that are usually hyper threaded processors, somehow the user’s performance requirements are guaranteed. Recently in numerous researches, hyper threading processors are used to design a fast processor especially in network and embedded domains [3-7, 17]. In [18] a Markov model based on fine-grain hardware multithreading is implemented. Analytical Markov model is faster than simulation and has negligible inaccuracy that means the level of inaccuracy is acceptable. In their method stalled threads are defined as states, and transitions indicate the cache contention between threads [18]. Cache is one of the most important parts in designing multithread CMPs because the performance of a processor is severely related to cache access. Cache memories are usually

used to improve the performance and power consumption by bridging the gap between the speed and power consumption of the main memory and CPU [28]. Therefore, the system performance and power consumption is severely related to the average memory access time and power consumption which makes cache as a major part in designing embedded processor architectures. In [3, 4] cache misses are introduced as a factor for reducing memory level parallelism between threads. Thread criticality prediction has been used in [5-8]. In these methods for achieving better performance, resources are given to the so called most critical threads which have higher L2 cache misses. To improve packet-processing in network processors, [4, 8] have applied direct cache access (DCA) techniques.

In [4-9] processor architectures are based on simultaneous multithreading (SMT) and cache miss rate is used to evaluate the performance improvement. To find out the effect of cache access delay on performance, a comparison between multi-core and multi-thread processors has been performed in [10]. Likewise, victim cache is an approach to improve the performance of a multi-thread processor [5]. Most of the recent researches rely on comparing the multithread results with single-core single-thread processors. In the other word, multi-thread processors are the heir of the single thread processors [6-7, 17].

Generally, one of the easiest ways to improve the performance of embedded and network processors is increasing the cache size [6, 7, 14-20], but this improvement, severely increase the occupied area and power consumption of the processor. So, it is necessary to find a cache size that creates the best tradeoff between performance, power, and area of the processor.

From other point of view, according to the performance per area parameter, higher performance in a specified area budget is one of the most important needs of a high performance embedded processor. A negative point of the recent researches is that they don’t have any constraints on the cache size and the main similarity between these researches is that they have tolerated hardware cost created by hyper threading and they didn't make any constraint on area and power budget on other parts to create spaces for some hardware that is needed for running multiple threads! Because of the limited area budget in embedded processors, in this paper we will find the optimum size of L1 and L2 cache. Generally embedded processors are not implemented in multi-issue architectures and out of order (OOO) instruction execution that has renaming logic [20-22], [7], [11]. So we set the parameters in the way that this important point will be satisfied.

978-1-4673-3033-6/10/$26.00 ©2012 IEEE

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

48

Page 2: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

Figure1. a) Cache size overlapping. B) LCE and HCP points. C) Average performance penalty and energy saving

II. SIMULATION ENVIRONMENT AND BENCHMARKS For simulation, we used Multi2sim version 2.3.1[27], a super

scalar multithread multi-core simulation platform which has 6 stages of pipeline for X86 instructions set architecture. It can run programs in multi-issue platform. We have changed and compiled the source code of the simulator on a 2.4 GHz, dual-core processor with 4GB of RAM and 6MB of cache that run fedora 10 as an operating system and set the parameters to create scalar architecture which is appropriate for embedded systems.

Because embedded applications are so pervasive, homogenous set of applications cannot be a good choice for design space exploration (DSE). Therefore, we have done our DSE by heterogeneous set of applications from PacketBench [25] and MiBench [26]. Selected applications have been listed in figure 1.b. Embedded processors are designed for a specific domain that they are used in, however, by using heterogonous set of applications we present an exhaustive DSE for a wide range of applications.

III. EXPLORING CACHE ARCHITECTURE SPACE This part is based on [23, 32]. Alipour et al [23] did the

exhaustive exploration of cache size for embedded applications however these works have explored the design space considering the performance and have introduced the cache size that leads to the lowest cycles for running an embedded application. Their research showed that there is a range for L1 and L2 cache for heterogeneous embedded applications. They showed that although performance is improved by increasing the cache size, however, over a threshold level, performance is saturated and then decreased. Their proposed range for cache size is too big so in this paper by considering another important parameter for embedded processor i.e. power or energy consumption, the mentioned range is reduced and just a few cache sizes for heterogeneous embedded applications are introduced. Exploration of [23] reduced somehow 300 cache configurations to 36 configurations (6 sizes for L1 and 6 sizes for L2). In this paper, by considering both the dynamic and static power consumption of each configuration, we make more reduction on configurations of cache.

We have calculated the best cache size for each application based on the performance evaluation. Then the best performance for each application is calculated in the introduced size. From

now we call this point of cache size the highest cache performance (HCP). HCP point produces lowest cycle simulations. The HCP of all selected embedded applications are shown in figure 1.b in the right most columns. Authors of [19] did somehow the same research based on performance and they didn't consider the power constraint of the cache which is an important concern in embedded processors. To consider the power effects, we have used the model introduced in [32]. Author of [32] have used CACTI 5.0 [24], a tool from HP that is a platform for extracting parameters relevant to the cache size considering fabrication technology and introduced a model that consider both dynamic and static power based on detailed parameters such as number of read, writes, cache access considering misses and hits, idle cycles, all cache hierarchy level e.g. instruction and data cache level 1 and 2, and other related parameters.

Total energy that is dissipated by a hardware module (here a cache) is calculated by adding total dynamic and static energy. Based on the proposed model of [32], each access consumes some energy considering the cache configuration and miss penalty. Although any access may lead to a miss or hit, however, any events cause some energy dissipation [31]. Based on the energy analysis results, we introduced another point for cache configuration called lowest cache energy (LCE). LCEs are shown in figure 1.b in middle column and indicate the lowest energy consumption for each application. Results of figure 1.b show that for all applications, sizes of LCE is smaller than HCP so, LCE and HCP are the left and right margins of the cache size ranges, respectively, and they introduce a range for L1 and L2 considering both performance and energy consumption. To make a better sense for these ranges figure 1.a indicates the cache size overlapping of all applications in L1 and L2 ranges. Based on this figure, L1 (L2) range is from minimum L1 (L2) sizes for LCE column to maximum L1 (L2) sizes for HCP column. So in this way and based on figure 1.a, L1 ranges from 8KB to 128KB and L2 ranges from 16KB to 128KB and we choose the cache sizes that have the most overlapping between all benchmarks. Based on performance analysis of [23] there are 36 cache configurations for selected embedded applications.

By using this ordinary and simple overlapping algorithm the 36 cache configurations will be reduced to 12. These ranges specify an important point: any size for L1 and L2 out of this range is not recommended because the right side of these ranges leads to the maximum performance and the left side have the

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

49

Page 3: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

minimum power consumption for caches in selected embedded applications. Based on [23, 32] the configuration of L1=32KB and L2=64KB are an applicable cache size for selected heterogeneous embedded applications. Figure 1.c shows the average performance penalty (Pp column) of all 12 configurations related to this size which are labeled by cfg5 in the figure 1.c in the left most columns. As shown in figure 1.a, we can choose 16, 32, 64 and 128 KB for L1 and 32, 64, and 128 KB for L2. So at most we have 12 points, (3 points for L1 and 4 points for L2) that are candidates to be the performance per energy optimum cache sizes useful for embedded processors. All of these 12 configurations are listed in figure 1.c. Since, performance per energy is one of the most important parameters in designing the cache for embedded processors, as indicated in the results; a cache size can be applicable only when it satisfies these constraints. So to find the best cache hierarchy for running multiple threads we explored some other important parameters using MCPAT [29] tool, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multi-core and many-core processor configurations ranging from 90nm to 22nm and beyond. MCPAT [29] can model both a reservation-station model and a physical register-file model based on real architectures.

IV. ANALYSIS OF DIFERENT CACHE CONFIGURATIONS

A. Leakage power Figure 2.a shows the percentage of the cache leakage power

related to the selected core for all 12 configurations. Based on leakage power results, up to 17% of total core leakage power is consumed in the cache hierarchy. If the most important parameter for cache configuration selection is leakage power that has not been considered as a separate parameter in recent researches, cfg1 in deep submicron technology file e.g 65nm, 45nm, or 32nm, is the best cache configuration for selected embedded applications. Based on the performance results of MIPS section (figure 2.d) and also [23], this size will deliver -7.85 performance penalty in average related to the 9th configuration that delivers highest MIPS. So by using these tradeoffs, designer has to select the best configuration. So, in the following sections performance parameters will be calculated to consider such tradeoffs.

B. Dynamic power Figure 2.b shows the percentage of cache dynamic power

related to the core for all 12 configurations. Result of figure 2.a and 2.b together; show that when the selection parameter is only dynamic power, configuration selection is harder than when the parameter is leakage power because, in many configurations, relative percentage of the cache dynamic power, are the same. According to figure 2.b and dynamic power consumption, configuration 1 is the best cache size for selected embedded application. Based on the dynamic power consumption results, in 90nm feature size, up to 40% of core dynamic power, will be consumed in the cache hierarchy.

C. Total power Figure 2.c shows the percentage of the cache hierarchy total

power (leakage + dynamic) consumption related to the core for all 12 configurations. Here, like the previous sections, the best cache hierarchy configuration is cfg number 1, because this configuration consumes the lower percentage of the core total power. Based on the total power analysis results, in 90nm, up to 32% of core total power will be consumed in the cache hierarchy. In another point of view, performance is also one of the most important parameters to tune a cache configuration for embedded application. So the cost functions that consider both power and performance simultaneously, can deliver better results. So in the next section we will explore some important parameters based on power and performance of all 12 configurations which are more effective for embedded applications.

D. Million instructions per second (MIPS) As mentioned in previous sections, although power analysis

is one of the most important constraints for embedded applications however, designers should consider performance and power analysis together. At first we use the very popular performance metric called MIPS. MIPS, is a metric for measuring the execution speed of a computer's CPU.

Figure 2.d shows the result of comparing MIPS of all 12 configurations. The most important result from this figure is that MIPS exploration shows that the best configurations are 5 and 9 e.g. L1=32 and L2=64 and L1=64 and L2=128 KB. So to reach the highest MIPS and in another view highest performance, designer should select configuration 5 or 9. As mentioned before from one point of view we want to reach the highest performance per power for selected embedded applications and from other point of view we want to consider performance per area.

E. Power delay product (PDP) and MIPS per power Since power consumption varies, depending on the program

being executed, the benchmarking issue is also relevant in assigning an average power rating. In measuring power and performance together for a given program execution, we may use a fused metric such as the power-delay product (PDP) or energy-delay product (EDP). In general, the PDP-based formulations are more appropriate for low-power portable system in which battery life is the primary concern of energy efficiency. PDP, being dimensionally equal to energy, is the natural metric for such systems. Lower PDP means better architecture for such kind of systems.

Figure 2.e shows the results of PDP exploration for all 12 configurations. Interesting results based on this table is that, in all feature sizes the minimum PDP is reached in configuration 5 (L1=32, and L2=64KB). MIPS per power metric is the inverse of PDP formulation, where delay refers to average execution time per instruction. Configurations with higher MIPS per power are good choices for embedded systems. Results of this parameter can be seen in figure 2.f. Based on this figure and PDP results configurations 1 to 5 deliver better MIPS per power and the best configuration in different explored feature sizes is configuration 5.

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

50

Page 4: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

Figure 2.a. leakage power analysis. Figure 2.b. dynamic power analysis.

Figure 2.c. Total power analysis. Figure 2.d. MIPS analysis.

Figure 2.e. PDP analysis. Figure 2.f. MIPS per power analysis.

Figure 2.g. Area analysis. Figure 2.h. MIPS per area analysis.

Figure2. Analysis of 12 cache configurations related to the core

F. Area analysis As mentioned before area is one of the most important

parameter for embedded applications. By changing the configuration of cache hierarchy, area cost of embedded cores will change. Result of fig 2.g shows the effect of changing of this parameter. Like power analysis results, the best point for all

feature sizes is configuration 1. But there is more important result from this table. Based on analysis results, in 90nm up to 76% of core area will be occupied in the cache hierarchy. Another important factor in embedded system design is, MIPS per area parameter. This parameter shows the area efficiency of different configuration considering a limited area budget for multicore

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

51

Page 5: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

embedded designs. Figure 2.h shows the exploration of MIPS per area for all 12 mentioned cache hierarchy. Based on this table maximum MIPS per area will be created in configuration 1 but, based on performance analysis section this configuration has -7.8% performance penalties for all selected embedded applications (as mentioned before).

V. EXPLORING OPTIMUM MULTITHREAD ARCHITECTURE

In this part based on the optimum sizes of cache which is the configuration 5 considering 4 important factors in embedded application evaluations, we introduce an optimum performance per power hyper thread architecture for selected embedded applications. It means multi-threading upon uni-thread processor by running multiple thread on optimum single-thread/ single-core area minimized embedded processor. There are 3 type of multithreading that called Interleaved multithreading (IMT), Blocked multithreading (BMT) and Simultaneous multithreading (SMT) [30].

IMT: An instruction of another thread is fetched and fed into the execution pipeline at each processor cycle.

BMT: The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch.

SMT: Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor. Thus, the wide superscalar instruction issue is combined with the multiple-context approach. Researches show that superscalar SMT deliver highest performance improvements [3-7] but with the highest hardware budget. So in this paper we used SMT architecture to access highest performance for embedded applications but the parameters are shared and set in the way that a scalar architecture has been simulated. By this work hardware redundancy caused by SMT will be normalized and the word SMT means executing instructions from different thread in each cycle. We set the superscalar parameter size to 1 (issue-width and width of other pipeline stages is equal to 1).

Table2. Processor parameters and sharing strategy.

The parameters we used for simulation to create multi-thread architecture are listed in table 2. Based on this table we have used maximum sharing strategy to reach the highest feasible performance improvements based on limited power and area budget. As multithreading creates some hardware redundancy, we used shared L1 and L2 cache, shared register file and share all the parameters that can be shared between multiple threads. It means using minimum hardware for single-thread / single-core architectures that in average create just lower than 3% performance penalty and up to 7% energy saving for selected embedded applications.

In this paper we want to answer to this question: “is it feasible to run multiple threads on a single thread with limited area and power budget?” If yes, then what is the maximum number of threads that can be executed on this architecture? And what is the maximum performance improvement? Results of these simulations are shown in figure 3. The results show that we can run up to 3 threads on a single-thread-single-core architecture and reach up to 31% performance improvement for susan corners benchmark and 17% performance improvements in average for selected embedded applications by running 2 threads on a single-thread processor with limited area and power budget.

Figure3. Effect of multi-threading on the area minimized and optimum uni-thread architecture

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

52

Page 6: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

VI. CONCLUSION In this paper we performed an exhaustive design space

exploration to find multi-thread architectural guideline for embedded application. We explored the optimum single-thread architecture based on different cache configurations as they are the most important components in embedded processors. We introduced optimum cache size based on performance, power, and performance per power, area, MIPS and PDP parameters. We have executed multiple threads on a single-thread processor with limited area and power budget and results show that it is feasible to create multi-threading in a uni-thread architecture by limited power and area budget.

Because multi-threading leads to a hardware redundancy we explored the optimum size of cache of the processor to have low area and just have negligible performance penalty (lower than 3%) and about 7% energy saving compared to the highest performance configuration. By this method, we have created some room for hardware multi-threading so, in the last processor we have very little area overhead for changing a uni-thread processor to hyper thread processor. Our explorations show that running two threads on a single-thread processor with limited area and power budget, in average, leads to 17% performance improvements for selected representative embedded applications.

VII. REFERENCES [1] Daniel D. Gajski, Samar Abdi, Andreas Gerstlauer and Gunar Schirner, Embedded system design, modeling, synthesis and verification, Springer 2009. [2] Embedded processors top 10 billion units in 2008, available online at: http://www.vdcresearch.com/_documents/pressrelease/press-attachment-1503.pdf [3] Shailender, Chaudhry. Robert, Cypher. Magnus, Ekman. Martin, Karlsson. Anders, Landin. Sherman, Yip. Haakan, Zeffer. Marc,Tremblay, "Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor," international symposium on computer architecture (ISCA 2009). pp. 484-495, 2009. [4] Kyueun, Yi. Gaudiot, J.-L.,"Network Applications on Simultaneous Multithreading Processors," IEEE Transaction on Computer (TCOMPUTER). pp. 1200-1209, 2009. [5] Colohan, C.B. Ailamaki, A.C. Steffan, J.G. Mowry, T.C.,"CMP Support for Large and Dependent Speculative Threads," IEEE Transaction on Parallel and Distributed systems (TPDS), pp. 1041-1054, 2007. [6] Chung, E.S. Hoe, J.C., "High-Level design and validation of the BlueSPARC multithreaded Processor," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD, vol. 29, no. 10, pp. 1459– 1470, 2010. [7] Moron, C.E., and Malony, A. “Delopment of embedded multicore systems," 16th IEEE Conference on Emerging Technologies & Factory Automation (ETFA), pp 1-4, 2011. [8] Abhishek, Bhattacharjee. Margaret, Martonosi. "Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors," International Symposium on Computer Architecture (ISCA ). pp. 168-176, 2009. [9] Huggahalli, R. Iyer, R. Tetrick, S.,"Direct cache access for high bandwidth network I/O," International Symposium on computer Architecture (ISCA), pp. 50-59, 2005. [10] Guz, Z. Bolotin, E. Keidar, I. Kolodny, A. Mendelson, A. Weiser, U.C., "Many-core vs. many-thread machines: stay away from the valley," Journal of Computer Architecture Letters (L-CA), pp.25-28, 2009. [11] Wang, H. Koren, I. Krishna, C.," Utilization-based resource partitioning for power-performance efficiency in SMT processors," IEEE Transactions on Parallel and Distributed Systems, (TPDS ) vol. 22, no. 99, pp. 191-216, 2010. [12] Davanam, N., Byeong Kil Lee., "Towards smaller-sized cache for mobile processors using shared set-associatively," international conference on information technology. pp. 1-6. 2010.

[13] Bienia, C. Kumar, S. Kai Li.," PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on Chip- Multiprocessors," International Symposium on Workload Characterization (IISWC), pp.47-56, 2008. [14] Guanjun, Jiang. Du, Chen. Binbin, Wu. Yi, Zhao. Tianzhou, Chen. Jingwei, Liu.," CMP thread assignment based on group sharing L2 Cache," International Conference on Embedded Computing, pp. 298-303, 2009. [15] McNairy, C. Bhatia, R.," Montecito:a dual core dual thread Itanium processor," IEEE Journal MICRO, pp.10-20, 2005. [16] Hyunjin, Lee. Sangyeun, Cho. Childers, B.R.," StimulusCache: boosting performance of chip multi-processors with excess cache," IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), pp, 1 – 12, 2010. [17] Davanam, N. Byeong, Kil. Lee.," Towards smaller-sized cache for mobile processors using shared set-associativity," international conference on information technology, pp. 1-6, 2010. [18] Yi-Ying Tsai, and Chung-Ho Chen," Energy-efficient trace reuse cache for embedded processors," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol 19, no 9, pp. 1681-1694, 2011. [19] Przybylski, S.; Horowitz, M.; Hennessy, J. " Performance tradeoffs in cache design",15th annual international symposium on computer architecture, (ISCA 88) pp. 290-298 , 1988 . [20] A, Yamamoto. Y, Tanaka. H, Ando. T, Shimada.," Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation," in MEDEA-8, pp. 41–48, 2007. [21] Yamamoto. Y, Tanaka. H, Ando. T, Shimada.,"Two-step physical register deallocation for data prefetching and address precalculation," IPSJ Trans. on Advanced Computing Systems, vol. 1, no. 2, pp. 34–46, 2008. [22] Tanaka, Y. Ando, H.,"Reducing register file size through instruction pre execution enhanced by value prediction," IEEE International Conference on Computer Design, pp. 238 – 245. 2009. [23] Mehdi Alipour and Mostafa E.Salehi, and Hesamodin shojaei baghini" Design space exploration to find the optimum cache and register file size for embedded applications," Int'l Conf. Embedded Systems and Applications ( ESA'11), pp. 214-219, 2011. [24] Shyamkumar Thoziyoor, Naveen Muralimanohar, and Norman P. Jouppi, "CACTI 5.0 technical report," from Advanced Architecture Laboratory, HP Laboratories HPL-2007. Available online: www.hpl.hp.com/research/cacti/ [25] Ramaswamy, Ramaswamy. Tilman, Wolf.," PacketBench: A tool for workload characterization of network processing," 6th IEEE Annual Workshop on Workload Characterization (WWC-6), Austin, TX. pp. 42-50, Oct. 2003. [26] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T.Mudge, and R. B. Brown, “MiBench: a free, commercially representative embedded benchmark suite,” IEEE InternationalWorkshop onWorkload Characterization, pp. 3-14,2001. [27] R, Ubal. J, Sahuquillo. S, Petit. P, L'opez.," Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors," 19th Int'l Symposium on Computer Architecture and High Performance Computing. Oct 2007. [28] David A. Patterson, John L. Hennessy.: Computer organization and design: the hardware/software interface, Morgan Kaufman 2007. [29] Sheng Li, Jung Ho Ahn, Jay B. Brockman,and Norman P. Jouppi"McPAT 1.0: An integrated power, area, and timing modeling framework for multicore architectures," available online at: http://www.hpl.hp.com/research/mcpat/McPATAlpha_TechRep.pdf. [30]Theo Ungerer, Borut Robic, and jurij Slic "A Survey of processors with explicit multithreading", journal of ACM computing survey, vol. 35, no. 1, pp. 29-63, 2003. [31] Abel G. Silva-Filho, Filipe R. Cordeiro, Cristiano C. Ara ´ ujo, Adriano Sarmento,Millena Gomes, Edna Barros, andManoel E. Lima, "An ESL approach for energy consumption analysis of cache memories in SoC platforms," International Journal of Reconfigurable Computing, 2011. [32] Mehdi Alipour, Mostafa E. Salehi, and Kamran Moshari "Cache Power and Performance Tradeoffs for Embedded Applications," International conference on computer applications and industrial electronics (ICCAIE), pp 363-368, 2011.

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

53