13
1 Selecting Microarchitecture Configuration of Processors for Internet of Things Prasanna Kansakar, Student Member, IEEE and Arslan Munir, Senior Member, IEEE Abstract—The Internet of Things (IoT) makes use of ubiquitous internet connectivity to form a network of everyday physical objects for purposes of automation, remote data sensing and centralized management/control. IoT objects need to be embedded with processing capabilities to fulfill these services. The design of processing units for IoT objects is constrained by various stringent requirements, such as performance, power, thermal dissipation etc. In order to meet these diverse requirements, a multitude of processor design parameters need to be tuned accordingly. In this paper, we propose a temporally efficient design space exploration methodology which determines power and performance optimized microarchitecture configurations. We also discuss the possible combinations of these microarchitecture configurations to form an effective two-tiered heterogeneous processor for IoT applications. We evaluate our design space exploration methodology using a cycle-accurate simulator (ESESC) and a standard set of PARSEC and SPLASH2 benchmarks. The results show that our methodology determines microarchitecture configurations which are within 2.23%– 3.69% of the configurations obtained from fully exhaustive exploration while only exploring 3%–5% of the design space. Our methodology achieves on average 24.16× speedup in design space exploration as compared to fully exhaustive exploration in finding power and performance optimized microarchitecture configurations for processors. Index Terms—Internet of Things (IoT), design space exploration, microarchitecture, tunable processor parameters, cycle-accurate simulator (ESESC), PARSEC and SPLASH2 benchmarks I. I NTRODUCTION AND MOTIVATION T HE internet has grown rapidly in both enterprise and consumer markets. This has given rise to the Internet of Things (IoT) wherein everyday physical objects are interconnected through a communication network for purposes of automation, remote data sensing and centralized management/control. The IoT creates an intelligent, invisible network fabric that can be sensed, controlled and programmed which allows objects in IoT ecosystem to communicate, directly or indirectly, with each other or the Internet [1]. The “things”, in the scope of IoT, are IoT enabled objects containing sensing and actuating elements along with embedded hardware and software components which facilitate data aggregation, network connectivity and security. Each IoT enabled object is designed to perform an application specific task using data gathered by itself or using information made available to it through other objects in the network. There has been widespread deployment of IoT objects in recent years in various applications like healthcare, industry, transportation The authors are with the Department of Computer Science, Kansas State University, Manhattan, KS e-mail: {[email protected], [email protected]} etc. It is estimated that 6.4 billion connected end-devices are in use in the year 2016 [2], with the number expected to rise to 26 billion by the year 2020 [1]. The massive deployment of IoT objects results in generation of large volumes of data. Data communication, processing, real-time analysis and security of such large volumes of data are important issues that need to be resolved for efficient growth of the IoT ecosystem in the years to come. In the current IoT model, IoT end-devices are designed to be as simple and as cost effective as possible. Thus, they are designed with limited processing capabilities, just enough to securely connect and offload data to the cloud. Almost all complex data management functionalities such as data filtering and analysis are delegated to cloud datacenters, the core units of the IoT model. With the growth in data volume in the IoT ecosystem, there rises several significant challenges which renders this model infeasible. We list here three such challenges. Network Overload - Core network bandwidth is a vital resource in the IoT ecosystem which must be used efficiently. With ever increasing number of IoT objects, relaying data over the core network to the cloud, the network is severely overloaded. Network overloads introduce latency in critical data processing operations which impact most IoT applications such as healthcare and transportation that require real time data processing. Data security - Data communication in the IoT ecosystem mostly occurs over the public network infrastructure. In order to ensure secure data communication, several complex security protocols must be applied to the data. The volume of data requiring security increases as the number of IoT objects deployed in the IoT ecosystem increases. Applying complex security protocols to large volumes of data requires extensive computing operations which cannot be matched by the energy budget of IoT objects. Upgradability - As the IoT landscape continues to evolve, it becomes necessary to upgrade IoT deployments in frequent periods. IoT objects must be designed to support hassle free addition of new features via remote access. In an ideal IoT model, IoT objects must be able to upgrade to new, more complex features without deployment of new IoT objects and without any direct human involvement. With limited processing ability, addition of new features to existing IoT objects may be challenging or even infeasible. The challenges posed by the current IoT model can be overcome by adding processing capabilities inside or local

Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

1

Selecting Microarchitecture Configuration ofProcessors for Internet of Things

Prasanna Kansakar,Student Member, IEEEand Arslan Munir,Senior Member, IEEE

Abstract—The Internet of Things (IoT) makes use ofubiquitous internet connectivity to form a network of everydayphysical objects for purposes of automation, remote data sensingand centralized management/control. IoT objects need to beembedded with processing capabilities to fulfill these services.The design of processing units for IoT objects is constrainedby various stringent requirements, such as performance, power,thermal dissipation etc. In order to meet these diverserequirements, a multitude of processor design parametersneed to be tuned accordingly. In this paper, we propose atemporally efficient design space exploration methodologywhichdetermines power and performance optimized microarchitectureconfigurations. We also discuss the possible combinations of thesemicroarchitecture configurations to form an effective two-tieredheterogeneous processor for IoT applications. We evaluateourdesign space exploration methodology using a cycle-accuratesimulator (ESESC) and a standard set of PARSEC and SPLASH2benchmarks. The results show that our methodology determinesmicroarchitecture configurations which are within 2.23%–3.69% of the configurations obtained from fully exhaustiveexploration while only exploring 3%–5% of the design space.Our methodology achieves on average 24.16× speedup in designspace exploration as compared to fully exhaustive explorationin finding power and performance optimized microarchitectureconfigurations for processors.

Index Terms—Internet of Things (IoT), design spaceexploration, microarchitecture, tunable processor parameters,cycle-accurate simulator (ESESC), PARSEC and SPLASH2benchmarks

I. I NTRODUCTION AND MOTIVATION

T HE internet has grown rapidly in both enterpriseand consumer markets. This has given rise to the

Internet of Things (IoT) wherein everyday physical objectsare interconnected through a communication network forpurposes of automation, remote data sensing and centralizedmanagement/control. The IoT creates an intelligent, invisiblenetwork fabric that can be sensed, controlled and programmedwhich allows objects in IoT ecosystem to communicate,directly or indirectly, with each other or the Internet[1]. The “things”, in the scope of IoT, are IoT enabledobjects containing sensing and actuating elements along withembedded hardware and software components which facilitatedata aggregation, network connectivity and security. EachIoTenabled object is designed to perform an application specifictask using data gathered by itself or using information madeavailable to it through other objects in the network. There hasbeen widespread deployment of IoT objects in recent yearsin various applications like healthcare, industry, transportation

The authors are with the Department of Computer Science, Kansas StateUniversity, Manhattan, KSe-mail: {[email protected], [email protected]}

etc. It is estimated that 6.4 billion connected end-devicesarein use in the year 2016 [2], with the number expected to riseto 26 billion by the year 2020 [1].

The massive deployment of IoT objects results in generationof large volumes of data. Data communication, processing,real-time analysis and security of such large volumes of dataare important issues that need to be resolved for efficientgrowth of the IoT ecosystem in the years to come. In thecurrent IoT model, IoT end-devices are designed to be assimple and as cost effective as possible. Thus, they aredesigned with limited processing capabilities, just enough tosecurely connect and offload data to the cloud. Almost allcomplex data management functionalities such as data filteringand analysis are delegated to cloud datacenters, the coreunits of the IoT model. With the growth in data volume inthe IoT ecosystem, there rises several significant challengeswhich renders this model infeasible. We list here three suchchallenges.

• Network Overload- Core network bandwidth is a vitalresource in the IoT ecosystem which must be usedefficiently. With ever increasing number of IoT objects,relaying data over the core network to the cloud,the network is severely overloaded. Network overloadsintroduce latency in critical data processing operationswhich impact most IoT applications such as healthcareand transportation that require real time data processing.

• Data security- Data communication in the IoT ecosystemmostly occurs over the public network infrastructure.In order to ensure secure data communication, severalcomplex security protocols must be applied to the data.The volume of data requiring security increases as thenumber of IoT objects deployed in the IoT ecosystemincreases. Applying complex security protocols to largevolumes of data requires extensive computing operationswhich cannot be matched by the energy budget of IoTobjects.

• Upgradability- As the IoT landscape continues to evolve,it becomes necessary to upgrade IoT deployments infrequent periods. IoT objects must be designed to supporthassle free addition of new features via remote access. Inan ideal IoT model, IoT objects must be able to upgradeto new, more complex features without deploymentof new IoT objects and without any direct humaninvolvement. With limited processing ability, addition ofnew features to existing IoT objects may be challengingor even infeasible.

The challenges posed by the current IoT model can beovercome by adding processing capabilities inside or local

Page 2: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

2

to IoT objects [3]. With the added processing units, datamanagement operations such as filtering and analysis canbe carried out within the local network. IoT objects canthus, communicate summaries of information, obtained fromfiltering the aggregated data, to the cloud. This contributessignificantly to freeing up the core network bandwidth. Thereduction in data volume also reduces the energy expenditureon data security as less data requires lesser number ofcomputing operations to secure. Having more processingability also makes IoT deployments more flexible to upgradesas newer features can be added without significantly burdeningthe system.

Processing units interfaced with IoT objects require anoptimal balance between power and performance [4]. Sincemany IoT objects are battery powered, it is desirable that theseobjects operate for their entire lifetime with the battery they aredeployed with (e.g. medical sensors implanted into a patient’sbody via invasive surgical process). Although great progresshas been made in battery technology, batteries are still notable to keep pace with the demands of modern electronics[5]. So, power optimization must be considered in parallelwith performance optimization.

TIER 1

TIER 2

INTERCONNECT

HOST

PROCESSOR

HIGH PERFORMANCE

OPTIMIZED

INTERFACE

PROCESSOR

SENSING

ELEMENTS

ACTUATION

CONTROL

LOW POWER

OPTIMIZED

INTERFACE

PROCESSOR

SENSING

ELEMENTS

ACTUATION

CONTROL

LOW POWER

OPTIMIZED

INTERFACE

PROCESSOR

SENSING

ELEMENTS

ACTUATION

CONTROL

LOW POWER

OPTIMIZED

Fig. 1. Two-tiered heterogeneous processor architecture model for IoT

For incorporating higher levels of power optimizedperformance in IoT deployments, a two-tiered heterogeneousprocessor architecture is suitable [3] [6]. This two-tieredarchitecture, shown in Figure 1, consists of a host processor,optimized for high performance, interfaced with a number ofinterface processors, optimized for low power operation. Theinterface processors collect data from data-sensing elementsand control actuating elements. These processors are alwaysoperated in active mode because their low power operationdoes not severely impact battery life. Higher end function,such as filtering and analysis of data, and, implementationof complex security protocols are performed by the hostprocessor. Since these operations are infrequent, the powerhungry host processor is mostly operated in sleep state andonly activated intermittently for limited durations.

Designing efficient embedded processors with power-optimized performance, for use in IoT objects, is atedious process. Preventing high performance processorsfrom violating the power budget requirements dictated bythe market is an enormous design challenge [7]. Theopportunities for optimizing a processor design for power

are the greatest at the architecture level [7]. Thus, powerand performance optimizations should be performed whiledefining the microarchitecture configuration of processors. Themicroarchitecture configuration consists of several processordesign parameters each of which has to be tuned based onthe impact it has on the overall power and performanceof the processor. Selecting a microarchitecture configurationinvolves rigorous design space exploration over a search spaceconsisting of all possible settings for tunable processor designparameters. There are two main challenges that need to beaddressed in this process.

Firstly, the design space exploration methodology, employedto select microarchitecture configurations of processors for IoTobjects, must be temporally efficient. Long processor designtime leads to long time to market which results in loweredprofits [8] [9] and shorter product life cycle [9]. The IoTmarket also lacks accepted industry standards so, those whoget to the market first have the greatest opportunity to influencethose standards [9].

Secondly, the design space exploration methodology mustbalance processor power consumption with performance,which are conflicting design metrics [10]. It is not possibleto have optimal solutions for optimization problems withconflicting design metrics. The optimization problem shouldinstead be modeled as an Optimal Production Frontier problemalso known as Pareto Efficiency [11] problem. Multiplesolutions are obtained for such problems where each solutionfavors one of the conflicting metrics. The design spaceexploration methodology must intelligently choose the besttrade-off solution based on application specific requirements.

In this paper, we propose a temporally efficient designspace exploration methodology for determining power andperformance optimized microarchitecture configurations ofembedded processors used in IoT objects. We use acombination of exhaustive, greedy and one-shot searchmethods to perform design space exploration. We verify theeffectiveness of our methodology by testing it on a cycleaccurate simulator using a large set of standard benchmarkswith varying workloads.

The main contributions of our paper are:• We propose a temporally efficient design space

exploration methodology to find microarchitectureconfigurations for low-power and high-performanceoptimized embedded processors used in IoT objects.

• We include a threshold parameter in the design spaceexploration methodology which can be manipulated bythe system designer to control design time based on timeto market constraints.

• We propose exhaustive, greedy and one-shot searchalgorithms which yield microarchitecture configurationswhich are 2.23%-3.69% of the microarchitectureconfigurations obtained from fully exhaustive search.

• We distinguish between different microarchitectureconfigurations based on the size and type of benchmarkused, and, relate them with potential use cases in IoT.

The remainder of the paper is organized as follows. InSection II, we present a review of related work. We describeour design space exploration methodology in Section III and

Page 3: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

3

elaborate on its different phases in Section IV. In Section Vwe describe the cycle-accurate simulator and benchmarks usedto test our methodology. We discuss the results in Section VIand present our conclusions and future research directionsinSection VII.

II. RELATED WORK

Several SoC design companies have released articles ontechniques of increasing processing capabilities in IoT objects.Some articles guide the selection of processors for IoTobjects while others describe low power optimized processorarchitectures for IoT deployments.

ARM proposed a processor architecture consisting ofmultiple homogeneous processors in a single IoT objecteach serving a different purpose [6]. They defined a systemwith three Cortex-M processors, one to handle networkconnectivity, one to manage interface with sensors andactuators and one as a host processor controlling the other two.They stated that multiple processors are better for loweringpower consumption in IoT objects since only the processorserving the current task would be in active mode whilethe rest would be in sleep mode. ARM also proposed aguide to selecting microcontrollers for IoT objects [12]. Inthis guide, they argued that high-end microcontrollers weresuitable for IoT deployments for two reasons. Firstly, high-end microcontrollers complete processing tasks sooner andcan enter sleep mode to conserve power and secondly, largerflash and RAM sizes available with high-end microcontrollersfacilitate implementation of complex networking protocolswithout addition of any new processors in the system. Thesearticles clearly demonstrate the need for having more power-optimized performance in IoT deployments.

Synopsys also proposed the use of multiple processorsin IoT deployments [13]. They described the use of two-tiered processor architecture in IoT objects – ultra low powerembedded processors used to interface with sensing elementsto collect, filter and process data and host processor usedto manage embedded processors. Their processor architecturelowered power consumption by keeping power hungry hostprocessor mostly in sleep mode, similar to the concept usedby ARM. Synopsys also discussed optimization of processorsusing configurable hardware extensions for sensor applications[13]. They stated that adding custom hardware extensions forexecuting typical sensor functions reduces the processor cyclecount required to execute sensor applications. The reductionin cycle count lowers energy consumption either by loweringthe clock frequency and keeping the same execution time, orhaving the same power but shorter execution time.

Apart from research carried out by SoC design companies,processor design has also been extensively studied inacademia [14] [15]. There are many research works inliterature involving optimized processor design. Most worksemploy design space exploration [16] [17] techniquesutilizing search methods like exhaustive and greedy searchand optimizing algorithms like genetic and evolutionaryalgorithms. Givargis et al. [18] developed an explorationmethodology named PLATUNE (PLATform TUNEr) thatcarried out exhaustive searches in two stages: first, over

clusters of strongly interconnected parameters to obtainPareto-optimal configurations local to each cluster, andsecond, over all the clusters to obtain a global Pareto-optimal solution. The approach could explore design spacesas large as1014 configurations, but it took an order of 1-3 days to complete. Palesi et al. [19] argued that the highexploration time for PLATUNE was due to the formation oflarge partial search spaces in the clustering process. Palesiet al. improved the PLATUNE exploration methodology byintroducing a new threshold value that distinguished betweenclusters based on the size of their partial search-space.Exhaustive search method was used for clusters with partialsearch-spaces smaller than the threshold value and a geneticexploration algorithm was used for larger spaces. Throughthis improvement, they were able to achieve 80% reductionin simulation time while still remaining within 1% of theresults obtained from exhaustive search. Genetic algorithmswere also used in the system MULTICUBE, by Silvano et al.[20]. The MULTICUBE system defined an automatic designspace exploration algorithm that could quickly determine anapproximate Pareto front for a given design requirements.

Munir et al. [21] proposed another alternative to overcomethe overhead of exhaustive search in their work on dynamicoptimization of wireless sensor networks. Their approachwas divided into two phases. In the first phase, a one-shotsearch algorithm selected initial parameter settings and furtherordered the parameters based on their significance towardsthe application requirements. In the second phase, a greedyalgorithm was used to search the design space. Their approachyielded a design configuration that was within 8% of theoptimal configuration while only exploring 1% of the designspace.

In this paper, we improve on the work carried out by Muniret al. [21]. We leverage a similar approach to design spaceexploration but add two new phases: a set-partitioning phaseand an exhaustive search phase. The addition of the exhaustivesearch phase aims at increasing the degree of closeness tothe optimal solution by exploring a larger portion of thedesign space, as argued by Silvano et al. [20]. The limiton the number of configurations considered in the exhaustivesearch is determined by the set-partitioning phase that uses athreshold value [19].

III. M ETHODOLOGY

Our design space exploration methodology for determiningoptimal microarchitecture configuration of embeddedprocessors for IoT is shown in Figure 2. Our methodologyis implemented in four phases – initial one-shot searchconfiguration tuning and parameter significance, set-partitioning, exhaustive search configuration tuning andgreedy search configuration tuning.

The initial one-shot search configuration tuning andparameter significance phase is carried out by the initialone-shot search configuration tuning module and theparameter significance ordering module. The microarchitectureconfiguration parameter settings set, which consists of allthe possible settings for each tunable microarchitectureparameter, is provided as input to the initial one-shot search

Page 4: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

4

Microarchitecure configuration parameter

settings set

Initial one-shot configuration

tuning module

Cycle - accurate simulator

Test benchmarks

Parameter significance

orderingmodule

Set partitioning

Separated exhaustivesearch set

Separated greedy

search set

Exhaustive searchconfiguration

tuning module

Greedy searchconfiguration

tuning module

Best settings of parametersin exhaustive

search set

Optimized microarchitectureconfigurations forprocessor for IoT

Weights for design metrics

Explorationthreshold

Significanceordered

parameter set

INPUT

OUTPUT

INPUTINPUT INPUT

Fig. 2. Design space exploration methodology for determining optimalmicroarchitecture configuration of embedded processor forIoT

configuration tuning module by the system designer. Thismodule uses the parameter settings set to generate initial testconfigurations. Each initial configuration is passed to a cycle-accurate simulator. The test benchmarks for evaluating themicroarchitecture configurations are provided as input to thesimulator by the system designer. The simulator executes eachinitial test configuration separately for each test benchmarkspecified. The test benchmarks provide varying workloads fortesting the initial test configurations. The system designer alsoprovides the weights for balancing design metrics as input tothe simulator. These weights are used to specify the preferredtradeoff between conflicting design metrics.

The simulator module evaluates the initial testconfigurations supplied by the initial one-shot searchconfiguration tuning module to determine the best initialsetting for each tunable microarchitecture parameter. Thesimulation results are forwarded to the parameter significanceordering module where the tunable microarchitectureparameters are ordered based on their significance to thedesign metrics considered.

The ordered set of significance values is communicated tothe set-partitioning module which separates the parametersinto two search sets – exhaustive and greedy. The parametersare separated based on an exploration threshold value providedby the system designer. The exploration threshold value is usedto control search space for the exhaustive search phase of ourdesign space exploration methodology. The exhaustive searchphase is the longest phase in the design space exploration

methodology and processor design time can be significantlyaltered by varying this exploration threshold value.

The microarchitecture parameters separated out in theexhaustive search set are communicated to the exhaustivesearch configuration tuning module. This module generatestest configurations using all possible combinations of tunableprocessor design parameters. The parameters which are notin the exhaustive search set retain their best settings fromtheinitial one-shot search configuration tuning process. These testconfigurations are evaluated on the cycle-accurate simulatorto determine a test configuration possessing the best tradeoffbetween the conflicting design metrics considered. The bestsettings for the microarchitecture parameters in the exhaustivesearch set are then communicated to the greedy searchconfiguration tuning module.

The greedy search configuration tuning module generatestest configurations using the processor design parametersseparated out in the greedy search set. A greedy searchalgorithm (refer Section IV-D) is used to generate thesetest configurations. The microarchitecture parameters in theexhaustive search set retain their best setting obtained from theexhaustive search simulation process. The parameters whichare in neither of the two search sets, retain their best settingsfrom the initial one-shot search configuration tuning process.The best configuration obtained at the end of the greedysearch configuration tuning process is communicated back tothe processor designer as the optimal microarchitecture oftheprocessor with the preferred tradeoff between the conflictingdesign metrics.

A. Defining the Design Space

Considern number of tunable parameters are available todescribe the microarchitecture configuration of an embeddedprocessor for IoT. LetP be the list of these tunable parametersdefined as the following set:

P = {P1, P2, P3, · · · , Pn} (1)

Each tunable parameterPi [where i ∈ {1, 2 · · ·n}] in the listP is the set of possible settings forith parameter. LetL bethe set containing the size of the set of possible settings foreach parameter in listP .

L = {L1, L2, L3, · · · , Ln} (2)

such that,Li = |Pi| ∀ i ∈ 1, 2, · · · , n (3)

where|Pi| is the cardinal value of setPi.So, each parameter setting setPi in the listP is defined as

follows:

Pi = {Pi1, Pi2, Pi3, · · · , PiLi} ∀ i ∈ {1, 2, · · · , n} (4)

The values in the setPi are arranged in ascending order.The state space for design space exploration is the collection

of all the possible configurations that can be obtained usingthen parameters.

S = P1 × P2 × P3 × · · · × Pn (5)

Page 5: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

5

Here, × represents the Cartesian product of lists inP .Throughout this paper, we use the termS to denote the statespace composed of alln tunable parameters. To maintaingenerality, when referring to a state space composed ofatunable parameters wherea < n, we attach a subscript tothe termS.

Sa = P1 × P2 × P3 × · · · × Pa ∀ a < n (6)

We note that the state space ofa tunable parameters does notconstitute a complete design configuration and is only used asan intermediate when defining our methodology.

We also reserve the use of× operator in the followingmanner:

Sa = Sa × Pi ∀ i ∈ {1, 2, · · · , n} (7)

This represents the extension of the state spaceSa to includeone new set of parameter settingsPi from the list P . Thisoperation increases the number of tunable parameters in statespacea by one.

When referring to a design configuration that belongs to thestate spaceS, we use the terms. We attach subscripts tosto refer to specific design configurations. For example, a statesf that consists of the first setting of each tunable parametercan be written as:

sf = (P11, P21, P31, · · · , Pn1) (8)

Similarly, to denote an incomplete/partial design configurationof a tunable parameters we use the termδsa.

B. Benchmarks

Each of the configurations, selected from the statespaceS by our methodology, is tested onm number oftest benchmarks. The design metrics for each simulatedconfiguration is collected separately for each benchmark.

C. Objective Function

In our methodology, design configurations are comparedwith each other based on their objective functions. Theobjective function of a design configuration is the weightedsum of the normalized design metrics obtained after simulatingthat design configuration. Leto be the number of designmetrics andV be the set of normalized values of designmetrics which are obtained from the simulation.

V ks = {V k

s1, Vks2, V

ks3, · · · , V

kso} ∀ k = 1, 2, · · · ,m (9)

Let w be the set of weights for the design metrics based onthe requirements of the targeted application. These weights areset by the system designer.

w = {w1, w2, w3, · · · , wo} (10)

such that,0 ≤ wl ≤ 1 ∀ l = 1, 2, · · · , o (11)

and, ∑wl = 1 ∀ l = 1, 2, · · · , o (12)

TABLE IL IST OF SYMBOLS

Symbol Descriptionn Number of tunable microarchitecture parametersP List of tunable microarchitecture parametersPi Set of possible settings forith tunable microarchitecture

parameterL Size of set of possible settings for each tunable

microarchitecture parameterLi Cardinal value of setPi

S State space for design space explorationSa Partial/Incomplete state spacestag State in state spaceS with ‘tag’ identifierδsa State in partial state spaceSa

m Number of test benchmarkso Number of design metricsV ks Set of normalized values obtained for design metrics from

simulation of states for kth benchmarkw Set of weights for design metricswl Weight for lth design metricFk

s Objective function obtained from simulating states for kth

benchmark

The objective functionF of a design configurations for a testbenchmarkk is defined as follows:

Fks =

∑wlV

ksl ∀ l = 1, 2, · · · , o (13)

The optimization problem, considered in this paper, is tominimize the value of the objective functionF . The designmetrics are chosen such that the minimization of theirvalues is the favorable design choice. For example, whenconsidering the performance metric, the design goal is tomaximize performance. To model this into the objectivefunction which we use execution time to measure performance.Minimizing execution time would fit with minimizing theobjective function while still modeling the design goal ofmaximizing performance. The optimization problem for eachtest benchmarkk is defined as follows:

min. F ks

s.t. s ∈ S(14)

Table I presents the symbols established in this section inlist form.

IV. PHASES OFMETHODOLOGY

Our proposed design space exploration methodologyconsists of four distinct phases. In this section, we elaborateon the steps involved in each phase using the notation set upin Section III.

A. Phase I : Initial One-Shot Search Configuration Tuningand Parameter Significance

In this phase of our methodology, best initial setting for eachtunable microarchitecture parameter in setP is determinedby using a one-shot search configuration tuning process. Theone-shot search process is based on single factor analysiswhich is an effective heuristic approach used in designspace exploration [22]. Unlike single factor analysis whereinparameters can have only two settings, a zero value and anon-zero value setting, one-shot search works on parameters

Page 6: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

6

with more than two non-zero value settings. In one-shot searchprocess, parameters are evaluated on a one by one basis. Twotest configurations are generated for each parameter, one withthe first setting and one with the last settings from the list ofsettings for the current parameter. The remaining parametersare arbitrarily set to their first setting from their correspondinglist of settings.

Algorithm 1: Initial One-Shot Search ConfigurationTuning and Parameter SignificanceInput: P - List of Tunable ParametersOutput: B - Set of Best Settings;D - Significance of

Parameters with respect to Objective Function

1 for i← 1 to n do2 sf = {Pi1}3 sl = {PiL[i]}4 for j ← 1 to n do5 if i 6= j then6 sf = sf ∪ {Pj1}7 sl = sl ∪ {Pj1}8 end9 end

10 for k ← 1 to m do11 Explorekth benchmark using configurationsf12 CalculateFk

sf

13 Explorekth benchmark using configurationsl14 CalculateFk

sl

15 Dki = Fk

l −Fkf

16 if Dki > 0 then

17 Bki = Pi1

18 else19 Bk

i = PiL[i]

20 end21 end22 end

The steps involved in initial one-shot search configurationtuning and determining parameter significance are detailedinAlgorithm 1. The first and last test configurations generatedfor evaluating a tunable microarchitecture parameter,Pi in setP , are denoted bysf andsl, respectively. These configurationsare tested on the cycle-accurate simulator. From the results ofthe simulation, objective functions,Fsf andFsl correspondingto sf and sl, respectively, are determined. The objectivefunction values are used to determine best initial setting aswell as significance of each microarchitecture parameter. Themagnitude of the difference betweenFsf and Fsl , whichis stored in parameter significance setD (line 15), is usedas parameter significance. The higher the magnitude of adifferenceDk

i , i ∈ {1, 2, 3, . . . , n} for a benchmarkk, k ∈{1, 2, 3, . . . ,m}, the higher is the significance of parameterPi

to the workload characterized by benchmarkk. The sign ofthe difference betweenFsf andFsl is used to pick the bestinitial setting for parameterPi. If the difference is positive,then the first setting of parameterPi is chosen as the bestsetting, otherwise the last setting is chosen. The best settings

for the parameters are stored in the set of best settingsBki

(lines 17 and 19).

B. Phase II : Set-Partitioning

Algorithm 2: Set-PartitioningInput: D - Significance of Parameters towards

Objective Function;I - Index Set;T -Exhaustive Search Threshold Factor

Output: E - Set of Parameters for Exhaustive Search;G - Set of Parameters for Greedy Search

1 E = ∅ and G = ∅2 for k ← 1 to m do3 sortDescending (| Dk |)- s.t. index information of

the sorted values is preserved inIk

4 sort(P k) and sort(Lk) w.r.t. index information inIk

5 numE = 1 and i = 16 while numE ≤ T do7 numE = numE × Lk

i

8 if numE ≤ T then9 Ek = Ek ∪ {Pi}

10 i = i+ 111 else12 break13 end14 end15 numG = ceil((|P k| − |Ek|) / 2)16 while numG > 0 do17 Gk = Gk ∪ {P k

i }18 numG = numG − 119 i = i+ 120 end21 end

The set-partitioning phase, presented in Algorithm 2, showshow the parameter significance values determined in thefirst phase of our methodology are used to separate thelist of tunable microarchitecture parameters into exhaustiveand greedy search sets. First, the parameter significanceset |Dk| for each benchmarkk, k ∈ {1, 2, 3, . . . ,m},is sorted in descending order of magnitude using thesortDescending(|Dk|) function. The index information of thesorted values is preserved in a set of indexesIk (line 3).For example, if the fifth entryDk

5 has the greatest value,Dk5

will become the first entry in the setDk and first entry inthe set of indexesIk will be 5, that is,Ik1 = 5. The set ofindexes,Ik, is used to sort the list of tunable microarchitectureparameters,P k, and list of set sizes,Lk. After sorting, theparameters with higher significance lie towards the start ofthe set and the parameters with lower significance lie towardsthe end of the set. The list of parameters is then divided intothree subsets, exhaustive search, greedy search and one-shotsearch sets. The exhaustive search set gets parameters withthehighest significance. The number of parameters separated intothe exhaustive search set depends on the exploration thresholdvalue,T , provided by the system designer. The threshold value

Page 7: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

7

T limits the size of the partial search space of the exhaustivesearch set,numE (line 6).

After separating out exhaustive search set, the parametersremaining in the parameter list are separated into greedysearch and one-shot search sets. The list of remainingparameter is divided into two halves (line 15) and theupper half ceil((|P | − |Ek|)/2) is separated as the greedysearch set and the lower half is separated as one-shotsearch set. We observe empirically that dividing the list ofremaining parameters into halves provides efficient designspace exploration without significantly compromising thesolution quality. The parameters separated as one-shot searchset are not explored further and are left at the best settingsdetermined for them in Algorithm 1.

C. Phase III : Exhaustive Search Configuration Tuning

Algorithm 3: Exhaustive SearchInput: P - List of Tunable Parameters;B - Set of

Best Settings for One-shot Search;E - List ofParameters for Exhaustive Search

Output: B - List of Best Settings for One-shot andExhaustive Search

1 sE = ∅2 δsE = ∅ and δs

E′ = ∅

3 for k ← 1 to m do4 Fk

sb=∞

5 for i← 1 to n do6 if Pi /∈ Ek then7 δsk

E′ = δsk

E′ ∪ {Bk

i }8 end9 end

10 for i← 1 to n do11 if Pi ∈ Ek then12 Sk

E = SkE × Pi

13 end14 end15 for j ← 1 to |Sk

E | do16 δusedskEj is a partial configuration in state

spaceSkE

17 skE = δskEj ∪ δskE′

18 Explorekth benchmark using configurationskE19 CalculateFk

E

20 if FksE

< Fksb

then21 Fk

sb= Fk

sE

22 Bk = skE23 end24 end25 end

Algorithm 3 details the steps involved in the exhaustivesearch process. The exhaustive search process determinesthe best settings for the parameters in the exhaustive searchset E . First, the settings for the parameters that are notin the exhaustive search setE are assigned (line 7). Theseparameters are assigned their best settings from the set of

best settingsBki as determined in the initial one-shot search

configuration tuning process described in Algorithm 1. Thesesettings make up the partial test design configurationδs

E′ .

Next, a partial state spaceSE is formed for the parametersin the exhaustive search setE (line 12). Every possiblepartial test design configuration,δsEj (line 16), in the partialstate spaceSE , is combined with the partial test designconfigurationδs

E′ to form complete simulatable test design

configurations. Each complete test design configuration isevaluated on the simulator. An objective function value,FsE ,is obtained for each complete test design configuration,sE ,from the simulator. The algorithm keeps track of the smallestobjective function value encountered in the search processinFsb which represents the best objective function value. Whena design configuration results in an objective function thathasa value less thanFsb (line 20), thenFsb is changed to thenew minimum value and the set of best settingsB is updatedwith the corresponding design configuration.

D. Phase IV : Greedy Search Configuration Tuning

In the final phase of our methodology, described inAlgorithm 4, the best settings for the parameters in greedysearch setG are determined. For each parameter in the setG,the sign of the parameter significance is checked to determinewhether the first setting or last setting was chosen as the bestsetting in the first phase of our methodology. If the sign ofparameter significance is positive, then it indicates that firstsetting for that parameter yields a smaller objective functionas compared to the last. If the sign is negative then it indicatesthat the last setting for that parameter yields a smaller objectivefunction as compared to the first. We assume that the settingthat yields the smallest objective function lies closer towardsthe setting that yields the smallest objective function in theinitial one-shot search configuration tuning process. To ensurethat the search process starts from the setting that yieldedthe smallest objective function in the initial one-shot searchconfiguration tuning process, we sort the set of parametersettingsPi in descending order (for last setting as best setting)or left unchanged in default ascending order (for first settingas best setting) (line 8).

In the greedy search process, the parameters in the greedysearch set are considered one at a time. First, a partial testdesign configurationδs

GP′ is formed using the exhaustive

search set, the one-shot search set and the non-currentparameters in greedy search set. The parameters in theexhaustive search set,E , are assigned their best values asdetermined in the exhaustive search configuration tuningprocess. The parameters in the one-shot search set retain thebest settings determined in the initial one-shot configurationtuning process. The non-current parameters in the greedysearch set,G, are assigned best settings in one of two ways.If the non-current parameter has already been processed bythe greedy search optimization process, then the parameterisassigned the best setting obtained from that process. If thenon-current parameter has not been processed yet, then theparameter is assigned the best setting obtained from the initialone-shot search configuration tuning process.

Page 8: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

8

Algorithm 4: Greedy SearchInput: P - List of Tunable Parameters,D -

Significance of Parameters towards ObjectiveFunction,B - Set of Best Settings for One-shotand Exhaustive Search,E - Set of Parametersfor Exhaustive Search,G - Set of Parametersfor Greedy Search

Output: B - Complete set of Best Settings

1 sG = ∅2 δsG′ = ∅3 GP = ∅4 for k = 1 to m do5 Fk

sb=∞

6 for i← 1 to n do7 if Pi ∈ Gk then8 if Dk

i < 0 then9 GP = sortDescending (Pi)

10 end11 for j ← 1 to n do12 if Pj 6= GP then13 δsk

G′

P

= δskG

P

∪ {Bkj }

14 end15 end16 for l← 1 to Li do17 skG = δsk

GP′ ∪ {GPl}

18 Explorekth benchmark usingconfigurationskG

19 CalculateFksG

20 if FksG

< Fksb

then21 Fk

sb= Fk

sG

22 Bki = GPj

23 else24 break25 end26 end27 end28 end29 end

The partial test design configurationδsGP

′ is then combinedwith the settings for the current parameter being processedto form the complete simulatable test design configurationsG (line 17). This configuration is evaluated on the cycle-accurate simulator. The resulting objective function,FsG , iscompared with the best objective functionFsb , which holds thesmallest value objective function encountered thus far in thesearch process. Similar to the exhaustive search process, whena design configuration results in an objective function thathasa value less thanFsb (line 20), thenFsb is changed to the newminimum value and the set of best settingsBk

i is updated withthe corresponding design configuration. However, when thesearch process encounters a design configuration that resultsin an objective function that has a value greater thanFsb , thenthe search process for the current parameter is terminated andthe next parameter in the parameter listG is explored.

V. EXPERIMENTAL SETUP

We used the ESESC [23] (Enhanced Super EScalar)simulator to simulate all the test microarchitectureconfigurations generated by our methodology. The ESESCsimulator is a fast cycle-accurate chip multiprocessorsimulator. It models an out-of-order RISC (ReducedInstruction Set Computing) processor running ARMinstruction set.

We used benchmarks from the PARSEC and SPLASH2[24], [25] benchmark suite to test our methodology. ThePARSEC and SPLASH2 benchmark suite is a collection ofstandardized benchmarks which provides a diverse range ofworkloads for evaluation of processors.

We used the following benchmarks from the PARSEC andSPLASH2 suite to test our methodology.

PARSEC Benchmarks: Blackscholes, Canneal, Facesim,Fluidanimate, Freqmine, x264

SPLASH2 Benchmarks: Cholesky, FFT, LUcb, LU ncb,Oceancp, Oceanncp, Radiosity, Radix, Raytrace

The methodology phases were implemented using PERL[26]. The results from the simulation processes were collectedin MS Excel using Excel-Writer-XLSX [27] tool for PERL.

We tested our design space exploration methodologyseparately for low-power and high-performance processordesign. We combined the microarchitecture configurationsobtained from these tests to form a two-tiered heterogeneousprocessor architecture. The microarchitecture configurationobtained from the low-power processor design tests wereused to implement the low-power optimized interfaceprocessors, the lower tier of the two-tiered architecture.The microarchitecture configuration obtained from the high-performance processor design tests were used to implementthe high-performance optimized host processor, the upper tierof the two-tiered architecture.

TABLE IIM IRCOARCHITECTURE CONFIGURATION PARAMETER SETTINGS SET

Parameter NameSet of Settings

Low-Power High-PerformanceCores 1, 2, 4 2, 4, 8Frequency (MHz) 75, 100, 125, 150 1700, 2200, 2800, 3200L1-I Cache Size (kB) 8, 16, 32, 64 8, 16, 32, 64, 128L1-D Cache Size (kB) 8, 16, 32, 64 8, 16, 32, 64, 128L2 Cache Size (kB) 256, 512, 1024 256, 512, 1024L3 Cache Size (kB) 2048, 4096 2048, 4096, 8192

The list of microarchitecture parameters considered fortesting our methodology along with the set of possible settingsfor each parameter is listed in Table II. We used different rangeof settings for low-power and high-performance processordesign. The range of settings listed in Table II under low-power design were used for the design of low-power optimizedinterface processors. The design space cardinality for low-power processor design was 1,152 configurations. The rangeof settings listed in Table II under high-performance designwere used for the design of high-performance optimized hostprocessor. The design space cardinality for high-performanceprocessor design was 2,700 configurations.

Page 9: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

9

TABLE IIIWEIGHTS FOR DESIGN METRICS

Configuration Power PerformanceLow-Power 0.9 0.1High-Performance 0.1 0.9

We used power and performance as design metrics toevaluate the microarchitecture configurations for both low-power and high-performance optimized processors. We usednormalized value of total dynamic power and leakage power[28] across all the cores in the processor as the power metricand the normalized value of total execution time as theperformance metric. We used the weights presented in TableIII to specify the preference for the conflicting design metricsof power and performance. The linear objective function usedfor the evaluation of the test microarchitecture configurationswas:

F = wP · P + wE · E (15)

where,

P = Dynamic Power + Leaked Power

E = Total Execution T ime(16)

VI. RESULTS

In this section, we present the results obtained while testingour methodology. This section is divided into two subsections.In the first subsection, we present results to validate ourdesign space exploration methodology and in the secondsubsection, we discuss some of the applicability of some of themicroarchitecture configurations to important IoT use cases.

A. Evaluation of design space exploration methodology

For evaluating our methodology, we compared ourmicroarchitecture configuration results with those obtainedfrom a fully exhaustive search of the design space. We testedour methodology with an exploration threshold ofT = 150.This threshold value is an upper bound which limits the partialstate space for the exhaustive search phase of our methodology.

1) Parameter significance:Figure 3 shows the normalizedvalues of parameter significance for different PARSECbenchmarks. The normalization is carried out using themaximum values for total power and total execution timeobtained in the initial one-shot search configuration tuningprocess. The parameter significance values are calculated inthe first phase of our methodology, initial one-shot searchconfiguration tuning. We observe that the significance ofeach of the tunable processor design parameters varies basedon the type of workload offered by the test benchmarks.For each of the test benchmarks, there are at most threesignificant processor design parameters. We note that theoperating frequency is the processor design parameter withthehighest significance for most of the test benchmarks followedby core count, which is the second most significant designparameter. For certain test benchmarks, the size of the L1-Icache and L1-D cache are also highly significant to overalldesign. The large significance in cache sizes is a result oflarge working sets with fine data-parallel granularity offeredby those test benchmarks.

Bla

cksc

hole

s

Can

neal

Face

sim

Flu

idan

imat

e

Fre

qmin

e

x264

Cores

Frequency

L1−I Cache

L1−D Cache

L2 Cache

L3 Cache

Par

amet

er S

igni

fican

ce |D

|

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 3. Significance of microarchitecture configuration parameters forPARSEC benchmarks for high-performance optimized processor for IoT

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Total Power

Exe

cutio

n T

ime

x264 pareto frontObjective function linePoint of intersection

Fig. 4. Linear objective function plotted with Pareto frontfor x264 (PARSEC)benchmark for high-performance optimized processor for IoT

2) Selecting a favorable tradeoff solution:Figure 4 showsthe Pareto front obtained for x264 (PARSEC) benchmarkfor high-performance optimization requirement. The Paretofront is generated using the normalized values of total powerand execution time design metrics. The front represents theconflicting interdependency between power and performancein a processor. It shows that increasing the performance ofa processor degrades its power efficiency whereas increasingpower efficiency degrades performance. It is thus impossibleto determine a microarchitecture configuration which resultsin both these metrics having optimal values. The goal ofthe design space exploration methodology is to determine abalance between these conflicting design metrics. A suitabletradeoff between these metrics is selected by using thepreference specified using the weights assigned to each metric.In our experiments, we specifiedwP andwE as the weightsfor power and performance metrics respectively to define alinear objective function (Equation 15). Figure 4 shows theobjective function plotted along with the Pareto front. We notethat the objective function forms a straight line in the power-

Page 10: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

10

TABLE IVCOMPARISON OF MIRCOARCHITECTURE CONFIGURATIONS OBTAINED FOR

X264 (PARSEC)BENCHMARK FOR HIGH-PERFORMANCE OPTIMIZED

PROCESSOR FORIOT

Parameter NameMicroarchitecture Configuration

Proposed Fully ExhaustiveMethodology Search

Cores 2 2Frequency (MHz) 3200 3200L1-I Cache Size (kB) 64 128L1-D Cache Size (kB) 64 128L2 Cache Size (kB) 1024 256L3 Cache Size (kB) 2048 8192Total Power (W) 1.597 1.600Execution Time (ms) 35.142 34.152

performance graph with the slope−wP/wE . We observethat the objective function is tangent to the Pareto front atthe power-performance value pair of the microarchitectureconfiguration obtained as solution by our methodology.

3) Comparison with fully exhaustive search:We verifiedthe microarchitecture configuration obtained as solution fromour methodology by comparing it against the solution obtainedby running a fully exhaustive search of the design space.We present a comparison of the x264 (PARSEC) benchmarkas an example in Table IV. The table shows a side-by-sidecomparison of the microarchitecture configurations obtainedfrom our proposed methodology with the same obtained fromfully exhaustive search. Comparing these values, we see thatsignificant parameters like operating frequency and core-countmatch exactly while other parameters only differ slightly.Thetable also contains the values of the total power and executiontime obtained for both configurations. Comparing the valuesofthese design metrics, we see that the total power and executiontime values obtained from our methodology are within -0.18%and 2.89% respectively of the total power and execution timevalues obtained from fully exhaustive exploration.

Using our methodology, on average we achievemicroarchitecture configurations with total power valueswithin 2.23% for low-power optimized processor andexecution time within 3.69% for high-performance optimizedprocessors as compared to fully exhaustive search. Theseconfigurations are obtained by exploring only 3%–5% of theprocessor design space which results in our methodologyhaving an average speedup of 24.16× as compared to fullyexhaustive exploration of the design space.

B. Application scopes in IoT

Based on the type and size of workload offered bythe test benchmarks, we separate them into four differentcategories each of which relates to an IoT application orprocess. Table V shows the categorization of some of thekey test benchmarks. The Cholesky and Radix benchmarksfrom the SPLASH2 benchmark suite are categorized underdata sensing and aggregation. The Cholesky benchmark is asparse matrix factorization kernel and the Radix benchmarkis an integer sort kernel [29]. The Cholesky benchmark isrepresentative of data sensing in IoT applications, where datais acquired from multiple sensor sources and transformed into

TABLE VCATEGORIZATION OF TEST BENCHMARKS ACCORDING TOIOT

APPLICATION

IoT Application BenchmarksData sensing and aggregation Cholesky, RadixData analysis and Data mining Blackscholes, FreqmineGraphics Facesim, FluidanimateSignal processing and CommunicationFFT

TABLE VIM IRCOARCHITECTURE CONFIGURATIONS FOR LOW-POWER OPTIMIZED

PROCESSORS FORIOT

Parameter NameMicroarchitecture Configuration

Cholesky RadixCores 1 1Frequency (MHz) 75 75L1-I Cache Size (kB) 8 8L1-D Cache Size (kB) 32 64L2 Cache Size (kB) 256 256L3 Cache Size (kB) 2048 4096Total Power (W) 0.0934 0.0935Execution Time (ms) 327.958 332.535

a more useful format. The Radix benchmark is representativeof data aggregation, where indexing, sorting and storingoperations are carried out on sensed data. These benchmarksare useful in determining the microarchitecture configurationsof low-power optimized interface processors for the two-tieredheterogeneous processor architecture.

The remaining categories all model more complexapplications requiring high level of processing capabilities.The Blackscholes and Freqmine benchmarks from thePARSEC benchmark suite are listed under data analysis anddata mining. The Blackscholes benchmark is a financialanalysis benchmark that analytically solves large sets of partialdifferential equations [24]. The Freqmine benchmark is a datamining kernel which implements Frequent Itemset Mining[24]. These benchmarks are representative of data analysisand filtering operations that need to be carried out on largevolumes of sensor data in an IoT network.

The Facesim and Fluidanimate benchmarks from thePARSEC benchmark suite are listed under graphics. TheFacesim benchmark generates a visually realistic model ofa human face and the Fluidanimate benchmark simulates anincompressible fluid for interactive animation purposes [24].Graphical applications are important in IoT objects which needto interact with users via graphical user interfaces.

The FFT benchmark from the SPLASH2 benchmark suite islisted under signal processing and communication. The FFTbenchmark is an implementation of Fast Fourier Transformalgorithm which is optimized to minimize interprocesscommunication [29]. Signal processing and communication isone of the most common applications in an IoT network. FFTis an important Digital Signal Processing (DSP) algorithmwhich is required in communication of data over SoftwareDefined Radios (SDR) [14].

These benchmarks, which require higher processingcapabilities, are useful in determining the microarchitectureconfigurations of high-performance optimized host processorfor the two-tiered heterogeneous processor architecture.

Page 11: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

11

TABLE VIIM IRCOARCHITECTURE CONFIGURATION FOR HIGH-PERFORMANCE OPTIMIZED PROCESSORS FORIOT

Parameter NameMicroarchitecture Configuration

Blackscholes Freqmine Facesim Fluidanimate FFTCores 8 2 2 2 4Frequency (MHz) 3200 3200 3200 3200 3200L1-I Cache Size (kB) 64 32 8 8 128L1-D Cache Size (kB) 128 128 64 64 32L2 Cache Size (kB) 256 1024 1024 1024 512L3 Cache Size (kB) 8192 2048 8192 4096 2048Total Power (W) 4.549 1.565 1.546 1.546 2.563Execution Time (ms) 28.1239 67.319 60.072 55.605 29.986

1) Microarchitecture configurations for low-poweroptimized processors for IoT: Table VI shows themicroarchitecture configurations obtained for Choleskyand Radix benchmarks from the SPLASH2 benchmark suite.In these configurations, we note that for low-power optimizedprocessor, the lowest operating frequency and core count areselected. This result can be interpreted intuitively, becausehigh operating frequency and high number of cores in theprocessor increases the power consumption of the processor.We also note that these configurations have large L1-D cachesizes. This is because of the large workload offered by thetest benchmarks. This is representative of the growing IoTecosystem in which large volumes of data are gathered from alarge number of sensing elements. The values of total powerand execution times for microarchitecture configurations arealso shown in Table VI. We observe that the power values arein the range of a hundred milliwatts and the execution time isin the range of a few hundred milliseconds. These values arewithin the operational requirements in most IoT deployments.These configurations implement the interface processors inthe two-tiered heterogeneous processor architecture. Withlow-power requirements, these processors can always beoperated in active mode, without impacting the power budgetof IoT deployments

2) Microarchitecture configuration for high-performanceoptimized processors for IoT: Table VII shows themicroarchitecture configurations obtained for Blackscholes,Freqmine, Facesim and Fluidanimate benchmarks from thePARSEC benchmark suite and the FFT benchmark from theSPLASH2 benchmark suite. We analyze the microarchitectureconfigurations obtained for these test benchmarks accordingto the categorization discussed in subsection VI-B. Weobserve that for data analysis and data mining applications,represented by the Blackscholes and Freqmine benchmarks,higher performance is achieved primarily by the increasein operating frequency. We note that the size of the L1-Dcache for these applications is also high, which is becauseboth are highly data-parallel benchmarks. The size of theL2 cache, for Blackscholes, and, L3 cache, for Freqmine,is also high which is also a result of data-parallelism inthese benchmarks. For graphics applications, representedbyFacesim and Fluidanimate benchmarks, higher performancecan again be attributed to increase in operating frequency.These benchmarks are also highly data-parallel which explainsthe large L1-D cache, L2 cache and L3 cache in the resultingmicroarchitecture configurations. In signal processing and

communication applications, represented by FFT benchmark,performance improvement, similar to other applications, isattained by increase in operating frequency. However, FFTrequires a larger instruction cache as compared to largerdata caches for other applications. Higher L1-I cache couldbe a result of the FFT benchmark being optimized for lowinterprocess communication.

The total power and execution time of eachmicroarchitecture configuration is also listed in TableVII. These configurations have high total power values in therange of one to a few watts but significantly low executiontime values in the range of few tens of milliseconds. Theseconfigurations implement the host processor in the two-tieredheterogeneous processor architecture. Due to their high-powerrequirement, these processors are mostly kept in sleep modeand are activated intermittently for short durations to saveenergy and prolong battery life. Because these processorshave shorter execution times, they can execute their tasksquickly and go to sleep thus, decreasing the duration thatthey are active.

VII. C ONCLUSION AND FUTURE WORK

In this paper, we proposed a temporally efficient designspace exploration methodology for selecting microarchitectureconfigurations of processors for IoT. Our explorationmethodology consisted of four phases. In the first phase, wedetermined best initial settings for tunable processor designparameters using initial one-shot search method. We alsocalculated the significance of each design parameter on theoverall design in this phase. The results of this phase wereused in the second phase to separate the processor designparameters into distinct search sets using an explorationthreshold value supplied by the system designer. The third andthe fourth phase of the methodology implemented exhaustiveand greedy search methods to prune these search sets todetermine the best microarchitecture configuration of theprocessor.

We tested our methodology over two design spaces, onefor determining low-power optimized and the other fordetermining high-performance optimized processors for IoT.We validated the results obtained from our methodologyby comparing with solutions obtained from fully exhaustiveexploration of the design spaces. Our results revealed thatourmethodology obtained microarchitecture configurations closeto within 2.23%–3.69% of the configurations obtained from

Page 12: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

12

fully exhaustive search. Our methodology only explored 3%–5% of the overall design space to determine these high qualitysolutions. This resulted in 24.16× average speedup on designspace exploration as compared to the time required for fullyexhaustive exploration.

We also described a two-tiered heterogeneous processorarchitecture for incorporating power-optimized performance inIoT objects. We used the results obtained from the evaluationof our design space exploration methodology to describe thetwo-tiered architecture. We categorized the test benchmarksinto four different categories, relating them with possibleIoT use cases and analyze microarchitecture configurationsdetermined for these benchmarks to make our assertions onprocessors for IoT objects. We determined that for low-poweroptimization, microarchitecture configurations with lower corecount and lower operating frequency are more suitable. Forhigh-performance optimization, improvement in performanceprimarily results from increase in operating frequency. Wealsoanalyzed the cache hierarchy for different microarchitectureconfigurations and related them with the type and size ofworkloads offered by the test benchmarks.

In the future, we plan to investigate microarchitectureconfigurations of ultra-low power processors for IoT. Wealso intend to test our design space exploration methodologyusing standard IoT benchmarks. We also aim to improve ourmethodology by incorporating better optimization techniqueslike genetic and evolutionary algorithms and machine-learning.We also plan to study the practical applicability of the two-tiered heterogeneous processor model for processors for IoTobjects, and, compare the model with processor architecturemodels currently in use in the IoT market.

REFERENCES

[1] J. Chase, “The evolution of the internet of things - from connected thingsto living in the data, preparing for challenges and IoT readiness,” TexasInstruments, Tech. Rep., Sep 2013.

[2] (2015, Nov) Gartner says 6.4 billion connected ”things”will bein use in 2016, up 30 percent from 2015. [Online]. Available:http://www.gartner.com/newsroom/id/3165317

[3] S. Bath. (2016, Aug) Developing solutions for the internetof things. [Online]. Available: https://www.intrinsyc.com/increasing-solution-differentiation-edge-based-heterogeneous-computing/

[4] “Developing solutions for the internet of things,” Intel, Tech. Rep., 2014.[5] S. Matalon, R. Klein, and C. Walls, “Embedded system power

consumption: A software or hardware issue?” Mentor Graphics, Tech.Rep., Jun 2011.

[6] “Intelligent flexible IoT nodes,” ARM, Tech. Rep., Oct 2015.[7] Y. Veller and S. Matalon, “Why you should optimize power at the

electronic system level,” Mentor Graphics, Tech. Rep., Aug2010.[8] C. Rommel, “Architecting success with heterogenous systems,” VDC

Research, Mentor Graphics, Tech. Rep., 2016.[9] “IoT opportunity demands new approach to mcu-based embedded

designs - rapidly moving market requires integrated silicon/softwareplatform,” Renesas and Synergy, Tech. Rep., Oct 2015.

[10] J. Branke, K. Deb, K. Miettinen, and R. Slowinski,MultiobjectiveOptimization - Interactive and Evolutionary Approaches. Verlag BerlinHeidelberg: Springer, 2008.

[11] S. Boyd and L. Vandenberghe,Convex Optimization. New York, NY,USA: Cambridge University Press, 2004.

[12] K. Char, “Internet of things system design with integrated wirelessMCUs,” Silicon Labs, ARM, Tech. Rep., Oct 2015.

[13] J. Geuzebroek and A. Vaassen, “Building an efficient, tightly coupledembedded system using an extensible processor,” Synopsys,Tech. Rep.,Jun 2014.

[14] T. Adegbija, A. Rogacs, C. Patel, and A. Gordon-Ross, “Enabling right-provisioned microprocessor architectures for the internet of things,” inASME Proceedings of International Mechanical EngineeringCongressand Exposition, Houston, Texas, USA, Nov 2015.

[15] J. Michanan, R. Dewri, and M. J. Rutherford, “Understanding the power-performance tradeoff through pareto analysis of live performance data,”in Proceedings of International Green Computing Conference (IGCC),Dallas, Texas, USA, Nov 2014.

[16] Q. Guo, T. Chen, Y. Chen, Z.-H. Zhou, W. Hu, and Z. Xu, “Effectiveand efficient microprocessor design space exploration using unlabeleddesign configurations,”ACM Transactions on Intelligent Systems andTechnology, vol. 5, no. 1, pp. 20:1–20:18, Jan 2014.

[17] M. Monchiero, R. Canal, and A. Gonzalez, “Power/performance/thermaldesign-space exploration for multicore architectures,”IEEE Transactionson Parallel and Distributed Systems, vol. 19, no. 5, pp. 666–681, May2008.

[18] T. Givargis and F. Vahid, “Platune: A tuning framework for system-on-a-chip platforms,”IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, vol. 21, no. 11, pp. 1317–1327, Nov2002.

[19] M. Palesi and T. Givargis, “Multi-objective design space explorationusing genetic algorithms,” inProceedings of the 10th InternationalSymposium on Hardware/Software Codesign (CODES), Estes Park, CO,USA, May 2002.

[20] C. Silvano, W. Fornaciari, G. Palermo, V. Zaccaria, F. Castro,M. Martinez, S. Bocchio, R. Zafalon, P. Avasare, G. Vanmeerbeeck,C. Ykman-Couvreur, M. Wouters, C. Kavka, L. Onesti, A. Turco,U. Bondi, G. Mariani, H. Posadas, E. Villar, C. Wu, F. Dongrui, Z. Hao,and T. Shibin, “MULTICUBE: Multi-objective design space explorationof multi-core architectures,” inProceedings of IEEE Computer SocietyAnnual Symposium on VLSI (ISVLSI), Lixouri, Kefalonia, Jul 2010.

[21] A. Munir, A. Gordon-Ross, S. Lysecky, and R. Lysecky, “Alightweight dynamic optimization methodology and application metricsestimation model for wireless sensor networks,”Sustainable Computing:Informatics and Systems, vol. 3, no. 2, pp. 94 – 108, Jun 2013.

[22] D. Sheldon, “Design space exploration of parameterized systems usingdesign of experiments,” Ph.D. dissertation, Department ofComputerScience, Dec 2011.

[23] E. K. Ardestani and J. Renau, “ESESC: A fast multicore simulatorusing time-based sampling,” inProceedings of IEEE 19th InternationalSymposium on High Performance Computer Architecture (HPCA),Washington, DC, USA, Feb 2013.

[24] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,Department of Computer Science, Jan 2011.

[25] Y. Bao, C. Bienia, and K. Li,The PARSEC Benchmark Suite Tutorial -PARSEC 3.0, San Jose, CA, USA, Jun 2011.

[26] (2015) Perl reference. [Online]. Available: http://perlmaven.com/[27] J. McNamara. (2015, Apr) Excel-writer-XLSX. [Online]. Available:

http://search.cpan.org/dist/Excel-Writer-XLSX/[28] A. F. Lorenzon, M. C. Cera, and A. C. S. Beck, “On the influence

of static power consumption in multicore embedded systems,” in 2015IEEE International Symposium on Circuits and Sytems (ISCAS), Lisbon,Portugal, May 2015.

[29] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta,“The SPLASH-2 programs: Characterization and methodologicalconsiderations,” in Proceedings of 22nd Annual InternationalSymposium on Computer Architecture (ISCA), Santa MargheritaLigure, Italy, Jun 1995.

Prasanna Kansakar is a PhD student in theDepartment of Computer Science (CS) at KansasState University (K-State), Manhattan, KS. Hisresearch interests include Internet of Things,embedded and cyber-physical systems, computerarchitecture, multicore, secure and trustworthysystems, and hardware-based security. Kansakar hasan MS degree in computer science and engineeringfrom the University of Nevada, Reno (UNR). He isa student member of the IEEE.

Page 13: Selecting Microarchitecture Configuration of Processors for ...people.cs.ksu.edu/~amunir/documents/publications/... · THE internet has grown rapidly in both enterprise and consumer

13

Arslan Munir is currently an Assistant Professor inthe Department of Computer Science (CS) at KansasState University (K-State). He holds a MichelleMunson-Serban Simu Keystone Research FacultyScholarship from the College of Engineering.He was a postdoctoral research associate inthe Electrical and Computer Engineering (ECE)department at Rice University, Houston, Texas, USAfrom May 2012 to June 2014. He received hisM.A.Sc. in ECE from the University of BritishColumbia (UBC), Vancouver, Canada, in 2007 and

his Ph.D. in ECE from the University of Florida (UF), Gainesville, Florida,USA, in 2012. From 2007 to 2008, he worked as a software developmentengineer at Mentor Graphics in the Embedded Systems Division.

Munir’s current research interests include embedded and cyber-physical systems, secure and trustworthy systems, hardware-based security,computer architecture, multicore, parallel computing, distributed computing,reconfigurable computing, artificial intelligence (AI) safety and security, dataanalytics, and fault tolerance. Munir received many academic awards includingthe doctoral fellowship from Natural Sciences and Engineering ResearchCouncil (NSERC) of Canada. He earned gold medals for best performance inelectrical engineering, gold medals and academic roll of honor for securingrank one in pre-engineering provincial examinations (out of approximately300,000 candidates). He is a Senior Member of IEEE.